[RFC 0/6] scheduler-based cpu frequency scaling

List overview All Threads
Download

newer

older

EAS on Juno LSK conflicting patches

eas-dev stripping people from cc

Michael Turquette

16 Apr 2015 16 Apr '15

5:29 a.m.

This series implements an event-driven cpufreq governor that scales cpu frequency as a function of cfs runqueue utilization. The intent of this RFC is to get some discussion going about how the scheduler can become the policy engine for selecting cpu frequency, what limitations exist and what design do we want to take to get to a solution.

This series depends on having frequency-invariant representations for load. This requires Vincent's recently merged cpu capacity rework patches, as well as two patches posted by Morten in his energy aware scheduling v3 series. The latter two patches are included in this series for posterity, but discussion around them probably belongs in the v3 eas series or the forthcoming v4 series.

Thanks to Juri Lelli juri.lelli@arm.com for contributing to the development of the governor.

A git branch with these patches can be pulled from here: https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800 Chromebook2.

---8<---

eas-dev,

Please let me know what you think of this series, including code as well as cover letter and commitlog text. I was not able to finish the irq_work additions to the governor in time for me to submit this tonight (these changes remove the periodic behavior of calling cap_gov_kick_thread from run_rebalance_domains). I'll focus on the irq_work stuff tomorrow and post an addenedum to this series asap.

I have not done any benchmark testing with this series. That is also on my todo list for this week and any help there would be appreciated.

Freedom & Howard, if you are bored and feel like measuring power across some benchmarks on your non-buggy EVBs then please do. I can only measure power on the A53s right now which limits me to a single cluster with two cores.

Regards, Mike

Michael Turquette (4): sched: sched feature for cpu frequency selection cpufreq: add per-governor private data sched: export get_cpu_usage in sched.h sched: cap_gov: PELT-based cpu frequency scaling

Morten Rasmussen (2): cpufreq: Architecture specific callback for frequency changes arm: Frequency invariant scheduler load-tracking support

-- 1.9.1

Show replies by date

Michael Turquette

16 Apr 16 Apr

5:29 a.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

From: Morten Rasmussen Morten.Rasmussen@arm.com

Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions. The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu capacity later.

Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- drivers/cpufreq/cpufreq.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 28e59a4..3c6398a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -280,6 +280,10 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci) #endif }

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {} + +void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {} + static void __cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state) { @@ -317,6 +321,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, pr_debug("FREQ: %lu - CPU: %lu\n", (unsigned long)freqs->new, (unsigned long)freqs->cpu); trace_cpu_frequency(freqs->new, freqs->cpu); + arch_scale_set_curr_freq(freqs->cpu, freqs->new); srcu_notifier_call_chain(&cpufreq_transition_notifier_list, CPUFREQ_POSTCHANGE, freqs); if (likely(policy) && likely(policy->cpu == freqs->cpu)) @@ -2148,7 +2153,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, struct cpufreq_policy *new_policy) { struct cpufreq_governor *old_gov; - int ret; + int ret, cpu;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu, new_policy->min, new_policy->max); @@ -2186,6 +2191,12 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, policy->min = new_policy->min; policy->max = new_policy->max;

+ for_each_cpu(cpu, policy->cpus) { + arch_scale_set_max_freq(cpu, policy->max); + /* Workaround for corner cases where notifiers don't fire */ + arch_scale_set_curr_freq(cpu, policy->cur); + } + pr_debug("new min and max freqs are %u - %u kHz\n", policy->min, policy->max);

-- 1.9.1

Wu, Junjie

17 Apr 17 Apr

9:34 p.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

On 4/15/2015 22:29, Michael Turquette wrote:

...

From: Morten Rasmussen Morten.Rasmussen@arm.com

Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions. The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu capacity later.

Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

drivers/cpufreq/cpufreq.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 28e59a4..3c6398a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -280,6 +280,10 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci) #endif }

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}

+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}

static void __cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state) {

@@ -317,6 +321,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, pr_debug("FREQ: %lu - CPU: %lu\n", (unsigned long)freqs->new, (unsigned long)freqs->cpu); trace_cpu_frequency(freqs->new, freqs->cpu);
arch_scale_set_curr_freq(freqs->cpu, freqs->new);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list, CPUFREQ_POSTCHANGE, freqs); if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2148,7 +2153,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, struct cpufreq_policy *new_policy) { struct cpufreq_governor *old_gov;

int ret;

int ret, cpu;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu, new_policy->min, new_policy->max);

@@ -2186,6 +2191,12 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, policy->min = new_policy->min; policy->max = new_policy->max;
for_each_cpu(cpu, policy->cpus) {
arch_scale_set_max_freq(cpu, policy->max);
/* Workaround for corner cases where notifiers don't fire */
arch_scale_set_curr_freq(cpu, policy->cur);
}

pr_debug("new min and max freqs are %u - %u kHz\n", policy->min, policy->max);

Just curious why we need these new callbacks? I don't think they are providing any new information not already given by existing cpufreq policy and transition notifications.

Regards, - Junjie

Morten Rasmussen

20 Apr 20 Apr

11:54 a.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

On Fri, Apr 17, 2015 at 10:34:21PM +0100, Wu, Junjie wrote:

...

On 4/15/2015 22:29, Michael Turquette wrote:

...
From: Morten Rasmussen Morten.Rasmussen@arm.com

Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions. The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu capacity later.

Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

drivers/cpufreq/cpufreq.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 28e59a4..3c6398a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -280,6 +280,10 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci) #endif }

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}

+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}

static void __cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state) {

@@ -317,6 +321,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, pr_debug("FREQ: %lu - CPU: %lu\n", (unsigned long)freqs->new, (unsigned long)freqs->cpu); trace_cpu_frequency(freqs->new, freqs->cpu);
arch_scale_set_curr_freq(freqs->cpu, freqs->new);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list, CPUFREQ_POSTCHANGE, freqs); if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2148,7 +2153,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, struct cpufreq_policy *new_policy) { struct cpufreq_governor *old_gov;

int ret;

int ret, cpu;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu, new_policy->min, new_policy->max);

@@ -2186,6 +2191,12 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, policy->min = new_policy->min; policy->max = new_policy->max;
for_each_cpu(cpu, policy->cpus) {
arch_scale_set_max_freq(cpu, policy->max);
/* Workaround for corner cases where notifiers don't fire */
arch_scale_set_curr_freq(cpu, policy->cur);
}

pr_debug("new min and max freqs are %u - %u kHz\n", policy->min, policy->max);
Just curious why we need these new callbacks? I don't think they are providing any new information not already given by existing cpufreq policy and transition notifications.

Right. They are providing the same information I think. I went with the __weak function callbacks as this method is used for the existing scaling functions used by the scheduler (arch_scale_{cpu,freq}_capacity()). However, that is changing now so I should give it a try and see if we can use the notifiers instead. There might be some initialization problems though as the notifier has to be registrered before cpufreq initializes the first (default) policy for us to know what the max frequency is.

Morten

Morten Rasmussen

24 Apr 24 Apr

2:50 p.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

On Mon, Apr 20, 2015 at 12:54:15PM +0100, Morten Rasmussen wrote:

...

On Fri, Apr 17, 2015 at 10:34:21PM +0100, Wu, Junjie wrote:

...
On 4/15/2015 22:29, Michael Turquette wrote:

...
From: Morten Rasmussen Morten.Rasmussen@arm.com

Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions. The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu capacity later.

Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

drivers/cpufreq/cpufreq.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 28e59a4..3c6398a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -280,6 +280,10 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci) #endif }

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}

+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}

static void __cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state) {

@@ -317,6 +321,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, pr_debug("FREQ: %lu - CPU: %lu\n", (unsigned long)freqs->new, (unsigned long)freqs->cpu); trace_cpu_frequency(freqs->new, freqs->cpu);
arch_scale_set_curr_freq(freqs->cpu, freqs->new);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list, CPUFREQ_POSTCHANGE, freqs); if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2148,7 +2153,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, struct cpufreq_policy *new_policy) { struct cpufreq_governor *old_gov;

int ret;

int ret, cpu;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu, new_policy->min, new_policy->max);

@@ -2186,6 +2191,12 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, policy->min = new_policy->min; policy->max = new_policy->max;
for_each_cpu(cpu, policy->cpus) {
arch_scale_set_max_freq(cpu, policy->max);
/* Workaround for corner cases where notifiers don't fire */
arch_scale_set_curr_freq(cpu, policy->cur);
}

pr_debug("new min and max freqs are %u - %u kHz\n", policy->min, policy->max);
Just curious why we need these new callbacks? I don't think they are providing any new information not already given by existing cpufreq policy and transition notifications.
Right. They are providing the same information I think. I went with the __weak function callbacks as this method is used for the existing scaling functions used by the scheduler (arch_scale_{cpu,freq}_capacity()). However, that is changing now so I should give it a try and see if we can use the notifiers instead. There might be some initialization problems though as the notifier has to be registrered before cpufreq initializes the first (default) policy for us to know what the max frequency is.

I think the patch below gives us what we need using the cpufreq notifiers instead. It is just a single patch, no neeed to touch cpufreq, so the patch below replaces both patch 1 and 2.

I haven't found any initialization problems on TC2. I'm not 100% sure that policy->cur is set at initilization of all cpufreq drivers. Drivers are not required to have a get-function which seems required for policy->cur to be set.

I haven't tested with hotplug yet.

Mike: I have pushed the patch to linux-arm.org as well.

Morten

...

From 0fe329c77782acde0290954779a2ee17920a3dad Mon Sep 17 00:00:00 2001

From: Morten Rasmussen Morten.Rasmussen@arm.com Date: Mon, 22 Sep 2014 17:24:03 +0100 Subject: [PATCH] arm: Frequency invariant scheduler load-tracking support

Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

This implementation only provides frequency invariance. No micro-architecture invariance yet.

Cc: Russell King linux@arm.linux.org.uk

Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/include/asm/topology.h | 7 ++++++ arch/arm/kernel/smp.c | 53 +++++++++++++++++++++++++++++++++++++++-- arch/arm/kernel/topology.c | 17 +++++++++++++ 3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85ff..4b985dc 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -24,6 +24,13 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu); + +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity); + #else

static inline void init_cpu_topology(void) { } diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 86ef244..297ce1b 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -672,12 +672,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref); static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq); static unsigned long global_l_p_j_ref; static unsigned long global_l_p_j_ref_freq; +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq); +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); + +/* + * Scheduler load-tracking scale-invariance + * + * Provides the scheduler with a scale-invariance correction factor that + * compensates for frequency scaling through arch_scale_freq_capacity() + * (implemented in topology.c). + */ +static inline +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max) +{ + unsigned long capacity; + + if (!max) + return; + + capacity = (curr << SCHED_CAPACITY_SHIFT) / max; + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity); +}

static int cpufreq_callback(struct notifier_block *nb, unsigned long val, void *data) { struct cpufreq_freqs *freq = data; int cpu = freq->cpu; + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));

if (freq->flags & CPUFREQ_CONST_LOOPS) return NOTIFY_OK; @@ -702,6 +724,9 @@ static int cpufreq_callback(struct notifier_block *nb, per_cpu(l_p_j_ref_freq, cpu), freq->new); } + + scale_freq_capacity(cpu, freq->new, max); + return NOTIFY_OK; }

@@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = { .notifier_call = cpufreq_callback, };

+static int cpufreq_policy_callback(struct notifier_block *nb, + unsigned long val, void *data) +{ + struct cpufreq_policy *policy = data; + int i; + + for_each_cpu(i, policy->cpus) { + scale_freq_capacity(i, policy->cur, policy->max); + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max); + } + + return NOTIFY_OK; +} + +static struct notifier_block cpufreq_policy_notifier = { + .notifier_call = cpufreq_policy_callback, +}; + static int __init register_cpufreq_notifier(void) { - return cpufreq_register_notifier(&cpufreq_notifier, + int ret; + + ret = cpufreq_register_notifier(&cpufreq_notifier, CPUFREQ_TRANSITION_NOTIFIER); + if (ret) + return ret; + + return cpufreq_register_notifier(&cpufreq_policy_notifier, + CPUFREQ_POLICY_NOTIFIER); } core_initcall(register_cpufreq_notifier); - #endif diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 08b7847..9c09e6e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu) cpu, arch_scale_cpu_capacity(NULL, cpu)); }

+/* + * Scheduler load-tracking scale-invariance + * + * Provides the scheduler with a scale-invariance correction factor that + * compensates for frequency scaling (arch_scale_freq_capacity()). The scaling + * factor is updated in smp.c + */ +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{ + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu)); + + if (!curr) + return SCHED_CAPACITY_SCALE; + + return curr; +} + #else static inline void parse_dt_topology(void) {} static inline void update_cpu_capacity(unsigned int cpuid) {} -- 1.9.1

Wu, Junjie

5:32 p.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

On 4/24/2015 7:50, Morten Rasmussen wrote:

...

On Mon, Apr 20, 2015 at 12:54:15PM +0100, Morten Rasmussen wrote:

...
On Fri, Apr 17, 2015 at 10:34:21PM +0100, Wu, Junjie wrote:

...
On 4/15/2015 22:29, Michael Turquette wrote:

...
From: Morten Rasmussen Morten.Rasmussen@arm.com

Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions. The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu capacity later.

Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

drivers/cpufreq/cpufreq.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 28e59a4..3c6398a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -280,6 +280,10 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci) #endif }

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}

+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}

static void __cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state) {

@@ -317,6 +321,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, pr_debug("FREQ: %lu - CPU: %lu\n", (unsigned long)freqs->new, (unsigned long)freqs->cpu); trace_cpu_frequency(freqs->new, freqs->cpu);
arch_scale_set_curr_freq(freqs->cpu, freqs->new);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
		CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2148,7 +2153,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, struct cpufreq_policy *new_policy) { struct cpufreq_governor *old_gov;

int ret;

int ret, cpu;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu, new_policy->min, new_policy->max);

@@ -2186,6 +2191,12 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, policy->min = new_policy->min; policy->max = new_policy->max;
for_each_cpu(cpu, policy->cpus) {
arch_scale_set_max_freq(cpu, policy->max);
/* Workaround for corner cases where notifiers don't fire */
arch_scale_set_curr_freq(cpu, policy->cur);
}

pr_debug("new min and max freqs are %u - %u kHz\n", policy->min, policy->max);
Just curious why we need these new callbacks? I don't think they are providing any new information not already given by existing cpufreq policy and transition notifications.
Right. They are providing the same information I think. I went with the __weak function callbacks as this method is used for the existing scaling functions used by the scheduler (arch_scale_{cpu,freq}_capacity()). However, that is changing now so I should give it a try and see if we can use the notifiers instead. There might be some initialization problems though as the notifier has to be registrered before cpufreq initializes the first (default) policy for us to know what the max frequency is.
I think the patch below gives us what we need using the cpufreq notifiers instead. It is just a single patch, no neeed to touch cpufreq, so the patch below replaces both patch 1 and 2.

I haven't found any initialization problems on TC2. I'm not 100% sure that policy->cur is set at initilization of all cpufreq drivers. Drivers are not required to have a get-function which seems required for policy->cur to be set.

Even if cpufreq driver's init doesn't fill in policy->cur, cpufreq_init_policy() would call cpufreq_set_policy() that sets frequency. You would receive both policy and transition notifications where the POSTCHANGE would have the right frequency information. I think your patch would be fine.

...

I haven't tested with hotplug yet.

Mike: I have pushed the patch to linux-arm.org as well.

Morten

From 0fe329c77782acde0290954779a2ee17920a3dad Mon Sep 17 00:00:00 2001 From: Morten Rasmussen Morten.Rasmussen@arm.com Date: Mon, 22 Sep 2014 17:24:03 +0100 Subject: [PATCH] arm: Frequency invariant scheduler load-tracking support

Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

This implementation only provides frequency invariance. No micro-architecture invariance yet.

Cc: Russell King linux@arm.linux.org.uk

Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

arch/arm/include/asm/topology.h | 7 ++++++ arch/arm/kernel/smp.c | 53 +++++++++++++++++++++++++++++++++++++++-- arch/arm/kernel/topology.c | 17 +++++++++++++ 3 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85ff..4b985dc 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -24,6 +24,13 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);

+DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity);

#else

static inline void init_cpu_topology(void) { }

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 86ef244..297ce1b 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -672,12 +672,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref); static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq); static unsigned long global_l_p_j_ref; static unsigned long global_l_p_j_ref_freq; +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq); +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity);

+/*

Scheduler load-tracking scale-invariance

Provides the scheduler with a scale-invariance correction factor that

compensates for frequency scaling through arch_scale_freq_capacity()

(implemented in topology.c).

*/

+static inline +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max) +{
unsigned long capacity;

if (!max)
return;
capacity = (curr << SCHED_CAPACITY_SHIFT) / max;

atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity);
+}

static int cpufreq_callback(struct notifier_block *nb, unsigned long val, void *data) { struct cpufreq_freqs *freq = data; int cpu = freq->cpu;

unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));

if (freq->flags & CPUFREQ_CONST_LOOPS) return NOTIFY_OK;

@@ -702,6 +724,9 @@ static int cpufreq_callback(struct notifier_block *nb, per_cpu(l_p_j_ref_freq, cpu), freq->new); }

scale_freq_capacity(cpu, freq->new, max);

return NOTIFY_OK; }

@@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = { .notifier_call = cpufreq_callback, };

+static int cpufreq_policy_callback(struct notifier_block *nb,
				unsigned long val, void *data)
+{
struct cpufreq_policy *policy = data;

int i;

for_each_cpu(i, policy->cpus) {
scale_freq_capacity(i, policy->cur, policy->max);
atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
}

return NOTIFY_OK;
+}

You need to skip all other events except CPUFREQ_NOTIFY. CPUFREQ_ADJUST and CPUFREQ_INCOMPATIBLE are meant for drivers to change policy->min/max, and it not uncommon to have policy->max adjusted by kernel thermal drivers. CPUFREQ_NOTIFY is the final result of new limits.

- Junjie

...

+static struct notifier_block cpufreq_policy_notifier = {

.notifier_call = cpufreq_policy_callback,

+};

static int __init register_cpufreq_notifier(void) {

return cpufreq_register_notifier(&cpufreq_notifier,
int ret;

ret = cpufreq_register_notifier(&cpufreq_notifier, CPUFREQ_TRANSITION_NOTIFIER);

if (ret)
return ret;
return cpufreq_register_notifier(&cpufreq_policy_notifier,
				CPUFREQ_POLICY_NOTIFIER);
} core_initcall(register_cpufreq_notifier);
#endif

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 08b7847..9c09e6e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -169,6 +169,23 @@ static void update_cpu_capacity(unsigned int cpu) cpu, arch_scale_cpu_capacity(NULL, cpu)); }

+/*

Scheduler load-tracking scale-invariance

Provides the scheduler with a scale-invariance correction factor that

compensates for frequency scaling (arch_scale_freq_capacity()). The scaling

factor is updated in smp.c

*/

+unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{
unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));

if (!curr)
return SCHED_CAPACITY_SCALE;
return curr;
+}

#else static inline void parse_dt_topology(void) {} static inline void update_cpu_capacity(unsigned int cpuid) {}

Morten Rasmussen

27 Apr 27 Apr

1:52 p.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

On Fri, Apr 24, 2015 at 06:32:02PM +0100, Wu, Junjie wrote:

...

On 4/24/2015 7:50, Morten Rasmussen wrote:

...
On Mon, Apr 20, 2015 at 12:54:15PM +0100, Morten Rasmussen wrote:

...
On Fri, Apr 17, 2015 at 10:34:21PM +0100, Wu, Junjie wrote:

...
On 4/15/2015 22:29, Michael Turquette wrote:

...
From: Morten Rasmussen Morten.Rasmussen@arm.com

Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions. The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu capacity later.

Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

drivers/cpufreq/cpufreq.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 28e59a4..3c6398a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -280,6 +280,10 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci) #endif }

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}

+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}

static void __cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state) {

@@ -317,6 +321,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, pr_debug("FREQ: %lu - CPU: %lu\n", (unsigned long)freqs->new, (unsigned long)freqs->cpu); trace_cpu_frequency(freqs->new, freqs->cpu);
arch_scale_set_curr_freq(freqs->cpu, freqs->new);
srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
		CPUFREQ_POSTCHANGE, freqs);
if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2148,7 +2153,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, struct cpufreq_policy *new_policy) { struct cpufreq_governor *old_gov;

int ret;

int ret, cpu;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n", new_policy->cpu, new_policy->min, new_policy->max);

@@ -2186,6 +2191,12 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, policy->min = new_policy->min; policy->max = new_policy->max;
for_each_cpu(cpu, policy->cpus) {
arch_scale_set_max_freq(cpu, policy->max);
/* Workaround for corner cases where notifiers don't fire */
arch_scale_set_curr_freq(cpu, policy->cur);
}

pr_debug("new min and max freqs are %u - %u kHz\n", policy->min, policy->max);
Just curious why we need these new callbacks? I don't think they are providing any new information not already given by existing cpufreq policy and transition notifications.
Right. They are providing the same information I think. I went with the __weak function callbacks as this method is used for the existing scaling functions used by the scheduler (arch_scale_{cpu,freq}_capacity()). However, that is changing now so I should give it a try and see if we can use the notifiers instead. There might be some initialization problems though as the notifier has to be registrered before cpufreq initializes the first (default) policy for us to know what the max frequency is.
I think the patch below gives us what we need using the cpufreq notifiers instead. It is just a single patch, no neeed to touch cpufreq, so the patch below replaces both patch 1 and 2.

I haven't found any initialization problems on TC2. I'm not 100% sure that policy->cur is set at initilization of all cpufreq drivers. Drivers are not required to have a get-function which seems required for policy->cur to be set.
Even if cpufreq driver's init doesn't fill in policy->cur, cpufreq_init_policy() would call cpufreq_set_policy() that sets frequency. You would receive both policy and transition notifications where the POSTCHANGE would have the right frequency information. I think your patch would be fine.

AFAICT, you are not guaranteed a transition notification during init. I don't get one on TC2 if I use the userspace governor as default. But Viresh has confirmed that somebody will set policy->cur in all cases.

[...]

...

...
@@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = { .notifier_call = cpufreq_callback, };

+static int cpufreq_policy_callback(struct notifier_block *nb,
				unsigned long val, void *data)
+{
struct cpufreq_policy *policy = data;

int i;

for_each_cpu(i, policy->cpus) {
scale_freq_capacity(i, policy->cur, policy->max);
atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
}

return NOTIFY_OK;
+}
You need to skip all other events except CPUFREQ_NOTIFY. CPUFREQ_ADJUST and CPUFREQ_INCOMPATIBLE are meant for drivers to change policy->min/max, and it not uncommon to have policy->max adjusted by kernel thermal drivers. CPUFREQ_NOTIFY is the final result of new limits.

Good point, thanks. No reason to let all the noise pass through to the scheduler. I have added:

+ if (val != CPUFREQ_NOTIFY) + return NOTIFY_OK;

before the loop above.

Mike: I have force-updated the branch to contain this fix.

Thanks, Morten

Michael Turquette

3:47 p.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

Quoting Morten Rasmussen (2015-04-27 06:52:10)

...

On Fri, Apr 24, 2015 at 06:32:02PM +0100, Wu, Junjie wrote:

...
On 4/24/2015 7:50, Morten Rasmussen wrote:

...
On Mon, Apr 20, 2015 at 12:54:15PM +0100, Morten Rasmussen wrote:

...
On Fri, Apr 17, 2015 at 10:34:21PM +0100, Wu, Junjie wrote:

...
On 4/15/2015 22:29, Michael Turquette wrote:

...
From: Morten Rasmussen Morten.Rasmussen@arm.com

Architectures that don't have any other means for tracking cpu frequency changes need a callback from cpufreq to implement a scaling factor to enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would need to know both the max frequency and the current frequency. This patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions. The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu capacity later.

Cc: Rafael J. Wysocki rjw@rjwysocki.net Cc: Viresh Kumar viresh.kumar@linaro.org Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

drivers/cpufreq/cpufreq.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 28e59a4..3c6398a 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -280,6 +280,10 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci) #endif }

+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}

+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}

static void __cpufreq_notify_transition(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs, unsigned int state) {

@@ -317,6 +321,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy, pr_debug("FREQ: %lu - CPU: %lu\n", (unsigned long)freqs->new, (unsigned long)freqs->cpu); trace_cpu_frequency(freqs->new, freqs->cpu);
           arch_scale_set_curr_freq(freqs->cpu, freqs->new);
           srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
                           CPUFREQ_POSTCHANGE, freqs);
           if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2148,7 +2153,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, struct cpufreq_policy *new_policy) { struct cpufreq_governor *old_gov;
   int ret;
   int ret, cpu;

   pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
            new_policy->cpu, new_policy->min, new_policy->max);
@@ -2186,6 +2191,12 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, policy->min = new_policy->min; policy->max = new_policy->max;
   for_each_cpu(cpu, policy->cpus) {
           arch_scale_set_max_freq(cpu, policy->max);
           /* Workaround for corner cases where notifiers don't fire */
           arch_scale_set_curr_freq(cpu, policy->cur);
   }
   pr_debug("new min and max freqs are %u - %u kHz\n",
            policy->min, policy->max);
Just curious why we need these new callbacks? I don't think they are providing any new information not already given by existing cpufreq policy and transition notifications.
Right. They are providing the same information I think. I went with the __weak function callbacks as this method is used for the existing scaling functions used by the scheduler (arch_scale_{cpu,freq}_capacity()). However, that is changing now so I should give it a try and see if we can use the notifiers instead. There might be some initialization problems though as the notifier has to be registrered before cpufreq initializes the first (default) policy for us to know what the max frequency is.
I think the patch below gives us what we need using the cpufreq notifiers instead. It is just a single patch, no neeed to touch cpufreq, so the patch below replaces both patch 1 and 2.

I haven't found any initialization problems on TC2. I'm not 100% sure that policy->cur is set at initilization of all cpufreq drivers. Drivers are not required to have a get-function which seems required for policy->cur to be set.
Even if cpufreq driver's init doesn't fill in policy->cur, cpufreq_init_policy() would call cpufreq_set_policy() that sets frequency. You would receive both policy and transition notifications where the POSTCHANGE would have the right frequency information. I think your patch would be fine.
AFAICT, you are not guaranteed a transition notification during init. I don't get one on TC2 if I use the userspace governor as default. But Viresh has confirmed that somebody will set policy->cur in all cases.

[...]

...
...
@@ -709,11 +734,35 @@ static struct notifier_block cpufreq_notifier = { .notifier_call = cpufreq_callback, };

+static int cpufreq_policy_callback(struct notifier_block *nb,
                                      unsigned long val, void *data)
+{
struct cpufreq_policy *policy = data;

int i;

for_each_cpu(i, policy->cpus) {
      scale_freq_capacity(i, policy->cur, policy->max);
      atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max);
}

return NOTIFY_OK;
+}
You need to skip all other events except CPUFREQ_NOTIFY. CPUFREQ_ADJUST and CPUFREQ_INCOMPATIBLE are meant for drivers to change policy->min/max, and it not uncommon to have policy->max adjusted by kernel thermal drivers. CPUFREQ_NOTIFY is the final result of new limits.
Good point, thanks. No reason to let all the noise pass through to the scheduler. I have added:
  if (val != CPUFREQ_NOTIFY)
          return NOTIFY_OK;
before the loop above.

Mike: I have force-updated the branch to contain this fix.

Morten,

Thanks. I've gone ahead and pushed my RFC v2 to eas-dev yesterday. I'll rebase it to your patch after we complete this comments/review cycle.

Regards, Mike

...

Thanks, Morten

Viresh Kumar

5:18 a.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

On 24 April 2015 at 20:20, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

I haven't found any initialization problems on TC2. I'm not 100% sure that policy->cur is set at initilization of all cpufreq drivers. Drivers

Yes it is.

...

are not required to have a get-function which seems required for policy->cur to be set.

Yeah, in that case drivers must initialize it from their ->init() callbacks.

Morten Rasmussen

1:53 p.m.

New subject: [RFC 1/6] cpufreq: Architecture specific callback for frequency changes

On Mon, Apr 27, 2015 at 06:18:28AM +0100, Viresh Kumar wrote:

...

On 24 April 2015 at 20:20, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
I haven't found any initialization problems on TC2. I'm not 100% sure that policy->cur is set at initilization of all cpufreq drivers. Drivers

Yes it is.

...
are not required to have a get-function which seems required for policy->cur to be set.

Yeah, in that case drivers must initialize it from their ->init() callbacks.

Thanks for clarifying.

Morten

Michael Turquette

16 Apr 16 Apr

5:29 a.m.

New subject: [RFC 2/6] arm: Frequency invariant scheduler load-tracking support

From: Morten Rasmussen Morten.Rasmussen@arm.com

Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is:

current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)

This implementation only provides frequency invariance. No micro-architecture invariance yet.

Cc: Russell King linux@arm.linux.org.uk Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com --- arch/arm/include/asm/topology.h | 4 ++++ arch/arm/kernel/topology.c | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85ff..86acd06 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -24,6 +24,10 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu); + #else

static inline void init_cpu_topology(void) { } diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 08b7847..eccc634 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -169,6 +169,47 @@ static void update_cpu_capacity(unsigned int cpu) cpu, arch_scale_cpu_capacity(NULL, cpu)); }

+/* + * Scheduler load-tracking scale-invariance + * + * Provides the scheduler with a scale-invariance correction factor that + * compensates for frequency scaling. + */ + +static DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq); + +/* cpufreq callback function setting current cpu frequency */ +void arch_scale_set_curr_freq(int cpu, unsigned long freq) +{ + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu)); + unsigned long curr; + + if (!max) + return; + + curr = (freq * SCHED_CAPACITY_SCALE) / max; + + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), curr); +} + +/* cpufreq callback function setting max cpu frequency */ +void arch_scale_set_max_freq(int cpu, unsigned long freq) +{ + atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq); +} + +/* arch_scale_freq_capacity() implementation called from scheduler */ +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{ + unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu)); + + if (!curr) + return SCHED_CAPACITY_SCALE; + + return curr; +} + #else static inline void parse_dt_topology(void) {} static inline void update_cpu_capacity(unsigned int cpuid) {} -- 1.9.1

Vincent Guittot

22 Apr 22 Apr

4:20 p.m.

New subject: [RFC 2/6] arm: Frequency invariant scheduler load-tracking support

On 16 April 2015 at 07:29, Michael Turquette mturquette@linaro.org wrote:

...

From: Morten Rasmussen Morten.Rasmussen@arm.com

Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is:
    current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
This implementation only provides frequency invariance. No micro-architecture invariance yet.

Cc: Russell King linux@arm.linux.org.uk Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

arch/arm/include/asm/topology.h | 4 ++++ arch/arm/kernel/topology.c | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85ff..86acd06 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -24,6 +24,10 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);

#else

static inline void init_cpu_topology(void) { } diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 08b7847..eccc634 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -169,6 +169,47 @@ static void update_cpu_capacity(unsigned int cpu) cpu, arch_scale_cpu_capacity(NULL, cpu)); }

+/*

Scheduler load-tracking scale-invariance

Provides the scheduler with a scale-invariance correction factor that

compensates for frequency scaling.

*/

+static DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);

+/* cpufreq callback function setting current cpu frequency */ +void arch_scale_set_curr_freq(int cpu, unsigned long freq) +{
  unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
  unsigned long curr;
  if (!max)
          return;
  curr = (freq * SCHED_CAPACITY_SCALE) / max;
  atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), curr);
+}

+/* cpufreq callback function setting max cpu frequency */ +void arch_scale_set_max_freq(int cpu, unsigned long freq) +{
  atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);

You should update per_cpu(cpu_freq_capacity, cpu) too

...

+}

+/* arch_scale_freq_capacity() implementation called from scheduler */ +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu) +{
  unsigned long curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
  if (!curr)
          return SCHED_CAPACITY_SCALE;
  return curr;
+}

#else static inline void parse_dt_topology(void) {} static inline void update_cpu_capacity(unsigned int cpuid) {} -- 1.9.1

Morten Rasmussen

23 Apr 23 Apr

2:37 p.m.

New subject: [RFC 2/6] arm: Frequency invariant scheduler load-tracking support

On Wed, Apr 22, 2015 at 05:20:07PM +0100, Vincent Guittot wrote:

...

On 16 April 2015 at 07:29, Michael Turquette mturquette@linaro.org wrote:

...
From: Morten Rasmussen Morten.Rasmussen@arm.com

Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is:
    current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
This implementation only provides frequency invariance. No micro-architecture invariance yet.

Cc: Russell King linux@arm.linux.org.uk Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

arch/arm/include/asm/topology.h | 4 ++++ arch/arm/kernel/topology.c | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85ff..86acd06 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -24,6 +24,10 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);

#else

static inline void init_cpu_topology(void) { } diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 08b7847..eccc634 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -169,6 +169,47 @@ static void update_cpu_capacity(unsigned int cpu) cpu, arch_scale_cpu_capacity(NULL, cpu)); }

+/*

Scheduler load-tracking scale-invariance

Provides the scheduler with a scale-invariance correction factor that

compensates for frequency scaling.

*/

+static DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);

+/* cpufreq callback function setting current cpu frequency */ +void arch_scale_set_curr_freq(int cpu, unsigned long freq) +{
  unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
  unsigned long curr;
  if (!max)
          return;
  curr = (freq * SCHED_CAPACITY_SCALE) / max;
  atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), curr);
+}

+/* cpufreq callback function setting max cpu frequency */ +void arch_scale_set_max_freq(int cpu, unsigned long freq) +{
  atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
You should update per_cpu(cpu_freq_capacity, cpu) too

Right. I made the assumption that we wouldn't change max very often. But I guess we could do something like:

unsigned long curr, old_max;

if (!freq) return;

old_max = atomic_long_read(&per_cpu(cpu_max_freq, cpu)); curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu)); curr = (curr * old_max)/freq;

atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq); atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), curr);

We would get some rounding errors though if we change max. For example:

curr max capacity 200 1000 204 200 800 255 (as proposed for arch_scale_set_max_freq()) 200 800 256 (if computed by arch_scale_set_curr_freq())

This can only be avoided by having another per_cpu variable storing the curr freq too. It shouldn't be a big deal. I can fix that.

Thanks, Morten

Vincent Guittot

3:59 p.m.

New subject: [RFC 2/6] arm: Frequency invariant scheduler load-tracking support

On 23 April 2015 at 16:37, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On Wed, Apr 22, 2015 at 05:20:07PM +0100, Vincent Guittot wrote:

...
On 16 April 2015 at 07:29, Michael Turquette mturquette@linaro.org wrote:

...
From: Morten Rasmussen Morten.Rasmussen@arm.com

Implements arch-specific function to provide the scheduler with a frequency scaling correction factor for more accurate load-tracking. The factor is:
    current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
This implementation only provides frequency invariance. No micro-architecture invariance yet.

Cc: Russell King linux@arm.linux.org.uk Signed-off-by: Morten Rasmussen morten.rasmussen@arm.com

arch/arm/include/asm/topology.h | 4 ++++ arch/arm/kernel/topology.c | 41 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h index 2fe85ff..86acd06 100644 --- a/arch/arm/include/asm/topology.h +++ b/arch/arm/include/asm/topology.h @@ -24,6 +24,10 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu);

+#define arch_scale_freq_capacity arm_arch_scale_freq_capacity +struct sched_domain; +extern unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu);

#else

static inline void init_cpu_topology(void) { } diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index 08b7847..eccc634 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -169,6 +169,47 @@ static void update_cpu_capacity(unsigned int cpu) cpu, arch_scale_cpu_capacity(NULL, cpu)); }

+/*

Scheduler load-tracking scale-invariance

Provides the scheduler with a scale-invariance correction factor that

compensates for frequency scaling.

*/

+static DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);

+/* cpufreq callback function setting current cpu frequency */ +void arch_scale_set_curr_freq(int cpu, unsigned long freq) +{
  unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
  unsigned long curr;
  if (!max)
          return;
  curr = (freq * SCHED_CAPACITY_SCALE) / max;
  atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), curr);
+}

+/* cpufreq callback function setting max cpu frequency */ +void arch_scale_set_max_freq(int cpu, unsigned long freq) +{
  atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
You should update per_cpu(cpu_freq_capacity, cpu) too
Right. I made the assumption that we wouldn't change max very often. But I guess we could do something like:
    unsigned long curr, old_max;

    if (!freq)
            return;

    old_max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
    curr = atomic_long_read(&per_cpu(cpu_freq_capacity, cpu));
    curr = (curr * old_max)/freq;

    atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
    atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), curr);
We would get some rounding errors though if we change max. For example:

curr max capacity 200 1000 204 200 800 255 (as proposed for arch_scale_set_max_freq()) 200 800 256 (if computed by arch_scale_set_curr_freq())

This can only be avoided by having another per_cpu variable storing the curr freq too. It shouldn't be a big deal. I can fix that.

I don't have strong opinion about which proposal is the best but the use of another per cpu variable makes code much more simple and readable IMHO

Vincent

...

Thanks, Morten

Michael Turquette

16 Apr 16 Apr

5:29 a.m.

New subject: [RFC 3/6] sched: sched feature for cpu frequency selection

This patch introduces the SCHED_ENERGY_FREQ sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set to false when SCHED_DEBUG is not defined and thus disabled by default.

Signed-off-by: Michael Turquette mturquette@linaro.org --- kernel/sched/fair.c | 5 +++++ kernel/sched/features.h | 6 ++++++ 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 46855d0..75aec8d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4207,6 +4207,11 @@ static inline void hrtick_update(struct rq *rq) } #endif

+static inline bool sched_energy_freq(void) +{ + return sched_feat(SCHED_ENERGY_FREQ); +} + /* * The enqueue_task method is called before nr_running is * increased. Here we update the fair scheduling stats and diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 91e33cd..77381cf 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif + +/* + * Scheduler-driven CPU frequency selection aimed to save energy based on + * load tracking + */ +SCHED_FEAT(SCHED_ENERGY_FREQ, false) -- 1.9.1

Juri Lelli

4:53 p.m.

New subject: [RFC 3/6] sched: sched feature for cpu frequency selection

Hi Mike,

On 16/04/15 06:29, Michael Turquette wrote:

...

This patch introduces the SCHED_ENERGY_FREQ sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set to false when SCHED_DEBUG is not defined and thus disabled by default.

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/fair.c | 5 +++++ kernel/sched/features.h | 6 ++++++ 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 46855d0..75aec8d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4207,6 +4207,11 @@ static inline void hrtick_update(struct rq *rq) } #endif +static inline bool sched_energy_freq(void) +{

return sched_feat(SCHED_ENERGY_FREQ);

+}

/*

The enqueue_task method is called before nr_running is

increased. Here we update the fair scheduling stats and

diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 91e33cd..77381cf 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif

+/*

Scheduler-driven CPU frequency selection aimed to save energy based on

load tracking

*/

+SCHED_FEAT(SCHED_ENERGY_FREQ, false)

Do we really need this? I understand that you don't want to add overhead in enqueue/dequeue/etc paths, but to me it looks a bit redundant as to enable the governor we have to both set it in scaling_governor and enable this sched feature.

Thanks,

- Juri

Michael Turquette

6:41 p.m.

New subject: [RFC 3/6] sched: sched feature for cpu frequency selection

Quoting Juri Lelli (2015-04-16 09:53:28)

...

Hi Mike,

On 16/04/15 06:29, Michael Turquette wrote:

...
This patch introduces the SCHED_ENERGY_FREQ sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set to false when SCHED_DEBUG is not defined and thus disabled by default.

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/fair.c | 5 +++++ kernel/sched/features.h | 6 ++++++ 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 46855d0..75aec8d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4207,6 +4207,11 @@ static inline void hrtick_update(struct rq *rq) } #endif +static inline bool sched_energy_freq(void) +{
return sched_feat(SCHED_ENERGY_FREQ);
+}

/*

The enqueue_task method is called before nr_running is

increased. Here we update the fair scheduling stats and

diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 91e33cd..77381cf 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif

+/*

Scheduler-driven CPU frequency selection aimed to save energy based on

load tracking

*/

+SCHED_FEAT(SCHED_ENERGY_FREQ, false)
Do we really need this? I understand that you don't want to add overhead in enqueue/dequeue/etc paths, but to me it looks a bit redundant as to enable the governor we have to both set it in scaling_governor and enable this sched feature.

I do not think it is redundant. If we remove this but the governor is not active then there is still overhead. cap_gov_update_cpu will call cpufreq_cpu_get(int cpu), fetch the policy pointer and only after that will we realize that there is no policy.gov_data, at which point we bail.

I'd prefer to avoid running through this code every time we enter {en,de}queue_task_fair and task_tick_fair.

An alternative might be to keep per-cpu pointers to gov_data. Then we only need to check for !per_cpu(cap_gov_data, cpu) and bail quickly.

Regards, Mike

...

Thanks,

Juri

Juri Lelli

17 Apr 17 Apr

11:24 a.m.

New subject: [RFC 3/6] sched: sched feature for cpu frequency selection

Hi,

On 16/04/15 19:41, Michael Turquette wrote:

...

Quoting Juri Lelli (2015-04-16 09:53:28)

...
Hi Mike,

On 16/04/15 06:29, Michael Turquette wrote:

...
This patch introduces the SCHED_ENERGY_FREQ sched feature, which is implemented using jump labels when SCHED_DEBUG is defined. It is statically set to false when SCHED_DEBUG is not defined and thus disabled by default.

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/fair.c | 5 +++++ kernel/sched/features.h | 6 ++++++ 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 46855d0..75aec8d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4207,6 +4207,11 @@ static inline void hrtick_update(struct rq *rq) } #endif +static inline bool sched_energy_freq(void) +{
return sched_feat(SCHED_ENERGY_FREQ);
+}

/*

The enqueue_task method is called before nr_running is

increased. Here we update the fair scheduling stats and

diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 91e33cd..77381cf 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -96,3 +96,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true) */ SCHED_FEAT(NUMA_RESIST_LOWER, false) #endif

+/*

Scheduler-driven CPU frequency selection aimed to save energy based on

load tracking

*/

+SCHED_FEAT(SCHED_ENERGY_FREQ, false)
Do we really need this? I understand that you don't want to add overhead in enqueue/dequeue/etc paths, but to me it looks a bit redundant as to enable the governor we have to both set it in scaling_governor and enable this sched feature.
I do not think it is redundant. If we remove this but the governor is not active then there is still overhead. cap_gov_update_cpu will call cpufreq_cpu_get(int cpu), fetch the policy pointer and only after that will we realize that there is no policy.gov_data, at which point we bail.

I'd prefer to avoid running through this code every time we enter {en,de}queue_task_fair and task_tick_fair.

Completely agree. That's exactly what I meant with "I understand that you don't want...", thanks for clarifying :).

...

An alternative might be to keep per-cpu pointers to gov_data. Then we only need to check for !per_cpu(cap_gov_data, cpu) and bail quickly.

Yeah, something like this might work better IMHO, so that we have a single switch for the whole thing.

Thanks,

- Juri

...

Regards, Mike

...
Thanks,

Juri

Michael Turquette

16 Apr 16 Apr

5:29 a.m.

New subject: [RFC 4/6] cpufreq: add per-governor private data

Private data provided by the cpufreq governor on a per-policy basis. Reduces the need for per-cpu variables to track private data as is done in some legacy governors. Analogous to the cpufreq driver private data that is stored per-governor.

Cc: Viresh Kumar viresh.kumar@linaro.org Cc: Rafael J. Wysocki rjw@rjwysocki.net Signed-off-by: Michael Turquette mturquette@linaro.org --- include/linux/cpufreq.h | 3 +++ 1 file changed, 3 insertions(+)

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 2ee4888..7cdf63a 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -116,6 +116,9 @@ struct cpufreq_policy {

/* For cpufreq driver's internal use */ void *driver_data; + + /* For cpufreq governor's internal use */ + void *gov_data; };

/* Only for ACPI */ -- 1.9.1

Michael Turquette

5:29 a.m.

New subject: [RFC 5/6] sched: export get_cpu_usage in sched.h

get_cpu_usage is useful to a cpu frequency scaling policy which is based on CFS load tracking and cpu capacity metrics. Expose these calls in sched.h so that they can be used in such a policy.

Signed-off-by: Michael Turquette mturquette@linaro.org --- kernel/sched/fair.c | 2 +- kernel/sched/sched.h | 2 ++ 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 75aec8d..b066a61 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4801,7 +4801,7 @@ done: * Without capping the usage, a group could be seen as overloaded (CPU0 usage * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity */ -static int get_cpu_usage(int cpu) +int get_cpu_usage(int cpu) { unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; unsigned long capacity = capacity_orig_of(cpu); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 91c6736..0fe57ba 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1396,6 +1396,8 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu) } #endif

+int get_cpu_usage(int cpu); + static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Michael Turquette

5:29 a.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

1) re-using the cpufreq machine drivers without using the governor interface is hard.

2) using the cpufreq interface allows us to switch between the scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org --- drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor. + +config CPU_FREQ_DEFAULT_GOV_CAP_GOV + bool "cap_gov" + select CPU_FREQ_GOV_CAP_GOV + select CPU_FREQ_GOV_PERFORMANCE + help + Use the CPUfreq governor 'cap_gov' as default. This scales cpu + frequency from the scheduler as per-entity load tracking + statistics are updated. endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE

If in doubt, say N.

+config CPU_FREQ_GOV_CAP_GOV + tristate "'capacity governor' cpufreq governor" + depends on CPU_FREQ + select CPU_FREQ_GOV_COMMON + help + 'cap_gov' - this governor scales cpu frequency from the + scheduler as a function of cpu capacity utilization. It does + not evaluate utilization on a periodic basis (unlike ondemand) + but instead is invoked from CFS when updating per-entity load + tracking statistics. + + If in doubt, say N. + comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/* + * Copyright (C) 2014 Michael Turquette mturquette@linaro.org + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h> + +#include "sched.h" + +#define UP_THRESHOLD 95 +#define THROTTLE_NSEC 50000000 /* 50ms default */ + +/* + * per-cpu pointer to atomic_t gov_data->cap_gov_wake_task + * used in scheduler hot paths {en,de}queueu, task_tick without having to + * access struct cpufreq_policy and struct gov_data + */ +static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task); + +/** + * gov_data - per-policy data internal to the governor + * @throttle: time until throttling period expires. Derived from THROTTLE_NSEC + * @task: worker task for dvfs transition that may block/sleep + * @need_wake_task: flag the governor to wake this policy's worker thread + * + * struct gov_data is the per-policy cap_gov-specific data structure. A + * per-policy instance of it is created when the cap_gov governor receives + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data + * member of struct cpufreq_policy. + * + * Readers of this data must call down_read(policy->rwsem). Writers must + * call down_write(policy->rwsem). + */ +struct gov_data { + ktime_t throttle; + unsigned int throttle_nsec; + struct task_struct *task; + atomic_t need_wake_task; +}; + +/** + * cap_gov_select_freq - pick the next frequency for a cpu + * @cpu: the cpu whose frequency may be changed + * + * cap_gov_select_freq works in a way similar to the ondemand governor. First + * we inspect the utilization of all of the cpus in this policy to find the + * most utilized cpu. This is achieved by calling get_cpu_usage, which returns + * frequency-invarant capacity utilization. + * + * This max utilization is compared against the up_threshold (default 95% + * utilization). If the max cpu utilization is greater than this threshold then + * we scale the policy up to the max frequency. Othewise we find the lowest + * frequency (smallest cpu capacity) that is still larger than the max capacity + * utilization for this policy. + * + * Returns frequency selected. + */ +static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{ + int cpu = 0; + struct gov_data *gd; + int index; + unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0; + struct cpufreq_frequency_table *pos; + + if (!policy->gov_data) + goto out; + + gd = policy->gov_data; + + /* + * get_cpu_usage is called without locking the runqueues. This is the + * same behavior used by find_busiest_cpu in load_balance. We are + * willing to accept occasionally stale data here in exchange for + * lockless behavior. + */ + for_each_cpu(cpu, policy->cpus) { + usage = get_cpu_usage(cpu); + trace_printk("cpu = %d usage = %lu", cpu, usage); + if (usage > max_usage) + max_usage = usage; + } + trace_printk("max_usage = %lu", max_usage); + + /* find the utilization threshold at which we scale up frequency */ + index = cpufreq_frequency_table_get_index(policy, policy->cur); + + /* + * converge towards max_usage. We want the lowest frequency whose + * capacity is >= to max_usage. In other words: + * + * find capacity == floor(usage) + * + * Sadly cpufreq freq tables are not guaranteed to be ordered by + * frequency... + */ + freq = policy->max; + cpufreq_for_each_entry(pos, policy->freq_table) { + cap = pos->frequency * SCHED_CAPACITY_SCALE / + policy->max; + if (max_usage < cap && pos->frequency < freq) + freq = pos->frequency; + trace_printk("cpu = %u max_usage = %lu cap = %lu \ + table_freq = %u freq = %lu", + cpumask_first(policy->cpus), max_usage, cap, + pos->frequency, freq); + } + +out: + trace_printk("cpu %d final freq %lu", cpu, freq); + return freq; +} + +/* + * we pass in struct cpufreq_policy. This is safe because changing out the + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP), + * which tears down all of the data structures and __cpufreq_governor(policy, + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the + * new policy pointer + */ +static int cap_gov_thread(void *data) +{ + struct sched_param param; + struct cpufreq_policy *policy; + struct gov_data *gd; + unsigned long freq; + int ret; + + policy = (struct cpufreq_policy *) data; + if (!policy) { + pr_warn("%s: missing policy\n", __func__); + do_exit(-EINVAL); + } + + gd = policy->gov_data; + if (!gd) { + pr_warn("%s: missing governor data\n", __func__); + do_exit(-EINVAL); + } + + param.sched_priority = 0; + sched_setscheduler(current, SCHED_FIFO, &param); + set_cpus_allowed_ptr(current, policy->related_cpus); + + /* main loop of the per-policy kthread */ + do { + down_write(&policy->rwsem); + if (!atomic_read(&gd->need_wake_task)) { + if (kthread_should_stop()) + break; + trace_printk("NOT waking up kthread (%d)", gd->task->pid); + up_write(&policy->rwsem); + set_current_state(TASK_INTERRUPTIBLE); + schedule(); + continue; + } + + trace_printk("kthread %d requested freq switch", gd->task->pid); + + freq = cap_gov_select_freq(policy); + + ret = __cpufreq_driver_target(policy, freq, + CPUFREQ_RELATION_H); + if (ret) + pr_debug("%s: __cpufreq_driver_target returned %d\n", + __func__, ret); + + trace_printk("kthread %d requested freq switch", gd->task->pid); + + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec); + atomic_set(&gd->need_wake_task, 0); + up_write(&policy->rwsem); + } while (!kthread_should_stop()); + + do_exit(0); +} + +static void cap_gov_wake_up_process(struct task_struct *task) +{ + /* this is null during early boot */ + if (IS_ERR_OR_NULL(task)) { + return; + } + + wake_up_process(task); +} + +void cap_gov_kick_thread(int cpu) +{ + struct cpufreq_policy *policy; + struct gov_data *gd = NULL; + + policy = cpufreq_cpu_get(cpu); + if (IS_ERR_OR_NULL(policy)) + return; + + gd = policy->gov_data; + if (!gd) + goto out; + + /* per-cpu access not needed here since we have gd */ + if (atomic_read(&gd->need_wake_task)) { + trace_printk("waking up kthread (%d)", gd->task->pid); + cap_gov_wake_up_process(gd->task); + } + +out: + cpufreq_cpu_put(policy); +} + +/** + * cap_gov_update_cpu - interface to scheduler for changing capacity values + * @cpu: cpu whose capacity utilization has recently changed + * + * cap_gov_udpate_cpu is an interface exposed to the scheduler so that the + * scheduler may inform the governor of updates to capacity utilization and + * make changes to cpu frequency. Currently this interface is designed around + * PELT values in CFS. It can be expanded to other scheduling classes in the + * future if needed. + * + * The semantics of this call vary based on the cpu frequency scaling + * characteristics of the hardware. + * + * If kicking off a dvfs transition is an operation that might block or sleep + * in the cpufreq driver then we set the need_wake_task flag in this function + * and return. Selecting a frequency and programming it is done in a dedicated + * kernel thread which will be woken up from rebalance_domains. See + * cap_gov_kick_thread above. + * + * If kicking off a dvfs transition is an operation that returns quickly in the + * cpufreq driver and will never sleep then we select the frequency in this + * function and program the hardware for it in the scheduler hot path. No + * dedicated kthread is needed. + */ +void cap_gov_update_cpu(int cpu) +{ + struct cpufreq_policy *policy; + struct gov_data *gd; + + /* XXX put policy pointer in per-cpu data? */ + policy = cpufreq_cpu_get(cpu); + if (IS_ERR_OR_NULL(policy)) { + return; + } + + if (!policy->gov_data) { + trace_printk("missing governor data"); + goto out; + } + + gd = policy->gov_data; + + /* bail early if we are throttled */ + if (ktime_before(ktime_get(), gd->throttle)) { + trace_printk("THROTTLED"); + goto out; + } + + atomic_set(per_cpu(cap_gov_wake_task, cpu), 1); + +out: + cpufreq_cpu_put(policy); + return; +} + +static void cap_gov_start(struct cpufreq_policy *policy) +{ + int cpu; + struct gov_data *gd; + + /* prepare per-policy private data */ + gd = kzalloc(sizeof(*gd), GFP_KERNEL); + if (!gd) { + pr_debug("%s: failed to allocate private data\n", __func__); + return; + } + + /* + * Don't ask for freq changes at an higher rate than what + * the driver advertises as transition latency. + */ + gd->throttle_nsec = policy->cpuinfo.transition_latency ? + policy->cpuinfo.transition_latency : + THROTTLE_NSEC; + pr_debug("%s: throttle threshold = %u [ns]\n", + __func__, gd->throttle_nsec); + + /* save per-cpu pointer to per-policy need_wake_task */ + for_each_cpu(cpu, policy->related_cpus) + per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task; + + /* init per-policy kthread */ + gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task"); + if (IS_ERR_OR_NULL(gd->task)) + pr_err("%s: failed to create kcap_gov_task thread\n", __func__); + + policy->gov_data = gd; +} + +static void cap_gov_stop(struct cpufreq_policy *policy) +{ + struct gov_data *gd; + + gd = policy->gov_data; + policy->gov_data = NULL; + + kthread_stop(gd->task); + + /* FIXME replace with devm counterparts? */ + kfree(gd); +} + +static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event) +{ + switch (event) { + case CPUFREQ_GOV_START: + /* Start managing the frequency */ + cap_gov_start(policy); + return 0; + + case CPUFREQ_GOV_STOP: + cap_gov_stop(policy); + return 0; + + case CPUFREQ_GOV_LIMITS: /* unused */ + case CPUFREQ_GOV_POLICY_INIT: /* unused */ + case CPUFREQ_GOV_POLICY_EXIT: /* unused */ + break; + } + return 0; +} + +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV +static +#endif +struct cpufreq_governor cpufreq_gov_cap_gov = { + .name = "cap_gov", + .governor = cap_gov_setup, + .owner = THIS_MODULE, +}; + +static int __init cap_gov_init(void) +{ + return cpufreq_register_governor(&cpufreq_gov_cap_gov); +} + +static void __exit cap_gov_exit(void) +{ + cpufreq_unregister_governor(&cpufreq_gov_cap_gov); +} + +/* Try to make this the default governor */ +fs_initcall(cap_gov_init); + +MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b066a61..2ec2dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); } + + if(sched_energy_freq()) + cap_gov_update_cpu(cpu_of(rq)); + hrtick_update(rq); }

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); } + + if(sched_energy_freq()) + cap_gov_update_cpu(cpu_of(rq)); + hrtick_update(rq); }

@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle); + + /* + * FIXME some hardware does not require this, but current CPUfreq + * locking prevents us from changing cpu frequency with rq locks held + * and interrupts disabled + */ + if (sched_energy_freq()) + cap_gov_kick_thread(cpu_of(this_rq)); }

/* @@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);

update_rq_runnable_avg(rq, 1); + + if(sched_energy_freq()) + cap_gov_update_cpu(cpu_of(rq)); }

/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe57ba..c45f1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)

int get_cpu_usage(int cpu);

+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV +void cap_gov_update_cpu(int cpu); +void cap_gov_kick_thread(int cpu); +#else +static inline void cap_gov_update_cpu(int cpu) {} +static inline void cap_gov_kick_thread(int cpu) {} +#endif + static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Amit Kucheria

7:11 a.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On Thu, Apr 16, 2015 at 10:59 AM, Michael Turquette mturquette@linaro.org wrote:

...

Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV

Two GOVs are redundant here and make it hard to read. A few name suggestions for your baby:

CPU_FREQ_GOV_SCHED_CAP CPU_FREQ_GOV_SCHED_STATS

...

  tristate "'capacity governor' cpufreq governor"

```
  depends on CPU_FREQ
```
```
  select CPU_FREQ_GOV_COMMON
```
```
  help
```

    'cap_gov' - this governor scales cpu frequency from the

same as above

...

    scheduler as a function of cpu capacity utilization. It does

    not evaluate utilization on a periodic basis (unlike ondemand)

    but instead is invoked from CFS when updating per-entity load

```
    tracking statistics.
```

perhaps add something to the effect that it is more responsive than existing governors to really sell it? :)

...

    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95

A comment that this probably belong as a sysfs tunable

...

+#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

s/cap_gov_wake_task/need_wake_task/

...

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@throttle_nsec ?

...

- @task: worker task for dvfs transition that may block/sleep
- @need_wake_task: flag the governor to wake this policy's worker thread
- struct gov_data is the per-policy cap_gov-specific data structure. A
- per-policy instance of it is created when the cap_gov governor receives
- the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
- member of struct cpufreq_policy.
- Readers of this data must call down_read(policy->rwsem). Writers must
- call down_write(policy->rwsem).
*/

+struct gov_data {

```
  ktime_t throttle;
```
```
  unsigned int throttle_nsec;
```
```
  struct task_struct *task;
```
```
  atomic_t need_wake_task;
```

+};

+/**

- cap_gov_select_freq - pick the next frequency for a cpu
- @cpu: the cpu whose frequency may be changed
- cap_gov_select_freq works in a way similar to the ondemand governor. First
- we inspect the utilization of all of the cpus in this policy to find the
- most utilized cpu. This is achieved by calling get_cpu_usage, which returns
- frequency-invarant capacity utilization.
- This max utilization is compared against the up_threshold (default 95%
- utilization). If the max cpu utilization is greater than this threshold then
- we scale the policy up to the max frequency. Othewise we find the lowest
- frequency (smallest cpu capacity) that is still larger than the max capacity
- utilization for this policy.
- Returns frequency selected.
*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{

```
  int cpu = 0;
```
```
  struct gov_data *gd;
```
```
  int index;
```

  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;

```
  struct cpufreq_frequency_table *pos;
```
```
  if (!policy->gov_data)
```
```
          goto out;
```
```
  gd = policy->gov_data;
```
```
  /*
```

   * get_cpu_usage is called without locking the runqueues. This is the

   * same behavior used by find_busiest_cpu in load_balance. We are

   * willing to accept occasionally stale data here in exchange for

```
   * lockless behavior.
```
```
   */
```
```
  for_each_cpu(cpu, policy->cpus) {
```
```
          usage = get_cpu_usage(cpu);
```

          trace_printk("cpu = %d usage = %lu", cpu, usage);

```
          if (usage > max_usage)
```
```
                  max_usage = usage;
```
```
  }
```

  trace_printk("max_usage = %lu", max_usage);

  /* find the utilization threshold at which we scale up frequency */

  index = cpufreq_frequency_table_get_index(policy, policy->cur);

```
  /*
```

   * converge towards max_usage. We want the lowest frequency whose

   * capacity is >= to max_usage. In other words:

```
   *
```

   *      find capacity == floor(usage)

```
   *
```

   * Sadly cpufreq freq tables are not guaranteed to be ordered by

```
   * frequency...
```
```
   */
```
```
  freq = policy->max;
```

  cpufreq_for_each_entry(pos, policy->freq_table) {

          cap = pos->frequency * SCHED_CAPACITY_SCALE /

```
                  policy->max;
```

          if (max_usage < cap && pos->frequency < freq)

                  freq = pos->frequency;

          trace_printk("cpu = %u max_usage = %lu cap = %lu \

                          table_freq = %u freq = %lu",

                          cpumask_first(policy->cpus), max_usage, cap,

                          pos->frequency, freq);

```
  }
```

+out:

  trace_printk("cpu %d final freq %lu", cpu, freq);

```
  return freq;
```

+/*

- we pass in struct cpufreq_policy. This is safe because changing out the
- policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
- which tears down all of the data structures and __cpufreq_governor(policy,
- CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
- new policy pointer
*/

+static int cap_gov_thread(void *data) +{

```
  struct sched_param param;
```
```
  struct cpufreq_policy *policy;
```
```
  struct gov_data *gd;
```
```
  unsigned long freq;
```
```
  int ret;
```

  policy = (struct cpufreq_policy *) data;

```
  if (!policy) {
```

          pr_warn("%s: missing policy\n", __func__);

```
          do_exit(-EINVAL);
```
```
  }
```
```
  gd = policy->gov_data;
```
```
  if (!gd) {
```

          pr_warn("%s: missing governor data\n", __func__);

```
          do_exit(-EINVAL);
```
```
  }
```
```
  param.sched_priority = 0;
```

  sched_setscheduler(current, SCHED_FIFO, &param);

  set_cpus_allowed_ptr(current, policy->related_cpus);

  /* main loop of the per-policy kthread */

```
  do {
```
```
          down_write(&policy->rwsem);
```

          if (!atomic_read(&gd->need_wake_task))  {

                  if (kthread_should_stop())

```
                          break;
```

                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);

                  up_write(&policy->rwsem);

                  set_current_state(TASK_INTERRUPTIBLE);

```
                  schedule();
```
```
                  continue;
```
```
          }
```

          trace_printk("kthread %d requested freq switch", gd->task->pid);

          freq = cap_gov_select_freq(policy);

          ret = __cpufreq_driver_target(policy, freq,

                          CPUFREQ_RELATION_H);

```
          if (ret)
```

                  pr_debug("%s: __cpufreq_driver_target returned %d\n",

                                  __func__, ret);

          trace_printk("kthread %d requested freq switch", gd->task->pid);

          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);

          atomic_set(&gd->need_wake_task, 0);

```
          up_write(&policy->rwsem);
```
```
  } while (!kthread_should_stop());
```
```
  do_exit(0);
```

+static void cap_gov_wake_up_process(struct task_struct *task) +{

```
  /* this is null during early boot */
```
```
  if (IS_ERR_OR_NULL(task)) {
```
```
          return;
```
```
  }
```
```
  wake_up_process(task);
```

+void cap_gov_kick_thread(int cpu) +{

```
  struct cpufreq_policy *policy;
```
```
  struct gov_data *gd = NULL;
```
```
  policy = cpufreq_cpu_get(cpu);
```
```
  if (IS_ERR_OR_NULL(policy))
```
```
          return;
```
```
  gd = policy->gov_data;
```
```
  if (!gd)
```
```
          goto out;
```

  /* per-cpu access not needed here since we have gd */

  if (atomic_read(&gd->need_wake_task)) {

          trace_printk("waking up kthread (%d)", gd->task->pid);

          cap_gov_wake_up_process(gd->task);

```
  }
```

+out:

```
  cpufreq_cpu_put(policy);
```

+/**

- cap_gov_update_cpu - interface to scheduler for changing capacity values
- @cpu: cpu whose capacity utilization has recently changed
- cap_gov_udpate_cpu is an interface exposed to the scheduler so that the
- scheduler may inform the governor of updates to capacity utilization and
- make changes to cpu frequency. Currently this interface is designed around
- PELT values in CFS. It can be expanded to other scheduling classes in the
- future if needed.
- The semantics of this call vary based on the cpu frequency scaling
- characteristics of the hardware.
- If kicking off a dvfs transition is an operation that might block or sleep
- in the cpufreq driver then we set the need_wake_task flag in this function

The comment here isn't obvious since first glance you don't touch need_wake_task. Perhaps clarify it as follows?

we set the need_wake_task (cap_gov_wake_task is a pointer to it)

...

- and return. Selecting a frequency and programming it is done in a dedicated
- kernel thread which will be woken up from rebalance_domains. See
- cap_gov_kick_thread above.
- If kicking off a dvfs transition is an operation that returns quickly in the
- cpufreq driver and will never sleep then we select the frequency in this
- function and program the hardware for it in the scheduler hot path. No
- dedicated kthread is needed.
*/

+void cap_gov_update_cpu(int cpu) +{

```
  struct cpufreq_policy *policy;
```
```
  struct gov_data *gd;
```

  /* XXX put policy pointer in per-cpu data? */

```
  policy = cpufreq_cpu_get(cpu);
```
```
  if (IS_ERR_OR_NULL(policy)) {
```
```
          return;
```
```
  }
```
```
  if (!policy->gov_data) {
```

          trace_printk("missing governor data");

```
          goto out;
```
```
  }
```
```
  gd = policy->gov_data;
```
```
  /* bail early if we are throttled */
```

  if (ktime_before(ktime_get(), gd->throttle)) {

```
          trace_printk("THROTTLED");
```
```
          goto out;
```
```
  }
```

  atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);

+out:

```
  cpufreq_cpu_put(policy);
```
```
  return;
```

+static void cap_gov_start(struct cpufreq_policy *policy) +{

```
  int cpu;
```
```
  struct gov_data *gd;
```

  /* prepare per-policy private data */

  gd = kzalloc(sizeof(*gd), GFP_KERNEL);

```
  if (!gd) {
```

          pr_debug("%s: failed to allocate private data\n", __func__);

```
          return;
```
```
  }
```
```
  /*
```

   * Don't ask for freq changes at an higher rate than what

   * the driver advertises as transition latency.

```
   */
```

  gd->throttle_nsec = policy->cpuinfo.transition_latency ?

                      policy->cpuinfo.transition_latency :

```
                      THROTTLE_NSEC;
```

  pr_debug("%s: throttle threshold = %u [ns]\n",

            __func__, gd->throttle_nsec);

  /* save per-cpu pointer to per-policy need_wake_task */

  for_each_cpu(cpu, policy->related_cpus)

          per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;

```
  /* init per-policy kthread */
```

  gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");

```
  if (IS_ERR_OR_NULL(gd->task))
```

          pr_err("%s: failed to create kcap_gov_task thread\n", __func__);

```
  policy->gov_data = gd;
```

+static void cap_gov_stop(struct cpufreq_policy *policy) +{

```
  struct gov_data *gd;
```
```
  gd = policy->gov_data;
```
```
  policy->gov_data = NULL;
```
```
  kthread_stop(gd->task);
```

  /* FIXME replace with devm counterparts? */

```
  kfree(gd);
```

+static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event) +{

```
  switch (event) {
```
```
          case CPUFREQ_GOV_START:
```

                  /* Start managing the frequency */

                  cap_gov_start(policy);

```
                  return 0;
```
```
          case CPUFREQ_GOV_STOP:
```

                  cap_gov_stop(policy);

```
                  return 0;
```

          case CPUFREQ_GOV_LIMITS:        /* unused */

          case CPUFREQ_GOV_POLICY_INIT:   /* unused */

          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */

```
                  break;
```
```
  }
```
```
  return 0;
```

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV +static +#endif +struct cpufreq_governor cpufreq_gov_cap_gov = {

```
  .name                   = "cap_gov",
```

  .governor               = cap_gov_setup,

  .owner                  = THIS_MODULE,

+};

+static int __init cap_gov_init(void) +{

  return cpufreq_register_governor(&cpufreq_gov_cap_gov);

+static void __exit cap_gov_exit(void) +{

  cpufreq_unregister_governor(&cpufreq_gov_cap_gov);

+/* Try to make this the default governor */ +fs_initcall(cap_gov_init);

module_exit(cap_gov_exit) to allow switching governors

...

+MODULE_LICENSE("GPL");

Fill in MODULE_AUTHOR, MODULE_DESCRIPTION

...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b066a61..2ec2dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle);
  /*
   * FIXME some hardware does not require this, but current CPUfreq
   * locking prevents us from changing cpu frequency with rq locks held
   * and interrupts disabled
   */
  if (sched_energy_freq())
          cap_gov_kick_thread(cpu_of(this_rq));
}

/* @@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
}

/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe57ba..c45f1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)

int get_cpu_usage(int cpu);

+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV +void cap_gov_update_cpu(int cpu); +void cap_gov_kick_thread(int cpu); +#else +static inline void cap_gov_update_cpu(int cpu) {} +static inline void cap_gov_kick_thread(int cpu) {} +#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Juri Lelli

1:56 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Hi Amit, Mike,

On 16/04/15 08:11, Amit Kucheria wrote:

...

On Thu, Apr 16, 2015 at 10:59 AM, Michael Turquette mturquette@linaro.org wrote:

...
Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
Two GOVs are redundant here and make it hard to read. A few name suggestions for your baby:

CPU_FREQ_GOV_SCHED_CAP CPU_FREQ_GOV_SCHED_STATS

How about simply CPU_FREQ_GOV_SCHEDULER ?

In the end this is all about controlling cpufreq from the scheduler. How we do it and what we use to do it might even change in the future (today is capacity/usage.. tomorrow who knows :)).

Also, the name of cap_gov.c might be changed to cpufreq_scheduler.c, to be consistent with the cpufreq namespace.

Thanks,

- Juri

...

...
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
same as above

...
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
perhaps add something to the effect that it is more responsive than existing governors to really sell it? :)

...
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95
A comment that this probably belong as a sysfs tunable

...
+#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

s/cap_gov_wake_task/need_wake_task/

...

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@throttle_nsec ?

...
@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
          if (usage > max_usage)
                  max_usage = usage;
  }
  trace_printk("max_usage = %lu", max_usage);
  /* find the utilization threshold at which we scale up frequency */
  index = cpufreq_frequency_table_get_index(policy, policy->cur);
  /*
   * converge towards max_usage. We want the lowest frequency whose
   * capacity is >= to max_usage. In other words:
   *
   *      find capacity == floor(usage)
   *
   * Sadly cpufreq freq tables are not guaranteed to be ordered by
   * frequency...
   */
  freq = policy->max;
  cpufreq_for_each_entry(pos, policy->freq_table) {
          cap = pos->frequency * SCHED_CAPACITY_SCALE /
                  policy->max;
          if (max_usage < cap && pos->frequency < freq)
                  freq = pos->frequency;
          trace_printk("cpu = %u max_usage = %lu cap = %lu \
                          table_freq = %u freq = %lu",
                          cpumask_first(policy->cpus), max_usage, cap,
                          pos->frequency, freq);
  }
+out:
  trace_printk("cpu %d final freq %lu", cpu, freq);
  return freq;
+}

+/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears down all of the data structures and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

+static int cap_gov_thread(void *data) +{
  struct sched_param param;
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  unsigned long freq;
  int ret;
  policy = (struct cpufreq_policy *) data;
  if (!policy) {
          pr_warn("%s: missing policy\n", __func__);
          do_exit(-EINVAL);
  }
  gd = policy->gov_data;
  if (!gd) {
          pr_warn("%s: missing governor data\n", __func__);
          do_exit(-EINVAL);
  }
  param.sched_priority = 0;
  sched_setscheduler(current, SCHED_FIFO, &param);
  set_cpus_allowed_ptr(current, policy->related_cpus);
  /* main loop of the per-policy kthread */
  do {
          down_write(&policy->rwsem);
          if (!atomic_read(&gd->need_wake_task))  {
                  if (kthread_should_stop())
                          break;
                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);
                  up_write(&policy->rwsem);
                  set_current_state(TASK_INTERRUPTIBLE);
                  schedule();
                  continue;
          }
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          freq = cap_gov_select_freq(policy);
          ret = __cpufreq_driver_target(policy, freq,
                          CPUFREQ_RELATION_H);
          if (ret)
                  pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                  __func__, ret);
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
          atomic_set(&gd->need_wake_task, 0);
          up_write(&policy->rwsem);
  } while (!kthread_should_stop());
  do_exit(0);
+}

+static void cap_gov_wake_up_process(struct task_struct *task) +{
  /* this is null during early boot */
  if (IS_ERR_OR_NULL(task)) {
          return;
  }
  wake_up_process(task);
+}

+void cap_gov_kick_thread(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd = NULL;
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy))
          return;
  gd = policy->gov_data;
  if (!gd)
          goto out;
  /* per-cpu access not needed here since we have gd */
  if (atomic_read(&gd->need_wake_task)) {
          trace_printk("waking up kthread (%d)", gd->task->pid);
          cap_gov_wake_up_process(gd->task);
  }
+out:
  cpufreq_cpu_put(policy);
+}

+/**

cap_gov_update_cpu - interface to scheduler for changing capacity values

@cpu: cpu whose capacity utilization has recently changed

cap_gov_udpate_cpu is an interface exposed to the scheduler so that the

scheduler may inform the governor of updates to capacity utilization and

make changes to cpu frequency. Currently this interface is designed around

PELT values in CFS. It can be expanded to other scheduling classes in the

future if needed.

The semantics of this call vary based on the cpu frequency scaling

characteristics of the hardware.

If kicking off a dvfs transition is an operation that might block or sleep

in the cpufreq driver then we set the need_wake_task flag in this function
The comment here isn't obvious since first glance you don't touch need_wake_task. Perhaps clarify it as follows?

we set the need_wake_task (cap_gov_wake_task is a pointer to it)

...
and return. Selecting a frequency and programming it is done in a dedicated

kernel thread which will be woken up from rebalance_domains. See

cap_gov_kick_thread above.

If kicking off a dvfs transition is an operation that returns quickly in the

cpufreq driver and will never sleep then we select the frequency in this

function and program the hardware for it in the scheduler hot path. No

dedicated kthread is needed.

*/

+void cap_gov_update_cpu(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  /* XXX put policy pointer in per-cpu data? */
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy)) {
          return;
  }
  if (!policy->gov_data) {
          trace_printk("missing governor data");
          goto out;
  }
  gd = policy->gov_data;
  /* bail early if we are throttled */
  if (ktime_before(ktime_get(), gd->throttle)) {
          trace_printk("THROTTLED");
          goto out;
  }
  atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);
+out:
  cpufreq_cpu_put(policy);
  return;
+}

+static void cap_gov_start(struct cpufreq_policy *policy) +{
  int cpu;
  struct gov_data *gd;
  /* prepare per-policy private data */
  gd = kzalloc(sizeof(*gd), GFP_KERNEL);
  if (!gd) {
          pr_debug("%s: failed to allocate private data\n", __func__);
          return;
  }
  /*
   * Don't ask for freq changes at an higher rate than what
   * the driver advertises as transition latency.
   */
  gd->throttle_nsec = policy->cpuinfo.transition_latency ?
                      policy->cpuinfo.transition_latency :
                      THROTTLE_NSEC;
  pr_debug("%s: throttle threshold = %u [ns]\n",
            __func__, gd->throttle_nsec);
  /* save per-cpu pointer to per-policy need_wake_task */
  for_each_cpu(cpu, policy->related_cpus)
          per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;
  /* init per-policy kthread */
  gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");
  if (IS_ERR_OR_NULL(gd->task))
          pr_err("%s: failed to create kcap_gov_task thread\n", __func__);
  policy->gov_data = gd;
+}

+static void cap_gov_stop(struct cpufreq_policy *policy) +{
  struct gov_data *gd;
  gd = policy->gov_data;
  policy->gov_data = NULL;
  kthread_stop(gd->task);
  /* FIXME replace with devm counterparts? */
  kfree(gd);
+}

+static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event) +{
  switch (event) {
          case CPUFREQ_GOV_START:
                  /* Start managing the frequency */
                  cap_gov_start(policy);
                  return 0;
          case CPUFREQ_GOV_STOP:
                  cap_gov_stop(policy);
                  return 0;
          case CPUFREQ_GOV_LIMITS:        /* unused */
          case CPUFREQ_GOV_POLICY_INIT:   /* unused */
          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */
                  break;
  }
  return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV +static +#endif +struct cpufreq_governor cpufreq_gov_cap_gov = {
  .name                   = "cap_gov",
  .governor               = cap_gov_setup,
  .owner                  = THIS_MODULE,
+};

+static int __init cap_gov_init(void) +{
  return cpufreq_register_governor(&cpufreq_gov_cap_gov);
+}

+static void __exit cap_gov_exit(void) +{
  cpufreq_unregister_governor(&cpufreq_gov_cap_gov);
+}

+/* Try to make this the default governor */ +fs_initcall(cap_gov_init);
module_exit(cap_gov_exit) to allow switching governors

...

+MODULE_LICENSE("GPL");

Fill in MODULE_AUTHOR, MODULE_DESCRIPTION

...
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b066a61..2ec2dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle);
  /*
   * FIXME some hardware does not require this, but current CPUfreq
   * locking prevents us from changing cpu frequency with rq locks held
   * and interrupts disabled
   */
  if (sched_energy_freq())
          cap_gov_kick_thread(cpu_of(this_rq));
}

/* @@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
}

/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe57ba..c45f1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)

int get_cpu_usage(int cpu);

+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV +void cap_gov_update_cpu(int cpu); +void cap_gov_kick_thread(int cpu); +#else +static inline void cap_gov_update_cpu(int cpu) {} +static inline void cap_gov_kick_thread(int cpu) {} +#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Michael Turquette

6:15 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Quoting Amit Kucheria (2015-04-16 00:11:22)

...

On Thu, Apr 16, 2015 at 10:59 AM, Michael Turquette mturquette@linaro.org wrote:

...
+config CPU_FREQ_GOV_CAP_GOV

Two GOVs are redundant here and make it hard to read. A few name suggestions for your baby:

CPU_FREQ_GOV_SCHED_CAP CPU_FREQ_GOV_SCHED_STATS

I don't want the name to be too generic, since we're only dealing with cfs right now. Perhaps your SCHED_CAP variant or maybe CPU_FREQ_GOV_SCHED_CFS? That leaves room for SCHED_DL and others later on.

...

  tristate "'capacity governor' cpufreq governor"

```
  depends on CPU_FREQ
```
```
  select CPU_FREQ_GOV_COMMON
```
```
  help
```

    'cap_gov' - this governor scales cpu frequency from the

same as above

...

    scheduler as a function of cpu capacity utilization. It does

    not evaluate utilization on a periodic basis (unlike ondemand)

    but instead is invoked from CFS when updating per-entity load

```
    tracking statistics.
```

perhaps add something to the effect that it is more responsive than existing governors to really sell it? :)

Good idea. I'll add,

"Response to changes in load is improved over polling governors due to its event-driven design"

...

...
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95
A comment that this probably belong as a sysfs tunable

Doh, this shouldn't be here at all. I don't use any up or down thresholds in this version.

...

...
+#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

s/cap_gov_wake_task/need_wake_task/

Ack. I might also be able to gid rid of this entirely with the irq_work stuff.

...

...

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@throttle_nsec ?

Ack.

...

...
@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
          if (usage > max_usage)
                  max_usage = usage;
  }
  trace_printk("max_usage = %lu", max_usage);
  /* find the utilization threshold at which we scale up frequency */
  index = cpufreq_frequency_table_get_index(policy, policy->cur);
  /*
   * converge towards max_usage. We want the lowest frequency whose
   * capacity is >= to max_usage. In other words:
   *
   *      find capacity == floor(usage)
   *
   * Sadly cpufreq freq tables are not guaranteed to be ordered by
   * frequency...
   */
  freq = policy->max;
  cpufreq_for_each_entry(pos, policy->freq_table) {
          cap = pos->frequency * SCHED_CAPACITY_SCALE /
                  policy->max;
          if (max_usage < cap && pos->frequency < freq)
                  freq = pos->frequency;
          trace_printk("cpu = %u max_usage = %lu cap = %lu \
                          table_freq = %u freq = %lu",
                          cpumask_first(policy->cpus), max_usage, cap,
                          pos->frequency, freq);
  }
+out:
  trace_printk("cpu %d final freq %lu", cpu, freq);
  return freq;
+}

+/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears down all of the data structures and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

+static int cap_gov_thread(void *data) +{
  struct sched_param param;
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  unsigned long freq;
  int ret;
  policy = (struct cpufreq_policy *) data;
  if (!policy) {
          pr_warn("%s: missing policy\n", __func__);
          do_exit(-EINVAL);
  }
  gd = policy->gov_data;
  if (!gd) {
          pr_warn("%s: missing governor data\n", __func__);
          do_exit(-EINVAL);
  }
  param.sched_priority = 0;
  sched_setscheduler(current, SCHED_FIFO, &param);
  set_cpus_allowed_ptr(current, policy->related_cpus);
  /* main loop of the per-policy kthread */
  do {
          down_write(&policy->rwsem);
          if (!atomic_read(&gd->need_wake_task))  {
                  if (kthread_should_stop())
                          break;
                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);
                  up_write(&policy->rwsem);
                  set_current_state(TASK_INTERRUPTIBLE);
                  schedule();
                  continue;
          }
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          freq = cap_gov_select_freq(policy);
          ret = __cpufreq_driver_target(policy, freq,
                          CPUFREQ_RELATION_H);
          if (ret)
                  pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                  __func__, ret);
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
          atomic_set(&gd->need_wake_task, 0);
          up_write(&policy->rwsem);
  } while (!kthread_should_stop());
  do_exit(0);
+}

+static void cap_gov_wake_up_process(struct task_struct *task) +{
  /* this is null during early boot */
  if (IS_ERR_OR_NULL(task)) {
          return;
  }
  wake_up_process(task);
+}

+void cap_gov_kick_thread(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd = NULL;
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy))
          return;
  gd = policy->gov_data;
  if (!gd)
          goto out;
  /* per-cpu access not needed here since we have gd */
  if (atomic_read(&gd->need_wake_task)) {
          trace_printk("waking up kthread (%d)", gd->task->pid);
          cap_gov_wake_up_process(gd->task);
  }
+out:
  cpufreq_cpu_put(policy);
+}

+/**

cap_gov_update_cpu - interface to scheduler for changing capacity values

@cpu: cpu whose capacity utilization has recently changed

cap_gov_udpate_cpu is an interface exposed to the scheduler so that the

scheduler may inform the governor of updates to capacity utilization and

make changes to cpu frequency. Currently this interface is designed around

PELT values in CFS. It can be expanded to other scheduling classes in the

future if needed.

The semantics of this call vary based on the cpu frequency scaling

characteristics of the hardware.

If kicking off a dvfs transition is an operation that might block or sleep

in the cpufreq driver then we set the need_wake_task flag in this function
The comment here isn't obvious since first glance you don't touch need_wake_task. Perhaps clarify it as follows?

we set the need_wake_task (cap_gov_wake_task is a pointer to it)

I can do that. Additionally the kerneldoc description should remove all of the text about hardware that has async/non-blocking dvfs transition. This version of the patch ALWAYS kicks the kthread and the previous "driver_might_sleep" bool has been removed.

Trying to keep the submission as simple and not-over-engineered as possible.

Regards, Mike

Juri Lelli

4:46 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Hi Mike,

On 16/04/15 06:29, Michael Turquette wrote:

...

Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95

Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.

We could solve this problem by putting the up threshold back. As soon as you cross it you go to max, and then adapt, choosing the right capacity for the actual, non capped, utilization of the task.

...

+#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);

Here and below, do you want to post the patches with trace_printks?

...

```
          if (usage > max_usage)
```
```
                  max_usage = usage;
```
```
  }
```

  trace_printk("max_usage = %lu", max_usage);

  /* find the utilization threshold at which we scale up frequency */

  index = cpufreq_frequency_table_get_index(policy, policy->cur);

```
  /*
```

   * converge towards max_usage. We want the lowest frequency whose

   * capacity is >= to max_usage. In other words:

```
   *
```

   *      find capacity == floor(usage)

```
   *
```

   * Sadly cpufreq freq tables are not guaranteed to be ordered by

```
   * frequency...
```
```
   */
```
```
  freq = policy->max;
```

  cpufreq_for_each_entry(pos, policy->freq_table) {

          cap = pos->frequency * SCHED_CAPACITY_SCALE /

```
                  policy->max;
```

          if (max_usage < cap && pos->frequency < freq)

                  freq = pos->frequency;

          trace_printk("cpu = %u max_usage = %lu cap = %lu \

                          table_freq = %u freq = %lu",

                          cpumask_first(policy->cpus), max_usage, cap,

                          pos->frequency, freq);

```
  }
```

+out:

  trace_printk("cpu %d final freq %lu", cpu, freq);

```
  return freq;
```

+/*

- we pass in struct cpufreq_policy. This is safe because changing out the
- policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
- which tears down all of the data structures and __cpufreq_governor(policy,
- CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
- new policy pointer
*/

+static int cap_gov_thread(void *data) +{

```
  struct sched_param param;
```
```
  struct cpufreq_policy *policy;
```
```
  struct gov_data *gd;
```
```
  unsigned long freq;
```
```
  int ret;
```

  policy = (struct cpufreq_policy *) data;

```
  if (!policy) {
```

          pr_warn("%s: missing policy\n", __func__);

```
          do_exit(-EINVAL);
```
```
  }
```
```
  gd = policy->gov_data;
```
```
  if (!gd) {
```

          pr_warn("%s: missing governor data\n", __func__);

```
          do_exit(-EINVAL);
```
```
  }
```
```
  param.sched_priority = 0;
```

  sched_setscheduler(current, SCHED_FIFO, &param);

  set_cpus_allowed_ptr(current, policy->related_cpus);

We should check return values of these functions, use the in-kernel version of setscheduler and set a true RT prio for kthreads, something like:

- param.sched_priority = 0; - sched_setscheduler(current, SCHED_FIFO, &param); - set_cpus_allowed_ptr(current, policy->related_cpus); + param.sched_priority = 50; + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param); + if (ret) { + pr_warn("%s: failed to set SCHED_FIFO\n", __func__); + do_exit(-EINVAL); + } else { + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n", + __func__, gd->task->pid); + } + + ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus); + if (ret) { + pr_warn("%s: failed to set allowed ptr\n", __func__); + do_exit(-EINVAL); + }

...

  /* main loop of the per-policy kthread */
  do {
          down_write(&policy->rwsem);
          if (!atomic_read(&gd->need_wake_task))  {
                  if (kthread_should_stop())
                          break;
                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);
                  up_write(&policy->rwsem);
                  set_current_state(TASK_INTERRUPTIBLE);
                  schedule();
                  continue;
          }
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          freq = cap_gov_select_freq(policy);
          ret = __cpufreq_driver_target(policy, freq,
                          CPUFREQ_RELATION_H);
          if (ret)
                  pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                  __func__, ret);
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
          atomic_set(&gd->need_wake_task, 0);
          up_write(&policy->rwsem);
  } while (!kthread_should_stop());
  do_exit(0);
+}

+static void cap_gov_wake_up_process(struct task_struct *task) +{
  /* this is null during early boot */
  if (IS_ERR_OR_NULL(task)) {
          return;
  }
  wake_up_process(task);
+}

+void cap_gov_kick_thread(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd = NULL;
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy))
          return;
  gd = policy->gov_data;
  if (!gd)
          goto out;
  /* per-cpu access not needed here since we have gd */
  if (atomic_read(&gd->need_wake_task)) {
          trace_printk("waking up kthread (%d)", gd->task->pid);
          cap_gov_wake_up_process(gd->task);
  }
+out:
  cpufreq_cpu_put(policy);
+}

+/**

cap_gov_update_cpu - interface to scheduler for changing capacity values

@cpu: cpu whose capacity utilization has recently changed

cap_gov_udpate_cpu is an interface exposed to the scheduler so that the

scheduler may inform the governor of updates to capacity utilization and

make changes to cpu frequency. Currently this interface is designed around

PELT values in CFS. It can be expanded to other scheduling classes in the

future if needed.

The semantics of this call vary based on the cpu frequency scaling

characteristics of the hardware.

If kicking off a dvfs transition is an operation that might block or sleep

in the cpufreq driver then we set the need_wake_task flag in this function

and return. Selecting a frequency and programming it is done in a dedicated

kernel thread which will be woken up from rebalance_domains. See

cap_gov_kick_thread above.

If kicking off a dvfs transition is an operation that returns quickly in the

cpufreq driver and will never sleep then we select the frequency in this

function and program the hardware for it in the scheduler hot path. No

dedicated kthread is needed.

This is not something that we already have, right? This is of course fine, but IMHO we have to highlight this "problem" a bit more. Also, clearly state that code for that case is not part of this patchset.

Thanks,

- Juri

...

*/

+void cap_gov_update_cpu(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  /* XXX put policy pointer in per-cpu data? */
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy)) {
          return;
  }
  if (!policy->gov_data) {
          trace_printk("missing governor data");
          goto out;
  }
  gd = policy->gov_data;
  /* bail early if we are throttled */
  if (ktime_before(ktime_get(), gd->throttle)) {
          trace_printk("THROTTLED");
          goto out;
  }
  atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);
+out:
  cpufreq_cpu_put(policy);
  return;
+}

+static void cap_gov_start(struct cpufreq_policy *policy) +{
  int cpu;
  struct gov_data *gd;
  /* prepare per-policy private data */
  gd = kzalloc(sizeof(*gd), GFP_KERNEL);
  if (!gd) {
          pr_debug("%s: failed to allocate private data\n", __func__);
          return;
  }
  /*
   * Don't ask for freq changes at an higher rate than what
   * the driver advertises as transition latency.
   */
  gd->throttle_nsec = policy->cpuinfo.transition_latency ?
                      policy->cpuinfo.transition_latency :
                      THROTTLE_NSEC;
  pr_debug("%s: throttle threshold = %u [ns]\n",
            __func__, gd->throttle_nsec);
  /* save per-cpu pointer to per-policy need_wake_task */
  for_each_cpu(cpu, policy->related_cpus)
          per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;
  /* init per-policy kthread */
  gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");
  if (IS_ERR_OR_NULL(gd->task))
          pr_err("%s: failed to create kcap_gov_task thread\n", __func__);
  policy->gov_data = gd;
+}

+static void cap_gov_stop(struct cpufreq_policy *policy) +{
  struct gov_data *gd;
  gd = policy->gov_data;
  policy->gov_data = NULL;
  kthread_stop(gd->task);
  /* FIXME replace with devm counterparts? */
  kfree(gd);
+}

+static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event) +{
  switch (event) {
          case CPUFREQ_GOV_START:
                  /* Start managing the frequency */
                  cap_gov_start(policy);
                  return 0;
          case CPUFREQ_GOV_STOP:
                  cap_gov_stop(policy);
                  return 0;
          case CPUFREQ_GOV_LIMITS:        /* unused */
          case CPUFREQ_GOV_POLICY_INIT:   /* unused */
          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */
                  break;
  }
  return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV +static +#endif +struct cpufreq_governor cpufreq_gov_cap_gov = {
  .name                   = "cap_gov",
  .governor               = cap_gov_setup,
  .owner                  = THIS_MODULE,
+};

+static int __init cap_gov_init(void) +{
  return cpufreq_register_governor(&cpufreq_gov_cap_gov);
+}

+static void __exit cap_gov_exit(void) +{
  cpufreq_unregister_governor(&cpufreq_gov_cap_gov);
+}

+/* Try to make this the default governor */ +fs_initcall(cap_gov_init);

+MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b066a61..2ec2dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle);
  /*
   * FIXME some hardware does not require this, but current CPUfreq
   * locking prevents us from changing cpu frequency with rq locks held
   * and interrupts disabled
   */
  if (sched_energy_freq())
          cap_gov_kick_thread(cpu_of(this_rq));
}

/* @@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
}

/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe57ba..c45f1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)

int get_cpu_usage(int cpu);

+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV +void cap_gov_update_cpu(int cpu); +void cap_gov_kick_thread(int cpu); +#else +static inline void cap_gov_update_cpu(int cpu) {} +static inline void cap_gov_kick_thread(int cpu) {} +#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Michael Turquette

6:47 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Quoting Juri Lelli (2015-04-16 09:46:47)

...

Hi Mike,

On 16/04/15 06:29, Michael Turquette wrote:

...
Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95
Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.

We could solve this problem by putting the up threshold back. As soon as you cross it you go to max, and then adapt, choosing the right capacity for the actual, non capped, utilization of the task.

...
+#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
Here and below, do you want to post the patches with trace_printks?

Good catch. Will remove.

Proper tracepoint support can show up in a later patch.

...

```
          if (usage > max_usage)
```
```
                  max_usage = usage;
```
```
  }
```

  trace_printk("max_usage = %lu", max_usage);

  /* find the utilization threshold at which we scale up frequency */

  index = cpufreq_frequency_table_get_index(policy, policy->cur);

```
  /*
```

   * converge towards max_usage. We want the lowest frequency whose

   * capacity is >= to max_usage. In other words:

```
   *
```

   *      find capacity == floor(usage)

```
   *
```

   * Sadly cpufreq freq tables are not guaranteed to be ordered by

```
   * frequency...
```
```
   */
```
```
  freq = policy->max;
```

  cpufreq_for_each_entry(pos, policy->freq_table) {

          cap = pos->frequency * SCHED_CAPACITY_SCALE /

```
                  policy->max;
```

          if (max_usage < cap && pos->frequency < freq)

                  freq = pos->frequency;

          trace_printk("cpu = %u max_usage = %lu cap = %lu \

                          table_freq = %u freq = %lu",

                          cpumask_first(policy->cpus), max_usage, cap,

                          pos->frequency, freq);

```
  }
```

+out:

  trace_printk("cpu %d final freq %lu", cpu, freq);

```
  return freq;
```

+/*

- we pass in struct cpufreq_policy. This is safe because changing out the
- policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
- which tears down all of the data structures and __cpufreq_governor(policy,
- CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
- new policy pointer
*/

+static int cap_gov_thread(void *data) +{

```
  struct sched_param param;
```
```
  struct cpufreq_policy *policy;
```
```
  struct gov_data *gd;
```
```
  unsigned long freq;
```
```
  int ret;
```

  policy = (struct cpufreq_policy *) data;

```
  if (!policy) {
```

          pr_warn("%s: missing policy\n", __func__);

```
          do_exit(-EINVAL);
```
```
  }
```
```
  gd = policy->gov_data;
```
```
  if (!gd) {
```

          pr_warn("%s: missing governor data\n", __func__);

```
          do_exit(-EINVAL);
```
```
  }
```
```
  param.sched_priority = 0;
```

  sched_setscheduler(current, SCHED_FIFO, &param);

  set_cpus_allowed_ptr(current, policy->related_cpus);

We should check return values of these functions, use the in-kernel version of setscheduler and set a true RT prio for kthreads, something like:

```
  param.sched_priority = 0;
```

  sched_setscheduler(current, SCHED_FIFO, &param);

  set_cpus_allowed_ptr(current, policy->related_cpus);

```
  param.sched_priority = 50;
```

  ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);

```
  if (ret) {
```

          pr_warn("%s: failed to set SCHED_FIFO\n", __func__);

```
          do_exit(-EINVAL);
```
```
  } else {
```

          pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",

                   __func__, gd->task->pid);

```
  }
```

  ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);

```
  if (ret) {
```

          pr_warn("%s: failed to set allowed ptr\n", __func__);

```
          do_exit(-EINVAL);
```
```
  }
```

Yes, I had rolled in your code to do this in a previous version. I'll bring it back in.

...

...
  /* main loop of the per-policy kthread */
  do {
          down_write(&policy->rwsem);
          if (!atomic_read(&gd->need_wake_task))  {
                  if (kthread_should_stop())
                          break;
                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);
                  up_write(&policy->rwsem);
                  set_current_state(TASK_INTERRUPTIBLE);
                  schedule();
                  continue;
          }
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          freq = cap_gov_select_freq(policy);
          ret = __cpufreq_driver_target(policy, freq,
                          CPUFREQ_RELATION_H);
          if (ret)
                  pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                  __func__, ret);
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
          atomic_set(&gd->need_wake_task, 0);
          up_write(&policy->rwsem);
  } while (!kthread_should_stop());
  do_exit(0);
+}

+static void cap_gov_wake_up_process(struct task_struct *task) +{
  /* this is null during early boot */
  if (IS_ERR_OR_NULL(task)) {
          return;
  }
  wake_up_process(task);
+}

+void cap_gov_kick_thread(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd = NULL;
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy))
          return;
  gd = policy->gov_data;
  if (!gd)
          goto out;
  /* per-cpu access not needed here since we have gd */
  if (atomic_read(&gd->need_wake_task)) {
          trace_printk("waking up kthread (%d)", gd->task->pid);
          cap_gov_wake_up_process(gd->task);
  }
+out:
  cpufreq_cpu_put(policy);
+}

+/**

cap_gov_update_cpu - interface to scheduler for changing capacity values

@cpu: cpu whose capacity utilization has recently changed

cap_gov_udpate_cpu is an interface exposed to the scheduler so that the

scheduler may inform the governor of updates to capacity utilization and

make changes to cpu frequency. Currently this interface is designed around

PELT values in CFS. It can be expanded to other scheduling classes in the

future if needed.

The semantics of this call vary based on the cpu frequency scaling

characteristics of the hardware.

If kicking off a dvfs transition is an operation that might block or sleep

in the cpufreq driver then we set the need_wake_task flag in this function

and return. Selecting a frequency and programming it is done in a dedicated

kernel thread which will be woken up from rebalance_domains. See

cap_gov_kick_thread above.

If kicking off a dvfs transition is an operation that returns quickly in the

cpufreq driver and will never sleep then we select the frequency in this

function and program the hardware for it in the scheduler hot path. No

dedicated kthread is needed.
This is not something that we already have, right? This is of course fine, but IMHO we have to highlight this "problem" a bit more. Also, clearly state that code for that case is not part of this patchset.

As I stated in reply to Amit, I'm thinking of removing some of the above text since I removed support for "driver_might_sleep".

I want to keep the patch set as simple as possible, and regardless of whether or not we have async dvfs hardware it is still not possible to call __cpufreq_driver_target from within the schedule() context, so for not it is a moot point.

Regards, Mike

...

Thanks,

Juri

...
*/

+void cap_gov_update_cpu(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  /* XXX put policy pointer in per-cpu data? */
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy)) {
          return;
  }
  if (!policy->gov_data) {
          trace_printk("missing governor data");
          goto out;
  }
  gd = policy->gov_data;
  /* bail early if we are throttled */
  if (ktime_before(ktime_get(), gd->throttle)) {
          trace_printk("THROTTLED");
          goto out;
  }
  atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);
+out:
  cpufreq_cpu_put(policy);
  return;
+}

+static void cap_gov_start(struct cpufreq_policy *policy) +{
  int cpu;
  struct gov_data *gd;
  /* prepare per-policy private data */
  gd = kzalloc(sizeof(*gd), GFP_KERNEL);
  if (!gd) {
          pr_debug("%s: failed to allocate private data\n", __func__);
          return;
  }
  /*
   * Don't ask for freq changes at an higher rate than what
   * the driver advertises as transition latency.
   */
  gd->throttle_nsec = policy->cpuinfo.transition_latency ?
                      policy->cpuinfo.transition_latency :
                      THROTTLE_NSEC;
  pr_debug("%s: throttle threshold = %u [ns]\n",
            __func__, gd->throttle_nsec);
  /* save per-cpu pointer to per-policy need_wake_task */
  for_each_cpu(cpu, policy->related_cpus)
          per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;
  /* init per-policy kthread */
  gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");
  if (IS_ERR_OR_NULL(gd->task))
          pr_err("%s: failed to create kcap_gov_task thread\n", __func__);
  policy->gov_data = gd;
+}

+static void cap_gov_stop(struct cpufreq_policy *policy) +{
  struct gov_data *gd;
  gd = policy->gov_data;
  policy->gov_data = NULL;
  kthread_stop(gd->task);
  /* FIXME replace with devm counterparts? */
  kfree(gd);
+}

+static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event) +{
  switch (event) {
          case CPUFREQ_GOV_START:
                  /* Start managing the frequency */
                  cap_gov_start(policy);
                  return 0;
          case CPUFREQ_GOV_STOP:
                  cap_gov_stop(policy);
                  return 0;
          case CPUFREQ_GOV_LIMITS:        /* unused */
          case CPUFREQ_GOV_POLICY_INIT:   /* unused */
          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */
                  break;
  }
  return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV +static +#endif +struct cpufreq_governor cpufreq_gov_cap_gov = {
  .name                   = "cap_gov",
  .governor               = cap_gov_setup,
  .owner                  = THIS_MODULE,
+};

+static int __init cap_gov_init(void) +{
  return cpufreq_register_governor(&cpufreq_gov_cap_gov);
+}

+static void __exit cap_gov_exit(void) +{
  cpufreq_unregister_governor(&cpufreq_gov_cap_gov);
+}

+/* Try to make this the default governor */ +fs_initcall(cap_gov_init);

+MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b066a61..2ec2dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle);
  /*
   * FIXME some hardware does not require this, but current CPUfreq
   * locking prevents us from changing cpu frequency with rq locks held
   * and interrupts disabled
   */
  if (sched_energy_freq())
          cap_gov_kick_thread(cpu_of(this_rq));
}

/* @@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
}

/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe57ba..c45f1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)

int get_cpu_usage(int cpu);

+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV +void cap_gov_update_cpu(int cpu); +void cap_gov_kick_thread(int cpu); +#else +static inline void cap_gov_update_cpu(int cpu) {} +static inline void cap_gov_kick_thread(int cpu) {} +#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Michael Turquette

21 Apr 21 Apr

4:58 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Quoting Juri Lelli (2015-04-16 09:46:47)

...

On 16/04/15 06:29, Michael Turquette wrote:

...
+#define UP_THRESHOLD 95

Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.

Juri & Morten,

Yes, the UP_THRESHOLD constant is a leftover.

We discussed the issue of usage being capped at the current capacity in our call yesterday but I have some doubts. Let's forget big.little for a moment and talk about an SMP system. On my pandaboard I clearly see usage values taken directly from get_cpu_usgae that scale up and down through the whole range (and as a result the cpu frequencies selected cover the whole range).

My current testing involves short running tasks that are quickly queued and dequeued, not a long running task as you suggest. Is there a different behavior in the way cfs.utilization_load_avg is used depending on task length?

Can you please explain why you feel that the return value of get_cpu_usage will not exceed the current capacity? I do not observe this behavior. Do you see this when testing only my branch? Or do you see it when merging my branch with the eas v3 series?

Vincent,

The value of cfs.utilization_load_avg is already normalized against the max possible capacity, right? I do not believe that the return value of get_cpu_usage is capped at the current capacity, but please let me know if I have a misunderstanding.

...

We could solve this problem by putting the up threshold back. As soon as you cross it you go to max, and then adapt, choosing the right capacity for the actual, non capped, utilization of the task.

Juri,

In my testing so far I have not seen a reason to add a threshold back in. I'm OK to do so but I need to be convinced. I did not exactly understand your point on the call yesterday so maybe we can figure it out here on the list.

Thanks a lot, Mike

Morten Rasmussen

22 Apr 22 Apr

11:10 a.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On Tue, Apr 21, 2015 at 05:58:03PM +0100, Mike Turquette wrote:

...

Quoting Juri Lelli (2015-04-16 09:46:47)

...
On 16/04/15 06:29, Michael Turquette wrote:

...
+#define UP_THRESHOLD 95

Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.

Juri & Morten,

Yes, the UP_THRESHOLD constant is a leftover.

We discussed the issue of usage being capped at the current capacity in our call yesterday but I have some doubts. Let's forget big.little for a moment and talk about an SMP system. On my pandaboard I clearly see usage values taken directly from get_cpu_usgae that scale up and down through the whole range (and as a result the cpu frequencies selected cover the whole range).

Let me clarify that 'capped' was the wrong word. It is converging towards the current capacity. Sorry for the confusion.

cfs.utilization_load_avg is the sum of PELT utilization for all tasks on the rq. Utilization is running time tracking which means that the sum can only temporarily and under special circumstances (such as task migration and fork) go above 100% (1024) if we ignore frequency invariance. If it goes above it will converge to 100% over time. It happens fairly quickly for forked tasks as their avg_period is small in the early life of a new task.

In Vincent's patch set, my patch 'sched: Make sched entity usage tracking scale-invariant' changes this a bit. In __update_entity_runnable_avg() we now scale the utilization PELT signal by freq_curr/freq_max. The sum (cfs.utilization_load_avg) is therefore also converging towards freq_curr/freq_max (*1024). For example, running at 300 MHz and freq_max = 1000 MHz, the sum is converging towards 307. Without any migrations or new tasks, the utilization will be in the range 0..307 no matter how many tasks that are on the rq. Just as before, the sum may temporarily go above if you have new tasks being forked or task being migrated to the rq.

Let's take an example where you have a task an existing task waking up with a low utilization, say 100. It could be a webpage rendering thread that did minor updates to some webpage already loaded last time it was scheduled, but this time it is being scheduled to render a new webpage. The task PELT utilization is added to cfs.utilization_load_avg when it is enqueued, so the sum is now 100. freq_curr = 300 MHz. The task will start rendering the webpage and run for quite a while during which it will built up its PELT utilization. It will ramp up quickly in the beginning and converge towards 307 due to the freq_curr/freq_max scaling in __update_entity_runnable_avg(). Due to the properties of the geometric series it will converge slower and slower the closer we get to 307. PJT defined:

#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */

where a period here is 1024 us. So if you don't have any other tasks causing any noise it may take quite a while to get to 307. Worst case 345 ms. If you do have noise you may not see this delay, but I wouldn't rely on it for determining when to increase the frequency.

In Vincent's patches get_cpu_usage() returns a somewhat modified metric.

static int get_cpu_usage(int cpu) { unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; unsigned long capacity = capacity_orig_of(cpu);

if (usage >= SCHED_LOAD_SCALE) return capacity;

return (usage * capacity) >> SCHED_LOAD_SHIFT; }

The utilization is scaled and capped by cpu capacity. capacity_orig_of(cpu) is 1024 for non-SMT and non-big.LITTLE systems in which case get_cpu_usage() just enforces an upper limit for cfs.utilization_load_avg at 1024. For such systems get_cpu_usage() can be compared to normalized frequency (freq_curr*1024/freq_max). If you are running at 300 MHz, your normalized frequency is 300*1024/1000 = 307 and get_cpu_usage() will eventually return 307 if you have at least one always-running tasks on the cpu.

In mainline Linux, capacity != 1024 for SMT systems (determined by 1178/#hw_threads) and big.LITTLE systems with the clock-frequency property set in DT (which enables Vincent's capacity scaling code in topology.c and it enabled in exynos5420.dtsi). In this case get_cpu_usage() scales utilization to the range 0..capacity_orig_of(cpu).

If we take the example from before but now have an SMT system with two hw-threads per core, capacity_orig_of() = 589. If you have an always-running task and you are at 300 MHz, cfs.utilization_load_avg = 307 (as before), but get_cpu_usage() returns 307*589/1024 = 176. 307 is the convergence target again and won't go above unless other tasks show up and due to the capacity scaling in get_cpu_usage() it will never go above 176. If you were running at 1000 MHz (freq_max), get_cpu_usage() would return 589. You would never go above 589 despite your normalized frequency freq_curr*1024/freq_max = 1024*1024/1024 = 1024. So here you would be comparing usage on one scale (0..589) to frequency 'capacity' on another scale (0..1024). That is broken in my opinion. The same scaling must be applied on both sides. Either apply capacity_orig_of() scaling to the frequency or have a non-scaling version of get_cpu_usage().

The issue is the same for big.LITTLE systems. If you enabled Vincent's cpu_efficiency code for TC2 by setting the clock-frequency properties in the DT (as they are in the LSK tree), the A7 capacity_orig_of() = 606.

While I don't want big.LITTLE to be part of the sched/dvfs integration discussion, IMHO, we are working towards a goal of better scheduling and power management on all systems including big.LITTLE. So I think we should keep those in mind too and avoid cutting corners where we know it will cause trouble for some systems. I'm not asking for you to do big.LITTLE specific modifications or even mention it in the patch set, I'm just asking for minor changes that allows us to extend this to work for big.LITTLE as well.

...

My current testing involves short running tasks that are quickly queued and dequeued, not a long running task as you suggest. Is there a different behavior in the way cfs.utilization_load_avg is used depending on task length?

PELT utilization tracks the running time of the tasks. cfs.utilization_load_avg is the sum of the PELT utilization of all tasks on the rq. The PELT utilization builds up when a task is running and decays when it is blocked/sleeping. Keep in mind that PELT utilization is initialized to max but with a very short history, so the utilization value is very sensitive in the early life of a task.

...

Can you please explain why you feel that the return value of get_cpu_usage will not exceed the current capacity? I do not observe this behavior. Do you see this when testing only my branch? Or do you see it when merging my branch with the eas v3 series?

I think it is covered above. I haven't tested the patches myself, but Juri has confirmed that get_cpu_usage() is converging towards freq_curr*1024/freq_max using the user-space governor.

...

Vincent,

The value of cfs.utilization_load_avg is already normalized against the max possible capacity, right? I do not believe that the return value of get_cpu_usage is capped at the current capacity, but please let me know if I have a misunderstanding.

As said above, it is not capped but converging towards freq_curr*capacity_orig_of(cpu)/frq_max.

I hope that answers your questions, please let me know if it doesn't.

Thanks, Morten

Vincent Guittot

1:14 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On 22 April 2015 at 13:10, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On Tue, Apr 21, 2015 at 05:58:03PM +0100, Mike Turquette wrote:

...
Quoting Juri Lelli (2015-04-16 09:46:47)

...
On 16/04/15 06:29, Michael Turquette wrote:

...
+#define UP_THRESHOLD 95

Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.

Juri & Morten,

Yes, the UP_THRESHOLD constant is a leftover.

We discussed the issue of usage being capped at the current capacity in our call yesterday but I have some doubts. Let's forget big.little for a moment and talk about an SMP system. On my pandaboard I clearly see usage values taken directly from get_cpu_usgae that scale up and down through the whole range (and as a result the cpu frequencies selected cover the whole range).

Let me clarify that 'capped' was the wrong word. It is converging towards the current capacity. Sorry for the confusion.

cfs.utilization_load_avg is the sum of PELT utilization for all tasks on the rq. Utilization is running time tracking which means that the sum can only temporarily and under special circumstances (such as task migration and fork) go above 100% (1024) if we ignore frequency invariance. If it goes above it will converge to 100% over time. It happens fairly quickly for forked tasks as their avg_period is small in the early life of a new task.

In Vincent's patch set, my patch 'sched: Make sched entity usage tracking scale-invariant' changes this a bit. In __update_entity_runnable_avg() we now scale the utilization PELT signal by freq_curr/freq_max. The sum (cfs.utilization_load_avg) is therefore also converging towards freq_curr/freq_max (*1024). For example, running at 300 MHz and freq_max = 1000 MHz, the sum is converging towards 307. Without any migrations or new tasks, the utilization will be in the range 0..307 no matter how many tasks that are on the rq. Just as before, the sum may temporarily go above if you have new tasks being forked or task being migrated to the rq.

Let's take an example where you have a task an existing task waking up with a low utilization, say 100. It could be a webpage rendering thread that did minor updates to some webpage already loaded last time it was scheduled, but this time it is being scheduled to render a new webpage. The task PELT utilization is added to cfs.utilization_load_avg when it is enqueued, so the sum is now 100. freq_curr = 300 MHz. The task will start rendering the webpage and run for quite a while during which it will built up its PELT utilization. It will ramp up quickly in the beginning and converge towards 307 due to the freq_curr/freq_max scaling in __update_entity_runnable_avg(). Due to the properties of the geometric series it will converge slower and slower the closer we get to 307. PJT defined:

#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */

where a period here is 1024 us. So if you don't have any other tasks causing any noise it may take quite a while to get to 307. Worst case 345 ms. If you do have noise you may not see this delay, but I wouldn't rely on it for determining when to increase the frequency.

In Vincent's patches get_cpu_usage() returns a somewhat modified metric.

static int get_cpu_usage(int cpu) { unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; unsigned long capacity = capacity_orig_of(cpu);
    if (usage >= SCHED_LOAD_SCALE)
            return capacity;

    return (usage * capacity) >> SCHED_LOAD_SHIFT;
}

The utilization is scaled and capped by cpu capacity. capacity_orig_of(cpu) is 1024 for non-SMT and non-big.LITTLE systems in which case get_cpu_usage() just enforces an upper limit for cfs.utilization_load_avg at 1024. For such systems get_cpu_usage() can be compared to normalized frequency (freq_curr*1024/freq_max). If you are running at 300 MHz, your normalized frequency is 300*1024/1000 = 307 and get_cpu_usage() will eventually return 307 if you have at least one always-running tasks on the cpu.

In mainline Linux, capacity != 1024 for SMT systems (determined by 1178/#hw_threads) and big.LITTLE systems with the clock-frequency property set in DT (which enables Vincent's capacity scaling code in topology.c and it enabled in exynos5420.dtsi). In this case get_cpu_usage() scales utilization to the range 0..capacity_orig_of(cpu).

If we take the example from before but now have an SMT system with two hw-threads per core, capacity_orig_of() = 589. If you have an always-running task and you are at 300 MHz, cfs.utilization_load_avg = 307 (as before), but get_cpu_usage() returns 307*589/1024 = 176. 307 is the convergence target again and won't go above unless other tasks show up and due to the capacity scaling in get_cpu_usage() it will never go above 176. If you were running at 1000 MHz (freq_max), get_cpu_usage() would return 589. You would never go above 589 despite your normalized frequency freq_curr*1024/freq_max = 1024*1024/1024 = 1024. So here you would be comparing usage on one scale (0..589) to frequency 'capacity' on another scale (0..1024). That is broken in my opinion. The same scaling must be applied on both sides. Either apply capacity_orig_of() scaling to the frequency or have a non-scaling version of get_cpu_usage().

The issue is the same for big.LITTLE systems. If you enabled Vincent's cpu_efficiency code for TC2 by setting the clock-frequency properties in the DT (as they are in the LSK tree), the A7 capacity_orig_of() = 606.

While I don't want big.LITTLE to be part of the sched/dvfs integration discussion, IMHO, we are working towards a goal of better scheduling and power management on all systems including big.LITTLE. So I think we should keep those in mind too and avoid cutting corners where we know it will cause trouble for some systems. I'm not asking for you to do big.LITTLE specific modifications or even mention it in the patch set, I'm just asking for minor changes that allows us to extend this to work for big.LITTLE as well.

I agree with Morten that you have to use capacity_orig_of(CPU) instead of SCHED_CAPACITY_SCALE when you compare the compute capacity of an frequency with the current usage of the CPU

get_cpu_usage is in the range [0..capacity_orig_of(CPU)] so you have to scale the compute capacity of the frequency point in the same range. As Morten points out, in SMP system the capacity_orig_of(CPU) is SCHED_CAPACITY_SCALE but directly usign this default value is a shortcut

Regards, Vincent

...

...
My current testing involves short running tasks that are quickly queued and dequeued, not a long running task as you suggest. Is there a different behavior in the way cfs.utilization_load_avg is used depending on task length?

PELT utilization tracks the running time of the tasks. cfs.utilization_load_avg is the sum of the PELT utilization of all tasks on the rq. The PELT utilization builds up when a task is running and decays when it is blocked/sleeping. Keep in mind that PELT utilization is initialized to max but with a very short history, so the utilization value is very sensitive in the early life of a task.

...
Can you please explain why you feel that the return value of get_cpu_usage will not exceed the current capacity? I do not observe this behavior. Do you see this when testing only my branch? Or do you see it when merging my branch with the eas v3 series?

I think it is covered above. I haven't tested the patches myself, but Juri has confirmed that get_cpu_usage() is converging towards freq_curr*1024/freq_max using the user-space governor.

...
Vincent,

The value of cfs.utilization_load_avg is already normalized against the max possible capacity, right? I do not believe that the return value of get_cpu_usage is capped at the current capacity, but please let me know if I have a misunderstanding.

As said above, it is not capped but converging towards freq_curr*capacity_orig_of(cpu)/frq_max.

I hope that answers your questions, please let me know if it doesn't.

Thanks, Morten

Juri Lelli

2:46 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On 22/04/15 12:10, Morten Rasmussen wrote:

...

On Tue, Apr 21, 2015 at 05:58:03PM +0100, Mike Turquette wrote:

...
Quoting Juri Lelli (2015-04-16 09:46:47)

...
On 16/04/15 06:29, Michael Turquette wrote:

...
+#define UP_THRESHOLD 95

Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.

Juri & Morten,

Oh, I guess I have not much to add to Morten's reply :). I'll just add another example, hope it helps. Please find it attached (and commented below).

...

...
Yes, the UP_THRESHOLD constant is a leftover.

We discussed the issue of usage being capped at the current capacity in our call yesterday but I have some doubts. Let's forget big.little for a moment and talk about an SMP system. On my pandaboard I clearly see usage values taken directly from get_cpu_usgae that scale up and down through the whole range (and as a result the cpu frequencies selected cover the whole range).

Let me clarify that 'capped' was the wrong word. It is converging towards the current capacity. Sorry for the confusion.

cfs.utilization_load_avg is the sum of PELT utilization for all tasks on the rq. Utilization is running time tracking which means that the sum can only temporarily and under special circumstances (such as task migration and fork) go above 100% (1024) if we ignore frequency invariance. If it goes above it will converge to 100% over time. It happens fairly quickly for forked tasks as their avg_period is small in the early life of a new task.

In Vincent's patch set, my patch 'sched: Make sched entity usage tracking scale-invariant' changes this a bit. In __update_entity_runnable_avg() we now scale the utilization PELT signal by freq_curr/freq_max. The sum (cfs.utilization_load_avg) is therefore also converging towards freq_curr/freq_max (*1024). For example, running at 300 MHz and freq_max = 1000 MHz, the sum is converging towards 307. Without any migrations or new tasks, the utilization will be in the range 0..307 no matter how many tasks that are on the rq. Just as before, the sum may temporarily go above if you have new tasks being forked or task being migrated to the rq.

Let's take an example where you have a task an existing task waking up with a low utilization, say 100. It could be a webpage rendering thread that did minor updates to some webpage already loaded last time it was scheduled, but this time it is being scheduled to render a new webpage. The task PELT utilization is added to cfs.utilization_load_avg when it is enqueued, so the sum is now 100. freq_curr = 300 MHz. The task will start rendering the webpage and run for quite a while during which it will built up its PELT utilization. It will ramp up quickly in the beginning and converge towards 307 due to the freq_curr/freq_max scaling in __update_entity_runnable_avg(). Due to the properties of the geometric series it will converge slower and slower the closer we get to 307. PJT defined:

#define LOAD_AVG_MAX_N 345 /* number of full periods to produce LOAD_MAX_AVG */

where a period here is 1024 us. So if you don't have any other tasks causing any noise it may take quite a while to get to 307. Worst case 345 ms. If you do have noise you may not see this delay, but I wouldn't rely on it for determining when to increase the frequency.

In Vincent's patches get_cpu_usage() returns a somewhat modified metric.

static int get_cpu_usage(int cpu) { unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg; unsigned long capacity = capacity_orig_of(cpu);

if (usage >= SCHED_LOAD_SCALE) return capacity;

return (usage * capacity) >> SCHED_LOAD_SHIFT; }

The utilization is scaled and capped by cpu capacity. capacity_orig_of(cpu) is 1024 for non-SMT and non-big.LITTLE systems in which case get_cpu_usage() just enforces an upper limit for cfs.utilization_load_avg at 1024. For such systems get_cpu_usage() can be compared to normalized frequency (freq_curr*1024/freq_max). If you are running at 300 MHz, your normalized frequency is 300*1024/1000 = 307 and get_cpu_usage() will eventually return 307 if you have at least one always-running tasks on the cpu.

In mainline Linux, capacity != 1024 for SMT systems (determined by 1178/#hw_threads) and big.LITTLE systems with the clock-frequency property set in DT (which enables Vincent's capacity scaling code in topology.c and it enabled in exynos5420.dtsi). In this case get_cpu_usage() scales utilization to the range 0..capacity_orig_of(cpu).

If we take the example from before but now have an SMT system with two hw-threads per core, capacity_orig_of() = 589. If you have an always-running task and you are at 300 MHz, cfs.utilization_load_avg = 307 (as before), but get_cpu_usage() returns 307*589/1024 = 176. 307 is the convergence target again and won't go above unless other tasks show up and due to the capacity scaling in get_cpu_usage() it will never go above 176. If you were running at 1000 MHz (freq_max), get_cpu_usage() would return 589. You would never go above 589 despite your normalized frequency freq_curr*1024/freq_max = 1024*1024/1024 = 1024. So here you would be comparing usage on one scale (0..589) to frequency 'capacity' on another scale (0..1024). That is broken in my opinion. The same scaling must be applied on both sides. Either apply capacity_orig_of() scaling to the frequency or have a non-scaling version of get_cpu_usage().

The issue is the same for big.LITTLE systems. If you enabled Vincent's cpu_efficiency code for TC2 by setting the clock-frequency properties in the DT (as they are in the LSK tree), the A7 capacity_orig_of() = 606.

While I don't want big.LITTLE to be part of the sched/dvfs integration discussion, IMHO, we are working towards a goal of better scheduling and power management on all systems including big.LITTLE. So I think we should keep those in mind too and avoid cutting corners where we know it will cause trouble for some systems. I'm not asking for you to do big.LITTLE specific modifications or even mention it in the patch set, I'm just asking for minor changes that allows us to extend this to work for big.LITTLE as well.

...
My current testing involves short running tasks that are quickly queued and dequeued, not a long running task as you suggest. Is there a different behavior in the way cfs.utilization_load_avg is used depending on task length?

PELT utilization tracks the running time of the tasks. cfs.utilization_load_avg is the sum of the PELT utilization of all tasks on the rq. The PELT utilization builds up when a task is running and decays when it is blocked/sleeping. Keep in mind that PELT utilization is initialized to max but with a very short history, so the utilization value is very sensitive in the early life of a task.

...
Can you please explain why you feel that the return value of get_cpu_usage will not exceed the current capacity? I do not observe this behavior. Do you see this when testing only my branch? Or do you see it when merging my branch with the eas v3 series?

I think it is covered above. I haven't tested the patches myself, but Juri has confirmed that get_cpu_usage() is converging towards freq_curr*1024/freq_max using the user-space governor.

Right. So, I ran this simple example: a single task with 100ms period and 90% duty cycle (at max freq) on one of the big cores of TC2 (freq_max = 1200MHz, freq_min = 500MHz). I also removed the clock- frequency property from the DT to make things simpler, so that we have capacity_orig_of = 1024. We then have:

freq_min * 1024 / freq_max = 500 * 1024 / 1200 = 426

and this is the value at which the task's utilization eventually converges (see Fig. 1). When we run at max instead we can see the full range of utilizations (Fig. 3 shows the utilization of the same task at max freq). I also plot running_avg_sum of both cases to highlight how the signal is scaled (or not) by frequency.

Hope this helps in addition to Morten's explanation.

Best,

- Juri

...

...
Vincent,

The value of cfs.utilization_load_avg is already normalized against the max possible capacity, right? I do not believe that the return value of get_cpu_usage is capped at the current capacity, but please let me know if I have a misunderstanding.

As said above, it is not capped but converging towards freq_curr*capacity_orig_of(cpu)/frq_max.

I hope that answers your questions, please let me know if it doesn't.

Thanks, Morten

Vincent Guittot

12:59 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On 21 April 2015 at 18:58, Michael Turquette mturquette@linaro.org wrote:

...

Quoting Juri Lelli (2015-04-16 09:46:47)

...
On 16/04/15 06:29, Michael Turquette wrote:

...
+#define UP_THRESHOLD 95

Is this a leftover? In the changelog you say that you moved away from thresholds. Anyway, since we scale utilization by freq, I'm not sure we can live without some sort of up_threshold. The problem is that if you are running a task flat out on a CPU at a certain freq, let's say the lower one, you'll always get a usage for that CPU that corresponds to the current capacity of that CPU at that freq. As you use the usage signal to decide when to ramp up, you will never ramp up in this situation because the signal won't cross the capacity at the lower frequency.

Juri & Morten,

Yes, the UP_THRESHOLD constant is a leftover.

We discussed the issue of usage being capped at the current capacity in our call yesterday but I have some doubts. Let's forget big.little for a moment and talk about an SMP system. On my pandaboard I clearly see usage values taken directly from get_cpu_usgae that scale up and down through the whole range (and as a result the cpu frequencies selected cover the whole range).

My current testing involves short running tasks that are quickly queued and dequeued, not a long running task as you suggest. Is there a different behavior in the way cfs.utilization_load_avg is used depending on task length?

Can you please explain why you feel that the return value of get_cpu_usage will not exceed the current capacity? I do not observe this behavior. Do you see this when testing only my branch? Or do you see it when merging my branch with the eas v3 series?

Vincent,

The value of cfs.utilization_load_avg is already normalized against the max possible capacity, right? I do not believe that the return value of get_cpu_usage is capped at the current capacity, but please let me know if I have a misunderstanding.

You're right. get_cpu_usage is only capped bu max capacity. Nevertheless, with frequency invariance patches, the utilization of a sched entity is capped by the current capacity so the usage which is a sum of sched_entity utilization will be capped by current capacity

...

...
We could solve this problem by putting the up threshold back. As soon as you cross it you go to max, and then adapt, choosing the right capacity for the actual, non capped, utilization of the task.

Juri,

In my testing so far I have not seen a reason to add a threshold back in. I'm OK to do so but I need to be convinced. I did not exactly understand your point on the call yesterday so maybe we can figure it out here on the list.

Thanks a lot, Mike

Vincent Guittot

17 Apr 17 Apr

1:11 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On 16 April 2015 at 07:29, Michael Turquette mturquette@linaro.org wrote:

...

Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95 +#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
          if (usage > max_usage)
                  max_usage = usage;
  }
  trace_printk("max_usage = %lu", max_usage);
  /* find the utilization threshold at which we scale up frequency */
  index = cpufreq_frequency_table_get_index(policy, policy->cur);
  /*
   * converge towards max_usage. We want the lowest frequency whose
   * capacity is >= to max_usage. In other words:
   *
   *      find capacity == floor(usage)
   *
   * Sadly cpufreq freq tables are not guaranteed to be ordered by
   * frequency...
   */
  freq = policy->max;
  cpufreq_for_each_entry(pos, policy->freq_table) {
          cap = pos->frequency * SCHED_CAPACITY_SCALE /
                  policy->max;
          if (max_usage < cap && pos->frequency < freq)
                  freq = pos->frequency;
          trace_printk("cpu = %u max_usage = %lu cap = %lu \
                          table_freq = %u freq = %lu",
                          cpumask_first(policy->cpus), max_usage, cap,
                          pos->frequency, freq);
  }
+out:
  trace_printk("cpu %d final freq %lu", cpu, freq);
  return freq;
+}

+/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears down all of the data structures and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

+static int cap_gov_thread(void *data) +{
  struct sched_param param;
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  unsigned long freq;
  int ret;
  policy = (struct cpufreq_policy *) data;
  if (!policy) {
          pr_warn("%s: missing policy\n", __func__);
          do_exit(-EINVAL);
  }
  gd = policy->gov_data;
  if (!gd) {
          pr_warn("%s: missing governor data\n", __func__);
          do_exit(-EINVAL);
  }
  param.sched_priority = 0;
  sched_setscheduler(current, SCHED_FIFO, &param);
  set_cpus_allowed_ptr(current, policy->related_cpus);
  /* main loop of the per-policy kthread */
  do {
          down_write(&policy->rwsem);
          if (!atomic_read(&gd->need_wake_task))  {
                  if (kthread_should_stop())
                          break;
                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);
                  up_write(&policy->rwsem);
                  set_current_state(TASK_INTERRUPTIBLE);
                  schedule();
                  continue;
          }
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          freq = cap_gov_select_freq(policy);
          ret = __cpufreq_driver_target(policy, freq,
                          CPUFREQ_RELATION_H);
          if (ret)
                  pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                  __func__, ret);
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
          atomic_set(&gd->need_wake_task, 0);
          up_write(&policy->rwsem);
  } while (!kthread_should_stop());
  do_exit(0);
+}

+static void cap_gov_wake_up_process(struct task_struct *task) +{
  /* this is null during early boot */
  if (IS_ERR_OR_NULL(task)) {
          return;
  }
  wake_up_process(task);
+}

+void cap_gov_kick_thread(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd = NULL;
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy))
          return;
  gd = policy->gov_data;
  if (!gd)
          goto out;
  /* per-cpu access not needed here since we have gd */
  if (atomic_read(&gd->need_wake_task)) {
          trace_printk("waking up kthread (%d)", gd->task->pid);
          cap_gov_wake_up_process(gd->task);
  }
+out:
  cpufreq_cpu_put(policy);
+}

+/**

cap_gov_update_cpu - interface to scheduler for changing capacity values

@cpu: cpu whose capacity utilization has recently changed

cap_gov_udpate_cpu is an interface exposed to the scheduler so that the

scheduler may inform the governor of updates to capacity utilization and

make changes to cpu frequency. Currently this interface is designed around

PELT values in CFS. It can be expanded to other scheduling classes in the

future if needed.

The semantics of this call vary based on the cpu frequency scaling

characteristics of the hardware.

If kicking off a dvfs transition is an operation that might block or sleep

in the cpufreq driver then we set the need_wake_task flag in this function

and return. Selecting a frequency and programming it is done in a dedicated

kernel thread which will be woken up from rebalance_domains. See

cap_gov_kick_thread above.

If kicking off a dvfs transition is an operation that returns quickly in the

cpufreq driver and will never sleep then we select the frequency in this

function and program the hardware for it in the scheduler hot path. No

dedicated kthread is needed.

*/

+void cap_gov_update_cpu(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  /* XXX put policy pointer in per-cpu data? */
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy)) {
          return;
  }
  if (!policy->gov_data) {
          trace_printk("missing governor data");
          goto out;
  }
  gd = policy->gov_data;
  /* bail early if we are throttled */
  if (ktime_before(ktime_get(), gd->throttle)) {
          trace_printk("THROTTLED");
          goto out;
  }
  atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);
+out:
  cpufreq_cpu_put(policy);
  return;
+}

+static void cap_gov_start(struct cpufreq_policy *policy) +{
  int cpu;
  struct gov_data *gd;
  /* prepare per-policy private data */
  gd = kzalloc(sizeof(*gd), GFP_KERNEL);
  if (!gd) {
          pr_debug("%s: failed to allocate private data\n", __func__);
          return;
  }
  /*
   * Don't ask for freq changes at an higher rate than what
   * the driver advertises as transition latency.
   */
  gd->throttle_nsec = policy->cpuinfo.transition_latency ?
                      policy->cpuinfo.transition_latency :
                      THROTTLE_NSEC;
  pr_debug("%s: throttle threshold = %u [ns]\n",
            __func__, gd->throttle_nsec);
  /* save per-cpu pointer to per-policy need_wake_task */
  for_each_cpu(cpu, policy->related_cpus)
          per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;
  /* init per-policy kthread */
  gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");
  if (IS_ERR_OR_NULL(gd->task))
          pr_err("%s: failed to create kcap_gov_task thread\n", __func__);
  policy->gov_data = gd;
+}

+static void cap_gov_stop(struct cpufreq_policy *policy) +{
  struct gov_data *gd;
  gd = policy->gov_data;
  policy->gov_data = NULL;
  kthread_stop(gd->task);
  /* FIXME replace with devm counterparts? */
  kfree(gd);
+}

+static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event) +{
  switch (event) {
          case CPUFREQ_GOV_START:
                  /* Start managing the frequency */
                  cap_gov_start(policy);
                  return 0;
          case CPUFREQ_GOV_STOP:
                  cap_gov_stop(policy);
                  return 0;
          case CPUFREQ_GOV_LIMITS:        /* unused */
          case CPUFREQ_GOV_POLICY_INIT:   /* unused */
          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */
                  break;
  }
  return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV +static +#endif +struct cpufreq_governor cpufreq_gov_cap_gov = {
  .name                   = "cap_gov",
  .governor               = cap_gov_setup,
  .owner                  = THIS_MODULE,
+};

+static int __init cap_gov_init(void) +{
  return cpufreq_register_governor(&cpufreq_gov_cap_gov);
+}

+static void __exit cap_gov_exit(void) +{
  cpufreq_unregister_governor(&cpufreq_gov_cap_gov);
+}

+/* Try to make this the default governor */ +fs_initcall(cap_gov_init);

+MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b066a61..2ec2dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle);
  /*
   * FIXME some hardware does not require this, but current CPUfreq
   * locking prevents us from changing cpu frequency with rq locks held
   * and interrupts disabled
   */
  if (sched_energy_freq())
          cap_gov_kick_thread(cpu_of(this_rq));
}

/* @@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
}

IIUC, you set a boolean each time a task is queued/dequeued on the cfs rq or when the tick fires. Then, you wake up a thread that will choose the best freq for the current load, the next time the SCHED_SOFTIRQ will be raised. I see onepotential issue: The load balance period can be large: 128ms (32*4) for a busy CPU of a quad cores system. So in a worst case, you can wait up to 128ms before being able to update the freq of a CPU.

As an example: a quad core system is idle task A wakes up on CPU0, the tick fires just after the wake up and raises a load balance which wakes up your thread. Your thread sets the frequency according to current usage (10%) task A has to compute something equivalent to 300ms at max CPU0 capacity you might wait up to 128ms before updating the frequency of CPU0 which is quite a long delay IMHO.

You might need to use another way to periodiccally wakes up your thread. Why haven't you use a hrtimer like for bandwidth limiter feature ?

Regards, Vincent

...

/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe57ba..c45f1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)

int get_cpu_usage(int cpu);

+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV +void cap_gov_update_cpu(int cpu); +void cap_gov_kick_thread(int cpu); +#else +static inline void cap_gov_update_cpu(int cpu) {} +static inline void cap_gov_kick_thread(int cpu) {} +#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Michael Turquette

8:12 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Quoting Vincent Guittot (2015-04-17 06:11:10)

...

On 16 April 2015 at 07:29, Michael Turquette mturquette@linaro.org wrote:

...
Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95 +#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
          if (usage > max_usage)
                  max_usage = usage;
  }
  trace_printk("max_usage = %lu", max_usage);
  /* find the utilization threshold at which we scale up frequency */
  index = cpufreq_frequency_table_get_index(policy, policy->cur);
  /*
   * converge towards max_usage. We want the lowest frequency whose
   * capacity is >= to max_usage. In other words:
   *
   *      find capacity == floor(usage)
   *
   * Sadly cpufreq freq tables are not guaranteed to be ordered by
   * frequency...
   */
  freq = policy->max;
  cpufreq_for_each_entry(pos, policy->freq_table) {
          cap = pos->frequency * SCHED_CAPACITY_SCALE /
                  policy->max;
          if (max_usage < cap && pos->frequency < freq)
                  freq = pos->frequency;
          trace_printk("cpu = %u max_usage = %lu cap = %lu \
                          table_freq = %u freq = %lu",
                          cpumask_first(policy->cpus), max_usage, cap,
                          pos->frequency, freq);
  }
+out:
  trace_printk("cpu %d final freq %lu", cpu, freq);
  return freq;
+}

+/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears down all of the data structures and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

+static int cap_gov_thread(void *data) +{
  struct sched_param param;
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  unsigned long freq;
  int ret;
  policy = (struct cpufreq_policy *) data;
  if (!policy) {
          pr_warn("%s: missing policy\n", __func__);
          do_exit(-EINVAL);
  }
  gd = policy->gov_data;
  if (!gd) {
          pr_warn("%s: missing governor data\n", __func__);
          do_exit(-EINVAL);
  }
  param.sched_priority = 0;
  sched_setscheduler(current, SCHED_FIFO, &param);
  set_cpus_allowed_ptr(current, policy->related_cpus);
  /* main loop of the per-policy kthread */
  do {
          down_write(&policy->rwsem);
          if (!atomic_read(&gd->need_wake_task))  {
                  if (kthread_should_stop())
                          break;
                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);
                  up_write(&policy->rwsem);
                  set_current_state(TASK_INTERRUPTIBLE);
                  schedule();
                  continue;
          }
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          freq = cap_gov_select_freq(policy);
          ret = __cpufreq_driver_target(policy, freq,
                          CPUFREQ_RELATION_H);
          if (ret)
                  pr_debug("%s: __cpufreq_driver_target returned %d\n",
                                  __func__, ret);
          trace_printk("kthread %d requested freq switch", gd->task->pid);
          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
          atomic_set(&gd->need_wake_task, 0);
          up_write(&policy->rwsem);
  } while (!kthread_should_stop());
  do_exit(0);
+}

+static void cap_gov_wake_up_process(struct task_struct *task) +{
  /* this is null during early boot */
  if (IS_ERR_OR_NULL(task)) {
          return;
  }
  wake_up_process(task);
+}

+void cap_gov_kick_thread(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd = NULL;
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy))
          return;
  gd = policy->gov_data;
  if (!gd)
          goto out;
  /* per-cpu access not needed here since we have gd */
  if (atomic_read(&gd->need_wake_task)) {
          trace_printk("waking up kthread (%d)", gd->task->pid);
          cap_gov_wake_up_process(gd->task);
  }
+out:
  cpufreq_cpu_put(policy);
+}

+/**

cap_gov_update_cpu - interface to scheduler for changing capacity values

@cpu: cpu whose capacity utilization has recently changed

cap_gov_udpate_cpu is an interface exposed to the scheduler so that the

scheduler may inform the governor of updates to capacity utilization and

make changes to cpu frequency. Currently this interface is designed around

PELT values in CFS. It can be expanded to other scheduling classes in the

future if needed.

The semantics of this call vary based on the cpu frequency scaling

characteristics of the hardware.

If kicking off a dvfs transition is an operation that might block or sleep

in the cpufreq driver then we set the need_wake_task flag in this function

and return. Selecting a frequency and programming it is done in a dedicated

kernel thread which will be woken up from rebalance_domains. See

cap_gov_kick_thread above.

If kicking off a dvfs transition is an operation that returns quickly in the

cpufreq driver and will never sleep then we select the frequency in this

function and program the hardware for it in the scheduler hot path. No

dedicated kthread is needed.

*/

+void cap_gov_update_cpu(int cpu) +{
  struct cpufreq_policy *policy;
  struct gov_data *gd;
  /* XXX put policy pointer in per-cpu data? */
  policy = cpufreq_cpu_get(cpu);
  if (IS_ERR_OR_NULL(policy)) {
          return;
  }
  if (!policy->gov_data) {
          trace_printk("missing governor data");
          goto out;
  }
  gd = policy->gov_data;
  /* bail early if we are throttled */
  if (ktime_before(ktime_get(), gd->throttle)) {
          trace_printk("THROTTLED");
          goto out;
  }
  atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);
+out:
  cpufreq_cpu_put(policy);
  return;
+}

+static void cap_gov_start(struct cpufreq_policy *policy) +{
  int cpu;
  struct gov_data *gd;
  /* prepare per-policy private data */
  gd = kzalloc(sizeof(*gd), GFP_KERNEL);
  if (!gd) {
          pr_debug("%s: failed to allocate private data\n", __func__);
          return;
  }
  /*
   * Don't ask for freq changes at an higher rate than what
   * the driver advertises as transition latency.
   */
  gd->throttle_nsec = policy->cpuinfo.transition_latency ?
                      policy->cpuinfo.transition_latency :
                      THROTTLE_NSEC;
  pr_debug("%s: throttle threshold = %u [ns]\n",
            __func__, gd->throttle_nsec);
  /* save per-cpu pointer to per-policy need_wake_task */
  for_each_cpu(cpu, policy->related_cpus)
          per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;
  /* init per-policy kthread */
  gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");
  if (IS_ERR_OR_NULL(gd->task))
          pr_err("%s: failed to create kcap_gov_task thread\n", __func__);
  policy->gov_data = gd;
+}

+static void cap_gov_stop(struct cpufreq_policy *policy) +{
  struct gov_data *gd;
  gd = policy->gov_data;
  policy->gov_data = NULL;
  kthread_stop(gd->task);
  /* FIXME replace with devm counterparts? */
  kfree(gd);
+}

+static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event) +{
  switch (event) {
          case CPUFREQ_GOV_START:
                  /* Start managing the frequency */
                  cap_gov_start(policy);
                  return 0;
          case CPUFREQ_GOV_STOP:
                  cap_gov_stop(policy);
                  return 0;
          case CPUFREQ_GOV_LIMITS:        /* unused */
          case CPUFREQ_GOV_POLICY_INIT:   /* unused */
          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */
                  break;
  }
  return 0;
+}

+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV +static +#endif +struct cpufreq_governor cpufreq_gov_cap_gov = {
  .name                   = "cap_gov",
  .governor               = cap_gov_setup,
  .owner                  = THIS_MODULE,
+};

+static int __init cap_gov_init(void) +{
  return cpufreq_register_governor(&cpufreq_gov_cap_gov);
+}

+static void __exit cap_gov_exit(void) +{
  cpufreq_unregister_governor(&cpufreq_gov_cap_gov);
+}

+/* Try to make this the default governor */ +fs_initcall(cap_gov_init);

+MODULE_LICENSE("GPL"); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b066a61..2ec2dc7 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) update_rq_runnable_avg(rq, rq->nr_running); add_nr_running(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) sub_nr_running(rq, 1); update_rq_runnable_avg(rq, 1); }
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
  hrtick_update(rq);
}

@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle);
  /*
   * FIXME some hardware does not require this, but current CPUfreq
   * locking prevents us from changing cpu frequency with rq locks held
   * and interrupts disabled
   */
  if (sched_energy_freq())
          cap_gov_kick_thread(cpu_of(this_rq));
}

/* @@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);
  if(sched_energy_freq())
          cap_gov_update_cpu(cpu_of(rq));
}
IIUC, you set a boolean each time a task is queued/dequeued on the cfs rq or when the tick fires. Then, you wake up a thread that will choose the best freq for the current load, the next time the SCHED_SOFTIRQ will be raised.

Thanks for the review Vincent!

Your understanding is correct. And the "event-driven" behavior is actually periodic in nature, due to that fact that we only wake the thread from run_rebalance_domains on the tick.

...

I see onepotential issue: The load balance period can be large: 128ms (32*4) for a busy CPU of a quad cores system. So in a worst case, you can wait up to 128ms before being able to update the freq of a CPU.

As an example: a quad core system is idle task A wakes up on CPU0, the tick fires just after the wake up and raises a load balance which wakes up your thread. Your thread sets the frequency according to current usage (10%) task A has to compute something equivalent to 300ms at max CPU0 capacity you might wait up to 128ms before updating the frequency of CPU0 which is quite a long delay IMHO.

You might need to use another way to periodiccally wakes up your thread. Why haven't you use a hrtimer like for bandwidth limiter feature ?

Because I don't know about the bandwidth limiter stuff? ;-)

Thanks for the hint. I will look into that. Additionally I'm trying to replace gov_data.need_task_wake and the cap_gov_kick_thread function with an irq_work call at the end of cap_gov_update_cpu. This irq_work worker would call wake_up_thread.

This may get rid of the atomic and drop the weird periodic behavior of cap_gov_kick_thread.

Your hrtimer idea might be better though. I'm looking into it now.

Regards, Mike

...

Regards, Vincent

...
/* diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 0fe57ba..c45f1ee 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)

int get_cpu_usage(int cpu);

+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV +void cap_gov_update_cpu(int cpu); +void cap_gov_kick_thread(int cpu); +#else +static inline void cap_gov_update_cpu(int cpu) {} +static inline void cap_gov_kick_thread(int cpu) {} +#endif

static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq)); -- 1.9.1

Ashwin Chaugule

2:34 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Hi Mike,

On 16 April 2015 at 01:29, Michael Turquette mturquette@linaro.org wrote:

...

Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95 +#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
          if (usage > max_usage)
                  max_usage = usage;
  }
  trace_printk("max_usage = %lu", max_usage);
  /* find the utilization threshold at which we scale up frequency */
  index = cpufreq_frequency_table_get_index(policy, policy->cur);
  /*
   * converge towards max_usage. We want the lowest frequency whose
   * capacity is >= to max_usage. In other words:
   *
   *      find capacity == floor(usage)
   *
   * Sadly cpufreq freq tables are not guaranteed to be ordered by
   * frequency...
   */
  freq = policy->max;
  cpufreq_for_each_entry(pos, policy->freq_table) {
          cap = pos->frequency * SCHED_CAPACITY_SCALE /
                  policy->max;
          if (max_usage < cap && pos->frequency < freq)
                  freq = pos->frequency;
          trace_printk("cpu = %u max_usage = %lu cap = %lu \
                          table_freq = %u freq = %lu",
                          cpumask_first(policy->cpus), max_usage, cap,
                          pos->frequency, freq);
  }

This code assumes all backend drivers will support a frequency table. I think this may not be always true (e.g. pcc-cpufreq.c which is already upstream or even the WIP CPPC driver). What do you think about detecting if the backend does not support a freq table, then just pass the max_usage down to the driver and let the driver handle it from there?

Regards, Ashwin.

Morten Rasmussen

3:37 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On Fri, Apr 17, 2015 at 03:34:30PM +0100, Ashwin Chaugule wrote:

...

Hi Mike,

On 16 April 2015 at 01:29, Michael Turquette mturquette@linaro.org wrote:

...
Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95 +#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
          if (usage > max_usage)
                  max_usage = usage;
  }
  trace_printk("max_usage = %lu", max_usage);
  /* find the utilization threshold at which we scale up frequency */
  index = cpufreq_frequency_table_get_index(policy, policy->cur);
  /*
   * converge towards max_usage. We want the lowest frequency whose
   * capacity is >= to max_usage. In other words:
   *
   *      find capacity == floor(usage)
   *
   * Sadly cpufreq freq tables are not guaranteed to be ordered by
   * frequency...
   */
  freq = policy->max;
  cpufreq_for_each_entry(pos, policy->freq_table) {
          cap = pos->frequency * SCHED_CAPACITY_SCALE /
                  policy->max;
          if (max_usage < cap && pos->frequency < freq)
                  freq = pos->frequency;
          trace_printk("cpu = %u max_usage = %lu cap = %lu \
                          table_freq = %u freq = %lu",
                          cpumask_first(policy->cpus), max_usage, cap,
                          pos->frequency, freq);
  }
This code assumes all backend drivers will support a frequency table. I think this may not be always true (e.g. pcc-cpufreq.c which is already upstream or even the WIP CPPC driver). What do you think about detecting if the backend does not support a freq table, then just pass the max_usage down to the driver and let the driver handle it from there?

Unless you change the driver interface it has to be a frequency passed to the driver.

Do you actually need to consider the OPP frequencies at all? It looks for the lowest frequency with enough capacity for max_usage. But, I don't think it is necessary to find a valid OPP frequency. __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_H) should be able handle that as long as we can provide it a minimum frequency (freq) which is sufficient for max_usage.

Can't we get away with just returning:

(max_usage * policy->max)/SCHED_CAPACITY_SCALE

and avoid having to look at the frequency table?

It isn't as simple as that though, some margin or threshold has to be added so we don't have to wait until max_usage == cap before we ask for a higher OPP. It takes ages to get there (100+ ms).

Btw, I noted that __cpufreq_driver_target() is deprecated. Are you planning on using that for the CPPC driver? Based on my rather limited understanding of CPPC I would have guessed you would go with setpolicy() instead?

Morten

Ashwin Chaugule

6:37 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Hi Morten,

On 17 April 2015 at 11:37, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On Fri, Apr 17, 2015 at 03:34:30PM +0100, Ashwin Chaugule wrote:

...
Hi Mike,

On 16 April 2015 at 01:29, Michael Turquette mturquette@linaro.org wrote:

...
Scheduler-driven cpu frequency selection is desirable as part of the on-going effort to make the scheduler better aware of energy consumption. No piece of the Linux kernel has a better view of the factors that affect a cpu frequency selection policy than the scheduler[0], and this patch is an attempt to get that discussion going again.

This patch implements a cpufreq governor, cap_gov, that directly accesses scheduler statistics, in particular the pelt data from cfs via the get_cpu_usage() function.

Put plainly, cap_gov selects the lowest cpu frequency that will prevent a runqueue from being over-utilized (until we hit the highest frequency of course).

cap_gov converts available cpu frequencies into capacity states. When the utilization of a cfs runqueue changes then the policy selects the capacity state which is the floor of the new usage.

Unlike the previous posting from 2014[1] this governor implements no policy of its own (e.g. with tunable thresholds for determining when to scale frequency), but instead implements a "follow the usage" method, where usage is defined as the cpu frequency-invariant product of utilization_load_avg and cpu_capacity_orig.

This governor is event-driven. There is no polling loop to check cpu idle time, or any other method which is unsynchronized with the scheduler. The entry points for this policy are in fair.c: enqueue_task_fair, dequeue_task_fair and task_tick_fair. run_rebalance_domains is used to kick the worker thread to prevent fatally re-entering into scheduler.

This policy is implemented using the cpufreq governor interface for two main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at run-time. This is very useful for comparative testing and tuning.

Finally, it is worth mentioning that this approach neglects all scheduling classes except for cfs. It is possible to add support for deadline and other other classes here, but I also wonder if a multi-governor approach would be a more maintainable solution, where the cpufreq core aggregates the constraints set by multiple governors. Supporting such an approach in the cpufreq core would also allow for peripheral devices to place constraint on cpu frequency without having to hack such behavior in at the governor level.

Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of testing, bug fixing and contributing towards the design.

[0] http://article.gmane.org/gmane.linux.kernel/1499836 [1] https://lkml.org/lkml/2014/10/22/22

Signed-off-by: Michael Turquette mturquette@linaro.org

drivers/cpufreq/Kconfig | 22 +++ include/linux/cpufreq.h | 3 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 19 +++ kernel/sched/sched.h | 8 ++ 6 files changed, 414 insertions(+) create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index a171fef..654d70a 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE Be aware that not all cpufreq drivers support the conservative governor. If unsure have a look at the help section of the driver. Fallback governor will be the performance governor.

+config CPU_FREQ_DEFAULT_GOV_CAP_GOV
  bool "cap_gov"
  select CPU_FREQ_GOV_CAP_GOV
  select CPU_FREQ_GOV_PERFORMANCE
  help
    Use the CPUfreq governor 'cap_gov' as default. This scales cpu
    frequency from the scheduler as per-entity load tracking
    statistics are updated.
endchoice

config CPU_FREQ_GOV_PERFORMANCE @@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.
+config CPU_FREQ_GOV_CAP_GOV
  tristate "'capacity governor' cpufreq governor"
  depends on CPU_FREQ
  select CPU_FREQ_GOV_COMMON
  help
    'cap_gov' - this governor scales cpu frequency from the
    scheduler as a function of cpu capacity utilization. It does
    not evaluate utilization on a periodic basis (unlike ondemand)
    but instead is invoked from CFS when updating per-entity load
    tracking statistics.
    If in doubt, say N.
comment "CPU frequency scaling drivers"

config CPUFREQ_DT diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 7cdf63a..4fc066f 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand; #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE) extern struct cpufreq_governor cpufreq_gov_conservative; #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative) +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV) +extern struct cpufreq_governor cpufreq_gov_cap_gov; +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_cap_gov) #endif

/********************************************************************* diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile index 46be870..da601d5 100644 --- a/kernel/sched/Makefile +++ b/kernel/sched/Makefile @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o obj-$(CONFIG_SCHEDSTATS) += stats.o obj-$(CONFIG_SCHED_DEBUG) += debug.o obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o +obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c new file mode 100644 index 0000000..72873ab --- /dev/null +++ b/kernel/sched/cap_gov.c @@ -0,0 +1,361 @@ +/*

Copyright (C) 2014 Michael Turquette mturquette@linaro.org

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License version 2 as

published by the Free Software Foundation.

*/

+#include <linux/cpufreq.h> +#include <linux/module.h> +#include <linux/kthread.h> +#include <linux/percpu.h>

+#include "sched.h"

+#define UP_THRESHOLD 95 +#define THROTTLE_NSEC 50000000 /* 50ms default */

+/*

per-cpu pointer to atomic_t gov_data->cap_gov_wake_task

used in scheduler hot paths {en,de}queueu, task_tick without having to

access struct cpufreq_policy and struct gov_data

*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);

+/**

gov_data - per-policy data internal to the governor

@throttle: time until throttling period expires. Derived from THROTTLE_NSEC

@task: worker task for dvfs transition that may block/sleep

@need_wake_task: flag the governor to wake this policy's worker thread

struct gov_data is the per-policy cap_gov-specific data structure. A

per-policy instance of it is created when the cap_gov governor receives

the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data

member of struct cpufreq_policy.

Readers of this data must call down_read(policy->rwsem). Writers must

call down_write(policy->rwsem).

*/

+struct gov_data {
  ktime_t throttle;
  unsigned int throttle_nsec;
  struct task_struct *task;
  atomic_t need_wake_task;
+};

+/**

cap_gov_select_freq - pick the next frequency for a cpu

@cpu: the cpu whose frequency may be changed

cap_gov_select_freq works in a way similar to the ondemand governor. First

we inspect the utilization of all of the cpus in this policy to find the

most utilized cpu. This is achieved by calling get_cpu_usage, which returns

frequency-invarant capacity utilization.

This max utilization is compared against the up_threshold (default 95%

utilization). If the max cpu utilization is greater than this threshold then

we scale the policy up to the max frequency. Othewise we find the lowest

frequency (smallest cpu capacity) that is still larger than the max capacity

utilization for this policy.

Returns frequency selected.

*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy) +{
  int cpu = 0;
  struct gov_data *gd;
  int index;
  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;
  struct cpufreq_frequency_table *pos;
  if (!policy->gov_data)
          goto out;
  gd = policy->gov_data;
  /*
   * get_cpu_usage is called without locking the runqueues. This is the
   * same behavior used by find_busiest_cpu in load_balance. We are
   * willing to accept occasionally stale data here in exchange for
   * lockless behavior.
   */
  for_each_cpu(cpu, policy->cpus) {
          usage = get_cpu_usage(cpu);
          trace_printk("cpu = %d usage = %lu", cpu, usage);
          if (usage > max_usage)
                  max_usage = usage;
  }
  trace_printk("max_usage = %lu", max_usage);
  /* find the utilization threshold at which we scale up frequency */
  index = cpufreq_frequency_table_get_index(policy, policy->cur);
  /*
   * converge towards max_usage. We want the lowest frequency whose
   * capacity is >= to max_usage. In other words:
   *
   *      find capacity == floor(usage)
   *
   * Sadly cpufreq freq tables are not guaranteed to be ordered by
   * frequency...
   */
  freq = policy->max;
  cpufreq_for_each_entry(pos, policy->freq_table) {
          cap = pos->frequency * SCHED_CAPACITY_SCALE /
                  policy->max;
          if (max_usage < cap && pos->frequency < freq)
                  freq = pos->frequency;
          trace_printk("cpu = %u max_usage = %lu cap = %lu \
                          table_freq = %u freq = %lu",
                          cpumask_first(policy->cpus), max_usage, cap,
                          pos->frequency, freq);
  }
This code assumes all backend drivers will support a frequency table. I think this may not be always true (e.g. pcc-cpufreq.c which is already upstream or even the WIP CPPC driver). What do you think about detecting if the backend does not support a freq table, then just pass the max_usage down to the driver and let the driver handle it from there?
Unless you change the driver interface it has to be a frequency passed to the driver.

Sure, but the pcc-cpufreq driver doesn't export a freq table although it seemingly accepts freq requests. Not sure if there are other similar examples. Why not pass the next freq request proportional to the load like ondemand does and let the backend take care of snapping the request to a table if it supports it? That'll keep things consistent with the current design. (sorry if I misunderstood CPU usage. Still grokking it.)

...

Do you actually need to consider the OPP frequencies at all? It looks for the lowest frequency with enough capacity for max_usage. But, I don't think it is necessary to find a valid OPP frequency. __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_H) should be able handle that as long as we can provide it a minimum frequency (freq) which is sufficient for max_usage.

Can't we get away with just returning:
    (max_usage * policy->max)/SCHED_CAPACITY_SCALE
and avoid having to look at the frequency table?

It isn't as simple as that though, some margin or threshold has to be added so we don't have to wait until max_usage == cap before we ask for a higher OPP. It takes ages to get there (100+ ms).

Btw, I noted that __cpufreq_driver_target() is deprecated. Are you planning on using that for the CPPC driver? Based on my rather limited understanding of CPPC I would have guessed you would go with setpolicy() instead?

I saw that comment too, but I dont think all drivers have/can have a discretized freq table, so I dont see how that call can be removed entirely. For CPPC, setpolicy() was the initial approach with PID as the governor, but after some experiments and talking to Rafael, we found that using the target() method + ondemand is a better solution, at least initially.

Regards, Ashwin.

Michael Turquette

21 Apr 21 Apr

5:11 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Quoting Ashwin Chaugule (2015-04-17 11:37:44)

...

On 17 April 2015 at 11:37, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On Fri, Apr 17, 2015 at 03:34:30PM +0100, Ashwin Chaugule wrote:

...

On 16 April 2015 at 01:29, Michael Turquette mturquette@linaro.org wrote:

...

```
  /*
```

   * get_cpu_usage is called without locking the runqueues. This is the

   * same behavior used by find_busiest_cpu in load_balance. We are

   * willing to accept occasionally stale data here in exchange for

```
   * lockless behavior.
```
```
   */
```
```
  for_each_cpu(cpu, policy->cpus) {
```
```
          usage = get_cpu_usage(cpu);
```

          trace_printk("cpu = %d usage = %lu", cpu, usage);

```
          if (usage > max_usage)
```
```
                  max_usage = usage;
```
```
  }
```

  trace_printk("max_usage = %lu", max_usage);

  /* find the utilization threshold at which we scale up frequency */

  index = cpufreq_frequency_table_get_index(policy, policy->cur);

```
  /*
```

   * converge towards max_usage. We want the lowest frequency whose

   * capacity is >= to max_usage. In other words:

```
   *
```

   *      find capacity == floor(usage)

```
   *
```

   * Sadly cpufreq freq tables are not guaranteed to be ordered by

```
   * frequency...
```
```
   */
```
```
  freq = policy->max;
```

  cpufreq_for_each_entry(pos, policy->freq_table) {

          cap = pos->frequency * SCHED_CAPACITY_SCALE /

```
                  policy->max;
```

          if (max_usage < cap && pos->frequency < freq)

                  freq = pos->frequency;

          trace_printk("cpu = %u max_usage = %lu cap = %lu \

                          table_freq = %u freq = %lu",

                          cpumask_first(policy->cpus), max_usage, cap,

                          pos->frequency, freq);

```
  }
```

Ashwin, you are correct that this code assume a freq table. Just a bit of laziness on my part to get the code out there.

...

...
Unless you change the driver interface it has to be a frequency passed to the driver.

Sure, but the pcc-cpufreq driver doesn't export a freq table although it seemingly accepts freq requests. Not sure if there are other similar examples. Why not pass the next freq request proportional to the load like ondemand does and let the backend take care of snapping the request to a table if it supports it? That'll keep things consistent with the current design. (sorry if I misunderstood CPU usage. Still grokking it.)

...
Do you actually need to consider the OPP frequencies at all? It looks for the lowest frequency with enough capacity for max_usage. But, I don't think it is necessary to find a valid OPP frequency. __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_H) should be able handle that as long as we can provide it a minimum frequency (freq) which is sufficient for max_usage.

Can't we get away with just returning:
    (max_usage * policy->max)/SCHED_CAPACITY_SCALE
and avoid having to look at the frequency table?

The 2014 sched-freq rfc had something like this. I switched to the freq table for a variety of reasons, some of which seemed important early on and are not important now.

But I still think it might be an optimization for the driver to use the table if it exists, instead of passing an arbitrary frequency which must get mapped to the table in the driver backend.

...

...
It isn't as simple as that though, some margin or threshold has to be added so we don't have to wait until max_usage == cap before we ask for a higher OPP. It takes ages to get there (100+ ms).

This isn't always true. First, I'd like to avoid a threshold if possible. I'm still not convinced it is necessary.

Secondly, for workloads that create new tasks and place them on the workqueue, we ramp up almost immediately because the load of the new task is maxed out be default (I hope I got that right).

In my testing, which mostly involves interactive scripts that I have written, I see near-instantaneous responses when I am idle and then I start some work.

Testing with a real-world use case would be great, especially something like Android.

Amit, can the backport stuff help us here? Is there an Android build that uses the eas 3.10 backport? Any testing using it?

...

...
Btw, I noted that __cpufreq_driver_target() is deprecated. Are you planning on using that for the CPPC driver? Based on my rather limited understanding of CPPC I would have guessed you would go with setpolicy() instead?

I saw that comment too, but I dont think all drivers have/can have a discretized freq table, so I dont see how that call can be removed entirely. For CPPC, setpolicy() was the initial approach with PID as the governor, but after some experiments and talking to Rafael, we found that using the target() method + ondemand is a better solution, at least initially.

I guess we can sort this bit out when the series is posted to LKML.

Regards, Mike

...

Regards, Ashwin.

Amit Kucheria

6:02 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

On Tue, Apr 21, 2015 at 10:41 PM, Michael Turquette mturquette@linaro.org wrote:

...

Quoting Ashwin Chaugule (2015-04-17 11:37:44)

...

...
...
It isn't as simple as that though, some margin or threshold has to be added so we don't have to wait until max_usage == cap before we ask for a higher OPP. It takes ages to get there (100+ ms).

This isn't always true. First, I'd like to avoid a threshold if possible. I'm still not convinced it is necessary.

Secondly, for workloads that create new tasks and place them on the workqueue, we ramp up almost immediately because the load of the new task is maxed out be default (I hope I got that right).

In my testing, which mostly involves interactive scripts that I have written, I see near-instantaneous responses when I am idle and then I start some work.

Testing with a real-world use case would be great, especially something like Android.

Amit, can the backport stuff help us here? Is there an Android build that uses the eas 3.10 backport? Any testing using it?

ARM has enabled Android on Juno with a subset of that backport. Juri and Morten should be able to help test.

/Amit

Michael Turquette

6:52 p.m.

New subject: [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Quoting Amit Kucheria (2015-04-21 11:02:29)

...

On Tue, Apr 21, 2015 at 10:41 PM, Michael Turquette mturquette@linaro.org wrote:

...
Quoting Ashwin Chaugule (2015-04-17 11:37:44)

...
...
...
It isn't as simple as that though, some margin or threshold has to be added so we don't have to wait until max_usage == cap before we ask for a higher OPP. It takes ages to get there (100+ ms).

This isn't always true. First, I'd like to avoid a threshold if possible. I'm still not convinced it is necessary.

Secondly, for workloads that create new tasks and place them on the workqueue, we ramp up almost immediately because the load of the new task is maxed out be default (I hope I got that right).

In my testing, which mostly involves interactive scripts that I have written, I see near-instantaneous responses when I am idle and then I start some work.

Testing with a real-world use case would be great, especially something like Android.

Amit, can the backport stuff help us here? Is there an Android build that uses the eas 3.10 backport? Any testing using it?

ARM has enabled Android on Juno with a subset of that backport. Juri and Morten should be able to help test.

Thanks. The important thing is to test cap_gov with a minimum of dependency patches. Basically Vincent's cpu capacity rework and the cpufreq capacity patches that are the first two patches in this series.

Adding in the energy model makes it an apples to oranges comparison.

Regards, Mike

...

/Amit

Juri Lelli

16 Apr 16 Apr

5 p.m.

Hi,

On 16/04/15 06:29, Michael Turquette wrote:

...

This series implements an event-driven cpufreq governor that scales cpu frequency as a function of cfs runqueue utilization. The intent of this RFC is to get some discussion going about how the scheduler can become the policy engine for selecting cpu frequency, what limitations exist and what design do we want to take to get to a solution.

This series depends on having frequency-invariant representations for load. This requires Vincent's recently merged cpu capacity rework patches, as well as two patches posted by Morten in his energy aware scheduling v3 series. The latter two patches are included in this series for posterity, but discussion around them probably belongs in the v3 eas series or the forthcoming v4 series.

Thanks to Juri Lelli juri.lelli@arm.com for contributing to the development of the governor.

A git branch with these patches can be pulled from here: https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800 Chromebook2.

---8<---

eas-dev,

Please let me know what you think of this series, including code as well as cover letter and commitlog text. I was not able to finish the irq_work additions to the governor in time for me to submit this tonight (these changes remove the periodic behavior of calling cap_gov_kick_thread from run_rebalance_domains). I'll focus on the irq_work stuff tomorrow and post an addenedum to this series asap.

IMHO, we should try to have also this bit for the posting on LKML. I would like to receive early feedback on it, so that we can also ask for advices on alternative solutions if the thing is not going to fly :).

Thanks for this post Mike.

Best,

- Juri

...

I have not done any benchmark testing with this series. That is also on my todo list for this week and any help there would be appreciated.

Freedom & Howard, if you are bored and feel like measuring power across some benchmarks on your non-buggy EVBs then please do. I can only measure power on the A53s right now which limits me to a single cluster with two cores.

Regards, Mike

Michael Turquette (4): sched: sched feature for cpu frequency selection cpufreq: add per-governor private data sched: export get_cpu_usage in sched.h sched: cap_gov: PELT-based cpu frequency scaling

Morten Rasmussen (2): cpufreq: Architecture specific callback for frequency changes arm: Frequency invariant scheduler load-tracking support

arch/arm/include/asm/topology.h | 4 + arch/arm/kernel/topology.c | 41 +++++ drivers/cpufreq/Kconfig | 22 +++ drivers/cpufreq/cpufreq.c | 13 +- include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 26 ++- kernel/sched/features.h | 6 + kernel/sched/sched.h | 10 ++ 10 files changed, 488 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/cap_gov.c

Michael Turquette

6:52 p.m.

Quoting Juri Lelli (2015-04-16 10:00:33)

...

Hi,

On 16/04/15 06:29, Michael Turquette wrote:

...
This series implements an event-driven cpufreq governor that scales cpu frequency as a function of cfs runqueue utilization. The intent of this RFC is to get some discussion going about how the scheduler can become the policy engine for selecting cpu frequency, what limitations exist and what design do we want to take to get to a solution.

This series depends on having frequency-invariant representations for load. This requires Vincent's recently merged cpu capacity rework patches, as well as two patches posted by Morten in his energy aware scheduling v3 series. The latter two patches are included in this series for posterity, but discussion around them probably belongs in the v3 eas series or the forthcoming v4 series.

Thanks to Juri Lelli juri.lelli@arm.com for contributing to the development of the governor.

A git branch with these patches can be pulled from here: https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800 Chromebook2.

---8<---

eas-dev,

Please let me know what you think of this series, including code as well as cover letter and commitlog text. I was not able to finish the irq_work additions to the governor in time for me to submit this tonight (these changes remove the periodic behavior of calling cap_gov_kick_thread from run_rebalance_domains). I'll focus on the irq_work stuff tomorrow and post an addenedum to this series asap.

IMHO, we should try to have also this bit for the posting on LKML. I would like to receive early feedback on it, so that we can also ask for advices on alternative solutions if the thing is not going to fly :).

I agree. I'm inclined to keep it as a separate patch just to give some options and different ways of doing things. Also we need to benchmark the performance costs/benefits of raising the IPI versus doing the atomic operations on need_task_wake.

What do you think about keeping the irq_work stuff separate?

Regards, Mike

...

Thanks for this post Mike.

Best,

Juri

...
I have not done any benchmark testing with this series. That is also on my todo list for this week and any help there would be appreciated.

Freedom & Howard, if you are bored and feel like measuring power across some benchmarks on your non-buggy EVBs then please do. I can only measure power on the A53s right now which limits me to a single cluster with two cores.

Regards, Mike

Michael Turquette (4): sched: sched feature for cpu frequency selection cpufreq: add per-governor private data sched: export get_cpu_usage in sched.h sched: cap_gov: PELT-based cpu frequency scaling

Morten Rasmussen (2): cpufreq: Architecture specific callback for frequency changes arm: Frequency invariant scheduler load-tracking support

arch/arm/include/asm/topology.h | 4 + arch/arm/kernel/topology.c | 41 +++++ drivers/cpufreq/Kconfig | 22 +++ drivers/cpufreq/cpufreq.c | 13 +- include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 26 ++- kernel/sched/features.h | 6 + kernel/sched/sched.h | 10 ++ 10 files changed, 488 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/cap_gov.c

Juri Lelli

17 Apr 17 Apr

11:16 a.m.

Hi,

On 16/04/15 19:52, Michael Turquette wrote:

...

Quoting Juri Lelli (2015-04-16 10:00:33)

...
Hi,

On 16/04/15 06:29, Michael Turquette wrote:

...
This series implements an event-driven cpufreq governor that scales cpu frequency as a function of cfs runqueue utilization. The intent of this RFC is to get some discussion going about how the scheduler can become the policy engine for selecting cpu frequency, what limitations exist and what design do we want to take to get to a solution.

This series depends on having frequency-invariant representations for load. This requires Vincent's recently merged cpu capacity rework patches, as well as two patches posted by Morten in his energy aware scheduling v3 series. The latter two patches are included in this series for posterity, but discussion around them probably belongs in the v3 eas series or the forthcoming v4 series.

Thanks to Juri Lelli juri.lelli@arm.com for contributing to the development of the governor.

A git branch with these patches can be pulled from here: https://git.linaro.org/people/mike.turquette/linux.git sched-freq

Smoke testing has been done on an OMAP4 Pandaboard and an Exynos 5800 Chromebook2.

---8<---

eas-dev,

Please let me know what you think of this series, including code as well as cover letter and commitlog text. I was not able to finish the irq_work additions to the governor in time for me to submit this tonight (these changes remove the periodic behavior of calling cap_gov_kick_thread from run_rebalance_domains). I'll focus on the irq_work stuff tomorrow and post an addenedum to this series asap.

IMHO, we should try to have also this bit for the posting on LKML. I would like to receive early feedback on it, so that we can also ask for advices on alternative solutions if the thing is not going to fly :).

I agree. I'm inclined to keep it as a separate patch just to give some options and different ways of doing things. Also we need to benchmark the performance costs/benefits of raising the IPI versus doing the atomic operations on need_task_wake.

What do you think about keeping the irq_work stuff separate?

Yes, no problem with that if the plan is to post it as part of this patchset.

Thanks,

- Juri

...

Regards, Mike

...
Thanks for this post Mike.

Best,

Juri

...
I have not done any benchmark testing with this series. That is also on my todo list for this week and any help there would be appreciated.

Freedom & Howard, if you are bored and feel like measuring power across some benchmarks on your non-buggy EVBs then please do. I can only measure power on the A53s right now which limits me to a single cluster with two cores.

Regards, Mike

Michael Turquette (4): sched: sched feature for cpu frequency selection cpufreq: add per-governor private data sched: export get_cpu_usage in sched.h sched: cap_gov: PELT-based cpu frequency scaling

Morten Rasmussen (2): cpufreq: Architecture specific callback for frequency changes arm: Frequency invariant scheduler load-tracking support

arch/arm/include/asm/topology.h | 4 + arch/arm/kernel/topology.c | 41 +++++ drivers/cpufreq/Kconfig | 22 +++ drivers/cpufreq/cpufreq.c | 13 +- include/linux/cpufreq.h | 6 + kernel/sched/Makefile | 1 + kernel/sched/cap_gov.c | 361 ++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 26 ++- kernel/sched/features.h | 6 + kernel/sched/sched.h | 10 ++ 10 files changed, 488 insertions(+), 2 deletions(-) create mode 100644 kernel/sched/cap_gov.c

Michael Turquette

22 Apr 22 Apr

5:16 a.m.

New subject: [PATCH] cap_gov: irq_work + workqueue

This patch demonstrates how to queue up irq_work from within schedule() context, which we then use to queue up a normal work_struct to select frequency and program it.

Call graph looks like this:

{en,de}queue_task_fair or task_tick_fair -> cap_gov_update_cpu (same as before) -> cap_gov_irq_work (new irq_work callback) -> cap_gov_work (new work_struct callback)

I've removed cap_gov_thread and the corresponding kernel thread that implemented its own loop. This also gets rid of the atomic need_task_wake variable and the need to call schedule() manually in the kthread loop.

The work_struct runs to completion, no loop necessary. We queue it on the same cpu that ran the *_task_fair function that kicked off the whole chain.

I'm not sure this is better and I don't understand the implications of queueing up this work all the time. I assume it is Not Good for scheduler stats? Maybe we should filter that out?

Note that this avoids the reentrancy issues faced by Morten back in October 2013 since we're not queueing up work_struct with rq locks held.

This patch is only smoke tested. The GOV_STOP path doesn't work, so switching away from cap_gov to ondemand will fail.

Thoughts?

Signed-off-by: Michael Turquette mturquette@linaro.org --- kernel/sched/cap_gov.c | 119 ++++++++++++++++--------------------------------- kernel/sched/fair.c | 8 ---- 2 files changed, 39 insertions(+), 88 deletions(-)

diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c index 72873ab..8380564 100644 --- a/kernel/sched/cap_gov.c +++ b/kernel/sched/cap_gov.c @@ -10,6 +10,8 @@ #include <linux/module.h> #include <linux/kthread.h> #include <linux/percpu.h> +#include <linux/irq_work.h> +#include <linux/workqueue.h>

#include "sched.h"

@@ -42,6 +44,9 @@ struct gov_data { unsigned int throttle_nsec; struct task_struct *task; atomic_t need_wake_task; + struct irq_work irq_work; + struct work_struct work; + struct cpufreq_policy *policy; };

/** @@ -117,101 +122,56 @@ out: return freq; }

-/* - * we pass in struct cpufreq_policy. This is safe because changing out the - * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP), - * which tears down all of the data structures and __cpufreq_governor(policy, - * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the - * new policy pointer - */ -static int cap_gov_thread(void *data) +static void cap_gov_work(struct work_struct *work) { - struct sched_param param; struct cpufreq_policy *policy; struct gov_data *gd; unsigned long freq; int ret;

- policy = (struct cpufreq_policy *) data; + gd = container_of(work, struct gov_data, work); + if (!gd) { + pr_err("%s: here0.5: no gd!\n", __func__); + return; + } + + policy = gd->policy; if (!policy) { + pr_err("%s: here1.5!\n", __func__); pr_warn("%s: missing policy\n", __func__); - do_exit(-EINVAL); + return; }

- gd = policy->gov_data; - if (!gd) { - pr_warn("%s: missing governor data\n", __func__); - do_exit(-EINVAL); - } + down_write(&policy->rwsem);

- param.sched_priority = 0; - sched_setscheduler(current, SCHED_FIFO, &param); - set_cpus_allowed_ptr(current, policy->related_cpus); - - /* main loop of the per-policy kthread */ - do { - down_write(&policy->rwsem); - if (!atomic_read(&gd->need_wake_task)) { - if (kthread_should_stop()) - break; - trace_printk("NOT waking up kthread (%d)", gd->task->pid); - up_write(&policy->rwsem); - set_current_state(TASK_INTERRUPTIBLE); - schedule(); - continue; - } - - trace_printk("kthread %d requested freq switch", gd->task->pid); - - freq = cap_gov_select_freq(policy); - - ret = __cpufreq_driver_target(policy, freq, - CPUFREQ_RELATION_H); - if (ret) - pr_debug("%s: __cpufreq_driver_target returned %d\n", - __func__, ret); - - trace_printk("kthread %d requested freq switch", gd->task->pid); - - gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec); - atomic_set(&gd->need_wake_task, 0); - up_write(&policy->rwsem); - } while (!kthread_should_stop()); - - do_exit(0); -} + trace_printk("kthread %d requested freq switch", gd->task->pid);

-static void cap_gov_wake_up_process(struct task_struct *task) -{ - /* this is null during early boot */ - if (IS_ERR_OR_NULL(task)) { - return; - } + freq = cap_gov_select_freq(policy); + + ret = __cpufreq_driver_target(policy, freq, + CPUFREQ_RELATION_H); + if (ret) + pr_err("%s: __cpufreq_driver_target returned %d\n", + __func__, ret);

- wake_up_process(task); + trace_printk("kthread %d requested freq switch", gd->task->pid); + + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec); + up_write(&policy->rwsem); }

-void cap_gov_kick_thread(int cpu) +static void cap_gov_irq_work(struct irq_work *irq_work) { struct cpufreq_policy *policy; - struct gov_data *gd = NULL; + struct gov_data *gd;

- policy = cpufreq_cpu_get(cpu); - if (IS_ERR_OR_NULL(policy)) + gd = container_of(irq_work, struct gov_data, irq_work); + if (!gd) { + pr_err("%s: here0.5: no gd!\n", __func__); return; - - gd = policy->gov_data; - if (!gd) - goto out; - - /* per-cpu access not needed here since we have gd */ - if (atomic_read(&gd->need_wake_task)) { - trace_printk("waking up kthread (%d)", gd->task->pid); - cap_gov_wake_up_process(gd->task); }

-out: - cpufreq_cpu_put(policy); + schedule_work_on(raw_smp_processor_id(), &gd->work) }

/** @@ -262,7 +222,8 @@ void cap_gov_update_cpu(int cpu) goto out; }

- atomic_set(per_cpu(cap_gov_wake_task, cpu), 1); + if (irq_work_queue_on(&gd->irq_work, cpu)) + trace_printk("could not queue irq_work");

out: cpufreq_cpu_put(policy); @@ -295,12 +256,10 @@ static void cap_gov_start(struct cpufreq_policy *policy) for_each_cpu(cpu, policy->related_cpus) per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;

- /* init per-policy kthread */ - gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task"); - if (IS_ERR_OR_NULL(gd->task)) - pr_err("%s: failed to create kcap_gov_task thread\n", __func__); - + init_irq_work(&gd->irq_work, cap_gov_irq_work); + INIT_WORK(&gd->work, cap_gov_work); policy->gov_data = gd; + gd->policy = policy; }

static void cap_gov_stop(struct cpufreq_policy *policy) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2ec2dc7..16d73a9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7776,14 +7776,6 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle); - - /* - * FIXME some hardware does not require this, but current CPUfreq - * locking prevents us from changing cpu frequency with rq locks held - * and interrupts disabled - */ - if (sched_energy_freq()) - cap_gov_kick_thread(cpu_of(this_rq)); }

/* -- 1.9.1

Juri Lelli

23 Apr 23 Apr

9:13 a.m.

New subject: [PATCH] cap_gov: irq_work + workqueue

Hi Mike,

On 22/04/15 06:16, Michael Turquette wrote:

...

This patch demonstrates how to queue up irq_work from within schedule() context, which we then use to queue up a normal work_struct to select frequency and program it.

Call graph looks like this:

{en,de}queue_task_fair or task_tick_fair -> cap_gov_update_cpu (same as before) -> cap_gov_irq_work (new irq_work callback) -> cap_gov_work (new work_struct callback)

I've removed cap_gov_thread and the corresponding kernel thread that implemented its own loop. This also gets rid of the atomic need_task_wake variable and the need to call schedule() manually in the kthread loop.

Can you expand a bit more on why do you think completely moving away from the kthread is better/more desirable?

...

The work_struct runs to completion, no loop necessary. We queue it on the same cpu that ran the *_task_fair function that kicked off the whole chain.

I'm not sure this is better and I don't understand the implications of queueing up this work all the time. I assume it is Not Good for scheduler stats? Maybe we should filter that out?

So, the implications I see on removing the kthread is that the work we queue ends up in the same bucket of others using the same queue. And also it seems we still have some reentrancy issues, as usually we don't want to trigger anything when the thing that is scheduled is the one responsible for changing frequency (with the kthread we can temporarily work around this setting the kthread to some RT prio, then maybe associating some special flag to it). The filtering seems to be simpler if we just know that we have to filter cap_gov's kthreads.

...

Note that this avoids the reentrancy issues faced by Morten back in October 2013 since we're not queueing up work_struct with rq locks held.

IMHO, we still have some issues, as said above.

...

This patch is only smoke tested. The GOV_STOP path doesn't work, so switching away from cap_gov to ondemand will fail.

Thoughts?

Comments on the code follows.

...

Signed-off-by: Michael Turquette mturquette@linaro.org

kernel/sched/cap_gov.c | 119 ++++++++++++++++--------------------------------- kernel/sched/fair.c | 8 ---- 2 files changed, 39 insertions(+), 88 deletions(-)

diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c index 72873ab..8380564 100644 --- a/kernel/sched/cap_gov.c +++ b/kernel/sched/cap_gov.c @@ -10,6 +10,8 @@ #include <linux/module.h> #include <linux/kthread.h> #include <linux/percpu.h> +#include <linux/irq_work.h> +#include <linux/workqueue.h> #include "sched.h" @@ -42,6 +44,9 @@ struct gov_data { unsigned int throttle_nsec; struct task_struct *task; atomic_t need_wake_task;

We can remove this two fields above.

...

struct irq_work irq_work;

struct work_struct work;

struct cpufreq_policy *policy;

}; /** @@ -117,101 +122,56 @@ out: return freq; } -/*

we pass in struct cpufreq_policy. This is safe because changing out the

policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),

which tears down all of the data structures and __cpufreq_governor(policy,

CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the

new policy pointer

*/

-static int cap_gov_thread(void *data) +static void cap_gov_work(struct work_struct *work) {

struct sched_param param; struct cpufreq_policy *policy; struct gov_data *gd; unsigned long freq; int ret;

policy = (struct cpufreq_policy *) data;
gd = container_of(work, struct gov_data, work);

if (!gd) {
pr_err("%s: here0.5: no gd!\n", __func__);
return;
}

policy = gd->policy; if (!policy) {
pr_err("%s: here1.5!\n", __func__);
pr_warn("%s: missing policy\n", __func__);
do_exit(-EINVAL);
return;
}
gd = policy->gov_data;

if (!gd) {
pr_warn("%s: missing governor data\n", __func__);
do_exit(-EINVAL);
}
down_write(&policy->rwsem);
param.sched_priority = 0;

sched_setscheduler(current, SCHED_FIFO, &param);

set_cpus_allowed_ptr(current, policy->related_cpus);

/* main loop of the per-policy kthread */

do {
down_write(&policy->rwsem);
if (!atomic_read(&gd->need_wake_task))  {
	if (kthread_should_stop())
		break;
	trace_printk("NOT waking up kthread (%d)", gd->task->pid);
	up_write(&policy->rwsem);
	set_current_state(TASK_INTERRUPTIBLE);
	schedule();
	continue;
}
trace_printk("kthread %d requested freq switch", gd->task->pid);
freq = cap_gov_select_freq(policy);
ret = __cpufreq_driver_target(policy, freq,
		CPUFREQ_RELATION_H);
if (ret)
	pr_debug("%s: __cpufreq_driver_target returned %d\n",
			__func__, ret);
trace_printk("kthread %d requested freq switch", gd->task->pid);
gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
atomic_set(&gd->need_wake_task, 0);
up_write(&policy->rwsem);
} while (!kthread_should_stop());

do_exit(0);
-}

trace_printk("kthread %d requested freq switch", gd->task->pid);

This goes away too (apart from being a trace_printk()).

...

-static void cap_gov_wake_up_process(struct task_struct *task) -{
/* this is null during early boot */

if (IS_ERR_OR_NULL(task)) {
return;
}
freq = cap_gov_select_freq(policy);

ret = __cpufreq_driver_target(policy, freq,
	CPUFREQ_RELATION_H);
if (ret)
pr_err("%s: __cpufreq_driver_target returned %d\n",
		__func__, ret);
wake_up_process(task);

trace_printk("kthread %d requested freq switch", gd->task->pid);

Ditto.

...

gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);

up_write(&policy->rwsem);

} -void cap_gov_kick_thread(int cpu) +static void cap_gov_irq_work(struct irq_work *irq_work) { struct cpufreq_policy *policy;

struct gov_data *gd = NULL;

struct gov_data *gd;

policy = cpufreq_cpu_get(cpu);

if (IS_ERR_OR_NULL(policy))
gd = container_of(irq_work, struct gov_data, irq_work);

if (!gd) {
pr_err("%s: here0.5: no gd!\n", __func__);
return;
gd = policy->gov_data;

if (!gd)
goto out;
/* per-cpu access not needed here since we have gd */

if (atomic_read(&gd->need_wake_task)) {
trace_printk("waking up kthread (%d)", gd->task->pid);
cap_gov_wake_up_process(gd->task);
}
-out:

cpufreq_cpu_put(policy);

schedule_work_on(raw_smp_processor_id(), &gd->work)

Missing semicolon :).

- schedule_work_on(raw_smp_processor_id(), &gd->work) + schedule_work_on(raw_smp_processor_id(), &gd->work);

...

} /** @@ -262,7 +222,8 @@ void cap_gov_update_cpu(int cpu) goto out; }

atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);
if (irq_work_queue_on(&gd->irq_work, cpu))
trace_printk("could not queue irq_work");

Maybe some pr_err thing?

...

out: cpufreq_cpu_put(policy); @@ -295,12 +256,10 @@ static void cap_gov_start(struct cpufreq_policy *policy) for_each_cpu(cpu, policy->related_cpus) per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;
/* init per-policy kthread */

gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");

if (IS_ERR_OR_NULL(gd->task))
pr_err("%s: failed to create kcap_gov_task thread\n", __func__);
init_irq_work(&gd->irq_work, cap_gov_irq_work);

INIT_WORK(&gd->work, cap_gov_work); policy->gov_data = gd;

gd->policy = policy;

}

You should also add

@@ -269,7 +270,7 @@ static void cap_gov_stop(struct cpufreq_policy *policy) gd = policy->gov_data; policy->gov_data = NULL;

- kthread_stop(gd->task);

/* FIXME replace with devm counterparts? */ kfree(gd);

Thanks,

- Juri

...

static void cap_gov_stop(struct cpufreq_policy *policy) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2ec2dc7..16d73a9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7776,14 +7776,6 @@ static void run_rebalance_domains(struct softirq_action *h) */ nohz_idle_balance(this_rq, idle); rebalance_domains(this_rq, idle);
/*
* FIXME some hardware does not require this, but current CPUfreq
* locking prevents us from changing cpu frequency with rq locks held
* and interrupts disabled
*/
if (sched_energy_freq())
cap_gov_kick_thread(cpu_of(this_rq));
} /*

3323

days inactive

3334

days old

eas-dev@lists.linaro.org

43 comments

participants

tags (0)

participants (8)

Amit Kucheria
Ashwin Chaugule
Juri Lelli
Michael Turquette
Morten Rasmussen
Vincent Guittot
Viresh Kumar
Wu, Junjie