Re: [Eas-dev] [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

16 Apr 2015

Quoting Juri Lelli (2015-04-16 09:46:47)
...
Hi Mike,
On 16/04/15 06:29, Michael Turquette wrote:
...
Scheduler-driven cpu frequency selection is desirable as part of the
on-going effort to make the scheduler better aware of energy
consumption.  No piece of the Linux kernel has a better view of the
factors that affect a cpu frequency selection policy than the
scheduler[0], and this patch is an attempt to get that discussion going
again.
This patch implements a cpufreq governor, cap_gov, that directly
accesses scheduler statistics, in particular the pelt data from cfs via
the get_cpu_usage() function.
Put plainly, cap_gov selects the lowest cpu frequency that will prevent
a runqueue from being over-utilized (until we hit the highest frequency
of course).
cap_gov converts available cpu frequencies into capacity states. When
the utilization of a cfs runqueue changes then the policy selects the
capacity state which is the floor of the new usage.
Unlike the previous posting from 2014[1] this governor implements no
policy of its own (e.g. with tunable thresholds for determining when to
scale frequency), but instead implements a "follow the usage" method,
where usage is defined as the cpu frequency-invariant product of
utilization_load_avg and cpu_capacity_orig.
This governor is event-driven. There is no polling loop to check cpu
idle time, or any other method which is unsynchronized with the
scheduler. The entry points for this policy are in fair.c:
enqueue_task_fair, dequeue_task_fair and task_tick_fair.
run_rebalance_domains is used to kick the worker thread to prevent
fatally re-entering into scheduler.
This policy is implemented using the cpufreq governor interface for two
main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at
run-time. This is very useful for comparative testing and tuning.
Finally, it is worth mentioning that this approach neglects all
scheduling classes except for cfs. It is possible to add support for
deadline and other other classes here, but I also wonder if a
multi-governor approach would be a more maintainable solution, where the
cpufreq core aggregates the constraints set by multiple governors.
Supporting such an approach in the cpufreq core would also allow for
peripheral devices to place constraint on cpu frequency without having
to hack such behavior in at the governor level.
Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of
testing, bug fixing and contributing towards the design.
[0] http://article.gmane.org/gmane.linux.kernel/1499836
[1] https://lkml.org/lkml/2014/10/22/22
Signed-off-by: Michael Turquette mturquette@linaro.org
drivers/cpufreq/Kconfig |  22 +++
 include/linux/cpufreq.h |   3 +
 kernel/sched/Makefile   |   1 +
 kernel/sched/cap_gov.c  | 361 ++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c     |  19 +++
 kernel/sched/sched.h    |   8 ++
 6 files changed, 414 insertions(+)
 create mode 100644 kernel/sched/cap_gov.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index a171fef..654d70a 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
          Be aware that not all cpufreq drivers support the conservative
          governor. If unsure have a look at the help section of the
          driver. Fallback governor will be the performance governor.



+config CPU_FREQ_DEFAULT_GOV_CAP_GOV

  bool "cap_gov"


  select CPU_FREQ_GOV_CAP_GOV


  select CPU_FREQ_GOV_PERFORMANCE


  help


    Use the CPUfreq governor 'cap_gov' as default. This scales cpu


    frequency from the scheduler as per-entity load tracking


    statistics are updated.



endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,19 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.


+config CPU_FREQ_GOV_CAP_GOV

  tristate "'capacity governor' cpufreq governor"


  depends on CPU_FREQ


  select CPU_FREQ_GOV_COMMON


  help


    'cap_gov' - this governor scales cpu frequency from the


    scheduler as a function of cpu capacity utilization. It does


    not evaluate utilization on a periodic basis (unlike ondemand)


    but instead is invoked from CFS when updating per-entity load


    tracking statistics.



    If in doubt, say N.




comment "CPU frequency scaling drivers"
config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 7cdf63a..4fc066f 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV)
+extern struct cpufreq_governor cpufreq_gov_cap_gov;
+#define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_cap_gov)
 #endif
/*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 46be870..da601d5 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o
diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c
new file mode 100644
index 0000000..72873ab
--- /dev/null
+++ b/kernel/sched/cap_gov.c
@@ -0,0 +1,361 @@
+/*


Copyright (C)  2014 Michael Turquette mturquette@linaro.org







This program is free software; you can redistribute it and/or modify



it under the terms of the GNU General Public License version 2 as



published by the Free Software Foundation.


*/


+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>



+#include "sched.h"



+#define UP_THRESHOLD           95
Is this a leftover? In the changelog you say that you moved away from
thresholds. Anyway, since we scale utilization by freq, I'm not sure
we can live without some sort of up_threshold. The problem is that if
you are running a task flat out on a CPU at a certain freq, let's say
the lower one, you'll always get a usage for that CPU that corresponds
to the current capacity of that CPU at that freq. As you use the usage
signal to decide when to ramp up, you will never ramp up in this
situation because the signal won't cross the capacity at the lower
frequency.
We could solve this problem by putting the up threshold back. As soon
as you cross it you go to max, and then adapt, choosing the right
capacity for the actual, non capped, utilization of the task.
...
+#define THROTTLE_NSEC          50000000 /* 50ms default */



+/*


per-cpu pointer to atomic_t gov_data->cap_gov_wake_task



used in scheduler hot paths {en,de}queueu, task_tick without having to



access struct cpufreq_policy and struct gov_data


*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);



+/**


gov_data - per-policy data internal to the governor



@throttle: time until throttling period expires. Derived from THROTTLE_NSEC



@task: worker task for dvfs transition that may block/sleep



@need_wake_task: flag the governor to wake this policy's worker thread







struct gov_data is the per-policy cap_gov-specific data structure. A



per-policy instance of it is created when the cap_gov governor receives



the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data



member of struct cpufreq_policy.







Readers of this data must call down_read(policy->rwsem). Writers must



call down_write(policy->rwsem).


*/

+struct gov_data {

  ktime_t throttle;


  unsigned int throttle_nsec;


  struct task_struct *task;


  atomic_t need_wake_task;



+};



+/**


cap_gov_select_freq - pick the next frequency for a cpu



@cpu: the cpu whose frequency may be changed







cap_gov_select_freq works in a way similar to the ondemand governor. First



we inspect the utilization of all of the cpus in this policy to find the



most utilized cpu. This is achieved by calling get_cpu_usage, which returns



frequency-invarant capacity utilization.







This max utilization is compared against the up_threshold (default 95%



utilization). If the max cpu utilization is greater than this threshold then



we scale the policy up to the max frequency. Othewise we find the lowest



frequency (smallest cpu capacity) that is still larger than the max capacity



utilization for this policy.







Returns frequency selected.


*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy)
+{

  int cpu = 0;


  struct gov_data *gd;


  int index;


  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;


  struct cpufreq_frequency_table *pos;



  if (!policy->gov_data)


          goto out;



  gd = policy->gov_data;



  /*


   * get_cpu_usage is called without locking the runqueues. This is the


   * same behavior used by find_busiest_cpu in load_balance. We are


   * willing to accept occasionally stale data here in exchange for


   * lockless behavior.


   */


  for_each_cpu(cpu, policy->cpus) {


          usage = get_cpu_usage(cpu);


          trace_printk("cpu = %d usage = %lu", cpu, usage);



Here and below, do you want to post the patches with trace_printks?
Good catch. Will remove.
Proper tracepoint support can show up in a later patch.
...
...

          if (usage > max_usage)


                  max_usage = usage;


  }


  trace_printk("max_usage = %lu", max_usage);



  /* find the utilization threshold at which we scale up frequency */


  index = cpufreq_frequency_table_get_index(policy, policy->cur);



  /*


   * converge towards max_usage. We want the lowest frequency whose


   * capacity is >= to max_usage. In other words:


   *


   *      find capacity == floor(usage)


   *


   * Sadly cpufreq freq tables are not guaranteed to be ordered by


   * frequency...


   */


  freq = policy->max;


  cpufreq_for_each_entry(pos, policy->freq_table) {


          cap = pos->frequency * SCHED_CAPACITY_SCALE /


                  policy->max;


          if (max_usage < cap && pos->frequency < freq)


                  freq = pos->frequency;


          trace_printk("cpu = %u max_usage = %lu cap = %lu \


                          table_freq = %u freq = %lu",


                          cpumask_first(policy->cpus), max_usage, cap,


                          pos->frequency, freq);


  }




+out:

  trace_printk("cpu %d final freq %lu", cpu, freq);


  return freq;



+}



+/*


we pass in struct cpufreq_policy. This is safe because changing out the



policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),



which tears down all of the data structures and __cpufreq_governor(policy,



CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the



new policy pointer


*/

+static int cap_gov_thread(void *data)
+{

  struct sched_param param;


  struct cpufreq_policy *policy;


  struct gov_data *gd;


  unsigned long freq;


  int ret;



  policy = (struct cpufreq_policy *) data;


  if (!policy) {


          pr_warn("%s: missing policy\n", __func__);


          do_exit(-EINVAL);


  }



  gd = policy->gov_data;


  if (!gd) {


          pr_warn("%s: missing governor data\n", __func__);


          do_exit(-EINVAL);


  }



  param.sched_priority = 0;


  sched_setscheduler(current, SCHED_FIFO, &param);


  set_cpus_allowed_ptr(current, policy->related_cpus);



We should check return values of these functions, use the in-kernel
version of setscheduler and set a true RT prio for kthreads, something
like:

  param.sched_priority = 0;


  sched_setscheduler(current, SCHED_FIFO, &param);


  set_cpus_allowed_ptr(current, policy->related_cpus);




  param.sched_priority = 50;


  ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);


  if (ret) {


          pr_warn("%s: failed to set SCHED_FIFO\n", __func__);


          do_exit(-EINVAL);


  } else {


          pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",


                   __func__, gd->task->pid);


  }



  ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);


  if (ret) {


          pr_warn("%s: failed to set allowed ptr\n", __func__);


          do_exit(-EINVAL);


  }



Yes, I had rolled in your code to do this in a previous version. I'll
bring it back in.
...
...


  /* main loop of the per-policy kthread */


  do {


          down_write(&policy->rwsem);


          if (!atomic_read(&gd->need_wake_task))  {


                  if (kthread_should_stop())


                          break;


                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);


                  up_write(&policy->rwsem);


                  set_current_state(TASK_INTERRUPTIBLE);


                  schedule();


                  continue;


          }



          trace_printk("kthread %d requested freq switch", gd->task->pid);



          freq = cap_gov_select_freq(policy);



          ret = __cpufreq_driver_target(policy, freq,


                          CPUFREQ_RELATION_H);


          if (ret)


                  pr_debug("%s: __cpufreq_driver_target returned %d\n",


                                  __func__, ret);



          trace_printk("kthread %d requested freq switch", gd->task->pid);



          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);


          atomic_set(&gd->need_wake_task, 0);


          up_write(&policy->rwsem);


  } while (!kthread_should_stop());



  do_exit(0);



+}



+static void cap_gov_wake_up_process(struct task_struct *task)
+{

  /* this is null during early boot */


  if (IS_ERR_OR_NULL(task)) {


          return;


  }



  wake_up_process(task);



+}



+void cap_gov_kick_thread(int cpu)
+{

  struct cpufreq_policy *policy;


  struct gov_data *gd = NULL;



  policy = cpufreq_cpu_get(cpu);


  if (IS_ERR_OR_NULL(policy))


          return;



  gd = policy->gov_data;


  if (!gd)


          goto out;



  /* per-cpu access not needed here since we have gd */


  if (atomic_read(&gd->need_wake_task)) {


          trace_printk("waking up kthread (%d)", gd->task->pid);


          cap_gov_wake_up_process(gd->task);


  }




+out:

  cpufreq_cpu_put(policy);



+}



+/**


cap_gov_update_cpu - interface to scheduler for changing capacity values



@cpu: cpu whose capacity utilization has recently changed







cap_gov_udpate_cpu is an interface exposed to the scheduler so that the



scheduler may inform the governor of updates to capacity utilization and



make changes to cpu frequency. Currently this interface is designed around



PELT values in CFS. It can be expanded to other scheduling classes in the



future if needed.







The semantics of this call vary based on the cpu frequency scaling



characteristics of the hardware.







If kicking off a dvfs transition is an operation that might block or sleep



in the cpufreq driver then we set the need_wake_task flag in this function



and return. Selecting a frequency and programming it is done in a dedicated



kernel thread which will be woken up from rebalance_domains. See



cap_gov_kick_thread above.







If kicking off a dvfs transition is an operation that returns quickly in the



cpufreq driver and will never sleep then we select the frequency in this



function and program the hardware for it in the scheduler hot path. No



dedicated kthread is needed.



This is not something that we already have, right? This is of course
fine, but IMHO we have to highlight this "problem" a bit more. Also,
clearly state that code for that case is not part of this patchset.
As I stated in reply to Amit, I'm thinking of removing some of the above
text since I removed support for "driver_might_sleep".
I want to keep the patch set as simple as possible, and regardless of
whether or not we have async dvfs hardware it is still not possible to
call __cpufreq_driver_target from within the schedule() context, so for
not it is a moot point.
Regards,
Mike
...
Thanks,

Juri

...

*/

+void cap_gov_update_cpu(int cpu)
+{

  struct cpufreq_policy *policy;


  struct gov_data *gd;



  /* XXX put policy pointer in per-cpu data? */


  policy = cpufreq_cpu_get(cpu);


  if (IS_ERR_OR_NULL(policy)) {


          return;


  }



  if (!policy->gov_data) {


          trace_printk("missing governor data");


          goto out;


  }



  gd = policy->gov_data;



  /* bail early if we are throttled */


  if (ktime_before(ktime_get(), gd->throttle)) {


          trace_printk("THROTTLED");


          goto out;


  }



  atomic_set(per_cpu(cap_gov_wake_task, cpu), 1);




+out:

  cpufreq_cpu_put(policy);


  return;



+}



+static void cap_gov_start(struct cpufreq_policy *policy)
+{

  int cpu;


  struct gov_data *gd;



  /* prepare per-policy private data */


  gd = kzalloc(sizeof(*gd), GFP_KERNEL);


  if (!gd) {


          pr_debug("%s: failed to allocate private data\n", __func__);


          return;


  }



  /*


   * Don't ask for freq changes at an higher rate than what


   * the driver advertises as transition latency.


   */


  gd->throttle_nsec = policy->cpuinfo.transition_latency ?


                      policy->cpuinfo.transition_latency :


                      THROTTLE_NSEC;


  pr_debug("%s: throttle threshold = %u [ns]\n",


            __func__, gd->throttle_nsec);



  /* save per-cpu pointer to per-policy need_wake_task */


  for_each_cpu(cpu, policy->related_cpus)


          per_cpu(cap_gov_wake_task, cpu) = &gd->need_wake_task;



  /* init per-policy kthread */


  gd->task = kthread_create(cap_gov_thread, policy, "kcap_gov_task");


  if (IS_ERR_OR_NULL(gd->task))


          pr_err("%s: failed to create kcap_gov_task thread\n", __func__);



  policy->gov_data = gd;



+}



+static void cap_gov_stop(struct cpufreq_policy *policy)
+{

  struct gov_data *gd;



  gd = policy->gov_data;


  policy->gov_data = NULL;



  kthread_stop(gd->task);



  /* FIXME replace with devm counterparts? */


  kfree(gd);



+}



+static int cap_gov_setup(struct cpufreq_policy *policy, unsigned int event)
+{

  switch (event) {


          case CPUFREQ_GOV_START:


                  /* Start managing the frequency */


                  cap_gov_start(policy);


                  return 0;



          case CPUFREQ_GOV_STOP:


                  cap_gov_stop(policy);


                  return 0;



          case CPUFREQ_GOV_LIMITS:        /* unused */


          case CPUFREQ_GOV_POLICY_INIT:   /* unused */


          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */


                  break;


  }


  return 0;



+}



+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV
+static
+#endif
+struct cpufreq_governor cpufreq_gov_cap_gov = {

  .name                   = "cap_gov",


  .governor               = cap_gov_setup,


  .owner                  = THIS_MODULE,



+};



+static int __init cap_gov_init(void)
+{

  return cpufreq_register_governor(&cpufreq_gov_cap_gov);



+}



+static void __exit cap_gov_exit(void)
+{

  cpufreq_unregister_governor(&cpufreq_gov_cap_gov);



+}



+/* Try to make this the default governor */
+fs_initcall(cap_gov_init);



+MODULE_LICENSE("GPL");
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b066a61..2ec2dc7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                update_rq_runnable_avg(rq, rq->nr_running);
                add_nr_running(rq, 1);
        }


  if(sched_energy_freq())


          cap_gov_update_cpu(cpu_of(rq));


  hrtick_update(rq);



}
@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                sub_nr_running(rq, 1);
                update_rq_runnable_avg(rq, 1);
        }


  if(sched_energy_freq())


          cap_gov_update_cpu(cpu_of(rq));


  hrtick_update(rq);



}
@@ -7768,6 +7776,14 @@ static void run_rebalance_domains(struct softirq_action *h)
         */
        nohz_idle_balance(this_rq, idle);
        rebalance_domains(this_rq, idle);


  /*


   * FIXME some hardware does not require this, but current CPUfreq


   * locking prevents us from changing cpu frequency with rq locks held


   * and interrupts disabled


   */


  if (sched_energy_freq())


          cap_gov_kick_thread(cpu_of(this_rq));



}
/*
@@ -7821,6 +7837,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
                task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);



  if(sched_energy_freq())


          cap_gov_update_cpu(cpu_of(rq));



}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0fe57ba..c45f1ee 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1398,6 +1398,14 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
int get_cpu_usage(int cpu);
+#ifdef CONFIG_CPU_FREQ_GOV_CAP_GOV
+void cap_gov_update_cpu(int cpu);
+void cap_gov_kick_thread(int cpu);
+#else
+static inline void cap_gov_update_cpu(int cpu) {}
+static inline void cap_gov_kick_thread(int cpu) {}
+#endif



static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
        rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

Signed-off-by: Michael Turquette mturquette@linaro.org