Re: [Eas-dev] [RFC internal v2 4/4] sched: cpufreq_sched_cfs: PELT-based cpu frequency scaling

27 Apr 2015

Hi Mike,
On 27/04/15 08:46, Michael Turquette wrote:
...
Scheduler-driven cpu frequency selection is desirable as part of the
on-going effort to make the scheduler better aware of energy
consumption.  No piece of the Linux kernel has a better view of the
factors that affect a cpu frequency selection policy than the
scheduler[0], and this patch is an attempt to get that discussion going
again.
This patch implements a cpufreq governor, sched_cfs, that directly
accesses scheduler statistics, in particular the pelt data from cfs via
the get_cpu_usage() function.
Put plainly, sched_cfs selects the lowest cpu frequency that will
prevent a runqueue from being over-utilized (until we hit the highest
frequency of course). This is done by requestiong a frequency which is
requesting ^
...
equivalent to the current capacity utilization, plus a margin.
Unlike the previous posting from 2014[1] this governor implements a
"follow the usage" method, where usage is defined as the cpu
frequency-invariant product of utilization_load_avg and
cpu_capacity_orig.
This governor is event-driven. There is no polling loop to check cpu
idle time, or any other method which is unsynchronized with the
scheduler. The entry points for this policy are in fair.c:
enqueue_task_fair, dequeue_task_fair and task_tick_fair.
This policy is implemented using the cpufreq governor interface for two
main reasons:

re-using the cpufreq machine drivers without using the governor

interface is hard.

using the cpufreq interface allows us to switch between the

scheduler-driven policy and legacy cpufreq governors such as ondemand at
run-time. This is very useful for comparative testing and tuning.
Finally, it is worth mentioning that this approach neglects all
scheduling classes except for cfs. It is possible to add support for
deadline and other other classes here, but I also wonder if a
multi-governor approach would be a more maintainable solution, where the
cpufreq core aggregates the constraints set by multiple governors.
Supporting such an approach in the cpufreq core would also allow for
peripheral devices to place constraint on cpu frequency without having
to hack such behavior in at the governor level.
Thanks to Juri Lelli juri.lelli@arm.com for doing a good bit of
testing, bug fixing and contributing towards the design.
[0] http://article.gmane.org/gmane.linux.kernel/1499836
[1] https://lkml.org/lkml/2014/10/22/22
Signed-off-by: Michael Turquette mturquette@linaro.org
changes since internal v1:

renamed everything
fixed possible deadlock between gov_cfs_thread and gov_cfs_stop
replaced direct usage-to-frequency mapping with
usage+margin-to-frequency mapping. This functions like an
up_threshold and allows us to easily work with non-discretized
frequency ranges
usage-to-frequency calculation now uses capacity_orig instead of
SCHED_LOAD_SCALE to handle SMT and asymmetric cpu use cases
dropped workqueue method due to instability
kthread is woken up by irq_work handler. This removes the need for
cap_gov_kick_thread() from v1

drivers/cpufreq/Kconfig          |  24 +++
 include/linux/cpufreq.h          |   3 +
 kernel/sched/Makefile            |   1 +
 kernel/sched/cpufreq_sched_cfs.c | 314 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/fair.c              |  11 ++
 kernel/sched/sched.h             |   6 +
 6 files changed, 359 insertions(+)
 create mode 100644 kernel/sched/cpufreq_sched_cfs.c

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index a171fef..35ba9c3 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,15 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
          Be aware that not all cpufreq drivers support the conservative
          governor. If unsure have a look at the help section of the
          driver. Fallback governor will be the performance governor.



+config CPU_FREQ_DEFAULT_GOV_SCHED_CFS

  bool "sched_cfs"


  select CPU_FREQ_GOV_SCHED_CFS


  select CPU_FREQ_GOV_PERFORMANCE


  help


    Use the CPUfreq governor 'sched_cfs' as default. This scales


    cpu frequency from the scheduler as per-entity load tracking


    statistics are updated.



endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +192,21 @@ config CPU_FREQ_GOV_CONSERVATIVE
      If in doubt, say N.


+config CPU_FREQ_GOV_SCHED_CFS

  tristate "'sched cfs' cpufreq governor"


  depends on CPU_FREQ



Also CONFIG_IRQ_WORK is a dependency.
...

  select CPU_FREQ_GOV_COMMON


  help


    'sched_cfs' - this governor scales cpu frequency from the


    scheduler as a function of cpu capacity utilization. It does


    not evaluate utilization on a periodic basis (as ondemand


    does) but instead is invoked from the completely fair


    scheduler when updating per-entity load tracking statistics.


    Latency to respond to changes in load is improved over polling


    governors due to its event-driven design.



    If in doubt, say N.




comment "CPU frequency scaling drivers"
config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 2ee4888..62e8152 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -485,6 +485,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV)
+extern struct cpufreq_governor cpufreq_gov_cap_gov;
+#define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_cap_gov)
 #endif
/*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 46be870..003b592 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHED_CFS) += cpufreq_sched_cfs.o
diff --git a/kernel/sched/cpufreq_sched_cfs.c b/kernel/sched/cpufreq_sched_cfs.c
new file mode 100644
index 0000000..746b220
--- /dev/null
+++ b/kernel/sched/cpufreq_sched_cfs.c
@@ -0,0 +1,314 @@
+/*


Copyright (C)  2015 Michael Turquette mturquette@linaro.org







This program is free software; you can redistribute it and/or modify



it under the terms of the GNU General Public License version 2 as



published by the Free Software Foundation.


*/


+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
We don't need this anymore (at least for now), right?
...
+#include <linux/irq_work.h>



+#include "sched.h"



+#define MARGIN_PCT             125 /* taken from imbalance_pct = 125 */
+#define THROTTLE_NSEC          50000000 /* 50ms default */



+/**


gov_data - per-policy data internal to the governor



@throttle: next throttling period expiry. Derived from throttle_nsec



@throttle_nsec: throttle period length in nanoseconds



@task: worker thread for dvfs transition that may block/sleep



@irq_work: callback used to wake up worker thread







struct gov_data is the per-policy gov_cfs-specific data structure. A



per-policy instance of it is created when the gov_cfs governor receives



the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data



member of struct cpufreq_policy.







Readers of this data must call down_read(policy->rwsem). Writers must



call down_write(policy->rwsem).


*/

+struct gov_data {

  ktime_t throttle;


  unsigned int throttle_nsec;


  struct task_struct *task;


  struct irq_work irq_work;


  struct cpufreq_policy *policy;



+};



+/**


gov_cfs_select_freq - pick the next frequency for a cpu



@policy: the cpufreq policy whose frequency may be changed







gov_cfs_select_freq selects a frequency based on pelt load statistics



tracked by cfs. First it finds the most utilized cpu in the policy and then



maps that utilization value onto a cpu frequency and returns it.







Additionally, gov_cfs_select_freq adds a margin to the cpu utilization value



before converting it to a frequency. The margin is derived from MARGIN_PCT,



which itself is inspired by imbalance_pct in cfs. This is needed to



proactively increase frequency in the case of increasing load.



utilization? ^
...






This approach attempts to maintain headroom of 25% unutilized cpu capacity.



A traditional way of doing this is to take 75% of the current capacity and



check if current utilization crosses that threshold. The only problem with



that approach is determining the next cpu frequency target if that threshold



is crossed.







Instead of using the 75% threshold, gov_cfs_select_freq adds a 25%



utilization margin to the utilization and converts that to a frequency. This



removes conditional logic around checking thresholds and better supports



drivers that use non-discretized frequency ranges (i.e. no pre-defined



frequency tables or operating points).







Returns frequency selected.


*/

+static unsigned long gov_cfs_select_freq(struct cpufreq_policy *policy)
+{

  int cpu = 0;


  struct gov_data *gd;


  unsigned long freq = 0, max_usage = 0, usage = 0;



  if (!policy->governor_data)


          goto out;



  gd = policy->governor_data;



  /*


   * get_cpu_usage is called without locking the runqueues. This is the


   * same behavior used by find_busiest_cpu in load_balance. We are


   * willing to accept occasionally stale data here in exchange for


   * lockless behavior.


   */


  for_each_cpu(cpu, policy->cpus) {


          usage = get_cpu_usage(cpu);


          if (usage > max_usage)


                  max_usage = usage;


  }



  /* add margin to max_usage based on imbalance_pct */


  max_usage = max_usage * MARGIN_PCT / 100;



  cpu = cpumask_first(policy->cpus);



  /* freq is current utilization + 25% */


  freq = max_usage * policy->max / capacity_orig_of(cpu);




+out:

  return freq;



+}



+/*


we pass in struct cpufreq_policy. This is safe because changing out the



policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),



which tears down all of the data structures and __cpufreq_governor(policy,



CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the



new policy pointer


*/

+static int gov_cfs_thread(void *data)
+{

  struct sched_param param;


  struct cpufreq_policy *policy;


  struct gov_data *gd;


  unsigned long freq;


  int ret;



  policy = (struct cpufreq_policy *) data;


  if (!policy) {


          pr_warn("%s: missing policy\n", __func__);


          do_exit(-EINVAL);


  }



  gd = policy->governor_data;


  if (!gd) {


          pr_warn("%s: missing governor data\n", __func__);


          do_exit(-EINVAL);


  }



  param.sched_priority = 50;


  ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, &param);


  if (ret) {


          pr_warn("%s: failed to set SCHED_FIFO\n", __func__);


          do_exit(-EINVAL);


  } else {


          pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",


                          __func__, gd->task->pid);


  }



  ret = set_cpus_allowed_ptr(gd->task, policy->related_cpus);


  if (ret) {


          pr_warn("%s: failed to set allowed ptr\n", __func__);


          do_exit(-EINVAL);


  }



  /* main loop of the per-policy kthread */


  do {


          set_current_state(TASK_INTERRUPTIBLE);


          schedule();


          if (kthread_should_stop())


                  break;



          /* avoid race with gov_cfs_stop */


          if (!down_write_trylock(&policy->rwsem))


                  continue;



          freq = gov_cfs_select_freq(policy);



          ret = __cpufreq_driver_target(policy, freq,


                          CPUFREQ_RELATION_H);



I think we should use CPUFREQ_RELATION_L here. From the comments
I read:
#define CPUFREQ_RELATION_L 0  /* lowest frequency at or above target */
#define CPUFREQ_RELATION_H 1  /* highest frequency below or at target */
So we have to tell the driver to select a frequency with enough
capacity (above the current one).
...

          if (ret)


                  pr_debug("%s: __cpufreq_driver_target returned %d\n",


                                  __func__, ret);



          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);


          up_write(&policy->rwsem);


  } while (!kthread_should_stop());



  do_exit(0);



+}



+static void gov_cfs_irq_work(struct irq_work *irq_work)
+{

  struct gov_data *gd;



  gd = container_of(irq_work, struct gov_data, irq_work);


  if (!gd) {


          return;


  }



No brackets?
...


  wake_up_process(gd->task);



So, we always wake up the kthread, even when we know that we won't
need a freq change. This might be, I fear, an almost certain source of
reasonable complain and pushback. I understand that we might not want
to start optimizing things, but IMHO this point deserves some more
thought before posting. Don't you think we could do some level of
aggregation before kicking the kthread? In task_tick_fair(), for
example, we could just check if we are beyond the 25% threshold and kick
the kthread only in that case.
...
+}



+/**


gov_cfs_update_cpu - interface to scheduler for changing capacity values



@cpu: cpu whose capacity utilization has recently changed







gov_cfs_udpate_cpu is an interface exposed to the scheduler so that the



scheduler may inform the governor of updates to capacity utilization and



make changes to cpu frequency. Currently this interface is designed around



PELT values in CFS. It can be expanded to other scheduling classes in the



future if needed.







gov_cfs_update_cpu raises an IPI. The irq_work handler for that IPI wakes up



the thread that does the actual work, gov_cfs_thread.


*/

+void gov_cfs_update_cpu(int cpu)
+{

  struct cpufreq_policy *policy;


  struct gov_data *gd;



  /* XXX put policy pointer in per-cpu data? */


  policy = cpufreq_cpu_get(cpu);


  if (IS_ERR_OR_NULL(policy)) {


          return;


  }



  if (!policy->governor_data) {


          goto out;


  }



  gd = policy->governor_data;



  /* bail early if we are throttled */


  if (ktime_before(ktime_get(), gd->throttle)) {


          goto out;


  }




No brackets in the 3 ifs above?
Thanks,
- Juri
...

  irq_work_queue_on(&gd->irq_work, cpu);




+out:

  cpufreq_cpu_put(policy);


  return;



+}



+static void gov_cfs_start(struct cpufreq_policy *policy)
+{

  struct gov_data *gd;



  /* prepare per-policy private data */


  gd = kzalloc(sizeof(*gd), GFP_KERNEL);


  if (!gd) {


          pr_debug("%s: failed to allocate private data\n", __func__);


          return;


  }



  /*


   * Don't ask for freq changes at an higher rate than what


   * the driver advertises as transition latency.


   */


  gd->throttle_nsec = policy->cpuinfo.transition_latency ?


                      policy->cpuinfo.transition_latency :


                      THROTTLE_NSEC;


  pr_debug("%s: throttle threshold = %u [ns]\n",


            __func__, gd->throttle_nsec);



  /* init per-policy kthread */


  gd->task = kthread_run(gov_cfs_thread, policy, "kgov_cfs_task");


  if (IS_ERR_OR_NULL(gd->task))


          pr_err("%s: failed to create kgov_cfs_task thread\n", __func__);



  init_irq_work(&gd->irq_work, gov_cfs_irq_work);


  policy->governor_data = gd;


  gd->policy = policy;



+}



+static void gov_cfs_stop(struct cpufreq_policy *policy)
+{

  struct gov_data *gd;



  gd = policy->governor_data;


  kthread_stop(gd->task);



  policy->governor_data = NULL;



  /* FIXME replace with devm counterparts? */


  kfree(gd);



+}



+static int gov_cfs_setup(struct cpufreq_policy *policy, unsigned int event)
+{

  switch (event) {


          case CPUFREQ_GOV_START:


                  /* Start managing the frequency */


                  gov_cfs_start(policy);


                  return 0;



          case CPUFREQ_GOV_STOP:


                  gov_cfs_stop(policy);


                  return 0;



          case CPUFREQ_GOV_LIMITS:        /* unused */


          case CPUFREQ_GOV_POLICY_INIT:   /* unused */


          case CPUFREQ_GOV_POLICY_EXIT:   /* unused */


                  break;


  }


  return 0;



+}



+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED_CFS
+static
+#endif
+struct cpufreq_governor cpufreq_gov_cfs = {

  .name                   = "gov_cfs",


  .governor               = gov_cfs_setup,


  .owner                  = THIS_MODULE,



+};



+static int __init gov_cfs_init(void)
+{

  return cpufreq_register_governor(&cpufreq_gov_cfs);



+}



+static void __exit gov_cfs_exit(void)
+{

  cpufreq_unregister_governor(&cpufreq_gov_cfs);



+}



+/* Try to make this the default governor */
+fs_initcall(gov_cfs_init);



+MODULE_LICENSE("GPL");
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 393fc36..a7b97f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4257,6 +4257,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                update_rq_runnable_avg(rq, rq->nr_running);
                add_nr_running(rq, 1);
        }


  if(sched_energy_freq())


          gov_cfs_update_cpu(cpu_of(rq));


  hrtick_update(rq);



}
@@ -4318,6 +4322,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
                sub_nr_running(rq, 1);
                update_rq_runnable_avg(rq, 1);
        }


  if(sched_energy_freq())


          gov_cfs_update_cpu(cpu_of(rq));


  hrtick_update(rq);



}
@@ -7821,6 +7829,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
                task_tick_numa(rq, curr);
    update_rq_runnable_avg(rq, 1);



  if(sched_energy_freq())


          gov_cfs_update_cpu(cpu_of(rq));



}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 63a8be9..ec23523 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1399,6 +1399,12 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 int get_cpu_usage(int cpu);
 unsigned long capacity_orig_of(int cpu);
+#ifdef CONFIG_CPU_FREQ_GOV_SCHED_CFS
+void gov_cfs_update_cpu(int cpu);
+#else
+static inline void gov_cfs_update_cpu(int cpu) {}
+#endif



static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
        rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
1.9.1

eas-dev mailing list
eas-dev@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/eas-dev

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC internal v2 4/4] sched: cpufreq_sched_cfs: PELT-based cpu frequency scaling

Signed-off-by: Michael Turquette mturquette@linaro.org