Re: [Eas-dev] [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling

16 Apr 2015

Quoting Amit Kucheria (2015-04-16 00:11:22)
...
On Thu, Apr 16, 2015 at 10:59 AM, Michael Turquette
mturquette@linaro.org wrote:
...
+config CPU_FREQ_GOV_CAP_GOV
Two GOVs are redundant here and make it hard to read. A few name
suggestions for your baby:
CPU_FREQ_GOV_SCHED_CAP
CPU_FREQ_GOV_SCHED_STATS
I don't want the name to be too generic, since we're only dealing with
cfs right now. Perhaps your SCHED_CAP variant or maybe
CPU_FREQ_GOV_SCHED_CFS? That leaves room for SCHED_DL and others later
on.
...
...

  tristate "'capacity governor' cpufreq governor"


  depends on CPU_FREQ


  select CPU_FREQ_GOV_COMMON


  help


    'cap_gov' - this governor scales cpu frequency from the



same as above
...

    scheduler as a function of cpu capacity utilization. It does


    not evaluate utilization on a periodic basis (unlike ondemand)


    but instead is invoked from CFS when updating per-entity load


    tracking statistics.



perhaps add something to the effect that it is more responsive than
existing governors to really sell it? :)
Good idea. I'll add,
"Response to changes in load is improved over polling governors due to
its event-driven design"
...
...


    If in doubt, say N.




comment "CPU frequency scaling drivers"
config CPUFREQ_DT

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 7cdf63a..4fc066f 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -488,6 +488,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
 #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
 extern struct cpufreq_governor cpufreq_gov_conservative;
 #define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CAP_GOV)
+extern struct cpufreq_governor cpufreq_gov_cap_gov;
+#define CPUFREQ_DEFAULT_GOVERNOR       (&cpufreq_gov_cap_gov)
 #endif
/*********************************************************************
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 46be870..da601d5 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_CAP_GOV) += cap_gov.o
diff --git a/kernel/sched/cap_gov.c b/kernel/sched/cap_gov.c
new file mode 100644
index 0000000..72873ab
--- /dev/null
+++ b/kernel/sched/cap_gov.c
@@ -0,0 +1,361 @@
+/*


Copyright (C)  2014 Michael Turquette mturquette@linaro.org







This program is free software; you can redistribute it and/or modify



it under the terms of the GNU General Public License version 2 as



published by the Free Software Foundation.


*/


+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>



+#include "sched.h"



+#define UP_THRESHOLD           95
A comment that this probably belong as a sysfs tunable
Doh, this shouldn't be here at all. I don't use any up or down
thresholds in this version.
...
...
+#define THROTTLE_NSEC          50000000 /* 50ms default */



+/*


per-cpu pointer to atomic_t gov_data->cap_gov_wake_task



s/cap_gov_wake_task/need_wake_task/
Ack. I might also be able to gid rid of this entirely with the irq_work
stuff.
...
...


used in scheduler hot paths {en,de}queueu, task_tick without having to



access struct cpufreq_policy and struct gov_data


*/

+static DEFINE_PER_CPU(atomic_t *, cap_gov_wake_task);



+/**


gov_data - per-policy data internal to the governor



@throttle: time until throttling period expires. Derived from THROTTLE_NSEC



@throttle_nsec ?
Ack.
...
...


@task: worker task for dvfs transition that may block/sleep



@need_wake_task: flag the governor to wake this policy's worker thread







struct gov_data is the per-policy cap_gov-specific data structure. A



per-policy instance of it is created when the cap_gov governor receives



the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data



member of struct cpufreq_policy.







Readers of this data must call down_read(policy->rwsem). Writers must



call down_write(policy->rwsem).


*/

+struct gov_data {

  ktime_t throttle;


  unsigned int throttle_nsec;


  struct task_struct *task;


  atomic_t need_wake_task;



+};



+/**


cap_gov_select_freq - pick the next frequency for a cpu



@cpu: the cpu whose frequency may be changed







cap_gov_select_freq works in a way similar to the ondemand governor. First



we inspect the utilization of all of the cpus in this policy to find the



most utilized cpu. This is achieved by calling get_cpu_usage, which returns



frequency-invarant capacity utilization.







This max utilization is compared against the up_threshold (default 95%



utilization). If the max cpu utilization is greater than this threshold then



we scale the policy up to the max frequency. Othewise we find the lowest



frequency (smallest cpu capacity) that is still larger than the max capacity



utilization for this policy.







Returns frequency selected.


*/

+static unsigned long cap_gov_select_freq(struct cpufreq_policy *policy)
+{

  int cpu = 0;


  struct gov_data *gd;


  int index;


  unsigned long freq = 0, max_usage = 0, cap = 0, usage = 0;


  struct cpufreq_frequency_table *pos;



  if (!policy->gov_data)


          goto out;



  gd = policy->gov_data;



  /*


   * get_cpu_usage is called without locking the runqueues. This is the


   * same behavior used by find_busiest_cpu in load_balance. We are


   * willing to accept occasionally stale data here in exchange for


   * lockless behavior.


   */


  for_each_cpu(cpu, policy->cpus) {


          usage = get_cpu_usage(cpu);


          trace_printk("cpu = %d usage = %lu", cpu, usage);


          if (usage > max_usage)


                  max_usage = usage;


  }


  trace_printk("max_usage = %lu", max_usage);



  /* find the utilization threshold at which we scale up frequency */


  index = cpufreq_frequency_table_get_index(policy, policy->cur);



  /*


   * converge towards max_usage. We want the lowest frequency whose


   * capacity is >= to max_usage. In other words:


   *


   *      find capacity == floor(usage)


   *


   * Sadly cpufreq freq tables are not guaranteed to be ordered by


   * frequency...


   */


  freq = policy->max;


  cpufreq_for_each_entry(pos, policy->freq_table) {


          cap = pos->frequency * SCHED_CAPACITY_SCALE /


                  policy->max;


          if (max_usage < cap && pos->frequency < freq)


                  freq = pos->frequency;


          trace_printk("cpu = %u max_usage = %lu cap = %lu \


                          table_freq = %u freq = %lu",


                          cpumask_first(policy->cpus), max_usage, cap,


                          pos->frequency, freq);


  }




+out:

  trace_printk("cpu %d final freq %lu", cpu, freq);


  return freq;



+}



+/*


we pass in struct cpufreq_policy. This is safe because changing out the



policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),



which tears down all of the data structures and __cpufreq_governor(policy,



CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the



new policy pointer


*/

+static int cap_gov_thread(void *data)
+{

  struct sched_param param;


  struct cpufreq_policy *policy;


  struct gov_data *gd;


  unsigned long freq;


  int ret;



  policy = (struct cpufreq_policy *) data;


  if (!policy) {


          pr_warn("%s: missing policy\n", __func__);


          do_exit(-EINVAL);


  }



  gd = policy->gov_data;


  if (!gd) {


          pr_warn("%s: missing governor data\n", __func__);


          do_exit(-EINVAL);


  }



  param.sched_priority = 0;


  sched_setscheduler(current, SCHED_FIFO, &param);


  set_cpus_allowed_ptr(current, policy->related_cpus);



  /* main loop of the per-policy kthread */


  do {


          down_write(&policy->rwsem);


          if (!atomic_read(&gd->need_wake_task))  {


                  if (kthread_should_stop())


                          break;


                  trace_printk("NOT waking up kthread (%d)", gd->task->pid);


                  up_write(&policy->rwsem);


                  set_current_state(TASK_INTERRUPTIBLE);


                  schedule();


                  continue;


          }



          trace_printk("kthread %d requested freq switch", gd->task->pid);



          freq = cap_gov_select_freq(policy);



          ret = __cpufreq_driver_target(policy, freq,


                          CPUFREQ_RELATION_H);


          if (ret)


                  pr_debug("%s: __cpufreq_driver_target returned %d\n",


                                  __func__, ret);



          trace_printk("kthread %d requested freq switch", gd->task->pid);



          gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);


          atomic_set(&gd->need_wake_task, 0);


          up_write(&policy->rwsem);


  } while (!kthread_should_stop());



  do_exit(0);



+}



+static void cap_gov_wake_up_process(struct task_struct *task)
+{

  /* this is null during early boot */


  if (IS_ERR_OR_NULL(task)) {


          return;


  }



  wake_up_process(task);



+}



+void cap_gov_kick_thread(int cpu)
+{

  struct cpufreq_policy *policy;


  struct gov_data *gd = NULL;



  policy = cpufreq_cpu_get(cpu);


  if (IS_ERR_OR_NULL(policy))


          return;



  gd = policy->gov_data;


  if (!gd)


          goto out;



  /* per-cpu access not needed here since we have gd */


  if (atomic_read(&gd->need_wake_task)) {


          trace_printk("waking up kthread (%d)", gd->task->pid);


          cap_gov_wake_up_process(gd->task);


  }




+out:

  cpufreq_cpu_put(policy);



+}



+/**


cap_gov_update_cpu - interface to scheduler for changing capacity values



@cpu: cpu whose capacity utilization has recently changed







cap_gov_udpate_cpu is an interface exposed to the scheduler so that the



scheduler may inform the governor of updates to capacity utilization and



make changes to cpu frequency. Currently this interface is designed around



PELT values in CFS. It can be expanded to other scheduling classes in the



future if needed.







The semantics of this call vary based on the cpu frequency scaling



characteristics of the hardware.







If kicking off a dvfs transition is an operation that might block or sleep



in the cpufreq driver then we set the need_wake_task flag in this function



The comment here isn't obvious since first glance you don't touch
need_wake_task. Perhaps clarify it as follows?
we set the need_wake_task (cap_gov_wake_task is a pointer to it)
I can do that. Additionally the kerneldoc description should remove all
of the text about hardware that has async/non-blocking dvfs transition.
This version of the patch ALWAYS kicks the kthread and the previous
"driver_might_sleep" bool has been removed.
Trying to keep the submission as simple and not-over-engineered as
possible.
Regards,
Mike

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC 6/6] sched: cap_gov: PELT-based cpu frequency scaling