[PATCH 0/4] CPUFreq: Implement per policy instances of governors

List overview All Threads
Download

newer

older

big.LITTLE MP status Feb 5, 2013

[PATCH 1/3] PM / devfreq: set...

Viresh Kumar

4 Feb 2013 4 Feb '13

11:38 a.m.

Currently, there can't be multiple instances of single governor_type. If we have a multi-package system, where we have multiple instances of struct policy (per package), we can't have multiple instances of same governor. i.e. We can't have multiple instances of ondemand governor for multiple packages.

Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/ governor-name/. Which again reflects that there can be only one instance of a governor_type in the system.

This is a bottleneck for multicluster system, where we want different packages to use same governor type, but with different tunables.

This patchset is inclined towards fixing this issue.

Viresh Kumar (4): cpufreq: Don't check cpu_online(policy->cpu) cpufreq: stats: Get rid of CPUFREQ_STATDEVICE_ATTR cpufreq: Add per policy governor-init/exit infrastructure cpufreq: governor: Implement per policy instances of governors

drivers/cpufreq/cpufreq.c | 41 ++++--- drivers/cpufreq/cpufreq_conservative.c | 142 +++++++++++++---------- drivers/cpufreq/cpufreq_governor.c | 140 +++++++++++++--------- drivers/cpufreq/cpufreq_governor.h | 42 ++++--- drivers/cpufreq/cpufreq_ondemand.c | 205 +++++++++++++++++++-------------- drivers/cpufreq/cpufreq_stats.c | 18 +-- drivers/cpufreq/cpufreq_userspace.c | 2 - drivers/cpufreq/freq_table.c | 6 - include/linux/cpufreq.h | 26 +---- 9 files changed, 346 insertions(+), 276 deletions(-)

-- 1.7.12.rc2.18.g61b472e

Show replies by date

Viresh Kumar

4 Feb 4 Feb

11:38 a.m.

New subject: [PATCH 1/4] cpufreq: Don't check cpu_online(policy->cpu)

policy->cpu or cpus in policy->cpus can't be offline anymore. And so we don't need to check if they are online or not.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 17 +++-------------- drivers/cpufreq/cpufreq_governor.c | 2 +- drivers/cpufreq/cpufreq_userspace.c | 2 -- drivers/cpufreq/freq_table.c | 6 ------ 4 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 9656420..e619f4f 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -76,10 +76,6 @@ static int lock_policy_rwsem_##mode \ int policy_cpu = per_cpu(cpufreq_policy_cpu, cpu); \ BUG_ON(policy_cpu == -1); \ down_##mode(&per_cpu(cpu_policy_rwsem, policy_cpu)); \ - if (unlikely(!cpu_online(cpu))) { \ - up_##mode(&per_cpu(cpu_policy_rwsem, policy_cpu)); \ - return -1; \ - } \ \ return 0; \ } @@ -719,8 +715,6 @@ static int cpufreq_add_dev_symlink(unsigned int cpu,

if (j == cpu) continue; - if (!cpu_online(j)) - continue;

pr_debug("CPU %u already managed, adding link\n", j); managed_policy = cpufreq_cpu_get(cpu); @@ -777,8 +771,6 @@ static int cpufreq_add_dev_interface(unsigned int cpu,

spin_lock_irqsave(&cpufreq_driver_lock, flags); for_each_cpu(j, policy->cpus) { - if (!cpu_online(j)) - continue; per_cpu(cpufreq_cpu_data, j) = policy; per_cpu(cpufreq_policy_cpu, j) = policy->cpu; } @@ -1005,11 +997,8 @@ static void update_policy_cpu(struct cpufreq_policy *policy, unsigned int cpu) policy->last_cpu = policy->cpu; policy->cpu = cpu;

- for_each_cpu(j, policy->cpus) { - if (!cpu_online(j)) - continue; + for_each_cpu(j, policy->cpus) per_cpu(cpufreq_policy_cpu, j) = cpu; - }

#ifdef CONFIG_CPU_FREQ_TABLE cpufreq_frequency_table_update_policy_cpu(policy); @@ -1468,7 +1457,7 @@ int __cpufreq_driver_target(struct cpufreq_policy *policy, if (target_freq == policy->cur) return 0;

- if (cpu_online(policy->cpu) && cpufreq_driver->target) + if (cpufreq_driver->target) retval = cpufreq_driver->target(policy, target_freq, relation);

return retval; @@ -1506,7 +1495,7 @@ int __cpufreq_driver_getavg(struct cpufreq_policy *policy, unsigned int cpu) if (cpufreq_disabled()) return ret;

- if (!(cpu_online(cpu) && cpufreq_driver->getavg)) + if (!cpufreq_driver->getavg) return 0;

policy = cpufreq_cpu_get(policy->cpu); diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 79795c4..e4a306c 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -225,7 +225,7 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data,

switch (event) { case CPUFREQ_GOV_START: - if ((!cpu_online(cpu)) || (!policy->cur)) + if (!policy->cur) return -EINVAL;

mutex_lock(&dbs_data->mutex); diff --git a/drivers/cpufreq/cpufreq_userspace.c b/drivers/cpufreq/cpufreq_userspace.c index c8c3d29..bbeb9c0 100644 --- a/drivers/cpufreq/cpufreq_userspace.c +++ b/drivers/cpufreq/cpufreq_userspace.c @@ -118,8 +118,6 @@ static int cpufreq_governor_userspace(struct cpufreq_policy *policy,

switch (event) { case CPUFREQ_GOV_START: - if (!cpu_online(cpu)) - return -EINVAL; BUG_ON(!policy->cur); mutex_lock(&userspace_mutex);

diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c index aa5bd39..d7a7966 100644 --- a/drivers/cpufreq/freq_table.c +++ b/drivers/cpufreq/freq_table.c @@ -63,9 +63,6 @@ int cpufreq_frequency_table_verify(struct cpufreq_policy *policy, pr_debug("request for verification of policy (%u - %u kHz) for cpu %u\n", policy->min, policy->max, policy->cpu);

- if (!cpu_online(policy->cpu)) - return -EINVAL; - cpufreq_verify_within_limits(policy, policy->cpuinfo.min_freq, policy->cpuinfo.max_freq);

@@ -121,9 +118,6 @@ int cpufreq_frequency_table_target(struct cpufreq_policy *policy, break; }

- if (!cpu_online(policy->cpu)) - return -EINVAL; - for (i = 0; (table[i].frequency != CPUFREQ_TABLE_END); i++) { unsigned int freq = table[i].frequency; if (freq == CPUFREQ_ENTRY_INVALID)

-- 1.7.12.rc2.18.g61b472e

Viresh Kumar

11:38 a.m.

New subject: [PATCH 2/4] cpufreq: stats: Get rid of CPUFREQ_STATDEVICE_ATTR

Macro "CPUFREQ_STATDEVICE_ATTR" is defined local to cpufreq_stats.c file and is almost a copy of the generic version present in cpufreq.h file. Lets use the generic version instead.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_stats.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_stats.c b/drivers/cpufreq/cpufreq_stats.c index 572124c..a2dee4c 100644 --- a/drivers/cpufreq/cpufreq_stats.c +++ b/drivers/cpufreq/cpufreq_stats.c @@ -24,12 +24,6 @@

static spinlock_t cpufreq_stats_lock;

-#define CPUFREQ_STATDEVICE_ATTR(_name, _mode, _show) \ -static struct freq_attr _attr_##_name = {\ - .attr = {.name = __stringify(_name), .mode = _mode, }, \ - .show = _show,\ -}; - struct cpufreq_stats { unsigned int cpu; unsigned int total_trans; @@ -136,17 +130,17 @@ static ssize_t show_trans_table(struct cpufreq_policy *policy, char *buf) return PAGE_SIZE; return len; } -CPUFREQ_STATDEVICE_ATTR(trans_table, 0444, show_trans_table); +cpufreq_freq_attr_ro(trans_table); #endif

-CPUFREQ_STATDEVICE_ATTR(total_trans, 0444, show_total_trans); -CPUFREQ_STATDEVICE_ATTR(time_in_state, 0444, show_time_in_state); +cpufreq_freq_attr_ro(total_trans); +cpufreq_freq_attr_ro(time_in_state);

static struct attribute *default_attrs[] = { - &_attr_total_trans.attr, - &_attr_time_in_state.attr, + &total_trans.attr, + &time_in_state.attr, #ifdef CONFIG_CPU_FREQ_STAT_DETAILS - &_attr_trans_table.attr, + &trans_table.attr, #endif NULL };

-- 1.7.12.rc2.18.g61b472e

Viresh Kumar

11:38 a.m.

New subject: [PATCH 3/4] cpufreq: Add per policy governor-init/exit infrastructure

Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/ governor-name/. Which again reflects that there can be only one instance of a governor_type in the system.

This is a bottleneck for multicluster system, where we want different packages to use same governor type, but with different tunables.

This patch is inclined towards providing this infrastructure. Because we are required to allocate governor's resources dynamically now, we must do it at policy creation and end. And so got CPUFREQ_GOV_POLICY_INIT/EXIT.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 20 +++++++++++++++++--- include/linux/cpufreq.h | 9 ++++++--- 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index e619f4f..1ae78d4 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1076,6 +1076,7 @@ static int __cpufreq_remove_dev(struct device *dev, struct subsys_interface *sif

/* If cpu is last user of policy, free policy */ if (cpus == 1) { + __cpufreq_governor(data, CPUFREQ_GOV_POLICY_EXIT); lock_policy_rwsem_write(cpu); kobj = &data->kobj; cmp = &data->kobj_unregister; @@ -1655,7 +1656,7 @@ EXPORT_SYMBOL(cpufreq_get_policy); static int __cpufreq_set_policy(struct cpufreq_policy *data, struct cpufreq_policy *policy) { - int ret = 0; + int ret = 0, failed = 1;

pr_debug("setting new policy for CPU %u: %u - %u kHz\n", policy->cpu, policy->min, policy->max); @@ -1709,18 +1710,31 @@ static int __cpufreq_set_policy(struct cpufreq_policy *data, pr_debug("governor switch\n");

/* end old governor */ - if (data->governor) + if (data->governor) { __cpufreq_governor(data, CPUFREQ_GOV_STOP); + __cpufreq_governor(data, + CPUFREQ_GOV_POLICY_EXIT); + }

/* start new governor */ data->governor = policy->governor; - if (__cpufreq_governor(data, CPUFREQ_GOV_START)) { + if (!__cpufreq_governor(data, CPUFREQ_GOV_POLICY_INIT)) { + if (!__cpufreq_governor(data, CPUFREQ_GOV_START)) + failed = 0; + else + __cpufreq_governor(data, + CPUFREQ_GOV_POLICY_EXIT); + } + + if (failed) { /* new governor failed, so re-start old one */ pr_debug("starting governor %s failed\n", data->governor->name); if (old_gov) { data->governor = old_gov; __cpufreq_governor(data, + CPUFREQ_GOV_POLICY_INIT); + __cpufreq_governor(data, CPUFREQ_GOV_START); } ret = -EINVAL; diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index a22944c..3b822ce 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -106,6 +106,7 @@ struct cpufreq_policy { * governors are used */ unsigned int policy; /* see above */ struct cpufreq_governor *governor; /* see below */ + void *governor_data;

struct work_struct update; /* if update_policy() needs to be * called, but you're in IRQ context */ @@ -178,9 +179,11 @@ static inline unsigned long cpufreq_scale(unsigned long old, u_int div, u_int mu * CPUFREQ GOVERNORS * *********************************************************************/

-#define CPUFREQ_GOV_START 1 -#define CPUFREQ_GOV_STOP 2 -#define CPUFREQ_GOV_LIMITS 3 +#define CPUFREQ_GOV_START 1 +#define CPUFREQ_GOV_STOP 2 +#define CPUFREQ_GOV_LIMITS 3 +#define CPUFREQ_GOV_POLICY_INIT 4 +#define CPUFREQ_GOV_POLICY_EXIT 4

struct cpufreq_governor { char name[CPUFREQ_NAME_LEN];

-- 1.7.12.rc2.18.g61b472e

Viresh Kumar

11:38 a.m.

New subject: [PATCH 4/4] cpufreq: governor: Implement per policy instances of governors

Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/ governor-name/. Which again reflects that there can be only one instance of a governor_type in the system.

This is a bottleneck for multicluster system, where we want different packages to use same governor type, but with different tunables.

This patch uses the infrastructure provided by earlier patch and implements init/exit routines for ondemand and conservative governors.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 4 - drivers/cpufreq/cpufreq_conservative.c | 142 +++++++++++++---------- drivers/cpufreq/cpufreq_governor.c | 138 +++++++++++++--------- drivers/cpufreq/cpufreq_governor.h | 42 ++++--- drivers/cpufreq/cpufreq_ondemand.c | 205 +++++++++++++++++++-------------- include/linux/cpufreq.h | 19 +-- 6 files changed, 314 insertions(+), 236 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 1ae78d4..7aacfbf 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1551,9 +1551,6 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, policy->cpu, event); ret = policy->governor->governor(policy, event);

- if (!policy->governor->initialized && (event == CPUFREQ_GOV_START)) - policy->governor->initialized = 1; - /* we keep one module reference alive for each CPU governed by this CPU */ if ((event != CPUFREQ_GOV_START) || ret) @@ -1577,7 +1574,6 @@ int cpufreq_register_governor(struct cpufreq_governor *governor)

mutex_lock(&cpufreq_governor_mutex);

- governor->initialized = 0; err = -EBUSY; if (__find_governor(governor->name) == NULL) { err = 0; diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index e8bb915..6b13f9f 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -20,6 +20,7 @@ #include <linux/mutex.h> #include <linux/notifier.h> #include <linux/percpu-defs.h> +#include <linux/slab.h> #include <linux/sysfs.h> #include <linux/types.h>

@@ -31,17 +32,8 @@ #define DEF_SAMPLING_DOWN_FACTOR (1) #define MAX_SAMPLING_DOWN_FACTOR (10)

-static struct dbs_data cs_dbs_data; static DEFINE_PER_CPU(struct cs_cpu_dbs_info_s, cs_cpu_dbs_info);

-static struct cs_dbs_tuners cs_tuners = { - .up_threshold = DEF_FREQUENCY_UP_THRESHOLD, - .down_threshold = DEF_FREQUENCY_DOWN_THRESHOLD, - .sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR, - .ignore_nice = 0, - .freq_step = 5, -}; - /* * Every sampling_rate, we check, if current idle time is less than 20% * (default), then we try to increase frequency Every sampling_rate * @@ -55,24 +47,26 @@ static void cs_check_cpu(int cpu, unsigned int load) { struct cs_cpu_dbs_info_s *dbs_info = &per_cpu(cs_cpu_dbs_info, cpu); struct cpufreq_policy *policy = dbs_info->cdbs.cur_policy; + struct dbs_data *dbs_data = policy->governor_data; + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int freq_target;

/* * break out if we 'cannot' reduce the speed as the user might * want freq_step to be zero */ - if (cs_tuners.freq_step == 0) + if (cs_tuners->freq_step == 0) return;

/* Check for frequency increase */ - if (load > cs_tuners.up_threshold) { + if (load > cs_tuners->up_threshold) { dbs_info->down_skip = 0;

/* if we are already at full speed then break out early */ if (dbs_info->requested_freq == policy->max) return;

- freq_target = (cs_tuners.freq_step * policy->max) / 100; + freq_target = (cs_tuners->freq_step * policy->max) / 100;

/* max freq cannot be less than 100. But who knows.... */ if (unlikely(freq_target == 0)) @@ -92,8 +86,8 @@ static void cs_check_cpu(int cpu, unsigned int load) * support the current CPU usage without triggering the up policy. To be * safe, we focus 10 points under the threshold. */ - if (load < (cs_tuners.down_threshold - 10)) { - freq_target = (cs_tuners.freq_step * policy->max) / 100; + if (load < (cs_tuners->down_threshold - 10)) { + freq_target = (cs_tuners->freq_step * policy->max) / 100;

dbs_info->requested_freq -= freq_target; if (dbs_info->requested_freq < policy->min) @@ -119,11 +113,13 @@ static void cs_dbs_timer(struct work_struct *work) unsigned int cpu = dbs_info->cdbs.cur_policy->cpu; struct cs_cpu_dbs_info_s *core_dbs_info = &per_cpu(cs_cpu_dbs_info, cpu); - int delay = delay_for_sampling_rate(cs_tuners.sampling_rate); + struct dbs_data *dbs_data = dbs_info->cdbs.cur_policy->governor_data; + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; + int delay = delay_for_sampling_rate(cs_tuners->sampling_rate);

mutex_lock(&core_dbs_info->cdbs.timer_mutex); - if (need_load_eval(&core_dbs_info->cdbs, cs_tuners.sampling_rate)) - dbs_check_cpu(&cs_dbs_data, cpu); + if (need_load_eval(&core_dbs_info->cdbs, cs_tuners->sampling_rate)) + dbs_check_cpu(dbs_data, cpu);

schedule_delayed_work_on(smp_processor_id(), dw, delay); mutex_unlock(&core_dbs_info->cdbs.timer_mutex); @@ -154,16 +150,11 @@ static int dbs_cpufreq_notifier(struct notifier_block *nb, unsigned long val, }

/************************** sysfs interface ************************/ -static ssize_t show_sampling_rate_min(struct kobject *kobj, - struct attribute *attr, char *buf) -{ - return sprintf(buf, "%u\n", cs_dbs_data.min_sampling_rate); -} - -static ssize_t store_sampling_down_factor(struct kobject *a, - struct attribute *b, - const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -171,13 +162,15 @@ static ssize_t store_sampling_down_factor(struct kobject *a, if (ret != 1 || input > MAX_SAMPLING_DOWN_FACTOR || input < 1) return -EINVAL;

- cs_tuners.sampling_down_factor = input; + cs_tuners->sampling_down_factor = input; return count; }

-static ssize_t store_sampling_rate(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_sampling_rate(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -185,43 +178,49 @@ static ssize_t store_sampling_rate(struct kobject *a, struct attribute *b, if (ret != 1) return -EINVAL;

- cs_tuners.sampling_rate = max(input, cs_dbs_data.min_sampling_rate); + cs_tuners->sampling_rate = max(input, dbs_data->min_sampling_rate); return count; }

- if (ret != 1 || input > 100 || input <= cs_tuners.down_threshold) + if (ret != 1 || input > 100 || input <= cs_tuners->down_threshold) return -EINVAL;

- cs_tuners.up_threshold = input; + cs_tuners->up_threshold = input; return count; }

-static ssize_t store_down_threshold(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_down_threshold(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input);

/* cannot be lower than 11 otherwise freq will not fall */ if (ret != 1 || input < 11 || input > 100 || - input >= cs_tuners.up_threshold) + input >= cs_tuners->up_threshold) return -EINVAL;

- cs_tuners.down_threshold = input; + cs_tuners->down_threshold = input; return count; }

-static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_ignore_nice(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input, j; int ret;

@@ -232,10 +231,10 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, if (input > 1) input = 1;

- if (input == cs_tuners.ignore_nice) /* nothing to do */ + if (input == cs_tuners->ignore_nice) /* nothing to do */ return count;

- cs_tuners.ignore_nice = input; + cs_tuners->ignore_nice = input;

/* we need to re-evaluate prev_cpu_idle */ for_each_online_cpu(j) { @@ -243,16 +242,18 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, dbs_info = &per_cpu(cs_cpu_dbs_info, j); dbs_info->cdbs.prev_cpu_idle = get_cpu_idle_time(j, &dbs_info->cdbs.prev_cpu_wall); - if (cs_tuners.ignore_nice) + if (cs_tuners->ignore_nice) dbs_info->cdbs.prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE]; } return count; }

-static ssize_t store_freq_step(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_freq_step(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -267,7 +268,7 @@ static ssize_t store_freq_step(struct kobject *a, struct attribute *b, * no need to test here if freq_step is zero as the user might actually * want this, they would be crazy though :) */ - cs_tuners.freq_step = input; + cs_tuners->freq_step = input; return count; }

@@ -275,16 +276,17 @@ show_one(cs, sampling_rate, sampling_rate); show_one(cs, sampling_down_factor, sampling_down_factor); show_one(cs, up_threshold, up_threshold); show_one(cs, down_threshold, down_threshold); -show_one(cs, ignore_nice_load, ignore_nice); +show_one(cs, ignore_nice, ignore_nice); show_one(cs, freq_step, freq_step); +declare_show_sampling_rate_min();

-define_one_global_rw(sampling_rate); -define_one_global_rw(sampling_down_factor); -define_one_global_rw(up_threshold); -define_one_global_rw(down_threshold); -define_one_global_rw(ignore_nice_load); -define_one_global_rw(freq_step); -define_one_global_ro(sampling_rate_min); +cpufreq_freq_attr_rw(sampling_rate); +cpufreq_freq_attr_rw(sampling_down_factor); +cpufreq_freq_attr_rw(up_threshold); +cpufreq_freq_attr_rw(down_threshold); +cpufreq_freq_attr_rw(ignore_nice); +cpufreq_freq_attr_rw(freq_step); +cpufreq_freq_attr_ro(sampling_rate_min);

static struct attribute *dbs_attributes[] = { &sampling_rate_min.attr, @@ -292,7 +294,7 @@ static struct attribute *dbs_attributes[] = { &sampling_down_factor.attr, &up_threshold.attr, &down_threshold.attr, - &ignore_nice_load.attr, + &ignore_nice.attr, &freq_step.attr, NULL }; @@ -304,6 +306,29 @@ static struct attribute_group cs_attr_group = {

/************************** sysfs end ************************/

+static int cs_init(struct dbs_data *dbs_data) +{ + struct cs_dbs_tuners *tuners; + + tuners = kzalloc(sizeof(struct cs_dbs_tuners), GFP_KERNEL); + if (!tuners) { + pr_err("%s: kzalloc failed\n", __func__); + return -ENOMEM; + } + + tuners->up_threshold = DEF_FREQUENCY_UP_THRESHOLD; + tuners->down_threshold = DEF_FREQUENCY_DOWN_THRESHOLD; + tuners->sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR; + tuners->ignore_nice = 0; + tuners->freq_step = 5; + + dbs_data->tuners = tuners; + dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO * + jiffies_to_usecs(10); + mutex_init(&dbs_data->mutex); + return 0; +} + define_get_cpu_dbs_routines(cs_cpu_dbs_info);

static struct notifier_block cs_cpufreq_notifier_block = { @@ -314,21 +339,21 @@ static struct cs_ops cs_ops = { .notifier_block = &cs_cpufreq_notifier_block, };

-static struct dbs_data cs_dbs_data = { +static struct common_dbs_data cs_dbs_cdata = { .governor = GOV_CONSERVATIVE, .attr_group = &cs_attr_group, - .tuners = &cs_tuners, .get_cpu_cdbs = get_cpu_cdbs, .get_cpu_dbs_info_s = get_cpu_dbs_info_s, .gov_dbs_timer = cs_dbs_timer, .gov_check_cpu = cs_check_cpu, .gov_ops = &cs_ops, + .init = cs_init, };

static int cs_cpufreq_governor_dbs(struct cpufreq_policy *policy, unsigned int event) { - return cpufreq_governor_dbs(&cs_dbs_data, policy, event); + return cpufreq_governor_dbs(policy, &cs_dbs_cdata, event); }

#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE @@ -343,7 +368,6 @@ struct cpufreq_governor cpufreq_gov_conservative = {

static int __init cpufreq_gov_dbs_init(void) { - mutex_init(&cs_dbs_data.mutex); return cpufreq_register_governor(&cpufreq_gov_conservative); }

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index e4a306c..545ae24 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -22,6 +22,7 @@ #include <linux/export.h> #include <linux/kernel_stat.h> #include <linux/mutex.h> +#include <linux/slab.h> #include <linux/tick.h> #include <linux/types.h> #include <linux/workqueue.h> @@ -65,7 +66,7 @@ EXPORT_SYMBOL_GPL(get_cpu_idle_time);

void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) { - struct cpu_dbs_common_info *cdbs = dbs_data->get_cpu_cdbs(cpu); + struct cpu_dbs_common_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu); struct od_dbs_tuners *od_tuners = dbs_data->tuners; struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; struct cpufreq_policy *policy; @@ -73,7 +74,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) unsigned int ignore_nice; unsigned int j;

- if (dbs_data->governor == GOV_ONDEMAND) + if (dbs_data->cdata->governor == GOV_ONDEMAND) ignore_nice = od_tuners->ignore_nice; else ignore_nice = cs_tuners->ignore_nice; @@ -87,7 +88,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) unsigned int idle_time, wall_time, iowait_time; unsigned int load;

- j_cdbs = dbs_data->get_cpu_cdbs(j); + j_cdbs = dbs_data->cdata->get_cpu_cdbs(j);

cur_idle_time = get_cpu_idle_time(j, &cur_wall_time);

@@ -117,9 +118,9 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) idle_time += jiffies_to_usecs(cur_nice_jiffies); }

- if (dbs_data->governor == GOV_ONDEMAND) { + if (dbs_data->cdata->governor == GOV_ONDEMAND) { struct od_cpu_dbs_info_s *od_j_dbs_info = - dbs_data->get_cpu_dbs_info_s(cpu); + dbs_data->cdata->get_cpu_dbs_info_s(cpu);

cur_iowait_time = get_cpu_iowait_time_us(j, &cur_wall_time); @@ -145,7 +146,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu)

load = 100 * (wall_time - idle_time) / wall_time;

- if (dbs_data->governor == GOV_ONDEMAND) { + if (dbs_data->cdata->governor == GOV_ONDEMAND) { int freq_avg = __cpufreq_driver_getavg(policy, j); if (freq_avg <= 0) freq_avg = policy->cur; @@ -157,7 +158,7 @@ void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) max_load = load; }

- dbs_data->gov_check_cpu(cpu, max_load); + dbs_data->cdata->gov_check_cpu(cpu, max_load); } EXPORT_SYMBOL_GPL(dbs_check_cpu);

@@ -165,14 +166,14 @@ static inline void dbs_timer_init(struct dbs_data *dbs_data, int cpu, unsigned int sampling_rate) { int delay = delay_for_sampling_rate(sampling_rate); - struct cpu_dbs_common_info *cdbs = dbs_data->get_cpu_cdbs(cpu); + struct cpu_dbs_common_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);

schedule_delayed_work_on(cpu, &cdbs->work, delay); }

static inline void dbs_timer_exit(struct dbs_data *dbs_data, int cpu) { - struct cpu_dbs_common_info *cdbs = dbs_data->get_cpu_cdbs(cpu); + struct cpu_dbs_common_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu);

cancel_delayed_work_sync(&cdbs->work); } @@ -196,31 +197,83 @@ bool need_load_eval(struct cpu_dbs_common_info *cdbs, } EXPORT_SYMBOL_GPL(need_load_eval);

-int cpufreq_governor_dbs(struct dbs_data *dbs_data, - struct cpufreq_policy *policy, unsigned int event) +static void set_sampling_rate(struct dbs_data *dbs_data, + unsigned int sampling_rate) +{ + if (dbs_data->cdata->governor == GOV_CONSERVATIVE) { + struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; + cs_tuners->sampling_rate = sampling_rate; + } else { + struct od_dbs_tuners *od_tuners = dbs_data->tuners; + od_tuners->sampling_rate = sampling_rate; + } +} + +int cpufreq_governor_dbs(struct cpufreq_policy *policy, + struct common_dbs_data *cdata, unsigned int event) { + struct dbs_data *dbs_data = policy->governor_data; struct od_cpu_dbs_info_s *od_dbs_info = NULL; struct cs_cpu_dbs_info_s *cs_dbs_info = NULL; struct cs_ops *cs_ops = NULL; struct od_ops *od_ops = NULL; - struct od_dbs_tuners *od_tuners = dbs_data->tuners; - struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; + struct od_dbs_tuners *od_tuners = NULL; + struct cs_dbs_tuners *cs_tuners = NULL; struct cpu_dbs_common_info *cpu_cdbs; - unsigned int *sampling_rate, latency, ignore_nice, j, cpu = policy->cpu; + unsigned int sampling_rate, latency, ignore_nice, j, cpu = policy->cpu; int rc;

- cpu_cdbs = dbs_data->get_cpu_cdbs(cpu); + WARN_ON(!dbs_data && (event != CPUFREQ_GOV_POLICY_INIT));

- if (dbs_data->governor == GOV_CONSERVATIVE) { - cs_dbs_info = dbs_data->get_cpu_dbs_info_s(cpu); - sampling_rate = &cs_tuners->sampling_rate; + switch (event) { + case CPUFREQ_GOV_POLICY_INIT: + WARN_ON(dbs_data); + dbs_data = kzalloc(sizeof(*dbs_data), GFP_KERNEL); + if (!dbs_data) { + pr_err("%s: POLICY_INIT: kzalloc failed\n", __func__); + return -ENOMEM; + } + + dbs_data->cdata = cdata; + rc = cdata->init(dbs_data); + if (rc) { + pr_err("%s: POLICY_INIT: init() failed\n", __func__); + kfree(dbs_data); + return rc; + } + policy->governor_data = dbs_data; + + /* policy latency is in nS. Convert it to uS first */ + latency = policy->cpuinfo.transition_latency / 1000; + if (latency == 0) + latency = 1; + + /* Bring kernel and HW constraints together */ + dbs_data->min_sampling_rate = max(dbs_data->min_sampling_rate, + MIN_LATENCY_MULTIPLIER * latency); + set_sampling_rate(dbs_data, max(dbs_data->min_sampling_rate, + latency * LATENCY_MULTIPLIER)); + return 0; + case CPUFREQ_GOV_POLICY_EXIT: + kfree(dbs_data); + policy->governor_data = NULL; + return 0; + } + + cpu_cdbs = dbs_data->cdata->get_cpu_cdbs(cpu); + + if (dbs_data->cdata->governor == GOV_CONSERVATIVE) { + cs_tuners = dbs_data->tuners; + cs_dbs_info = dbs_data->cdata->get_cpu_dbs_info_s(cpu); + sampling_rate = cs_tuners->sampling_rate; ignore_nice = cs_tuners->ignore_nice; - cs_ops = dbs_data->gov_ops; + cs_ops = dbs_data->cdata->gov_ops; } else { - od_dbs_info = dbs_data->get_cpu_dbs_info_s(cpu); - sampling_rate = &od_tuners->sampling_rate; + od_tuners = dbs_data->tuners; + od_dbs_info = dbs_data->cdata->get_cpu_dbs_info_s(cpu); + sampling_rate = od_tuners->sampling_rate; ignore_nice = od_tuners->ignore_nice; - od_ops = dbs_data->gov_ops; + od_ops = dbs_data->cdata->gov_ops; }

switch (event) { @@ -232,7 +285,7 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data,

for_each_cpu(j, policy->cpus) { struct cpu_dbs_common_info *j_cdbs = - dbs_data->get_cpu_cdbs(j); + dbs_data->cdata->get_cpu_cdbs(j);

j_cdbs->cpu = j; j_cdbs->cur_policy = policy; @@ -244,11 +297,11 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data,

mutex_init(&j_cdbs->timer_mutex); INIT_DEFERRABLE_WORK(&j_cdbs->work, - dbs_data->gov_dbs_timer); + dbs_data->cdata->gov_dbs_timer); }

- rc = sysfs_create_group(cpufreq_global_kobject, - dbs_data->attr_group); + rc = sysfs_create_group(&policy->kobj, + dbs_data->cdata->attr_group); if (rc) { mutex_unlock(&dbs_data->mutex); return rc; @@ -258,51 +311,29 @@ int cpufreq_governor_dbs(struct dbs_data *dbs_data, * conservative does not implement micro like ondemand * governor, thus we are bound to jiffes/HZ */ - if (dbs_data->governor == GOV_CONSERVATIVE) { + if (dbs_data->cdata->governor == GOV_CONSERVATIVE) { cs_dbs_info->down_skip = 0; cs_dbs_info->enable = 1; cs_dbs_info->requested_freq = policy->cur; cpufreq_register_notifier(cs_ops->notifier_block, CPUFREQ_TRANSITION_NOTIFIER); - - if (!policy->governor->initialized) - dbs_data->min_sampling_rate = - MIN_SAMPLING_RATE_RATIO * - jiffies_to_usecs(10); } else { od_dbs_info->rate_mult = 1; od_dbs_info->sample_type = OD_NORMAL_SAMPLE; od_ops->powersave_bias_init_cpu(cpu); - - if (!policy->governor->initialized) - od_tuners->io_is_busy = od_ops->io_busy(); }

- if (policy->governor->initialized) - goto unlock; - - /* policy latency is in nS. Convert it to uS first */ - latency = policy->cpuinfo.transition_latency / 1000; - if (latency == 0) - latency = 1; - - /* Bring kernel and HW constraints together */ - dbs_data->min_sampling_rate = max(dbs_data->min_sampling_rate, - MIN_LATENCY_MULTIPLIER * latency); - *sampling_rate = max(dbs_data->min_sampling_rate, latency * - LATENCY_MULTIPLIER); -unlock: mutex_unlock(&dbs_data->mutex);

/* Initiate timer time stamp */ cpu_cdbs->time_stamp = ktime_get();

for_each_cpu(j, policy->cpus) - dbs_timer_init(dbs_data, j, *sampling_rate); + dbs_timer_init(dbs_data, j, sampling_rate); break;

case CPUFREQ_GOV_STOP: - if (dbs_data->governor == GOV_CONSERVATIVE) + if (dbs_data->cdata->governor == GOV_CONSERVATIVE) cs_dbs_info->enable = 0;

for_each_cpu(j, policy->cpus) @@ -311,9 +342,8 @@ unlock: mutex_lock(&dbs_data->mutex); mutex_destroy(&cpu_cdbs->timer_mutex);

- sysfs_remove_group(cpufreq_global_kobject, - dbs_data->attr_group); - if (dbs_data->governor == GOV_CONSERVATIVE) + sysfs_remove_group(&policy->kobj, dbs_data->cdata->attr_group); + if (dbs_data->cdata->governor == GOV_CONSERVATIVE) cpufreq_unregister_notifier(cs_ops->notifier_block, CPUFREQ_TRANSITION_NOTIFIER); mutex_unlock(&dbs_data->mutex); diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 16314b6..8ea106a 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -43,9 +43,11 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE}; /* Macro creating sysfs show routines */ #define show_one(_gov, file_name, object) \ static ssize_t show_##file_name \ -(struct kobject *kobj, struct attribute *attr, char *buf) \ +(struct cpufreq_policy *policy, char *buf) \ { \ - return sprintf(buf, "%u\n", _gov##_tuners.object); \ + struct dbs_data *dbs_data = policy->governor_data; \ + struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ + return sprintf(buf, "%u\n", tuners->file_name); \ }

#define define_get_cpu_dbs_routines(_dbs_info) \ @@ -103,7 +105,7 @@ struct cs_cpu_dbs_info_s { unsigned int enable:1; };

-/* Governers sysfs tunables */ +/* Per policy Governers sysfs tunables */ struct od_dbs_tuners { unsigned int ignore_nice; unsigned int sampling_rate; @@ -123,31 +125,37 @@ struct cs_dbs_tuners { unsigned int freq_step; };

-/* Per Governer data */ -struct dbs_data { +/* Common Governer data across policies */ +struct dbs_data; +struct common_dbs_data { /* Common across governors */ #define GOV_ONDEMAND 0 #define GOV_CONSERVATIVE 1 int governor; - unsigned int min_sampling_rate; struct attribute_group *attr_group; - void *tuners; - - /* dbs_mutex protects dbs_enable in governor start/stop */ - struct mutex mutex;

struct cpu_dbs_common_info *(*get_cpu_cdbs)(int cpu); void *(*get_cpu_dbs_info_s)(int cpu); void (*gov_dbs_timer)(struct work_struct *work); void (*gov_check_cpu)(int cpu, unsigned int load); + int (*init)(struct dbs_data *dbs_data);

/* Governor specific ops, see below */ void *gov_ops; };

+/* Governer Per policy data */ +struct dbs_data { + struct common_dbs_data *cdata; + unsigned int min_sampling_rate; + void *tuners; + + /* dbs_mutex protects dbs_enable in governor start/stop */ + struct mutex mutex; +}; + /* Governor specific ops, will be passed to dbs_data->gov_ops */ struct od_ops { - int (*io_busy)(void); void (*powersave_bias_init_cpu)(int cpu); unsigned int (*powersave_bias_target)(struct cpufreq_policy *policy, unsigned int freq_next, unsigned int relation); @@ -169,10 +177,18 @@ static inline int delay_for_sampling_rate(unsigned int sampling_rate) return delay; }

+#define declare_show_sampling_rate_min() \ +static ssize_t show_sampling_rate_min(struct cpufreq_policy *policy, \ + char *buf) \ +{ \ + struct dbs_data *dbs_data = policy->governor_data; \ + return sprintf(buf, "%u\n", dbs_data->min_sampling_rate); \ +} + u64 get_cpu_idle_time(unsigned int cpu, u64 *wall); void dbs_check_cpu(struct dbs_data *dbs_data, int cpu); bool need_load_eval(struct cpu_dbs_common_info *cdbs, unsigned int sampling_rate); -int cpufreq_governor_dbs(struct dbs_data *dbs_data, - struct cpufreq_policy *policy, unsigned int event); +int cpufreq_governor_dbs(struct cpufreq_policy *policy, + struct common_dbs_data *cdata, unsigned int event); #endif /* _CPUFREQ_GOVERNER_H */ diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index f38b8da..7e15bbe 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -20,6 +20,7 @@ #include <linux/module.h> #include <linux/mutex.h> #include <linux/percpu-defs.h> +#include <linux/slab.h> #include <linux/sysfs.h> #include <linux/tick.h> #include <linux/types.h> @@ -37,21 +38,12 @@ #define MIN_FREQUENCY_UP_THRESHOLD (11) #define MAX_FREQUENCY_UP_THRESHOLD (100)

-static struct dbs_data od_dbs_data; static DEFINE_PER_CPU(struct od_cpu_dbs_info_s, od_cpu_dbs_info);

#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND static struct cpufreq_governor cpufreq_gov_ondemand; #endif

-static struct od_dbs_tuners od_tuners = { - .up_threshold = DEF_FREQUENCY_UP_THRESHOLD, - .sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR, - .down_differential = DEF_FREQUENCY_DOWN_DIFFERENTIAL, - .ignore_nice = 0, - .powersave_bias = 0, -}; - static void ondemand_powersave_bias_init_cpu(int cpu) { struct od_cpu_dbs_info_s *dbs_info = &per_cpu(od_cpu_dbs_info, cpu); @@ -97,6 +89,8 @@ static unsigned int powersave_bias_target(struct cpufreq_policy *policy, unsigned int jiffies_total, jiffies_hi, jiffies_lo; struct od_cpu_dbs_info_s *dbs_info = &per_cpu(od_cpu_dbs_info, policy->cpu); + struct dbs_data *dbs_data = policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners;

if (!dbs_info->freq_table) { dbs_info->freq_lo = 0; @@ -107,7 +101,7 @@ static unsigned int powersave_bias_target(struct cpufreq_policy *policy, cpufreq_frequency_table_target(policy, dbs_info->freq_table, freq_next, relation, &index); freq_req = dbs_info->freq_table[index].frequency; - freq_reduc = freq_req * od_tuners.powersave_bias / 1000; + freq_reduc = freq_req * od_tuners->powersave_bias / 1000; freq_avg = freq_req - freq_reduc;

/* Find freq bounds for freq_avg in freq_table */ @@ -126,7 +120,7 @@ static unsigned int powersave_bias_target(struct cpufreq_policy *policy, dbs_info->freq_lo_jiffies = 0; return freq_lo; } - jiffies_total = usecs_to_jiffies(od_tuners.sampling_rate); + jiffies_total = usecs_to_jiffies(od_tuners->sampling_rate); jiffies_hi = (freq_avg - freq_lo) * jiffies_total; jiffies_hi += ((freq_hi - freq_lo) / 2); jiffies_hi /= (freq_hi - freq_lo); @@ -147,12 +141,15 @@ static void ondemand_powersave_bias_init(void)

static void dbs_freq_increase(struct cpufreq_policy *p, unsigned int freq) { - if (od_tuners.powersave_bias) + struct dbs_data *dbs_data = p->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners; + + if (od_tuners->powersave_bias) freq = powersave_bias_target(p, freq, CPUFREQ_RELATION_H); else if (p->cur == p->max) return;

- __cpufreq_driver_target(p, freq, od_tuners.powersave_bias ? + __cpufreq_driver_target(p, freq, od_tuners->powersave_bias ? CPUFREQ_RELATION_L : CPUFREQ_RELATION_H); }

@@ -169,15 +166,17 @@ static void od_check_cpu(int cpu, unsigned int load_freq) { struct od_cpu_dbs_info_s *dbs_info = &per_cpu(od_cpu_dbs_info, cpu); struct cpufreq_policy *policy = dbs_info->cdbs.cur_policy; + struct dbs_data *dbs_data = policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners;

dbs_info->freq_lo = 0;

/* Check for frequency increase */ - if (load_freq > od_tuners.up_threshold * policy->cur) { + if (load_freq > od_tuners->up_threshold * policy->cur) { /* If switching to max speed, apply sampling_down_factor */ if (policy->cur < policy->max) dbs_info->rate_mult = - od_tuners.sampling_down_factor; + od_tuners->sampling_down_factor; dbs_freq_increase(policy, policy->max); return; } @@ -192,11 +191,11 @@ static void od_check_cpu(int cpu, unsigned int load_freq) * support the current CPU usage without triggering the up policy. To be * safe, we focus 10 points under the threshold. */ - if (load_freq < (od_tuners.up_threshold - od_tuners.down_differential) * - policy->cur) { + if (load_freq < (od_tuners->up_threshold - od_tuners->down_differential) + * policy->cur) { unsigned int freq_next; - freq_next = load_freq / (od_tuners.up_threshold - - od_tuners.down_differential); + freq_next = load_freq / (od_tuners->up_threshold - + od_tuners->down_differential);

/* No longer fully busy, reset rate_mult */ dbs_info->rate_mult = 1; @@ -204,7 +203,7 @@ static void od_check_cpu(int cpu, unsigned int load_freq) if (freq_next < policy->min) freq_next = policy->min;

- if (!od_tuners.powersave_bias) { + if (!od_tuners->powersave_bias) { __cpufreq_driver_target(policy, freq_next, CPUFREQ_RELATION_L); } else { @@ -224,12 +223,14 @@ static void od_dbs_timer(struct work_struct *work) unsigned int cpu = dbs_info->cdbs.cur_policy->cpu; struct od_cpu_dbs_info_s *core_dbs_info = &per_cpu(od_cpu_dbs_info, cpu); + struct dbs_data *dbs_data = dbs_info->cdbs.cur_policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners; int delay, sample_type = core_dbs_info->sample_type; bool eval_load;

mutex_lock(&core_dbs_info->cdbs.timer_mutex); eval_load = need_load_eval(&core_dbs_info->cdbs, - od_tuners.sampling_rate); + od_tuners->sampling_rate);

/* Common NORMAL_SAMPLE setup */ core_dbs_info->sample_type = OD_NORMAL_SAMPLE; @@ -241,13 +242,13 @@ static void od_dbs_timer(struct work_struct *work) CPUFREQ_RELATION_H); } else { if (eval_load) - dbs_check_cpu(&od_dbs_data, cpu); + dbs_check_cpu(dbs_data, cpu); if (core_dbs_info->freq_lo) { /* Setup timer for SUB_SAMPLE */ core_dbs_info->sample_type = OD_SUB_SAMPLE; delay = core_dbs_info->freq_hi_jiffies; } else { - delay = delay_for_sampling_rate(od_tuners.sampling_rate + delay = delay_for_sampling_rate(od_tuners->sampling_rate * core_dbs_info->rate_mult); } } @@ -257,13 +258,6 @@ static void od_dbs_timer(struct work_struct *work) }

/************************** sysfs interface ************************/ - -static ssize_t show_sampling_rate_min(struct kobject *kobj, - struct attribute *attr, char *buf) -{ - return sprintf(buf, "%u\n", od_dbs_data.min_sampling_rate); -} - /** * update_sampling_rate - update sampling rate effective immediately if needed. * @new_rate: new sampling rate @@ -277,12 +271,14 @@ static ssize_t show_sampling_rate_min(struct kobject *kobj, * reducing the sampling rate, we need to make the new value effective * immediately. */ -static void update_sampling_rate(unsigned int new_rate) +static void update_sampling_rate(struct dbs_data *dbs_data, + unsigned int new_rate) { + struct od_dbs_tuners *od_tuners = dbs_data->tuners; int cpu;

- od_tuners.sampling_rate = new_rate = max(new_rate, - od_dbs_data.min_sampling_rate); + od_tuners->sampling_rate = new_rate = max(new_rate, + dbs_data->min_sampling_rate);

for_each_online_cpu(cpu) { struct cpufreq_policy *policy; @@ -323,34 +319,39 @@ static void update_sampling_rate(unsigned int new_rate) } }

-static ssize_t store_sampling_rate(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_sampling_rate(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); if (ret != 1) return -EINVAL; - update_sampling_rate(input); + update_sampling_rate(dbs_data, input); return count; }

-static ssize_t store_io_is_busy(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_io_is_busy(struct cpufreq_policy *policy, const char *buf, + size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input; int ret;

ret = sscanf(buf, "%u", &input); if (ret != 1) return -EINVAL; - od_tuners.io_is_busy = !!input; + od_tuners->io_is_busy = !!input; return count; }

-static ssize_t store_up_threshold(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_up_threshold(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -359,20 +360,22 @@ static ssize_t store_up_threshold(struct kobject *a, struct attribute *b, input < MIN_FREQUENCY_UP_THRESHOLD) { return -EINVAL; } - od_tuners.up_threshold = input; + od_tuners->up_threshold = input; return count; }

-static ssize_t store_sampling_down_factor(struct kobject *a, - struct attribute *b, const char *buf, size_t count) +static ssize_t store_sampling_down_factor(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input, j; int ret; ret = sscanf(buf, "%u", &input);

if (ret != 1 || input > MAX_SAMPLING_DOWN_FACTOR || input < 1) return -EINVAL; - od_tuners.sampling_down_factor = input; + od_tuners->sampling_down_factor = input;

/* Reset down sampling multiplier in case it was active */ for_each_online_cpu(j) { @@ -383,9 +386,11 @@ static ssize_t store_sampling_down_factor(struct kobject *a, return count; }

-static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_ignore_nice(struct cpufreq_policy *policy, const char *buf, + size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input; int ret;

@@ -398,10 +403,10 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, if (input > 1) input = 1;

- if (input == od_tuners.ignore_nice) { /* nothing to do */ + if (input == od_tuners->ignore_nice) { /* nothing to do */ return count; } - od_tuners.ignore_nice = input; + od_tuners->ignore_nice = input;

/* we need to re-evaluate prev_cpu_idle */ for_each_online_cpu(j) { @@ -409,7 +414,7 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, dbs_info = &per_cpu(od_cpu_dbs_info, j); dbs_info->cdbs.prev_cpu_idle = get_cpu_idle_time(j, &dbs_info->cdbs.prev_cpu_wall); - if (od_tuners.ignore_nice) + if (od_tuners->ignore_nice) dbs_info->cdbs.prev_cpu_nice = kcpustat_cpu(j).cpustat[CPUTIME_NICE];

@@ -417,9 +422,11 @@ static ssize_t store_ignore_nice_load(struct kobject *a, struct attribute *b, return count; }

-static ssize_t store_powersave_bias(struct kobject *a, struct attribute *b, - const char *buf, size_t count) +static ssize_t store_powersave_bias(struct cpufreq_policy *policy, + const char *buf, size_t count) { + struct dbs_data *dbs_data = policy->governor_data; + struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); @@ -430,7 +437,7 @@ static ssize_t store_powersave_bias(struct kobject *a, struct attribute *b, if (input > 1000) input = 1000;

- od_tuners.powersave_bias = input; + od_tuners->powersave_bias = input; ondemand_powersave_bias_init(); return count; } @@ -439,23 +446,24 @@ show_one(od, sampling_rate, sampling_rate); show_one(od, io_is_busy, io_is_busy); show_one(od, up_threshold, up_threshold); show_one(od, sampling_down_factor, sampling_down_factor); -show_one(od, ignore_nice_load, ignore_nice); +show_one(od, ignore_nice, ignore_nice); show_one(od, powersave_bias, powersave_bias); +declare_show_sampling_rate_min();

-define_one_global_rw(sampling_rate); -define_one_global_rw(io_is_busy); -define_one_global_rw(up_threshold); -define_one_global_rw(sampling_down_factor); -define_one_global_rw(ignore_nice_load); -define_one_global_rw(powersave_bias); -define_one_global_ro(sampling_rate_min); +cpufreq_freq_attr_rw(sampling_rate); +cpufreq_freq_attr_rw(io_is_busy); +cpufreq_freq_attr_rw(up_threshold); +cpufreq_freq_attr_rw(sampling_down_factor); +cpufreq_freq_attr_rw(ignore_nice); +cpufreq_freq_attr_rw(powersave_bias); +cpufreq_freq_attr_ro(sampling_rate_min);

static struct attribute *dbs_attributes[] = { &sampling_rate_min.attr, &sampling_rate.attr, &up_threshold.attr, &sampling_down_factor.attr, - &ignore_nice_load.attr, + &ignore_nice.attr, &powersave_bias.attr, &io_is_busy.attr, NULL @@ -468,30 +476,73 @@ static struct attribute_group od_attr_group = {

/************************** sysfs end ************************/

+static int od_init(struct dbs_data *dbs_data) +{ + struct od_dbs_tuners *tuners; + u64 idle_time; + int cpu; + + tuners = kzalloc(sizeof(struct od_dbs_tuners), GFP_KERNEL); + if (!tuners) { + pr_err("%s: kzalloc failed\n", __func__); + return -ENOMEM; + } + + cpu = get_cpu(); + idle_time = get_cpu_idle_time_us(cpu, NULL); + put_cpu(); + if (idle_time != -1ULL) { + /* Idle micro accounting is supported. Use finer thresholds */ + tuners->up_threshold = MICRO_FREQUENCY_UP_THRESHOLD; + tuners->down_differential = MICRO_FREQUENCY_DOWN_DIFFERENTIAL; + /* + * In nohz/micro accounting case we set the minimum frequency + * not depending on HZ, but fixed (very low). The deferred + * timer might skip some samples if idle/sleeping as needed. + */ + dbs_data->min_sampling_rate = MICRO_FREQUENCY_MIN_SAMPLE_RATE; + } else { + tuners->up_threshold = DEF_FREQUENCY_UP_THRESHOLD; + tuners->down_differential = DEF_FREQUENCY_DOWN_DIFFERENTIAL; + + /* For correct statistics, we need 10 ticks for each measure */ + dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO * + jiffies_to_usecs(10); + } + + tuners->sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR; + tuners->ignore_nice = 0; + tuners->powersave_bias = 0; + tuners->io_is_busy = should_io_be_busy(); + + dbs_data->tuners = tuners; + mutex_init(&dbs_data->mutex); + return 0; +} + define_get_cpu_dbs_routines(od_cpu_dbs_info);

static struct od_ops od_ops = { - .io_busy = should_io_be_busy, .powersave_bias_init_cpu = ondemand_powersave_bias_init_cpu, .powersave_bias_target = powersave_bias_target, .freq_increase = dbs_freq_increase, };

-static struct dbs_data od_dbs_data = { +static struct common_dbs_data od_dbs_cdata = { .governor = GOV_ONDEMAND, .attr_group = &od_attr_group, - .tuners = &od_tuners, .get_cpu_cdbs = get_cpu_cdbs, .get_cpu_dbs_info_s = get_cpu_dbs_info_s, .gov_dbs_timer = od_dbs_timer, .gov_check_cpu = od_check_cpu, .gov_ops = &od_ops, + .init = od_init, };

static int od_cpufreq_governor_dbs(struct cpufreq_policy *policy, unsigned int event) { - return cpufreq_governor_dbs(&od_dbs_data, policy, event); + return cpufreq_governor_dbs(policy, &od_dbs_cdata, event); }

#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND @@ -506,28 +557,6 @@ struct cpufreq_governor cpufreq_gov_ondemand = {

static int __init cpufreq_gov_dbs_init(void) { - u64 idle_time; - int cpu = get_cpu(); - - mutex_init(&od_dbs_data.mutex); - idle_time = get_cpu_idle_time_us(cpu, NULL); - put_cpu(); - if (idle_time != -1ULL) { - /* Idle micro accounting is supported. Use finer thresholds */ - od_tuners.up_threshold = MICRO_FREQUENCY_UP_THRESHOLD; - od_tuners.down_differential = MICRO_FREQUENCY_DOWN_DIFFERENTIAL; - /* - * In nohz/micro accounting case we set the minimum frequency - * not depending on HZ, but fixed (very low). The deferred - * timer might skip some samples if idle/sleeping as needed. - */ - od_dbs_data.min_sampling_rate = MICRO_FREQUENCY_MIN_SAMPLE_RATE; - } else { - /* For correct statistics, we need 10 ticks for each measure */ - od_dbs_data.min_sampling_rate = MIN_SAMPLING_RATE_RATIO * - jiffies_to_usecs(10); - } - return cpufreq_register_governor(&cpufreq_gov_ondemand); }

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 3b822ce..5dae7e6 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -183,11 +183,10 @@ static inline unsigned long cpufreq_scale(unsigned long old, u_int div, u_int mu #define CPUFREQ_GOV_STOP 2 #define CPUFREQ_GOV_LIMITS 3 #define CPUFREQ_GOV_POLICY_INIT 4 -#define CPUFREQ_GOV_POLICY_EXIT 4 +#define CPUFREQ_GOV_POLICY_EXIT 5

struct cpufreq_governor { char name[CPUFREQ_NAME_LEN]; - int initialized; int (*governor) (struct cpufreq_policy *policy, unsigned int event); ssize_t (*show_setspeed) (struct cpufreq_policy *policy, @@ -307,22 +306,6 @@ __ATTR(_name, _perm, show_##_name, NULL) static struct freq_attr _name = \ __ATTR(_name, 0644, show_##_name, store_##_name)

-struct global_attr { - struct attribute attr; - ssize_t (*show)(struct kobject *kobj, - struct attribute *attr, char *buf); - ssize_t (*store)(struct kobject *a, struct attribute *b, - const char *c, size_t count); -}; - -#define define_one_global_ro(_name) \ -static struct global_attr _name = \ -__ATTR(_name, 0444, show_##_name, NULL) - -#define define_one_global_rw(_name) \ -static struct global_attr _name = \ -__ATTR(_name, 0644, show_##_name, store_##_name) - struct cpufreq_policy *cpufreq_cpu_get(unsigned int cpu); void cpufreq_cpu_put(struct cpufreq_policy *data); const char *cpufreq_get_current_driver(void);

-- 1.7.12.rc2.18.g61b472e

Francesco Lavra

10 Feb 10 Feb

9:14 p.m.

New subject: [PATCH 4/4] cpufreq: governor: Implement per policy instances of governors

Hi,

On 02/04/2013 12:38 PM, Viresh Kumar wrote:

...

Currently, there can't be multiple instances of single governor_type. If we have a multi-package system, where we have multiple instances of struct policy (per package), we can't have multiple instances of same governor. i.e. We can't have multiple instances of ondemand governor for multiple packages.

Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/ governor-name/. Which again reflects that there can be only one instance of a governor_type in the system.

This is a bottleneck for multicluster system, where we want different packages to use same governor type, but with different tunables.

This patch uses the infrastructure provided by earlier patch and implements init/exit routines for ondemand and conservative governors.

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org

drivers/cpufreq/cpufreq.c | 4 - drivers/cpufreq/cpufreq_conservative.c | 142 +++++++++++++---------- drivers/cpufreq/cpufreq_governor.c | 138 +++++++++++++--------- drivers/cpufreq/cpufreq_governor.h | 42 ++++--- drivers/cpufreq/cpufreq_ondemand.c | 205 +++++++++++++++++++-------------- include/linux/cpufreq.h | 19 +-- 6 files changed, 314 insertions(+), 236 deletions(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 1ae78d4..7aacfbf 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1551,9 +1551,6 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, policy->cpu, event); ret = policy->governor->governor(policy, event);
if (!policy->governor->initialized && (event == CPUFREQ_GOV_START))
policy->governor->initialized = 1;
/* we keep one module reference alive for each CPU governed by this CPU */ if ((event != CPUFREQ_GOV_START) || ret)
@@ -1577,7 +1574,6 @@ int cpufreq_register_governor(struct cpufreq_governor *governor) mutex_lock(&cpufreq_governor_mutex);

governor->initialized = 0; err = -EBUSY; if (__find_governor(governor->name) == NULL) { err = 0;

diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c

[...]

...

+static int cs_init(struct dbs_data *dbs_data) +{
struct cs_dbs_tuners *tuners;

tuners = kzalloc(sizeof(struct cs_dbs_tuners), GFP_KERNEL);

if (!tuners) {
pr_err("%s: kzalloc failed\n", __func__);
return -ENOMEM;
}

tuners->up_threshold = DEF_FREQUENCY_UP_THRESHOLD;

tuners->down_threshold = DEF_FREQUENCY_DOWN_THRESHOLD;

tuners->sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR;

tuners->ignore_nice = 0;

tuners->freq_step = 5;

dbs_data->tuners = tuners;

dbs_data->tuners is never freed, which means there is a memory leak across CPUFREQ_GOV_POLICY_INIT and CPUFREQ_GOV_POLICY_EXIT events.

The same goes for the ondemand governor.

Regards, Francesco

Viresh Kumar

11 Feb 11 Feb

4:16 a.m.

New subject: [PATCH 4/4] cpufreq: governor: Implement per policy instances of governors

On 11 February 2013 02:44, Francesco Lavra francescolavra.fl@gmail.com wrote:

...

dbs_data->tuners is never freed, which means there is a memory leak across CPUFREQ_GOV_POLICY_INIT and CPUFREQ_GOV_POLICY_EXIT events.

The same goes for the ondemand governor.

Thanks for pointing out. Would be fixed in next version.

Viresh Kumar

4:39 a.m.

New subject: [PATCH 4/4] cpufreq: governor: Implement per policy instances of governors

On 11 February 2013 09:46, Viresh Kumar viresh.kumar@linaro.org wrote:

...

On 11 February 2013 02:44, Francesco Lavra francescolavra.fl@gmail.com wrote:

...
dbs_data->tuners is never freed, which means there is a memory leak across CPUFREQ_GOV_POLICY_INIT and CPUFREQ_GOV_POLICY_EXIT events.

The same goes for the ondemand governor.

Thanks for pointing out. Would be fixed in next version.

Adding following to the original patch:

diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 6b13f9f..5d03577 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -329,6 +329,11 @@ static int cs_init(struct dbs_data *dbs_data) return 0; }

+static void cs_exit(struct dbs_data *dbs_data) +{ + kfree(dbs_data->tuners); +} + define_get_cpu_dbs_routines(cs_cpu_dbs_info);

static struct notifier_block cs_cpufreq_notifier_block = { @@ -348,6 +353,7 @@ static struct common_dbs_data cs_dbs_cdata = { .gov_check_cpu = cs_check_cpu, .gov_ops = &cs_ops, .init = cs_init, + .exit = cs_exit, };

static int cs_cpufreq_governor_dbs(struct cpufreq_policy *policy, diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 925f5b3..7722505 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -255,6 +255,7 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy, latency * LATENCY_MULTIPLIER)); return 0; case CPUFREQ_GOV_POLICY_EXIT: + cdata->exit(dbs_data); kfree(dbs_data); policy->governor_data = NULL; return 0; diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 8bd8df6..6301790 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -139,6 +139,7 @@ struct common_dbs_data { void (*gov_dbs_timer)(struct work_struct *work); void (*gov_check_cpu)(int cpu, unsigned int load); int (*init)(struct dbs_data *dbs_data); + void (*exit)(struct dbs_data *dbs_data);

/* Governor specific ops, see below */ void *gov_ops; diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 5eda540..f2539e0 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -525,6 +525,11 @@ static int od_init(struct dbs_data *dbs_data) return 0; }

+static void od_exit(struct dbs_data *dbs_data) +{ + kfree(dbs_data->tuners); +} + define_get_cpu_dbs_routines(od_cpu_dbs_info);

static struct od_ops od_ops = { @@ -542,6 +547,7 @@ static struct common_dbs_data od_dbs_cdata = { .gov_check_cpu = od_check_cpu, .gov_ops = &od_ops, .init = od_init, + .exit = od_exit, };

static int od_cpufreq_governor_dbs(struct cpufreq_policy *policy,

Rafael J. Wysocki

4 Feb 4 Feb

12:17 p.m.

On Monday, February 04, 2013 05:08:50 PM Viresh Kumar wrote:

...

Currently, there can't be multiple instances of single governor_type. If we have a multi-package system, where we have multiple instances of struct policy (per package), we can't have multiple instances of same governor. i.e. We can't have multiple instances of ondemand governor for multiple packages.

Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/ governor-name/. Which again reflects that there can be only one instance of a governor_type in the system.

This is a bottleneck for multicluster system, where we want different packages to use same governor type, but with different tunables.

This patchset is inclined towards fixing this issue.

Viresh Kumar (4): cpufreq: Don't check cpu_online(policy->cpu) cpufreq: stats: Get rid of CPUFREQ_STATDEVICE_ATTR cpufreq: Add per policy governor-init/exit infrastructure cpufreq: governor: Implement per policy instances of governors

Well, [1-2/4] are things I can take for v3.9 no problem. The other two I'd wait for the next cycle to be honest. We already have 30+ cpufreq patches scheduled for v3.9 and some of them quite subtle for that matter.

Thanks, Rafael

-- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center.

Viresh Kumar

12:24 p.m.

On 4 February 2013 17:47, Rafael J. Wysocki rjw@sisk.pl wrote:

...

Well, [1-2/4] are things I can take for v3.9 no problem. The other two I'd wait for the next cycle to be honest. We already have 30+ cpufreq patches scheduled for v3.9 and some of them quite subtle for that matter.

To be honest, i wanted to get these in 3.9 :) . And that's why hurried on them and got this full series working in a single day of work :)

Anyway, i understand your point and i also believe last two patches require some testing/review before these getting in.

One important point i would like to highlight is: governors directory would be present in cpu/cpu*/cpufreq/ now instead of cpu/cpufreq/.

Borislav Petkov

12:32 p.m.

On Mon, Feb 04, 2013 at 05:54:18PM +0530, Viresh Kumar wrote:

...

One important point i would like to highlight is: governors directory would be present in cpu/cpu*/cpufreq/ now instead of cpu/cpufreq/.

Uh, hold on, isn't this breaking a bunch of userspace with this move? Also, on all those other systems which don't need per-policy governors, we probably don't need this. So maybe this should be made optional, to be enabled by a config option IMO...

Thanks.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

12:54 p.m.

On 4 February 2013 18:02, Borislav Petkov bp@alien8.de wrote:

...

On Mon, Feb 04, 2013 at 05:54:18PM +0530, Viresh Kumar wrote:

...
One important point i would like to highlight is: governors directory would be present in cpu/cpu*/cpufreq/ now instead of cpu/cpufreq/.

Uh, hold on, isn't this breaking a bunch of userspace with this move? Also, on all those other systems which don't need per-policy governors, we probably don't need this. So maybe this should be made optional, to be enabled by a config option IMO...

That's why i am highlighting it again and again. :) What i believe is, the place where this directory was present earlier (cpu/cpufreq/) wasn't the right place. Everything else was in cpu/cpu*/cpufreq, then why this in cpu/cpufreq/ ?

I don't know how much of a pain it would be to fix userspace for it, but i know it wouldn't be that small.

I had another idea of doing this only for platforms where we have multiple struct policy alive at the same time. But didn't wanted to implement it before discussing this further.

Borislav Petkov

1:04 p.m.

On Mon, Feb 04, 2013 at 06:24:19PM +0530, Viresh Kumar wrote:

...

That's why i am highlighting it again and again. :)

Ah, see, someone caught up with it :).

...

What i believe is, the place where this directory was present earlier (cpu/cpufreq/) wasn't the right place. Everything else was in cpu/cpu*/cpufreq, then why this in cpu/cpufreq/ ?

For the simple reason that the "cpu*" stuff is per-cpu - the "cpu/cpufreq" is per system, i.e. one governor for the whole system.

...

I don't know how much of a pain it would be to fix userspace for it, but i know it wouldn't be that small.

I wouldn't fix userspace but simply not touch it. You can add your per-policy stuff in "cpu/cpu*" as new sysfs nodes and no need to change anything. And, also, as I suggested earlier, you should make it configurable since this code wouldn't make sense on x86, for example, where one system-wide governor should suffice.

...

I had another idea of doing this only for platforms where we have multiple struct policy alive at the same time. But didn't wanted to implement it before discussing this further.

Simply put it behind a config option like CONFIG_CPU_IDLE_MULTIPLE_DRIVERS, call the whole menu "Multi-power-domain-policy" something and that should be modulary enough.

Thanks.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

1:25 p.m.

On 4 February 2013 18:34, Borislav Petkov bp@alien8.de wrote:

...

On Mon, Feb 04, 2013 at 06:24:19PM +0530, Viresh Kumar wrote:

...

...
What i believe is, the place where this directory was present earlier (cpu/cpufreq/) wasn't the right place. Everything else was in cpu/cpu*/cpufreq, then why this in cpu/cpufreq/ ?

For the simple reason that the "cpu*" stuff is per-cpu - the "cpu/cpufreq" is per system, i.e. one governor for the whole system.

That's not completely true. There lies cpufreq directory in cpu/cpu*/ too, where we have per policy stuff in cpu/cpu*/, like policy tunables and stats. And the same is true for governor too.

...

...
I don't know how much of a pain it would be to fix userspace for it, but i know it wouldn't be that small.

I wouldn't fix userspace but simply not touch it. You can add your per-policy stuff in "cpu/cpu*" as new sysfs nodes and no need to change anything.

That was slightly confusing to me :( The whole governor directory is per policy, i have to keep that in cpu/cpu*/cpufreq instead of cpu/cpufreq .

...

And, also, as I suggested earlier, you should make it configurable since this code wouldn't make sense on x86, for example, where one system-wide governor should suffice.

Its not only for multicluster system, but a system where multiple cpus have separate clock control and hence multiple policy structures.

...

...
I had another idea of doing this only for platforms where we have multiple struct policy alive at the same time. But didn't wanted to implement it before discussing this further.

Simply put it behind a config option like CONFIG_CPU_IDLE_MULTIPLE_DRIVERS, call the whole menu "Multi-power-domain-policy" something and that should be modulary enough.

Problem with this is it would fail for single image solutions on which everybody is working on. So, with multiple platforms compiled into a single image, this wouldn't work.

Borislav Petkov

1:36 p.m.

On Mon, Feb 04, 2013 at 06:55:25PM +0530, Viresh Kumar wrote:

...

That's not completely true. There lies cpufreq directory in cpu/cpu*/ too, where we have per policy stuff in cpu/cpu*/, like policy tunables and stats. And the same is true for governor too.

$ tree /sys/devices/system/cpu/cpu0/cpufreq/ /sys/devices/system/cpu/cpu0/cpufreq/ ├── affected_cpus ├── bios_limit ├── cpb ├── cpuinfo_cur_freq ├── cpuinfo_max_freq ├── cpuinfo_min_freq ├── cpuinfo_transition_latency ├── related_cpus ├── scaling_available_frequencies ├── scaling_available_governors ├── scaling_cur_freq ├── scaling_driver ├── scaling_governor ├── scaling_max_freq ├── scaling_min_freq ├── scaling_setspeed └── stats ├── time_in_state ├── total_trans └── trans_table

1 directory, 19 files

$ grep -r . /sys/devices/system/cpu/cpu0/cpufreq/* /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0 /sys/devices/system/cpu/cpu0/cpufreq/bios_limit:4000000 /sys/devices/system/cpu/cpu0/cpufreq/cpb:1 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:4000000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0 1 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:4000000 3400000 2800000 2100000 1400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:powersave userspace conservative ondemand perform /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:acpi-cpufreq /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:4000000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:4000000 3089328 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:3400000 47448 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:2800000 67185 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:2100000 92731 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:1400000 11416914 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: From : To /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: : 4000000 3400000 2800000 2100000 140000 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 4000000: 0 34756 46388 53179 21824 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 3400000: 12938 0 3755 3555 1450 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 2800000: 19940 0 0 4547 2565 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 2100000: 18523 0 0 0 4275 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 1400000: 301168 0 0 0 /sys/devices/system/cpu/cpu0/cpufreq/stats/total_trans:799918

Show me the policy tunables here.

...

That was slightly confusing to me :( The whole governor directory is per policy, i have to keep that in cpu/cpu*/cpufreq instead of cpu/cpufreq.

So make a /sys/devices/system/cpu/cpufreq/policies/ and add functionality to assign cpus to policies or whatever the design of this thing will be.

...

Its not only for multicluster system, but a system where multiple cpus have separate clock control and hence multiple policy structures.

What are those systems? Examples?

...

Problem with this is it would fail for single image solutions on which everybody is working on. So, with multiple platforms compiled into a single image, this wouldn't work.

Single-image solutions will enable that config option and get built with it - no problem at all.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

1:58 p.m.

On 4 February 2013 19:06, Borislav Petkov bp@alien8.de wrote:

...

On Mon, Feb 04, 2013 at 06:55:25PM +0530, Viresh Kumar wrote:

...
That's not completely true. There lies cpufreq directory in cpu/cpu*/ too, where we have per policy stuff in cpu/cpu*/, like policy tunables and stats. And the same is true for governor too.

$ tree /sys/devices/system/cpu/cpu0/cpufreq/ /sys/devices/system/cpu/cpu0/cpufreq/ ├── affected_cpus ├── bios_limit ├── cpb ├── cpuinfo_cur_freq ├── cpuinfo_max_freq ├── cpuinfo_min_freq ├── cpuinfo_transition_latency ├── related_cpus ├── scaling_available_frequencies ├── scaling_available_governors ├── scaling_cur_freq ├── scaling_driver ├── scaling_governor ├── scaling_max_freq ├── scaling_min_freq ├── scaling_setspeed └── stats ├── time_in_state ├── total_trans └── trans_table

1 directory, 19 files

$ grep -r . /sys/devices/system/cpu/cpu0/cpufreq/* /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0 /sys/devices/system/cpu/cpu0/cpufreq/bios_limit:4000000 /sys/devices/system/cpu/cpu0/cpufreq/cpb:1 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:4000000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:4000 /sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0 1 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:4000000 3400000 2800000 2100000 1400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:powersave userspace conservative ondemand perform /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:acpi-cpufreq /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:4000000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:1400000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported> /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:4000000 3089328 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:3400000 47448 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:2800000 67185 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:2100000 92731 /sys/devices/system/cpu/cpu0/cpufreq/stats/time_in_state:1400000 11416914 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: From : To /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: : 4000000 3400000 2800000 2100000 140000 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 4000000: 0 34756 46388 53179 21824 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 3400000: 12938 0 3755 3555 1450 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 2800000: 19940 0 0 4547 2565 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 2100000: 18523 0 0 0 4275 /sys/devices/system/cpu/cpu0/cpufreq/stats/trans_table: 1400000: 301168 0 0 0 /sys/devices/system/cpu/cpu0/cpufreq/stats/total_trans:799918

Show me the policy tunables here.

All files which are directly present in cpu/cpu*/cpufreq/ folder. I am not talking about governor tunables but policy tunables. Things like scaling_[min]max_freq are policy tunables.

...

...
That was slightly confusing to me :( The whole governor directory is per policy, i have to keep that in cpu/cpu*/cpufreq instead of cpu/cpufreq.

So make a /sys/devices/system/cpu/cpufreq/policies/ and add functionality to assign cpus to policies or whatever the design of this thing will be.

Policies don't have a name associated with them and so cpu/cpufreq/policies doesn't make any sense. Rather one policy is related to multiple cpus and its tunables are linked in all the cpus that belong to it, like scaling_[min]max_freq.

...

...
Its not only for multicluster system, but a system where multiple cpus have separate clock control and hence multiple policy structures.

What are those systems? Examples?

Don't have examples of these, but there can be few. Over that it is a must for multicluster systems as clusters normally have separate clock control.

...

...
Problem with this is it would fail for single image solutions on which everybody is working on. So, with multiple platforms compiled into a single image, this wouldn't work.

Single-image solutions will enable that config option and get built with it - no problem at all.

But then we will get governors tunables in cpu/cpu*/cpufreq/ instead of cpu/cpufreq/ . Will that not break userspace for other systems?

Borislav Petkov

2:09 p.m.

On Mon, Feb 04, 2013 at 07:28:16PM +0530, Viresh Kumar wrote:

...

All files which are directly present in cpu/cpu*/cpufreq/ folder. I am not talking about governor tunables but policy tunables. Things like scaling_[min]max_freq are policy tunables.

No, on x86 those are the P-states frequencies. They're defined by the hardware.

...

Policies don't have a name associated with them and so cpu/cpufreq/policies doesn't make any sense. Rather one policy is related to multiple cpus and its tunables are linked in all the cpus that belong to it, like scaling_[min]max_freq.

Then do the following:

cpu/cpufreq/policies/ |-> policy0 |-> min_freq |-> max_freq |-> affected_cpus ...

or whatever needs to be a flexible interface for multi-policy cpufreq support.

Remember: once you do those, they're more or less cast in stone so take your time and do the design right, do not hurry those.

...

Don't have examples of these, but there can be few. Over that it is a must for multicluster systems as clusters normally have separate clock control.

Yeah, nice try. We only support real hardware in the kernel, not what could there be.

...

But then we will get governors tunables in cpu/cpu*/cpufreq/ instead of cpu/cpufreq/ . Will that not break userspace for other systems?

What's wrong with having both? The cpu/cpufreq/ governor will set the system-wide governor and the cpu/cpu*/cpufreq/ will add the different policies.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

2:21 p.m.

On 4 February 2013 19:39, Borislav Petkov bp@alien8.de wrote:

...

On Mon, Feb 04, 2013 at 07:28:16PM +0530, Viresh Kumar wrote:

...
All files which are directly present in cpu/cpu*/cpufreq/ folder. I am not talking about governor tunables but policy tunables. Things like scaling_[min]max_freq are policy tunables.

No, on x86 those are the P-states frequencies. They're defined by the hardware.

A question we need to answer: What is "policy"?

For me, "policy" is the hardware policy for a group of cpus, defined by the hardware. We call them P-states. But these are strictly per policy (i.e. per cpu group sharing clock line).

So, if we have two packages with two cpus each, we will have two copies of these P-states and every other file/directory in cpu/cpu*/cpufreq. One common copy would be shared between cpu/cpu[0-1]/cpufreq directory and other one between cpu/cpu[2-3]/cpufreq.

...

...
Policies don't have a name associated with them and so cpu/cpufreq/policies doesn't make any sense. Rather one policy is related to multiple cpus and its tunables are linked in all the cpus that belong to it, like scaling_[min]max_freq.

Then do the following:

cpu/cpufreq/policies/ |-> policy0 |-> min_freq |-> max_freq |-> affected_cpus ...

We correlate things with cpus rather than policies and so the current directory structure of cpu/cpu*/cpufreq/*** is the best suited ones.

...

or whatever needs to be a flexible interface for multi-policy cpufreq support.

The multi policy part is already well implemented, we are talking about governor per policy here.

...

Remember: once you do those, they're more or less cast in stone so take your time and do the design right, do not hurry those.

Yes, and that's why cpu/cpu*/cpufreq/ondemand/*** suits the best, with exactly the same logic that went for P-states or cpufreq-stats.

...

...
But then we will get governors tunables in cpu/cpu*/cpufreq/ instead of cpu/cpufreq/ . Will that not break userspace for other systems?

What's wrong with having both? The cpu/cpufreq/ governor will set the system-wide governor and the cpu/cpu*/cpufreq/ will add the different policies.

Hmm.. confused.. Consider two systems: - A dual core system, with cores sharing clocks. - A dual cluster system (dual core per cluster), with separate clocks per cluster.

Where will you keep governor directories for both of these configurations?

We need to select only one... cpu/cpufreq doesn't suit the second case at all as we need to use ondemand governor for both the clusters but with separate tunables. And so a single cpu/cpufreq/ondemand directory wouldn't solve the issue.

-- viresh

Borislav Petkov

3:05 p.m.

On Mon, Feb 04, 2013 at 07:51:33PM +0530, Viresh Kumar wrote:

...

We correlate things with cpus rather than policies and so the current directory structure of cpu/cpu*/cpufreq/*** is the best suited ones.

Ok, show me the details of that layout. How is that going to look?

One thing I've come to realize with the current interface is that if you want to change stuff, you need to iterate over all cpus instead of writing to a system-wide node.

And, in this case, if you can and need to change the policy per clock-domain, I wouldn't make it needlessly too-granulary per-cpu.

That's why I'm advocating the cpu/cpufreq/ path.

...

Yes, and that's why cpu/cpu*/cpufreq/ondemand/*** suits the best, with exactly the same logic that went for P-states or cpufreq-stats.

See above.

...

Hmm.. confused.. Consider two systems:

A dual core system, with cores sharing clocks.

A dual cluster system (dual core per cluster), with separate clocks

per cluster.

Where will you keep governor directories for both of these configurations?

Easy: as said above, make the policy granularity per clock-domain. On systems which have only one set of P-states - like it is the case with the overwhelming majority of systems running linux now - nothing should change.

...

We need to select only one... cpu/cpufreq doesn't suit the second case at all as we need to use ondemand governor for both the clusters but with separate tunables. And so a single cpu/cpufreq/ondemand directory wouldn't solve the issue.

Think of it this way: what is the highest granularity you need per clock-domain? If you want to control the policy per clock-domain, then cpu/cpufreq/ is what you want. If you want finer-grained control - and you need to think hard of what use cases are sensible for that finer-grained solution - then you're better off with cpu/cpu*/ layout.

In both cases though, having clear examples of why you've come up with the layout you're advocating would help reviewers a lot. If you simply come and say we need this because there might be systems out there who could use it, then that probably is not going to get you that far.

HTH.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

3:37 p.m.

On 4 February 2013 20:35, Borislav Petkov bp@alien8.de wrote:

...

On Mon, Feb 04, 2013 at 07:51:33PM +0530, Viresh Kumar wrote:

...
We correlate things with cpus rather than policies and so the current directory structure of cpu/cpu*/cpufreq/*** is the best suited ones.

Ok, show me the details of that layout. How is that going to look?

I don't have board right now to take the snapshot, but it would be like:

...

One thing I've come to realize with the current interface is that if you want to change stuff, you need to iterate over all cpus instead of writing to a system-wide node.

Not really. Following is the way by which cpu/cpu*/cpufreq directories are created:

For policy->cpu: ret = kobject_init_and_add(&policy->kobj, &ktype_cpufreq, &dev->kobj, "cpufreq");

This creates cpufreq directory for policy in policy->cpu...

For all other cpus in policy->cpus, we do: ret = sysfs_create_link(&cpu_dev->kobj, &policy->kobj, "cpufreq");

And so whatever gets added in cpu/cpu0/cpufreq directory is reflected in all other policy->cpus.

...

And, in this case, if you can and need to change the policy per clock-domain, I wouldn't make it needlessly too-granulary per-cpu.

That's why I'm advocating the cpu/cpufreq/ path.

Its already like this, i.e. per policy or clock-domain. Other cpus just have a link. And that's why in my code, i just add governor directory in policy->cpu's cpufreq directory and it gets reflected in other cpus of policy->cpus.

That's why i said P-states as policy tunables.

...

...
Hmm.. confused.. Consider two systems:

A dual core system, with cores sharing clocks.

A dual cluster system (dual core per cluster), with separate clocks

per cluster.

Where will you keep governor directories for both of these configurations?

Easy: as said above, make the policy granularity per clock-domain. On systems which have only one set of P-states - like it is the case with the overwhelming majority of systems running linux now - nothing should change.

Currently its not per policy, but single instance of any governor is supported. And it is present in cpu/cpufreq . That's why i said earlier, it isn't the right place for governor's directory. It is very much related to a policy or clock-domain.

...

...
We need to select only one... cpu/cpufreq doesn't suit the second case at all as we need to use ondemand governor for both the clusters but with separate tunables. And so a single cpu/cpufreq/ondemand directory wouldn't solve the issue.

Think of it this way: what is the highest granularity you need per clock-domain? If you want to control the policy per clock-domain, then cpu/cpufreq/ is what you want. If you want finer-grained control - and you need to think hard of what use cases are sensible for that finer-grained solution - then you're better off with cpu/cpu*/ layout.

I want to control it over clock-domain, but can't get that in cpu/cpufreq/. Policies don't have numbers assigned to them.

...

In both cases though, having clear examples of why you've come up with the layout you're advocating would help reviewers a lot. If you simply come and say we need this because there might be systems out there who could use it, then that probably is not going to get you that far.

So, i am working on ARM's big.LITTLE system where we have two clusters. One of A15s and other of A7s. Because of their different power ratings or performance figures, we need to have separate set of ondemand tunables for them. And hence this patch. Though this patch is required for any multi-cluster system.

-- viresh

Borislav Petkov

4:50 p.m.

On Mon, Feb 04, 2013 at 09:07:11PM +0530, Viresh Kumar wrote:

...

I don't have board right now to take the snapshot, but it would be like:

$ tree /sys/devices/system/cpu/cpu0/cpufreq/ /sys/devices/system/cpu/cpu0/cpufreq/ ├── affected_cpus ├── bios_limit ├── cpb ├── cpuinfo_cur_freq ├── cpuinfo_max_freq ├── cpuinfo_min_freq ├── cpuinfo_transition_latency ├── related_cpus ├── scaling_available_frequencies ├── scaling_available_governors ├── scaling_cur_freq ├── scaling_driver ├── scaling_governor ├── scaling_max_freq ├── scaling_min_freq ├── scaling_setspeed └── stats ├── time_in_state ├── total_trans └── trans_table └── ondemand ├── sampling_rate ├── up_threshold └── ignore_nice

So this is adding the current governor as a per-cpu thing.

...

...
One thing I've come to realize with the current interface is that if you want to change stuff, you need to iterate over all cpus instead of writing to a system-wide node.

Not really. Following is the way by which cpu/cpu*/cpufreq directories are created:

That's not what I meant - I meant from userspace:

for $i in $(grep processor /proc/cpuinfo | awk '{ print $3 }'); do echo "performance" > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

Instead of

echo "performance" > /sys/devices/system/cpu/cpufreq/scaling_governor

which is hypothetical but sets it for the whole system without fuss.

[ … ]

...

I want to control it over clock-domain, but can't get that in cpu/cpufreq/. Policies don't have numbers assigned to them.

So, give them names.

...

So, i am working on ARM's big.LITTLE system where we have two clusters. One of A15s and other of A7s. Because of their different power ratings or performance figures, we need to have separate set of ondemand tunables for them. And hence this patch. Though this patch is required for any multi-cluster system.

So you want this (values after "="):

Other arches create other policies and that's it. If you need another policy added to the set, you simply add 'policyN++' and that's it.

I think this is cleaner but whatever - I don't care that much. My only strong concern is that this thing should be a Kconfig option and optional for arches where it doesn't apply.

Thanks.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

5 Feb 5 Feb

7:20 a.m.

On Mon, Feb 4, 2013 at 10:20 PM, Borislav Petkov bp@alien8.de wrote:

...

On Mon, Feb 04, 2013 at 09:07:11PM +0530, Viresh Kumar wrote:

...
└── ondemand ├── sampling_rate ├── up_threshold └── ignore_nice

So this is adding the current governor as a per-cpu thing.

Its per policy, but yes it is replicated into all cpus as all policy->cpus share the same directory.

...

...
...
One thing I've come to realize with the current interface is that if you want to change stuff, you need to iterate over all cpus instead of writing to a system-wide node.

Not really. Following is the way by which cpu/cpu*/cpufreq directories are created:

That's not what I meant - I meant from userspace:

for $i in $(grep processor /proc/cpuinfo | awk '{ print $3 }'); do echo "performance" > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

Instead of

echo "performance" > /sys/devices/system/cpu/cpufreq/scaling_governor

which is hypothetical but sets it for the whole system without fuss.

We actually need to do this only for policy->cpu, but yes the user might not be well aware of policy cpu-groups and may do what you wrote.

But that is true even today, when we need to change any policy tunables or P-states.

...

...
I want to control it over clock-domain, but can't get that in cpu/cpufreq/. Policies don't have numbers assigned to them.

So, give them names.

IMHO, names doesn't suit policies.

...

...
So, i am working on ARM's big.LITTLE system where we have two clusters. One of A15s and other of A7s. Because of their different power ratings or performance figures, we need to have separate set of ondemand tunables for them. And hence this patch. Though this patch is required for any multi-cluster system.

So you want this (values after "="):

cpu/cpufreq/ |-> policy0 |-> name = A15 |-> min_freq = ... |-> max_freq = ... |-> affected_cpus = 0,1,2,... |-> ondemand |-> sampling_rate |-> up_threshold |-> ignore_nice ... |-> policy1 |-> name = A7 |-> min_freq = ... |-> max_freq = ... |-> affected_cpus = n,n+1,n+2,... |-> performance |-> sampling_rate |-> up_threshold |-> ignore_nice ...

We may have two clusters of A7's also, in a non-big little arch. Then these names become very confusing.

...

Other arches create other policies and that's it. If you need another policy added to the set, you simply add 'policyN++' and that's it.

For me, adding policy->names per arch is increasing complexity without much gain. We already have an existing infrastructure where this info is present inside cpus and that looks good to me.

...

I think this is cleaner but whatever - I don't care that much. My only strong concern is that this thing should be a Kconfig option and optional for arches where it doesn't apply.

Your concern is: we don't want to fix userspace for existing platforms where we have just a single cluster and so struct policy in the system.

so, a good solution instead of Kconfig stuff is, add another field in policy structure that would be filled by platform specific cpufreq drivers init() routine. Based on which we can decide to put governor directory in cpu/cpufreq/gov-name or cpu/cpu*/cpufreq/gov-name.

But i am not sure if keeping both kind of directory locations for separate platforms is a good idea. Userspace needs to adapt to these changes as well for multi-cluster platforms.

@Rafael: you know about any other multi-cluster/multi-clock-domain platform leaving big.LITTLE, where we have multiple structs policy?

Borislav Petkov

9:15 a.m.

On Tue, Feb 05, 2013 at 12:50:31PM +0530, Viresh Kumar wrote:

...

...
I think this is cleaner but whatever - I don't care that much. My only strong concern is that this thing should be a Kconfig option and optional for arches where it doesn't apply.

Your concern is: we don't want to fix userspace for existing platforms where we have just a single cluster and so struct policy in the system.

No, as I said so many times already and you're unwilling to understand it: multiple policies support in cpufreq should be optional and selectable in Kconfig so that systems which don't need that, don't have to see or use it. It is yet another feature which doesn't apply universally so we make such features optional. Like the rest of the gazillion things in the kernel already.

The existing sysfs layout cannot be changed because you're breaking userspace and we don't do that. It is that simple.

Concerning adding new sysfs entries, I told you to make it as easy as possible and as sensible as possible, dictated by the use cases. If you can't come up with some, then talk to the people who are going to use your design and ask them what makes sense the most.

*Then* write the code.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

9:47 a.m.

On 5 February 2013 14:45, Borislav Petkov bp@alien8.de wrote:

...

On Tue, Feb 05, 2013 at 12:50:31PM +0530, Viresh Kumar wrote:

...
...
I think this is cleaner but whatever - I don't care that much. My only strong concern is that this thing should be a Kconfig option and optional for arches where it doesn't apply.

Your concern is: we don't want to fix userspace for existing platforms where we have just a single cluster and so struct policy in the system.

No, as I said so many times already and you're unwilling to understand it:

I am willing to, but not able to :)

...

multiple policies support in cpufreq should be optional and selectable in Kconfig so that systems which don't need that, don't have to see or use it. It is yet another feature which doesn't apply universally so we make such features optional. Like the rest of the gazillion things in the kernel already.

I understand what Kconfig options are for, but i am not able to understand what's the benefit of this option here. For example: for single image solutions we need to keep it enabled. And so, would need some sort of logic in cpufreq core & platform driver to decide where to create the governors directory.

The code without Kconfig option would be as simple as:

platform_driver: init(struct cpufreq_policy *policy) { .. policy->have_multiple_policies = true; .. }

cpufreq-core:

add_dev() { if (policy->have_multiple_policies) create-folder-in-cpu/cpu*/cpufreq; else create-folder-in-cpu/cpufreq; }

And so, platforms like Krait or big.LITTLE can set it to true from their cpufreq-drivers. And this wouldn't break any of the current platforms.

...

The existing sysfs layout cannot be changed because you're breaking userspace and we don't do that. It is that simple.

That's fine. I understood it already. :)

The problem i see is: - both governor tunables, cpufreq-stats & policy tunables (P-states) have the same requirement. They are all per policy or clock-domain, instead of per cpu. - I want to keep all of these at the same place, as they should be present in the same hierarchy. - If we move everything to cpu/cpufreq/policy-names/ then also we would break existing userspace stuff for stats and P-states. - If we move everything to cpu/cpu*/cpufreq/ then also we would break existing userspace stuff for governors.

...

Concerning adding new sysfs entries, I told you to make it as easy as possible and as sensible as possible, dictated by the use cases. If you can't come up with some, then talk to the people who are going to use your design and ask them what makes sense the most.

For me cpu/cpu*/ is the most sensible as it is an very easy/convenient interface for users. I am the first one who is going to use it :)

@Rafael: What's your view on this discussion we are having? We probably need few more "minds" to jump in, as we are not moving towards a conclusion. :)

-- viresh

Borislav Petkov

10:27 a.m.

On Tue, Feb 05, 2013 at 03:17:21PM +0530, Viresh Kumar wrote:

...

...
multiple policies support in cpufreq should be optional and selectable in Kconfig so that systems which don't need that, don't have to see or use it. It is yet another feature which doesn't apply universally so we make such features optional. Like the rest of the gazillion things in the kernel already.

I understand what Kconfig options are for, but i am not able to understand what's the benefit of this option here.

Are you kidding me? You're simply not reading what I'm saying to you: "... should be optional and selectable in Kconfig so that systems which don't need that, don't have to see or use it." Because on those systems it doesn't apply.

How about we add an x86-specific extension which is a big wad of code and is needlessly run on ARM just because it is easier?

That's why we do config options, so that code which doesn't apply on a specific system, doesn't run on it.

Goddammit, how hard is to understand that??!

...

For example: for single image solutions we need to keep it enabled.

So keep it enabled!

...

And so, would need some sort of logic in cpufreq core & platform driver to decide where to create the governors directory.

The code without Kconfig option would be as simple as:

...

platform_driver: init(struct cpufreq_policy *policy) { .. policy->have_multiple_policies = true; .. }

cpufreq-core:

add_dev() { if (policy->have_multiple_policies) create-folder-in-cpu/cpu*/cpufreq; else create-folder-in-cpu/cpufreq;

Yes, this is how it could be done.

And this is what I mean by making it optional:

You go and abstract away the "create-folder-in-cpu/cpu*/cpufreq" functionality. If this is a function called add_additional_sysfs_entries(), for example, it should do nothing when CONFIG_CPUFREQ_MULTIPLE_POLICIES is disabled. Otherwise, it will do your dance:

#ifdef CONFIG_CPUFREQ_MULTIPLE_POLICIES static int add_additional_sysfs_entries(...) {

do all stuff required for multiple policies

} #else /* CONFIG_CPUFREQ_MULTIPLE_POLICIES */ static int add_additional_sysfs_entries(...) { return 0; } #endif /* CONFIG_CPUFREQ_MULTIPLE_POLICIES */

and all the rest of the stuff which is needed for multiple system policies, should be abstracted that way, more functions added, whatever. If it is starting to become more, you can create your own compilation unit. And so on and so on... the kernel is full of examples how to do stuff like that.

...

And so, platforms like Krait or big.LITTLE can set it to true from their cpufreq-drivers. And this wouldn't break any of the current platforms.

...
The existing sysfs layout cannot be changed because you're breaking userspace and we don't do that. It is that simple.

That's fine. I understood it already. :)

Not really, you obviously didn't.

...

The problem i see is:

both governor tunables, cpufreq-stats & policy tunables (P-states) have the

same requirement. They are all per policy or clock-domain, instead of per cpu.

I want to keep all of these at the same place, as they should be

present in the same hierarchy.

If we move everything to cpu/cpufreq/policy-names/ then also we would break

existing userspace stuff for stats and P-states.

If we move everything to cpu/cpu*/cpufreq/ then also we would break

existing userspace stuff for governors.

No, you're not allowed to change existing sysfs layout. FULLSTOP.

Simply add the new stuff to cpu/cpu*/cpufreq/ with code which is enabled when CONFIG_CPUFREQ_MULTIPLE_POLICIES is set. If CONFIG_CPUFREQ_MULTIPLE_POLICIES is not enabled, nothing changes.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

10:43 a.m.

On 5 February 2013 15:57, Borislav Petkov bp@alien8.de wrote:

...

Are you kidding me? You're simply not reading what I'm saying to you: "... should be optional and selectable in Kconfig so that systems which don't need that, don't have to see or use it." Because on those systems it doesn't apply.

How about we add an x86-specific extension which is a big wad of code and is needlessly run on ARM just because it is easier?

There isn't lot of code that we have to keep inside the macro you suggest. Its just an if else (with single line block), which would give the parent kobject. Nothing else.

I didn't wanted to create a macro for just that. For me an if/else is not that big code.

Anyway, if nobody else comes on my side i can create that macro for you. But, personally i would prefer code without such macros.

-- viresh

Borislav Petkov

11:04 a.m.

On Tue, Feb 05, 2013 at 04:13:23PM +0530, Viresh Kumar wrote:

...

There isn't lot of code that we have to keep inside the macro you suggest. Its just an if else (with single line block), which would give the parent kobject. Nothing else.

I didn't wanted to create a macro for just that. For me an if/else is not that big code.

Yeah, I imagine for you it isn't, no.

...

Anyway, if nobody else comes on my side i can create that macro for you. But, personally i would prefer code without such macros.

Here's an even cleaner way:

platform_driver: init(struct cpufreq_policy *policy) { ...

add_additional_sysfs_entries(policy);

... }

...

static void add_additional_sysfs_entries(struct cpufreq_policy *policy) { #ifdef CONFIG_CPUFREQ_MULTIPLE_POLICIES create-folder-in-cpu/cpu*/cpufreq; ... #endif }

and the platform driver will have in its Kconfig section:

config CPUFREQ_PLATFORM_DRIVER_X ... select CPUFREQ_MULTIPLE_POLICIES

You don't need the policy->have_multiple_policies member even.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

11:12 a.m.

On 5 February 2013 16:34, Borislav Petkov bp@alien8.de wrote:

...

Here's an even cleaner way:

platform_driver: init(struct cpufreq_policy *policy) { ...
    add_additional_sysfs_entries(policy);

    ...
}

...

static void add_additional_sysfs_entries(struct cpufreq_policy *policy) { #ifdef CONFIG_CPUFREQ_MULTIPLE_POLICIES create-folder-in-cpu/cpu*/cpufreq; ... #endif }

and the platform driver will have in its Kconfig section:

config CPUFREQ_PLATFORM_DRIVER_X ... select CPUFREQ_MULTIPLE_POLICIES

You don't need the policy->have_multiple_policies member even.

Tricky part is the name of this routine: add_additional_sysfs_entries().

I am not proposing additional set of directories inside cpu/cpu*/cpufreq/ but either of cpu/cpufreq/ or cpu/cpu*/cpufreq/ for governor.

And so, keeping that additional variable looks a better solution. Let me get a patch with this change only and see how it looks.

Borislav Petkov

11:19 a.m.

On Tue, Feb 05, 2013 at 04:42:23PM +0530, Viresh Kumar wrote:

...

Tricky part is the name of this routine: add_additional_sysfs_entries().

Now you're just being silly - this is just an example how to do it. If you want me to do it for ya, you need to send me your monthly salary.

...

And so, keeping that additional variable looks a better solution.

Yeah, I don't think you understand me at all. :(

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

11:26 a.m.

On 5 February 2013 16:49, Borislav Petkov bp@alien8.de wrote:

...

On Tue, Feb 05, 2013 at 04:42:23PM +0530, Viresh Kumar wrote:

...
Tricky part is the name of this routine: add_additional_sysfs_entries().

Now you're just being silly - this is just an example how to do it. If you want me to do it for ya, you need to send me your monthly salary.

Haha... I don't want you to do it. I don't want such routine to be exposed to cpufreq drivers as this belongs to the core code.

Just some kind of indication from platform driver is required about how/where it wants its governor directory to be present. And the variable suits more here.

Lets discuss it more on the next patch i send :)

-- viresh

Borislav Petkov

11:32 a.m.

On Tue, Feb 05, 2013 at 04:56:03PM +0530, Viresh Kumar wrote:

...

Just some kind of indication from platform driver is required about how/where it wants its governor directory to be present.

The indication is this:

config CPUFREQ_PLATFORM_DRIVER_X ... select CPUFREQ_MULTIPLE_POLICIES

You really need to slow down and really look at what I'm proposing.

...

And the variable suits more here.

And this variable means nothing on other systems so why add it in the first place?

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

12:24 p.m.

On 5 February 2013 17:02, Borislav Petkov bp@alien8.de wrote:

...

On Tue, Feb 05, 2013 at 04:56:03PM +0530, Viresh Kumar wrote:

...
Just some kind of indication from platform driver is required about how/where it wants its governor directory to be present.

The indication is this:

config CPUFREQ_PLATFORM_DRIVER_X ... select CPUFREQ_MULTIPLE_POLICIES

You really need to slow down and really look at what I'm proposing.

This indication isn't enough. On a single image solution, we need to identify the system which needs support for multiple policies and i still feel we need that variable type indication :)

Borislav Petkov

1:22 p.m.

On Tue, Feb 05, 2013 at 05:54:57PM +0530, Viresh Kumar wrote:q

...

This indication isn't enough. On a single image solution, we need to identify the system which needs support for multiple policies and i still feel we need that variable type indication :)

If the image is going to run also on systems which support only a single policy, then I guess you can make it a bool, stuff it in struct cpufreq_policy and ifdef around it.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

1:55 p.m.

On 5 February 2013 18:52, Borislav Petkov bp@alien8.de wrote:

...

On Tue, Feb 05, 2013 at 05:54:57PM +0530, Viresh Kumar wrote:q

...
This indication isn't enough. On a single image solution, we need to identify the system which needs support for multiple policies and i still feel we need that variable type indication :)

If the image is going to run also on systems which support only a single policy, then I guess you can make it a bool, stuff it in struct cpufreq_policy and ifdef around it.

That's what i proposed initially.

Viresh Kumar

9:36 a.m.

On 4 February 2013 19:06, Borislav Petkov bp@alien8.de wrote:

...

On Mon, Feb 04, 2013 at 06:55:25PM +0530, Viresh Kumar wrote:

...

...
Its not only for multicluster system, but a system where multiple cpus have separate clock control and hence multiple policy structures.

What are those systems? Examples?

Qualcomm's ARM based "krait". Currently shipping in millions of Android phones.

http://en.wikipedia.org/wiki/Krait_(CPU)

Thanks Charles for pointing it out, I knew there is one :)

-- viresh

Charles Garcia-Tobin

11:29 a.m.

...

Qualcomm's ARM based "krait". Currently shipping in millions of Android phones.

http://en.wikipedia.org/wiki/Krait_(CPU)

Thanks Charles for pointing it out, I knew there is one :)

-- viresh

...

On 4 February 2013 19:06, Borislav Petkov bp@alien8.de wrote:

...
On Mon, Feb 04, 2013 at 06:55:25PM +0530, Viresh Kumar wrote:

...
...
Its not only for multicluster system, but a system where multiple cpus have separate clock control and hence multiple policy structures.

What are those systems? Examples?

Qualcomm's ARM based "krait". Currently shipping in millions of Android phones.

http://en.wikipedia.org/wiki/Krait_(CPU)

Thanks Charles for pointing it out, I knew there is one :)

-- viresh

Actually shooting myself in the foot here, Krait is not such a great example because although you can use difference between frequencies you are less likely to use different tunables (not inconceivable but unlikely). The best examples systems are multi cluster and hereterogeneous systems, like the recently announced Samsung Exynos 5 octa http://en.wikipedia.org/wiki/Exynos_(system_on_chip). We will see more systems like this appearing, sporting low power cores combined with high performance ones, all running at the same time. I appreciate this is all very new, but more will come, and the requirement to have different tunables per cluster is very real. In ARM on our own multi cluster test chip, using an experimental version of this approach, we have seen good improvements in power consumption without compromising performance.

(Apologies ahead for any bit my mail server appends, not much I can do about it)

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Borislav Petkov

11:39 a.m.

On Tue, Feb 05, 2013 at 11:29:04AM +0000, Charles Garcia-Tobin wrote:

...

Actually shooting myself in the foot here, Krait is not such a great example because although you can use difference between frequencies you are less likely to use different tunables (not inconceivable but unlikely). The best examples systems are multi cluster and hereterogeneous systems, like the recently announced Samsung Exynos 5 octa http://en.wikipedia.org/wiki/Exynos_(system_on_chip). We will see more systems like this appearing, sporting low power cores combined with high performance ones, all running at the same time. I appreciate this is all very new, but more will come, and the requirement to have different tunables per cluster is very real. In ARM on our own multi cluster test chip, using an experimental version of this approach, we have seen good improvements in power consumption without compromising performance.

Ok, thanks for giving this insight, this is useful.

Question: do you need the granularity of that control to be per cpu (with that I mean what linux understands under "cpu," i.e. logical or physical core) or does one governor suffice per a set of cores, or as you call it, a cluster?

...

(Apologies ahead for any bit my mail server appends, not much I can do about it)

Yeah, my condolences :-)

...

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Leaving it in, in case you haven't seen how it looks like :-)

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Charles Garcia-Tobin

6:38 p.m.

...

On Tue, Feb 05, 2013 at 11:29:04AM +0000, Charles Garcia-Tobin wrote:

...
Actually shooting myself in the foot here, Krait is not such a great example because although you can use difference between frequencies you are less likely to use different tunables (not inconceivable but unlikely). The best examples systems are multi cluster and hereterogeneous systems, like the recently announced Samsung Exynos 5 octa http://en.wikipedia.org/wiki/Exynos_(system_on_chip). We will see more systems like this appearing, sporting low power cores combined with high performance ones, all running at the same time. I appreciate this is all very new, but more will come, and the requirement to have different tunables per cluster is very real. In ARM on our own multi cluster test chip, using an experimental version of this approach, we have seen good improvements in power consumption without compromising performance.

Ok, thanks for giving this insight, this is useful.

Question: do you need the granularity of that control to be per cpu (with that I mean what linux understands under "cpu," i.e. logical or physical core) or does one governor suffice per a set of cores, or as you call it, a cluster?

Later. Whatever you'd like to call it, but essentially a set of cpus, as linux understands them, that are logically related by the fact that you'd like to be able to use the same tuning policy and same governor across all of them.

Cheers

Charles

(apologies again for annoying addendums to follow)

Borislav Petkov

6:44 p.m.

On Tue, Feb 05, 2013 at 06:38:07PM +0000, Charles Garcia-Tobin wrote:

...

Later. Whatever you'd like to call it, but essentially a set of cpus, as linux understands them, that are logically related by the fact that you'd like to be able to use the same tuning policy and same governor across all of them.

Right, policy will be applied to the whole set, yes, but can you imagine that per-core settings could also make sense at some point?

Thanks.

-- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. --

Viresh Kumar

4:21 p.m.

On 4 February 2013 17:08, Viresh Kumar viresh.kumar@linaro.org wrote:

...

Currently, there can't be multiple instances of single governor_type. If we have a multi-package system, where we have multiple instances of struct policy (per package), we can't have multiple instances of same governor. i.e. We can't have multiple instances of ondemand governor for multiple packages.

Governors directory in sysfs is created at /sys/devices/system/cpu/cpufreq/ governor-name/. Which again reflects that there can be only one instance of a governor_type in the system.

This is a bottleneck for multicluster system, where we want different packages to use same governor type, but with different tunables.

This patchset is inclined towards fixing this issue.

After a long & hot discussion over the changes made by this patchset, following patches came out:

--------------x------------------x-------------------

commit 15b5548c9ccfb8088270f7574710d9d67edfe33b Author: Viresh Kumar viresh.kumar@linaro.org Date: Tue Feb 5 21:29:05 2013 +0530

cpufreq: Make governors directory sysfs location based on have_multiple_policies

Until now directory for governors tunables was getting created in cpu/cpufreq/<gov-name>. With the introduction of following patch: "cpufreq: governor: Implement per policy instances of governors"

this directory would be created in cpu/cpu<num>/cpufreq/<gov-name>. This might break userspace of existing platforms. Lets do this change only for platforms which need support for multiple policies and thus above mentioned patch.

From now on, such platforms would be require to do following from their init() routines:

policy->have_multiple_policies = true;

Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_governor.c | 2 +- include/linux/cpufreq.h | 14 ++++++++++++++ 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 545ae24..41ee86f 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -300,7 +300,7 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy, dbs_data->cdata->gov_dbs_timer); }

- rc = sysfs_create_group(&policy->kobj, + rc = sysfs_create_group(get_governor_parent_kobj(policy), dbs_data->cdata->attr_group); if (rc) { mutex_unlock(&dbs_data->mutex); diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 5dae7e6..6e1abd2 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -107,6 +107,11 @@ struct cpufreq_policy { unsigned int policy; /* see above */ struct cpufreq_governor *governor; /* see below */ void *governor_data; + /* This should be set by init() of platforms having multiple + * clock-domains, i.e. supporting multiple policies. With this sysfs + * directories of governor would be created in cpu/cpu<num>/cpufreq/ + * directory */ + bool have_multiple_policies;

struct work_struct update; /* if update_policy() needs to be * called, but you're in IRQ context */ @@ -134,6 +139,15 @@ static inline bool policy_is_shared(struct cpufreq_policy *policy) return cpumask_weight(policy->cpus) > 1; }

+static inline struct kobject * +get_governor_parent_kobj(struct cpufreq_policy *policy) +{ + if (policy->have_multiple_policies) + return &policy->kobj; + else + return cpufreq_global_kobject; +} + /******************** cpufreq transition notifiers *******************/

#define CPUFREQ_PRECHANGE (0)

--------------x------------------x-------------------

Plus the following patch, though i am still not in favor of this patch. @Rafael: Please share your opinion too on this one. :)

--------------x------------------x------------------- commit 1c7e9871fce7388136eda1c86ca75a520e4d3b9d Author: Viresh Kumar viresh.kumar@linaro.org Date: Tue Feb 5 21:41:40 2013 +0530

cpufreq: Add Kconfig option to enable/disable have_multiple_policies

have_multiple_policies is required by platforms having multiple clock-domains for cpus, i.e. supporting multiple policies for cpus. This patch adds in a Kconfig option for enabling execution of this code.

Requested-by: Borislav Petkov bp@alien8.de Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/Kconfig | 3 +++ include/linux/cpufreq.h | 4 ++++ 2 files changed, 7 insertions(+)

diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig index cbcb21e..e6e6939 100644 --- a/drivers/cpufreq/Kconfig +++ b/drivers/cpufreq/Kconfig @@ -23,6 +23,9 @@ config CPU_FREQ_TABLE config CPU_FREQ_GOV_COMMON bool

+config CPU_FREQ_HAVE_MULTIPLE_POLICIES + bool + config CPU_FREQ_STAT tristate "CPU frequency translation statistics" select CPU_FREQ_TABLE diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 6e1abd2..a092fcb 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -107,11 +107,13 @@ struct cpufreq_policy { unsigned int policy; /* see above */ struct cpufreq_governor *governor; /* see below */ void *governor_data; +#ifdef CONFIG_CPU_FREQ_HAVE_MULTIPLE_POLICIES /* This should be set by init() of platforms having multiple * clock-domains, i.e. supporting multiple policies. With this sysfs * directories of governor would be created in cpu/cpu<num>/cpufreq/ * directory */ bool have_multiple_policies; +#endif

struct work_struct update; /* if update_policy() needs to be * called, but you're in IRQ context */ @@ -142,9 +144,11 @@ static inline bool policy_is_shared(struct cpufreq_policy *policy) static inline struct kobject * get_governor_parent_kobj(struct cpufreq_policy *policy) { +#ifdef CONFIG_CPU_FREQ_HAVE_MULTIPLE_POLICIES if (policy->have_multiple_policies) return &policy->kobj; else +#endif return cpufreq_global_kobject; }

--------------x------------------x-------------------

Viresh Kumar

6 Feb 6 Feb

9:58 a.m.

On 5 February 2013 21:51, Viresh Kumar viresh.kumar@linaro.org wrote:

...

commit 15b5548c9ccfb8088270f7574710d9d67edfe33b Author: Viresh Kumar viresh.kumar@linaro.org Date: Tue Feb 5 21:29:05 2013 +0530
cpufreq: Make governors directory sysfs location based on
have_multiple_policies
Until now directory for governors tunables was getting created in
cpu/cpufreq/<gov-name>. With the introduction of following patch:
"cpufreq: governor: Implement per policy instances of governors"

this directory would be created in
cpu/cpu<num>/cpufreq/<gov-name>. This might break userspace of existing platforms. Lets do this change only for platforms which need support for multiple policies and thus above mentioned patch.
From now on, such platforms would be require to do following from
their init() routines:
    policy->have_multiple_policies = true;

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
drivers/cpufreq/cpufreq_governor.c | 2 +- include/linux/cpufreq.h | 14 ++++++++++++++ 2 files changed, 15 insertions(+), 1 deletion(-)

Hi Rafael,

Because this patch was quite big (317 insertions(+), 238 deletions(-)), i was planning a detailed self review to capture any mistakes and luckily i found one for above patch :)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 41ee86f..fe037c0 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -342,7 +342,8 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy, mutex_lock(&dbs_data->mutex); mutex_destroy(&cpu_cdbs->timer_mutex);

- sysfs_remove_group(&policy->kobj, dbs_data->cdata->attr_group); + sysfs_remove_group(get_governor_parent_kobj(policy), + dbs_data->cdata->attr_group); if (dbs_data->cdata->governor == GOV_CONSERVATIVE) cpufreq_unregister_notifier(cs_ops->notifier_block, CPUFREQ_TRANSITION_NOTIFIER);

I have pushed the complete patchset here:

http://git.linaro.org/gitweb?p=people/vireshk/linux.git%3Ba=shortlog%3Bh=ref...

Amit Kucheria

10:08 a.m.

On Wed, Feb 6, 2013 at 3:28 PM, Viresh Kumar viresh.kumar@linaro.org wrote:

...

On 5 February 2013 21:51, Viresh Kumar viresh.kumar@linaro.org wrote:

...
commit 15b5548c9ccfb8088270f7574710d9d67edfe33b Author: Viresh Kumar viresh.kumar@linaro.org Date: Tue Feb 5 21:29:05 2013 +0530
cpufreq: Make governors directory sysfs location based on
have_multiple_policies
Until now directory for governors tunables was getting created in
cpu/cpufreq/<gov-name>. With the introduction of following patch:
"cpufreq: governor: Implement per policy instances of governors"

this directory would be created in
cpu/cpu<num>/cpufreq/<gov-name>. This might break userspace of existing platforms. Lets do this change only for platforms which need support for multiple policies and thus above mentioned patch.
From now on, such platforms would be require to do following from
their init() routines:
    policy->have_multiple_policies = true;

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
drivers/cpufreq/cpufreq_governor.c | 2 +- include/linux/cpufreq.h | 14 ++++++++++++++ 2 files changed, 15 insertions(+), 1 deletion(-)
Hi Rafael,

Because this patch was quite big (317 insertions(+), 238 deletions(-)), i was planning a detailed self review to capture any mistakes and luckily i found one for above patch :)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 41ee86f..fe037c0 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -342,7 +342,8 @@ int cpufreq_governor_dbs(struct cpufreq_policy *policy, mutex_lock(&dbs_data->mutex); mutex_destroy(&cpu_cdbs->timer_mutex);
          sysfs_remove_group(&policy->kobj, dbs_data->cdata->attr_group);
          sysfs_remove_group(get_governor_parent_kobj(policy),
                          dbs_data->cdata->attr_group);
          if (dbs_data->cdata->governor == GOV_CONSERVATIVE)
                  cpufreq_unregister_notifier(cs_ops->notifier_block,
                                  CPUFREQ_TRANSITION_NOTIFIER);
I have pushed the complete patchset here:

http://git.linaro.org/gitweb?p=people/vireshk/linux.git%3Ba=shortlog%3Bh=ref...

Viresh, perhaps you should ask Stephen Rothwell to pull in your tree to get some more testing before Rafael pulls it in for 3.10?

Viresh Kumar

10:15 a.m.

On 6 February 2013 15:38, Amit Kucheria amit.kucheria@linaro.org wrote:

...

On Wed, Feb 6, 2013 at 3:28 PM, Viresh Kumar viresh.kumar@linaro.org wrote:

...
I have pushed the complete patchset here:

http://git.linaro.org/gitweb?p=people/vireshk/linux.git%3Ba=shortlog%3Bh=ref...

Viresh, perhaps you should ask Stephen Rothwell to pull in your tree to get some more testing before Rafael pulls it in for 3.10?

Its has been made clear by Rafael that these patches wouldn't make it for 3.9 (though i wanted them to :) ), and so once the merge window is over Rafael might pull them in and so they would reach Stephen's linux-next too...

I am not sure if sending a cpufreq pull request directly to Stephen is preferred. @Rafael: ??

Rafael J. Wysocki

10:38 a.m.

On Wednesday, February 06, 2013 03:45:58 PM Viresh Kumar wrote:

...

On 6 February 2013 15:38, Amit Kucheria amit.kucheria@linaro.org wrote:

...
On Wed, Feb 6, 2013 at 3:28 PM, Viresh Kumar viresh.kumar@linaro.org wrote:

...
I have pushed the complete patchset here:

http://git.linaro.org/gitweb?p=people/vireshk/linux.git%3Ba=shortlog%3Bh=ref...

Viresh, perhaps you should ask Stephen Rothwell to pull in your tree to get some more testing before Rafael pulls it in for 3.10?

Its has been made clear by Rafael that these patches wouldn't make it for 3.9 (though i wanted them to :) ), and so once the merge window is over Rafael might pull them in and so they would reach Stephen's linux-next too...

I am not sure if sending a cpufreq pull request directly to Stephen is preferred. @Rafael: ??

You may do that, if you want, but that's slightly confusing.

Also the policy is that material which is not going to be included into v3.9 shouldn't be in linux-next before v3.9-rc1.

Moreover, for build testing it is sufficient to put it into a branch somewhere at git.kernel.org (as you have already noticed :-)).

Thanks, Rafael

-- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center.

4546

days inactive

4547

days old

linaro-dev@lists.linaro.org

43 comments

participants

tags (0)

participants (6)

Amit Kucheria
Borislav Petkov
Charles Garcia-Tobin
Francesco Lavra
Rafael J. Wysocki
Viresh Kumar