Hi Rafael,
Here is the V2 with updated patches as suggested by you guys.
These are pushed here: git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git cpufreq/governor-kobject
The first four patches are for 4.5, if possible and others you can keep for 4.6.
V1->V2: - Improved changelogs, thanks Rafael. - Added new dbs_data->mutex to avoid concurrent updates to tunables. - Moved kobj_type to common_dbs_data. - Updated macros to static inline routines - s/show/governor_show - s/store/governor_store - Improved comments
@Juri: More testing requested :)
Viresh Kumar (7): cpufreq: governor: Treat min_sampling_rate as a governor-specific tunable cpufreq: governor: New sysfs show/store callbacks for governor tunables cpufreq: governor: Drop unused macros for creating governor tunable attributes Revert "cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT" cpufreq: Merge cpufreq_offline_prepare/finish routines cpufreq: Call __cpufreq_governor() with policy->rwsem held cpufreq: Remove cpufreq_governor_lock
drivers/cpufreq/cpufreq.c | 93 +++++++++++++---------------- drivers/cpufreq/cpufreq_conservative.c | 79 +++++++++---------------- drivers/cpufreq/cpufreq_governor.c | 105 +++++++++++++++++++++++++-------- drivers/cpufreq/cpufreq_governor.h | 104 ++++++++------------------------ drivers/cpufreq/cpufreq_ondemand.c | 79 +++++++++---------------- include/linux/cpufreq.h | 4 -- 6 files changed, 203 insertions(+), 261 deletions(-)
The min_sampling_rate governor tunable is a field in struct dbs_data, so it has to be handled in a special way separate from the rest of governor tunables. In particular, that requires a special macro to be present for creating its show/store sysfs attribute callbacks.
However, there is no real need for the data structures and code in question to be arranged this way and if min_sampling_rate is moved to data structures holding the other governor tunables, the sysfs attribute creation macros that work with those tunables will also work with min_sampling_rate and the special macro for it won't be necessary any more. That will make it easier to modify the governor code going forward, so do it.
[ Rafael: Written changelog ] Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_conservative.c | 14 ++++++------- drivers/cpufreq/cpufreq_governor.c | 36 ++++++++++++++++++++-------------- drivers/cpufreq/cpufreq_governor.h | 18 ++--------------- drivers/cpufreq/cpufreq_ondemand.c | 14 ++++++------- 4 files changed, 37 insertions(+), 45 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 606ad74abe6e..57750367bd26 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -185,7 +185,7 @@ static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, if (ret != 1) return -EINVAL;
- cs_tuners->sampling_rate = max(input, dbs_data->min_sampling_rate); + cs_tuners->sampling_rate = max(input, cs_tuners->min_sampling_rate); return count; }
@@ -281,7 +281,7 @@ show_store_one(cs, up_threshold); show_store_one(cs, down_threshold); show_store_one(cs, ignore_nice_load); show_store_one(cs, freq_step); -declare_show_sampling_rate_min(cs); +show_one(cs, min_sampling_rate);
gov_sys_pol_attr_rw(sampling_rate); gov_sys_pol_attr_rw(sampling_down_factor); @@ -289,10 +289,10 @@ gov_sys_pol_attr_rw(up_threshold); gov_sys_pol_attr_rw(down_threshold); gov_sys_pol_attr_rw(ignore_nice_load); gov_sys_pol_attr_rw(freq_step); -gov_sys_pol_attr_ro(sampling_rate_min); +gov_sys_pol_attr_ro(min_sampling_rate);
static struct attribute *dbs_attributes_gov_sys[] = { - &sampling_rate_min_gov_sys.attr, + &min_sampling_rate_gov_sys.attr, &sampling_rate_gov_sys.attr, &sampling_down_factor_gov_sys.attr, &up_threshold_gov_sys.attr, @@ -308,7 +308,7 @@ static struct attribute_group cs_attr_group_gov_sys = { };
static struct attribute *dbs_attributes_gov_pol[] = { - &sampling_rate_min_gov_pol.attr, + &min_sampling_rate_gov_pol.attr, &sampling_rate_gov_pol.attr, &sampling_down_factor_gov_pol.attr, &up_threshold_gov_pol.attr, @@ -340,10 +340,10 @@ static int cs_init(struct dbs_data *dbs_data, bool notify) tuners->sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR; tuners->ignore_nice_load = 0; tuners->freq_step = DEF_FREQUENCY_STEP; + tuners->min_sampling_rate = MIN_SAMPLING_RATE_RATIO * + jiffies_to_usecs(10);
dbs_data->tuners = tuners; - dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO * - jiffies_to_usecs(10);
if (notify) cpufreq_register_notifier(&cs_cpufreq_notifier_block, diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index e0d111024d48..9a7edc91ad57 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -286,16 +286,32 @@ static void dbs_timer_handler(unsigned long data) queue_work(system_wq, &shared->work); }
-static void set_sampling_rate(struct dbs_data *dbs_data, - unsigned int sampling_rate) +static void set_sampling_rate(struct cpufreq_policy *policy, + struct dbs_data *dbs_data) { + unsigned int *sampling_rate; + unsigned int *min_sampling_rate; + unsigned int latency; + + /* policy latency is in ns. Convert it to us first */ + latency = policy->cpuinfo.transition_latency / 1000; + if (latency == 0) + latency = 1; + if (dbs_data->cdata->governor == GOV_CONSERVATIVE) { struct cs_dbs_tuners *cs_tuners = dbs_data->tuners; - cs_tuners->sampling_rate = sampling_rate; + sampling_rate = &cs_tuners->sampling_rate; + min_sampling_rate = &cs_tuners->min_sampling_rate; } else { struct od_dbs_tuners *od_tuners = dbs_data->tuners; - od_tuners->sampling_rate = sampling_rate; + sampling_rate = &od_tuners->sampling_rate; + min_sampling_rate = &od_tuners->min_sampling_rate; } + + /* Bring kernel and HW constraints together */ + *min_sampling_rate = max(*min_sampling_rate, + MIN_LATENCY_MULTIPLIER * latency); + *sampling_rate = max(*min_sampling_rate, latency * LATENCY_MULTIPLIER); }
static int alloc_common_dbs_info(struct cpufreq_policy *policy, @@ -338,7 +354,6 @@ static int cpufreq_governor_init(struct cpufreq_policy *policy, struct dbs_data *dbs_data, struct common_dbs_data *cdata) { - unsigned int latency; int ret;
/* State should be equivalent to EXIT */ @@ -373,16 +388,7 @@ static int cpufreq_governor_init(struct cpufreq_policy *policy, if (ret) goto free_common_dbs_info;
- /* policy latency is in ns. Convert it to us first */ - latency = policy->cpuinfo.transition_latency / 1000; - if (latency == 0) - latency = 1; - - /* Bring kernel and HW constraints together */ - dbs_data->min_sampling_rate = max(dbs_data->min_sampling_rate, - MIN_LATENCY_MULTIPLIER * latency); - set_sampling_rate(dbs_data, max(dbs_data->min_sampling_rate, - latency * LATENCY_MULTIPLIER)); + set_sampling_rate(policy, dbs_data);
if (!have_governor_per_policy()) cdata->gdbs_data = dbs_data; diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 91e767a058a7..ad44a8546a3a 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -183,6 +183,7 @@ struct od_dbs_tuners { unsigned int up_threshold; unsigned int powersave_bias; unsigned int io_is_busy; + unsigned int min_sampling_rate; };
struct cs_dbs_tuners { @@ -192,6 +193,7 @@ struct cs_dbs_tuners { unsigned int up_threshold; unsigned int down_threshold; unsigned int freq_step; + unsigned int min_sampling_rate; };
/* Common Governor data across policies */ @@ -230,7 +232,6 @@ struct common_dbs_data { /* Governor Per policy data */ struct dbs_data { struct common_dbs_data *cdata; - unsigned int min_sampling_rate; int usage_count; void *tuners; }; @@ -254,21 +255,6 @@ static inline int delay_for_sampling_rate(unsigned int sampling_rate) return delay; }
-#define declare_show_sampling_rate_min(_gov) \ -static ssize_t show_sampling_rate_min_gov_sys \ -(struct kobject *kobj, struct attribute *attr, char *buf) \ -{ \ - struct dbs_data *dbs_data = _gov##_dbs_cdata.gdbs_data; \ - return sprintf(buf, "%u\n", dbs_data->min_sampling_rate); \ -} \ - \ -static ssize_t show_sampling_rate_min_gov_pol \ -(struct cpufreq_policy *policy, char *buf) \ -{ \ - struct dbs_data *dbs_data = policy->governor_data; \ - return sprintf(buf, "%u\n", dbs_data->min_sampling_rate); \ -} - extern struct mutex cpufreq_governor_lock;
void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay); diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index eae51070c034..b31f64745232 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -250,7 +250,7 @@ static void update_sampling_rate(struct dbs_data *dbs_data, int cpu;
od_tuners->sampling_rate = new_rate = max(new_rate, - dbs_data->min_sampling_rate); + od_tuners->min_sampling_rate);
/* * Lock governor so that governor start/stop can't execute in parallel. @@ -442,7 +442,7 @@ show_store_one(od, up_threshold); show_store_one(od, sampling_down_factor); show_store_one(od, ignore_nice_load); show_store_one(od, powersave_bias); -declare_show_sampling_rate_min(od); +show_one(od, min_sampling_rate);
gov_sys_pol_attr_rw(sampling_rate); gov_sys_pol_attr_rw(io_is_busy); @@ -450,10 +450,10 @@ gov_sys_pol_attr_rw(up_threshold); gov_sys_pol_attr_rw(sampling_down_factor); gov_sys_pol_attr_rw(ignore_nice_load); gov_sys_pol_attr_rw(powersave_bias); -gov_sys_pol_attr_ro(sampling_rate_min); +gov_sys_pol_attr_ro(min_sampling_rate);
static struct attribute *dbs_attributes_gov_sys[] = { - &sampling_rate_min_gov_sys.attr, + &min_sampling_rate_gov_sys.attr, &sampling_rate_gov_sys.attr, &up_threshold_gov_sys.attr, &sampling_down_factor_gov_sys.attr, @@ -469,7 +469,7 @@ static struct attribute_group od_attr_group_gov_sys = { };
static struct attribute *dbs_attributes_gov_pol[] = { - &sampling_rate_min_gov_pol.attr, + &min_sampling_rate_gov_pol.attr, &sampling_rate_gov_pol.attr, &up_threshold_gov_pol.attr, &sampling_down_factor_gov_pol.attr, @@ -509,12 +509,12 @@ static int od_init(struct dbs_data *dbs_data, bool notify) * not depending on HZ, but fixed (very low). The deferred * timer might skip some samples if idle/sleeping as needed. */ - dbs_data->min_sampling_rate = MICRO_FREQUENCY_MIN_SAMPLE_RATE; + tuners->min_sampling_rate = MICRO_FREQUENCY_MIN_SAMPLE_RATE; } else { tuners->up_threshold = DEF_FREQUENCY_UP_THRESHOLD;
/* For correct statistics, we need 10 ticks for each measure */ - dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO * + tuners->min_sampling_rate = MIN_SAMPLING_RATE_RATIO * jiffies_to_usecs(10); }
On Wednesday, February 03, 2016 07:32:17 PM Viresh Kumar wrote:
The min_sampling_rate governor tunable is a field in struct dbs_data, so it has to be handled in a special way separate from the rest of governor tunables. In particular, that requires a special macro to be present for creating its show/store sysfs attribute callbacks.
However, there is no real need for the data structures and code in question to be arranged this way and if min_sampling_rate is moved to data structures holding the other governor tunables, the sysfs attribute creation macros that work with those tunables will also work with min_sampling_rate and the special macro for it won't be necessary any more. That will make it easier to modify the governor code going forward, so do it.
[ Rafael: Written changelog ] Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
I'm having some second thoughts about the utility of this patch to be honest.
I actually would like to move some tunables in the opposite direction. That is, from struct od_dbs_tuners and struct cs_dbs_tuners to struct dbs_data. The tuners field in that will then become something like gov_tunables (in analogy with gov_ops in struct common_dbs_data) and it will point to governor-specific tunables.
The reason why I'd like to do that is to make it easier to get rid of the super-ugly governor == GOV_CONSERVATIVE etc tests in the common code.
Also I think that governor-specific tunables should be defined in the .c file for that governor rather than in the common header.
We will need two set of macros for their sysfs attributes then, but that's not a big deal IMO.
Thanks, Rafael
On 05-02-16, 03:31, Rafael J. Wysocki wrote:
I'm having some second thoughts about the utility of this patch to be honest.
I actually would like to move some tunables in the opposite direction. That is, from struct od_dbs_tuners and struct cs_dbs_tuners to struct dbs_data. The tuners field in that will then become something like gov_tunables (in analogy with gov_ops in struct common_dbs_data) and it will point to governor-specific tunables.
The reason why I'd like to do that is to make it easier to get rid of the super-ugly governor == GOV_CONSERVATIVE etc tests in the common code.
Also I think that governor-specific tunables should be defined in the .c file for that governor rather than in the common header.
We will need two set of macros for their sysfs attributes then, but that's not a big deal IMO.
I agree with that, no issues from my side.
But, this patch was actually required to kill the ugly macros. So, if we are planning to take this series as is, then maybe we can keep it for now and fix everything together with your patches :)
The ondemand and conservative governors use the global-attr or freq-attr structures to represent sysfs attributes corresponding to their tunables (which of them is actually used depends on whether or not different policy objects can use the same governor with different tunables at the same time and, consequently, on where those attributes are located in sysfs).
Unfortunately, in the freq-attr case, the standard cpufreq show/store sysfs attribute callbacks are applied to the governor tunable attributes and they always acquire the policy->rwsem lock before carrying out the operation. That may lead to an ABBA deadlock if governor tunable attributes are removed under policy->rwsem while one of them is being accessed concurrently (if sysfs attributes removal wins the race, it will wait for the access to complete with policy->rwsem held while the attribute callback will block on policy->rwsem indefinitely).
We attempted to address this issue by dropping policy->rwsem around governor tunable attributes removal (that is, around invocations of the ->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT) in cpufreq_set_policy(), but that opened up race conditions that had not been possible with policy->rwsem held all the time. Therefore policy->rwsem cannot be dropped in cpufreq_set_policy() at any point, but the deadlock situation described above must be avoided too.
To that end, use the observation that in principle governor tunables may be represented by the same data type regardless of whether the governor is system-wide or per-policy and introduce a new structure, struct governor_attr, for representing them and new corresponding macros for creating show/store sysfs callbacks for them. Also make their parent kobject use a new kobject type whose default show/store callbacks are not related to the standard core cpufreq ones in any way (and they don't acquire policy->rwsem in particular).
[ Rafael: Written changelog ] Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_conservative.c | 73 ++++++++++++---------------------- drivers/cpufreq/cpufreq_governor.c | 69 +++++++++++++++++++++++++++----- drivers/cpufreq/cpufreq_governor.h | 34 ++++++++++++++-- drivers/cpufreq/cpufreq_ondemand.c | 73 ++++++++++++---------------------- 4 files changed, 142 insertions(+), 107 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 57750367bd26..c749fb4fe5d2 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -275,54 +275,33 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf, return count; }
-show_store_one(cs, sampling_rate); -show_store_one(cs, sampling_down_factor); -show_store_one(cs, up_threshold); -show_store_one(cs, down_threshold); -show_store_one(cs, ignore_nice_load); -show_store_one(cs, freq_step); -show_one(cs, min_sampling_rate); - -gov_sys_pol_attr_rw(sampling_rate); -gov_sys_pol_attr_rw(sampling_down_factor); -gov_sys_pol_attr_rw(up_threshold); -gov_sys_pol_attr_rw(down_threshold); -gov_sys_pol_attr_rw(ignore_nice_load); -gov_sys_pol_attr_rw(freq_step); -gov_sys_pol_attr_ro(min_sampling_rate); - -static struct attribute *dbs_attributes_gov_sys[] = { - &min_sampling_rate_gov_sys.attr, - &sampling_rate_gov_sys.attr, - &sampling_down_factor_gov_sys.attr, - &up_threshold_gov_sys.attr, - &down_threshold_gov_sys.attr, - &ignore_nice_load_gov_sys.attr, - &freq_step_gov_sys.attr, +gov_show_one(cs, sampling_rate); +gov_show_one(cs, sampling_down_factor); +gov_show_one(cs, up_threshold); +gov_show_one(cs, down_threshold); +gov_show_one(cs, ignore_nice_load); +gov_show_one(cs, freq_step); +gov_show_one(cs, min_sampling_rate); + +gov_attr_rw(sampling_rate); +gov_attr_rw(sampling_down_factor); +gov_attr_rw(up_threshold); +gov_attr_rw(down_threshold); +gov_attr_rw(ignore_nice_load); +gov_attr_rw(freq_step); +gov_attr_ro(min_sampling_rate); + +static struct attribute *cs_attributes[] = { + &min_sampling_rate.attr, + &sampling_rate.attr, + &sampling_down_factor.attr, + &up_threshold.attr, + &down_threshold.attr, + &ignore_nice_load.attr, + &freq_step.attr, NULL };
-static struct attribute_group cs_attr_group_gov_sys = { - .attrs = dbs_attributes_gov_sys, - .name = "conservative", -}; - -static struct attribute *dbs_attributes_gov_pol[] = { - &min_sampling_rate_gov_pol.attr, - &sampling_rate_gov_pol.attr, - &sampling_down_factor_gov_pol.attr, - &up_threshold_gov_pol.attr, - &down_threshold_gov_pol.attr, - &ignore_nice_load_gov_pol.attr, - &freq_step_gov_pol.attr, - NULL -}; - -static struct attribute_group cs_attr_group_gov_pol = { - .attrs = dbs_attributes_gov_pol, - .name = "conservative", -}; - /************************** sysfs end ************************/
static int cs_init(struct dbs_data *dbs_data, bool notify) @@ -365,8 +344,8 @@ define_get_cpu_dbs_routines(cs_cpu_dbs_info);
static struct common_dbs_data cs_dbs_cdata = { .governor = GOV_CONSERVATIVE, - .attr_group_gov_sys = &cs_attr_group_gov_sys, - .attr_group_gov_pol = &cs_attr_group_gov_pol, + .kobj_name = "conservative", + .kobj_type = { .sysfs_ops = &governor_sysfs_ops, .default_attrs = cs_attributes }, .get_cpu_cdbs = get_cpu_cdbs, .get_cpu_dbs_info_s = get_cpu_dbs_info_s, .gov_dbs_timer = cs_dbs_timer, diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 9a7edc91ad57..e7f79d2477fa 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -22,14 +22,58 @@
#include "cpufreq_governor.h"
-static struct attribute_group *get_sysfs_attr(struct dbs_data *dbs_data) +static inline struct dbs_data *to_dbs_data(struct kobject *kobj) { - if (have_governor_per_policy()) - return dbs_data->cdata->attr_group_gov_pol; - else - return dbs_data->cdata->attr_group_gov_sys; + return container_of(kobj, struct dbs_data, kobj); }
+static inline struct governor_attr *to_gov_attr(struct attribute *attr) +{ + return container_of(attr, struct governor_attr, attr); +} + +static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct dbs_data *dbs_data = to_dbs_data(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret = -EIO; + + if (gattr->show) + ret = gattr->show(dbs_data, buf); + + return ret; +} + +static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t count) +{ + struct dbs_data *dbs_data = to_dbs_data(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret = -EIO; + + mutex_lock(&dbs_data->mutex); + + if (gattr->store) + ret = gattr->store(dbs_data, buf, count); + + mutex_unlock(&dbs_data->mutex); + + return ret; +} + +/* + * Sysfs Ops for accessing governor attributes. + * + * All show/store invocations for governor specific sysfs attributes, will first + * call the below show/store callbacks and the attribute specific callback will + * be called from within it. + */ +const struct sysfs_ops governor_sysfs_ops = { + .show = governor_show, + .store = governor_store, +}; + void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) { struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu); @@ -383,6 +427,7 @@ static int cpufreq_governor_init(struct cpufreq_policy *policy,
dbs_data->cdata = cdata; dbs_data->usage_count = 1; + mutex_init(&dbs_data->mutex);
ret = cdata->init(dbs_data, !policy->governor->initialized); if (ret) @@ -395,10 +440,14 @@ static int cpufreq_governor_init(struct cpufreq_policy *policy,
policy->governor_data = dbs_data;
- ret = sysfs_create_group(get_governor_parent_kobj(policy), - get_sysfs_attr(dbs_data)); - if (ret) + ret = kobject_init_and_add(&dbs_data->kobj, &cdata->kobj_type, + get_governor_parent_kobj(policy), + cdata->kobj_name); + if (ret) { + pr_err("cpufreq: Governor initialization failed (dbs_data kobject initialization error %d)\n", + ret); goto reset_gdbs_data; + }
return 0;
@@ -426,8 +475,7 @@ static int cpufreq_governor_exit(struct cpufreq_policy *policy, return -EBUSY;
if (!--dbs_data->usage_count) { - sysfs_remove_group(get_governor_parent_kobj(policy), - get_sysfs_attr(dbs_data)); + kobject_put(&dbs_data->kobj);
policy->governor_data = NULL;
@@ -435,6 +483,7 @@ static int cpufreq_governor_exit(struct cpufreq_policy *policy, cdata->gdbs_data = NULL;
cdata->exit(dbs_data, policy->governor->initialized == 1); + mutex_destroy(&dbs_data->mutex); kfree(dbs_data); } else { policy->governor_data = NULL; diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index ad44a8546a3a..45a6ac1ecfb0 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -108,6 +108,31 @@ static ssize_t store_##file_name##_gov_pol \ show_one(_gov, file_name); \ store_one(_gov, file_name)
+/* Governor's specific attributes */ +struct dbs_data; +struct governor_attr { + struct attribute attr; + ssize_t (*show)(struct dbs_data *dbs_data, char *buf); + ssize_t (*store)(struct dbs_data *dbs_data, const char *buf, + size_t count); +}; + +#define gov_show_one(_gov, file_name) \ +static ssize_t show_##file_name \ +(struct dbs_data *dbs_data, char *buf) \ +{ \ + struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ + return sprintf(buf, "%u\n", tuners->file_name); \ +} + +#define gov_attr_ro(_name) \ +static struct governor_attr _name = \ +__ATTR(_name, 0444, show_##_name, NULL) + +#define gov_attr_rw(_name) \ +static struct governor_attr _name = \ +__ATTR(_name, 0644, show_##_name, store_##_name) + /* create helper routines */ #define define_get_cpu_dbs_routines(_dbs_info) \ static struct cpu_dbs_info *get_cpu_cdbs(int cpu) \ @@ -197,14 +222,13 @@ struct cs_dbs_tuners { };
/* Common Governor data across policies */ -struct dbs_data; struct common_dbs_data { /* Common across governors */ #define GOV_ONDEMAND 0 #define GOV_CONSERVATIVE 1 int governor; - struct attribute_group *attr_group_gov_sys; /* one governor - system */ - struct attribute_group *attr_group_gov_pol; /* one governor - policy */ + const char *kobj_name; + struct kobj_type kobj_type;
/* * Common data for platforms that don't set @@ -234,6 +258,9 @@ struct dbs_data { struct common_dbs_data *cdata; int usage_count; void *tuners; + struct kobject kobj; + /* Protect concurrent updates to governor tunables from sysfs */ + struct mutex mutex; };
/* Governor specific ops, will be passed to dbs_data->gov_ops */ @@ -256,6 +283,7 @@ static inline int delay_for_sampling_rate(unsigned int sampling_rate) }
extern struct mutex cpufreq_governor_lock; +extern const struct sysfs_ops governor_sysfs_ops;
void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay); void gov_cancel_work(struct cpu_common_dbs_info *shared); diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index b31f64745232..82ed490f7de0 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -436,54 +436,33 @@ static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf, return count; }
-show_store_one(od, sampling_rate); -show_store_one(od, io_is_busy); -show_store_one(od, up_threshold); -show_store_one(od, sampling_down_factor); -show_store_one(od, ignore_nice_load); -show_store_one(od, powersave_bias); -show_one(od, min_sampling_rate); - -gov_sys_pol_attr_rw(sampling_rate); -gov_sys_pol_attr_rw(io_is_busy); -gov_sys_pol_attr_rw(up_threshold); -gov_sys_pol_attr_rw(sampling_down_factor); -gov_sys_pol_attr_rw(ignore_nice_load); -gov_sys_pol_attr_rw(powersave_bias); -gov_sys_pol_attr_ro(min_sampling_rate); - -static struct attribute *dbs_attributes_gov_sys[] = { - &min_sampling_rate_gov_sys.attr, - &sampling_rate_gov_sys.attr, - &up_threshold_gov_sys.attr, - &sampling_down_factor_gov_sys.attr, - &ignore_nice_load_gov_sys.attr, - &powersave_bias_gov_sys.attr, - &io_is_busy_gov_sys.attr, +gov_show_one(od, sampling_rate); +gov_show_one(od, io_is_busy); +gov_show_one(od, up_threshold); +gov_show_one(od, sampling_down_factor); +gov_show_one(od, ignore_nice_load); +gov_show_one(od, powersave_bias); +gov_show_one(od, min_sampling_rate); + +gov_attr_rw(sampling_rate); +gov_attr_rw(io_is_busy); +gov_attr_rw(up_threshold); +gov_attr_rw(sampling_down_factor); +gov_attr_rw(ignore_nice_load); +gov_attr_rw(powersave_bias); +gov_attr_ro(min_sampling_rate); + +static struct attribute *od_attributes[] = { + &min_sampling_rate.attr, + &sampling_rate.attr, + &up_threshold.attr, + &sampling_down_factor.attr, + &ignore_nice_load.attr, + &powersave_bias.attr, + &io_is_busy.attr, NULL };
-static struct attribute_group od_attr_group_gov_sys = { - .attrs = dbs_attributes_gov_sys, - .name = "ondemand", -}; - -static struct attribute *dbs_attributes_gov_pol[] = { - &min_sampling_rate_gov_pol.attr, - &sampling_rate_gov_pol.attr, - &up_threshold_gov_pol.attr, - &sampling_down_factor_gov_pol.attr, - &ignore_nice_load_gov_pol.attr, - &powersave_bias_gov_pol.attr, - &io_is_busy_gov_pol.attr, - NULL -}; - -static struct attribute_group od_attr_group_gov_pol = { - .attrs = dbs_attributes_gov_pol, - .name = "ondemand", -}; - /************************** sysfs end ************************/
static int od_init(struct dbs_data *dbs_data, bool notify) @@ -542,8 +521,8 @@ static struct od_ops od_ops = {
static struct common_dbs_data od_dbs_cdata = { .governor = GOV_ONDEMAND, - .attr_group_gov_sys = &od_attr_group_gov_sys, - .attr_group_gov_pol = &od_attr_group_gov_pol, + .kobj_name = "ondemand", + .kobj_type = { .sysfs_ops = &governor_sysfs_ops, .default_attrs = od_attributes }, .get_cpu_cdbs = get_cpu_cdbs, .get_cpu_dbs_info_s = get_cpu_dbs_info_s, .gov_dbs_timer = od_dbs_timer,
On 03-02-16, 19:32, Viresh Kumar wrote:
Build bot reported a minor fix here for compiling governors as modules:
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index e7f79d2477fa..f76a83a99ca4 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -73,6 +73,7 @@ const struct sysfs_ops governor_sysfs_ops = { .show = governor_show, .store = governor_store, }; +EXPORT_SYMBOL_GPL(governor_sysfs_ops);
void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) {
Full patch pasted below.
-------------------------8<-------------------------
From: Viresh Kumar viresh.kumar@linaro.org Date: Tue, 2 Feb 2016 12:35:01 +0530 Subject: [PATCH] cpufreq: governor: New sysfs show/store callbacks for governor tunables
The ondemand and conservative governors use the global-attr or freq-attr structures to represent sysfs attributes corresponding to their tunables (which of them is actually used depends on whether or not different policy objects can use the same governor with different tunables at the same time and, consequently, on where those attributes are located in sysfs).
Unfortunately, in the freq-attr case, the standard cpufreq show/store sysfs attribute callbacks are applied to the governor tunable attributes and they always acquire the policy->rwsem lock before carrying out the operation. That may lead to an ABBA deadlock if governor tunable attributes are removed under policy->rwsem while one of them is being accessed concurrently (if sysfs attributes removal wins the race, it will wait for the access to complete with policy->rwsem held while the attribute callback will block on policy->rwsem indefinitely).
We attempted to address this issue by dropping policy->rwsem around governor tunable attributes removal (that is, around invocations of the ->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT) in cpufreq_set_policy(), but that opened up race conditions that had not been possible with policy->rwsem held all the time. Therefore policy->rwsem cannot be dropped in cpufreq_set_policy() at any point, but the deadlock situation described above must be avoided too.
To that end, use the observation that in principle governor tunables may be represented by the same data type regardless of whether the governor is system-wide or per-policy and introduce a new structure, struct governor_attr, for representing them and new corresponding macros for creating show/store sysfs callbacks for them. Also make their parent kobject use a new kobject type whose default show/store callbacks are not related to the standard core cpufreq ones in any way (and they don't acquire policy->rwsem in particular).
[ Rafael: Written changelog ] Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_conservative.c | 73 ++++++++++++---------------------- drivers/cpufreq/cpufreq_governor.c | 70 +++++++++++++++++++++++++++----- drivers/cpufreq/cpufreq_governor.h | 34 ++++++++++++++-- drivers/cpufreq/cpufreq_ondemand.c | 73 ++++++++++++---------------------- 4 files changed, 143 insertions(+), 107 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_conservative.c b/drivers/cpufreq/cpufreq_conservative.c index 57750367bd26..c749fb4fe5d2 100644 --- a/drivers/cpufreq/cpufreq_conservative.c +++ b/drivers/cpufreq/cpufreq_conservative.c @@ -275,54 +275,33 @@ static ssize_t store_freq_step(struct dbs_data *dbs_data, const char *buf, return count; }
-show_store_one(cs, sampling_rate); -show_store_one(cs, sampling_down_factor); -show_store_one(cs, up_threshold); -show_store_one(cs, down_threshold); -show_store_one(cs, ignore_nice_load); -show_store_one(cs, freq_step); -show_one(cs, min_sampling_rate); - -gov_sys_pol_attr_rw(sampling_rate); -gov_sys_pol_attr_rw(sampling_down_factor); -gov_sys_pol_attr_rw(up_threshold); -gov_sys_pol_attr_rw(down_threshold); -gov_sys_pol_attr_rw(ignore_nice_load); -gov_sys_pol_attr_rw(freq_step); -gov_sys_pol_attr_ro(min_sampling_rate); - -static struct attribute *dbs_attributes_gov_sys[] = { - &min_sampling_rate_gov_sys.attr, - &sampling_rate_gov_sys.attr, - &sampling_down_factor_gov_sys.attr, - &up_threshold_gov_sys.attr, - &down_threshold_gov_sys.attr, - &ignore_nice_load_gov_sys.attr, - &freq_step_gov_sys.attr, +gov_show_one(cs, sampling_rate); +gov_show_one(cs, sampling_down_factor); +gov_show_one(cs, up_threshold); +gov_show_one(cs, down_threshold); +gov_show_one(cs, ignore_nice_load); +gov_show_one(cs, freq_step); +gov_show_one(cs, min_sampling_rate); + +gov_attr_rw(sampling_rate); +gov_attr_rw(sampling_down_factor); +gov_attr_rw(up_threshold); +gov_attr_rw(down_threshold); +gov_attr_rw(ignore_nice_load); +gov_attr_rw(freq_step); +gov_attr_ro(min_sampling_rate); + +static struct attribute *cs_attributes[] = { + &min_sampling_rate.attr, + &sampling_rate.attr, + &sampling_down_factor.attr, + &up_threshold.attr, + &down_threshold.attr, + &ignore_nice_load.attr, + &freq_step.attr, NULL };
-static struct attribute_group cs_attr_group_gov_sys = { - .attrs = dbs_attributes_gov_sys, - .name = "conservative", -}; - -static struct attribute *dbs_attributes_gov_pol[] = { - &min_sampling_rate_gov_pol.attr, - &sampling_rate_gov_pol.attr, - &sampling_down_factor_gov_pol.attr, - &up_threshold_gov_pol.attr, - &down_threshold_gov_pol.attr, - &ignore_nice_load_gov_pol.attr, - &freq_step_gov_pol.attr, - NULL -}; - -static struct attribute_group cs_attr_group_gov_pol = { - .attrs = dbs_attributes_gov_pol, - .name = "conservative", -}; - /************************** sysfs end ************************/
static int cs_init(struct dbs_data *dbs_data, bool notify) @@ -365,8 +344,8 @@ define_get_cpu_dbs_routines(cs_cpu_dbs_info);
static struct common_dbs_data cs_dbs_cdata = { .governor = GOV_CONSERVATIVE, - .attr_group_gov_sys = &cs_attr_group_gov_sys, - .attr_group_gov_pol = &cs_attr_group_gov_pol, + .kobj_name = "conservative", + .kobj_type = { .sysfs_ops = &governor_sysfs_ops, .default_attrs = cs_attributes }, .get_cpu_cdbs = get_cpu_cdbs, .get_cpu_dbs_info_s = get_cpu_dbs_info_s, .gov_dbs_timer = cs_dbs_timer, diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 9a7edc91ad57..f76a83a99ca4 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -22,14 +22,59 @@
#include "cpufreq_governor.h"
-static struct attribute_group *get_sysfs_attr(struct dbs_data *dbs_data) +static inline struct dbs_data *to_dbs_data(struct kobject *kobj) { - if (have_governor_per_policy()) - return dbs_data->cdata->attr_group_gov_pol; - else - return dbs_data->cdata->attr_group_gov_sys; + return container_of(kobj, struct dbs_data, kobj); }
+static inline struct governor_attr *to_gov_attr(struct attribute *attr) +{ + return container_of(attr, struct governor_attr, attr); +} + +static ssize_t governor_show(struct kobject *kobj, struct attribute *attr, + char *buf) +{ + struct dbs_data *dbs_data = to_dbs_data(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret = -EIO; + + if (gattr->show) + ret = gattr->show(dbs_data, buf); + + return ret; +} + +static ssize_t governor_store(struct kobject *kobj, struct attribute *attr, + const char *buf, size_t count) +{ + struct dbs_data *dbs_data = to_dbs_data(kobj); + struct governor_attr *gattr = to_gov_attr(attr); + int ret = -EIO; + + mutex_lock(&dbs_data->mutex); + + if (gattr->store) + ret = gattr->store(dbs_data, buf, count); + + mutex_unlock(&dbs_data->mutex); + + return ret; +} + +/* + * Sysfs Ops for accessing governor attributes. + * + * All show/store invocations for governor specific sysfs attributes, will first + * call the below show/store callbacks and the attribute specific callback will + * be called from within it. + */ +const struct sysfs_ops governor_sysfs_ops = { + .show = governor_show, + .store = governor_store, +}; +EXPORT_SYMBOL_GPL(governor_sysfs_ops); + void dbs_check_cpu(struct dbs_data *dbs_data, int cpu) { struct cpu_dbs_info *cdbs = dbs_data->cdata->get_cpu_cdbs(cpu); @@ -383,6 +428,7 @@ static int cpufreq_governor_init(struct cpufreq_policy *policy,
dbs_data->cdata = cdata; dbs_data->usage_count = 1; + mutex_init(&dbs_data->mutex);
ret = cdata->init(dbs_data, !policy->governor->initialized); if (ret) @@ -395,10 +441,14 @@ static int cpufreq_governor_init(struct cpufreq_policy *policy,
policy->governor_data = dbs_data;
- ret = sysfs_create_group(get_governor_parent_kobj(policy), - get_sysfs_attr(dbs_data)); - if (ret) + ret = kobject_init_and_add(&dbs_data->kobj, &cdata->kobj_type, + get_governor_parent_kobj(policy), + cdata->kobj_name); + if (ret) { + pr_err("cpufreq: Governor initialization failed (dbs_data kobject initialization error %d)\n", + ret); goto reset_gdbs_data; + }
return 0;
@@ -426,8 +476,7 @@ static int cpufreq_governor_exit(struct cpufreq_policy *policy, return -EBUSY;
if (!--dbs_data->usage_count) { - sysfs_remove_group(get_governor_parent_kobj(policy), - get_sysfs_attr(dbs_data)); + kobject_put(&dbs_data->kobj);
policy->governor_data = NULL;
@@ -435,6 +484,7 @@ static int cpufreq_governor_exit(struct cpufreq_policy *policy, cdata->gdbs_data = NULL;
cdata->exit(dbs_data, policy->governor->initialized == 1); + mutex_destroy(&dbs_data->mutex); kfree(dbs_data); } else { policy->governor_data = NULL; diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index ad44a8546a3a..45a6ac1ecfb0 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -108,6 +108,31 @@ static ssize_t store_##file_name##_gov_pol \ show_one(_gov, file_name); \ store_one(_gov, file_name)
+/* Governor's specific attributes */ +struct dbs_data; +struct governor_attr { + struct attribute attr; + ssize_t (*show)(struct dbs_data *dbs_data, char *buf); + ssize_t (*store)(struct dbs_data *dbs_data, const char *buf, + size_t count); +}; + +#define gov_show_one(_gov, file_name) \ +static ssize_t show_##file_name \ +(struct dbs_data *dbs_data, char *buf) \ +{ \ + struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ + return sprintf(buf, "%u\n", tuners->file_name); \ +} + +#define gov_attr_ro(_name) \ +static struct governor_attr _name = \ +__ATTR(_name, 0444, show_##_name, NULL) + +#define gov_attr_rw(_name) \ +static struct governor_attr _name = \ +__ATTR(_name, 0644, show_##_name, store_##_name) + /* create helper routines */ #define define_get_cpu_dbs_routines(_dbs_info) \ static struct cpu_dbs_info *get_cpu_cdbs(int cpu) \ @@ -197,14 +222,13 @@ struct cs_dbs_tuners { };
/* Common Governor data across policies */ -struct dbs_data; struct common_dbs_data { /* Common across governors */ #define GOV_ONDEMAND 0 #define GOV_CONSERVATIVE 1 int governor; - struct attribute_group *attr_group_gov_sys; /* one governor - system */ - struct attribute_group *attr_group_gov_pol; /* one governor - policy */ + const char *kobj_name; + struct kobj_type kobj_type;
/* * Common data for platforms that don't set @@ -234,6 +258,9 @@ struct dbs_data { struct common_dbs_data *cdata; int usage_count; void *tuners; + struct kobject kobj; + /* Protect concurrent updates to governor tunables from sysfs */ + struct mutex mutex; };
/* Governor specific ops, will be passed to dbs_data->gov_ops */ @@ -256,6 +283,7 @@ static inline int delay_for_sampling_rate(unsigned int sampling_rate) }
extern struct mutex cpufreq_governor_lock; +extern const struct sysfs_ops governor_sysfs_ops;
void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay); void gov_cancel_work(struct cpu_common_dbs_info *shared); diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index b31f64745232..82ed490f7de0 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -436,54 +436,33 @@ static ssize_t store_powersave_bias(struct dbs_data *dbs_data, const char *buf, return count; }
-show_store_one(od, sampling_rate); -show_store_one(od, io_is_busy); -show_store_one(od, up_threshold); -show_store_one(od, sampling_down_factor); -show_store_one(od, ignore_nice_load); -show_store_one(od, powersave_bias); -show_one(od, min_sampling_rate); - -gov_sys_pol_attr_rw(sampling_rate); -gov_sys_pol_attr_rw(io_is_busy); -gov_sys_pol_attr_rw(up_threshold); -gov_sys_pol_attr_rw(sampling_down_factor); -gov_sys_pol_attr_rw(ignore_nice_load); -gov_sys_pol_attr_rw(powersave_bias); -gov_sys_pol_attr_ro(min_sampling_rate); - -static struct attribute *dbs_attributes_gov_sys[] = { - &min_sampling_rate_gov_sys.attr, - &sampling_rate_gov_sys.attr, - &up_threshold_gov_sys.attr, - &sampling_down_factor_gov_sys.attr, - &ignore_nice_load_gov_sys.attr, - &powersave_bias_gov_sys.attr, - &io_is_busy_gov_sys.attr, +gov_show_one(od, sampling_rate); +gov_show_one(od, io_is_busy); +gov_show_one(od, up_threshold); +gov_show_one(od, sampling_down_factor); +gov_show_one(od, ignore_nice_load); +gov_show_one(od, powersave_bias); +gov_show_one(od, min_sampling_rate); + +gov_attr_rw(sampling_rate); +gov_attr_rw(io_is_busy); +gov_attr_rw(up_threshold); +gov_attr_rw(sampling_down_factor); +gov_attr_rw(ignore_nice_load); +gov_attr_rw(powersave_bias); +gov_attr_ro(min_sampling_rate); + +static struct attribute *od_attributes[] = { + &min_sampling_rate.attr, + &sampling_rate.attr, + &up_threshold.attr, + &sampling_down_factor.attr, + &ignore_nice_load.attr, + &powersave_bias.attr, + &io_is_busy.attr, NULL };
-static struct attribute_group od_attr_group_gov_sys = { - .attrs = dbs_attributes_gov_sys, - .name = "ondemand", -}; - -static struct attribute *dbs_attributes_gov_pol[] = { - &min_sampling_rate_gov_pol.attr, - &sampling_rate_gov_pol.attr, - &up_threshold_gov_pol.attr, - &sampling_down_factor_gov_pol.attr, - &ignore_nice_load_gov_pol.attr, - &powersave_bias_gov_pol.attr, - &io_is_busy_gov_pol.attr, - NULL -}; - -static struct attribute_group od_attr_group_gov_pol = { - .attrs = dbs_attributes_gov_pol, - .name = "ondemand", -}; - /************************** sysfs end ************************/
static int od_init(struct dbs_data *dbs_data, bool notify) @@ -542,8 +521,8 @@ static struct od_ops od_ops = {
static struct common_dbs_data od_dbs_cdata = { .governor = GOV_ONDEMAND, - .attr_group_gov_sys = &od_attr_group_gov_sys, - .attr_group_gov_pol = &od_attr_group_gov_pol, + .kobj_name = "ondemand", + .kobj_type = { .sysfs_ops = &governor_sysfs_ops, .default_attrs = od_attributes }, .get_cpu_cdbs = get_cpu_cdbs, .get_cpu_dbs_info_s = get_cpu_dbs_info_s, .gov_dbs_timer = od_dbs_timer,
"The previous commit introduced a new set of macros for creating sysfs attributes that represent governor tunables and the old macros used for this purpose are not needed any more, so drop them."
[ Rafael: Written changelog ] Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_governor.h | 68 -------------------------------------- 1 file changed, 68 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 45a6ac1ecfb0..ed328a39c4ac 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -40,74 +40,6 @@ /* Ondemand Sampling types */ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
-/* - * Macro for creating governors sysfs routines - * - * - gov_sys: One governor instance per whole system - * - gov_pol: One governor instance per policy - */ - -/* Create attributes */ -#define gov_sys_attr_ro(_name) \ -static struct global_attr _name##_gov_sys = \ -__ATTR(_name, 0444, show_##_name##_gov_sys, NULL) - -#define gov_sys_attr_rw(_name) \ -static struct global_attr _name##_gov_sys = \ -__ATTR(_name, 0644, show_##_name##_gov_sys, store_##_name##_gov_sys) - -#define gov_pol_attr_ro(_name) \ -static struct freq_attr _name##_gov_pol = \ -__ATTR(_name, 0444, show_##_name##_gov_pol, NULL) - -#define gov_pol_attr_rw(_name) \ -static struct freq_attr _name##_gov_pol = \ -__ATTR(_name, 0644, show_##_name##_gov_pol, store_##_name##_gov_pol) - -#define gov_sys_pol_attr_rw(_name) \ - gov_sys_attr_rw(_name); \ - gov_pol_attr_rw(_name) - -#define gov_sys_pol_attr_ro(_name) \ - gov_sys_attr_ro(_name); \ - gov_pol_attr_ro(_name) - -/* Create show/store routines */ -#define show_one(_gov, file_name) \ -static ssize_t show_##file_name##_gov_sys \ -(struct kobject *kobj, struct attribute *attr, char *buf) \ -{ \ - struct _gov##_dbs_tuners *tuners = _gov##_dbs_cdata.gdbs_data->tuners; \ - return sprintf(buf, "%u\n", tuners->file_name); \ -} \ - \ -static ssize_t show_##file_name##_gov_pol \ -(struct cpufreq_policy *policy, char *buf) \ -{ \ - struct dbs_data *dbs_data = policy->governor_data; \ - struct _gov##_dbs_tuners *tuners = dbs_data->tuners; \ - return sprintf(buf, "%u\n", tuners->file_name); \ -} - -#define store_one(_gov, file_name) \ -static ssize_t store_##file_name##_gov_sys \ -(struct kobject *kobj, struct attribute *attr, const char *buf, size_t count) \ -{ \ - struct dbs_data *dbs_data = _gov##_dbs_cdata.gdbs_data; \ - return store_##file_name(dbs_data, buf, count); \ -} \ - \ -static ssize_t store_##file_name##_gov_pol \ -(struct cpufreq_policy *policy, const char *buf, size_t count) \ -{ \ - struct dbs_data *dbs_data = policy->governor_data; \ - return store_##file_name(dbs_data, buf, count); \ -} - -#define show_store_one(_gov, file_name) \ -show_one(_gov, file_name); \ -store_one(_gov, file_name) - /* Governor's specific attributes */ struct dbs_data; struct governor_attr {
Earlier, when the struct freq-attr was used to represent governor attributes, the standard cpufreq show/store sysfs attribute callbacks were applied to the governor tunable attributes and they always acquire the policy->rwsem lock before carrying out the operation. That could have resulted in an ABBA deadlock if governor tunable attributes are removed under policy->rwsem while one of them is being accessed concurrently (if sysfs attributes removal wins the race, it will wait for the access to complete with policy->rwsem held while the attribute callback will block on policy->rwsem indefinitely).
We attempted to address this issue by dropping policy->rwsem around governor tunable attributes removal (that is, around invocations of the ->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT) in cpufreq_set_policy(), but that opened up race conditions that had not been possible with policy->rwsem held all the time.
The previous commit, "cpufreq: governor: New sysfs show/store callbacks for governor tunables", fixed the original ABBA deadlock by adding new governor specific show/store callbacks.
We don't have to drop rwsem around invocations of governor event CPUFREQ_GOV_POLICY_EXIT anymore, and original fix can be reverted now.
Fixes: 955ef4833574 ("cpufreq: Drop rwsem lock around CPUFREQ_GOV_POLICY_EXIT") Reported-by: Juri Lelli juri.lelli@arm.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 5 ----- include/linux/cpufreq.h | 4 ---- 2 files changed, 9 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index e979ec78b695..5f7e24567e0e 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -2155,10 +2155,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, return ret; }
- up_write(&policy->rwsem); ret = __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT); - down_write(&policy->rwsem); - if (ret) { pr_err("%s: Failed to Exit Governor: %s (%d)\n", __func__, old_gov->name, ret); @@ -2174,9 +2171,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy, if (!ret) goto out;
- up_write(&policy->rwsem); __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT); - down_write(&policy->rwsem); }
/* new governor failed, so re-start old one */ diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index 88a4215125bc..79b87cebaa9c 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -100,10 +100,6 @@ struct cpufreq_policy { * - Any routine that will write to the policy structure and/or may take away * the policy altogether (eg. CPU hotplug), will hold this lock in write * mode before doing so. - * - * Additional rules: - * - Lock should not be held across - * __cpufreq_governor(data, CPUFREQ_GOV_POLICY_EXIT); */ struct rw_semaphore rwsem;
The offline routine was separated into two halves earlier by 'commit 1aee40ac9c86 ("cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock");.
And the reasons cited were, race issues between accessing policy's sysfs files and policy kobject's cleanup.
That race isn't valid anymore, as we don't remove the policy & its kobject completely on hotplugs, but do that from ->remove() callback of subsys framework.
These two routines can be merged back now.
This is a preparatory step for the next patch, that will enforce policy->rwsem lock around __cpufreq_governor() routines STOP/EXIT sequence.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 36 ++++++++++-------------------------- 1 file changed, 10 insertions(+), 26 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 5f7e24567e0e..dc43294e7b31 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1309,9 +1309,10 @@ static int cpufreq_add_dev(struct device *dev, struct subsys_interface *sif) return ret; }
-static void cpufreq_offline_prepare(unsigned int cpu) +static void cpufreq_offline(unsigned int cpu) { struct cpufreq_policy *policy; + int ret;
pr_debug("%s: unregistering CPU %u\n", __func__, cpu);
@@ -1322,7 +1323,7 @@ static void cpufreq_offline_prepare(unsigned int cpu) }
if (has_target()) { - int ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP); + ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP); if (ret) pr_err("%s: Failed to stop governor\n", __func__); } @@ -1345,34 +1346,23 @@ static void cpufreq_offline_prepare(unsigned int cpu) /* Start governor again for active policy */ if (!policy_is_inactive(policy)) { if (has_target()) { - int ret = __cpufreq_governor(policy, CPUFREQ_GOV_START); + ret = __cpufreq_governor(policy, CPUFREQ_GOV_START); if (!ret) ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
if (ret) pr_err("%s: Failed to start governor\n", __func__); } - } else if (cpufreq_driver->stop_cpu) { - cpufreq_driver->stop_cpu(policy); - } -}
-static void cpufreq_offline_finish(unsigned int cpu) -{ - struct cpufreq_policy *policy = per_cpu(cpufreq_cpu_data, cpu); - - if (!policy) { - pr_debug("%s: No cpu_data found\n", __func__); return; }
- /* Only proceed for inactive policies */ - if (!policy_is_inactive(policy)) - return; + if (cpufreq_driver->stop_cpu) + cpufreq_driver->stop_cpu(policy);
/* If cpu is last user of policy, free policy */ if (has_target()) { - int ret = __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT); + ret = __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT); if (ret) pr_err("%s: Failed to exit governor\n", __func__); } @@ -1401,10 +1391,8 @@ static void cpufreq_remove_dev(struct device *dev, struct subsys_interface *sif) if (!policy) return;
- if (cpu_online(cpu)) { - cpufreq_offline_prepare(cpu); - cpufreq_offline_finish(cpu); - } + if (cpu_online(cpu)) + cpufreq_offline(cpu);
cpumask_clear_cpu(cpu, policy->real_cpus); remove_cpu_dev_symlink(policy, cpu); @@ -2255,11 +2243,7 @@ static int cpufreq_cpu_callback(struct notifier_block *nfb, break;
case CPU_DOWN_PREPARE: - cpufreq_offline_prepare(cpu); - break; - - case CPU_POST_DEAD: - cpufreq_offline_finish(cpu); + cpufreq_offline(cpu); break;
case CPU_DOWN_FAILED:
On 02/03/2016 06:02 AM, Viresh Kumar wrote:
The offline routine was separated into two halves earlier by 'commit 1aee40ac9c86 ("cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock");.
And the reasons cited were, race issues between accessing policy's sysfs files and policy kobject's cleanup.
That race isn't valid anymore, as we don't remove the policy & its kobject completely on hotplugs, but do that from ->remove() callback of subsys framework.
These two routines can be merged back now.
This is a preparatory step for the next patch, that will enforce policy->rwsem lock around __cpufreq_governor() routines STOP/EXIT sequence.
Is this stale text? Seems like this is now done in the *previous* patch?
-Saravana
On 03-02-16, 12:21, Saravana Kannan wrote:
On 02/03/2016 06:02 AM, Viresh Kumar wrote:
The offline routine was separated into two halves earlier by 'commit 1aee40ac9c86 ("cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock");.
And the reasons cited were, race issues between accessing policy's sysfs files and policy kobject's cleanup.
That race isn't valid anymore, as we don't remove the policy & its kobject completely on hotplugs, but do that from ->remove() callback of subsys framework.
These two routines can be merged back now.
This is a preparatory step for the next patch, that will enforce policy->rwsem lock around __cpufreq_governor() routines STOP/EXIT sequence.
Is this stale text? Seems like this is now done in the *previous* patch?
No, the previous patch has fixed a single location only. The next patch tried to fix others as well.
This isn't followed properly by all parts of the core code, some follow it, whereas others don't.
Enforcing it will also enable us to remove cpufreq_governor_lock, that is used today because we can't guarantee that __cpufreq_governor() isn't executed in parallel.
We should also ensure that the lock is held across state changes to the governors.
For example, while adding a CPU to the policy on cpu-online path, we need to stop the governor, change policy->cpus, start the governor and then refresh its limits. The complete sequence must be guaranteed to execute without any concurrent races. And that can be achieved using policy->rwsem around these use cases.
Also note that cpufreq_driver->stop_cpu() and ->exit() can get called while policy->rwsem is held. That shouldn't have any side effects though.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 45 ++++++++++++++++++++++++++++++--------------- 1 file changed, 30 insertions(+), 15 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index dc43294e7b31..4fc3889ca7c9 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -996,30 +996,29 @@ static int cpufreq_add_policy_cpu(struct cpufreq_policy *policy, unsigned int cp if (cpumask_test_cpu(cpu, policy->cpus)) return 0;
+ down_write(&policy->rwsem); if (has_target()) { ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP); if (ret) { pr_err("%s: Failed to stop governor\n", __func__); - return ret; + goto unlock; } }
- down_write(&policy->rwsem); cpumask_set_cpu(cpu, policy->cpus); - up_write(&policy->rwsem);
if (has_target()) { ret = __cpufreq_governor(policy, CPUFREQ_GOV_START); if (!ret) ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS);
- if (ret) { + if (ret) pr_err("%s: Failed to start governor\n", __func__); - return ret; - } }
- return 0; +unlock: + up_write(&policy->rwsem); + return ret; }
static struct cpufreq_policy *cpufreq_policy_alloc(unsigned int cpu) @@ -1322,13 +1321,13 @@ static void cpufreq_offline(unsigned int cpu) return; }
+ down_write(&policy->rwsem); if (has_target()) { ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP); if (ret) pr_err("%s: Failed to stop governor\n", __func__); }
- down_write(&policy->rwsem); cpumask_clear_cpu(cpu, policy->cpus);
if (policy_is_inactive(policy)) { @@ -1341,7 +1340,6 @@ static void cpufreq_offline(unsigned int cpu) /* Nominate new CPU */ policy->cpu = cpumask_any(policy->cpus); } - up_write(&policy->rwsem);
/* Start governor again for active policy */ if (!policy_is_inactive(policy)) { @@ -1376,6 +1374,7 @@ static void cpufreq_offline(unsigned int cpu) cpufreq_driver->exit(policy); policy->freq_table = NULL; } + up_write(&policy->rwsem); }
/** @@ -1572,6 +1571,7 @@ EXPORT_SYMBOL(cpufreq_generic_suspend); void cpufreq_suspend(void) { struct cpufreq_policy *policy; + int ret;
if (!cpufreq_driver) return; @@ -1582,7 +1582,11 @@ void cpufreq_suspend(void) pr_debug("%s: Suspending Governors\n", __func__);
for_each_active_policy(policy) { - if (__cpufreq_governor(policy, CPUFREQ_GOV_STOP)) + down_write(&policy->rwsem); + ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP); + up_write(&policy->rwsem); + + if (ret) pr_err("%s: Failed to stop governor for policy: %p\n", __func__, policy); else if (cpufreq_driver->suspend @@ -1604,6 +1608,7 @@ void cpufreq_suspend(void) void cpufreq_resume(void) { struct cpufreq_policy *policy; + int ret;
if (!cpufreq_driver) return; @@ -1616,13 +1621,20 @@ void cpufreq_resume(void) pr_debug("%s: Resuming Governors\n", __func__);
for_each_active_policy(policy) { - if (cpufreq_driver->resume && cpufreq_driver->resume(policy)) + if (cpufreq_driver->resume && cpufreq_driver->resume(policy)) { pr_err("%s: Failed to resume driver: %p\n", __func__, policy); - else if (__cpufreq_governor(policy, CPUFREQ_GOV_START) - || __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS)) - pr_err("%s: Failed to start governor for policy: %p\n", - __func__, policy); + } else { + down_write(&policy->rwsem); + ret = __cpufreq_governor(policy, CPUFREQ_GOV_START); + if (!ret) + __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS); + up_write(&policy->rwsem); + + if (ret) + pr_err("%s: Failed to start governor for policy: %p\n", + __func__, policy); + } }
/* @@ -2276,8 +2288,11 @@ static int cpufreq_boost_set_sw(int state) __func__); break; } + + down_write(&policy->rwsem); policy->user_policy.max = policy->max; __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS); + up_write(&policy->rwsem); } }
We used to drop policy->rwsem just before calling __cpufreq_governor() in some cases earlier and so it was possible that __cpufreq_governor() runs concurrently via separate threads.
In order to guarantee valid state transitions for governors, 'governor_enabled' was required to be protected using some locking and we created cpufreq_governor_lock for that.
But now, __cpufreq_governor() is always called from within policy->rwsem held and so 'governor_enabled' is protected against races even without cpufreq_governor_lock.
Get rid of the extra lock now.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 7 ------- 1 file changed, 7 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 4fc3889ca7c9..7bc8a5ed97e5 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -102,7 +102,6 @@ static LIST_HEAD(cpufreq_governor_list); static struct cpufreq_driver *cpufreq_driver; static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -DEFINE_MUTEX(cpufreq_governor_lock);
/* Flag to suspend/resume CPUFreq governors */ static bool cpufreq_suspended; @@ -1963,11 +1962,9 @@ static int __cpufreq_governor(struct cpufreq_policy *policy,
pr_debug("%s: for CPU %u, event %u\n", __func__, policy->cpu, event);
- mutex_lock(&cpufreq_governor_lock); if ((policy->governor_enabled && event == CPUFREQ_GOV_START) || (!policy->governor_enabled && (event == CPUFREQ_GOV_LIMITS || event == CPUFREQ_GOV_STOP))) { - mutex_unlock(&cpufreq_governor_lock); return -EBUSY; }
@@ -1976,8 +1973,6 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, else if (event == CPUFREQ_GOV_START) policy->governor_enabled = true;
- mutex_unlock(&cpufreq_governor_lock); - ret = policy->governor->governor(policy, event);
if (!ret) { @@ -1987,12 +1982,10 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, policy->governor->initialized--; } else { /* Restore original values */ - mutex_lock(&cpufreq_governor_lock); if (event == CPUFREQ_GOV_STOP) policy->governor_enabled = true; else if (event == CPUFREQ_GOV_START) policy->governor_enabled = false; - mutex_unlock(&cpufreq_governor_lock); }
if (((event == CPUFREQ_GOV_POLICY_INIT) && ret) ||
On 03-02-16, 19:32, Viresh Kumar wrote:
We used to drop policy->rwsem just before calling __cpufreq_governor() in some cases earlier and so it was possible that __cpufreq_governor() runs concurrently via separate threads.
In order to guarantee valid state transitions for governors, 'governor_enabled' was required to be protected using some locking and we created cpufreq_governor_lock for that.
But now, __cpufreq_governor() is always called from within policy->rwsem held and so 'governor_enabled' is protected against races even without cpufreq_governor_lock.
Get rid of the extra lock now.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
drivers/cpufreq/cpufreq.c | 7 ------- 1 file changed, 7 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 4fc3889ca7c9..7bc8a5ed97e5 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -102,7 +102,6 @@ static LIST_HEAD(cpufreq_governor_list); static struct cpufreq_driver *cpufreq_driver; static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -DEFINE_MUTEX(cpufreq_governor_lock); /* Flag to suspend/resume CPUFreq governors */ static bool cpufreq_suspended; @@ -1963,11 +1962,9 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, pr_debug("%s: for CPU %u, event %u\n", __func__, policy->cpu, event);
- mutex_lock(&cpufreq_governor_lock); if ((policy->governor_enabled && event == CPUFREQ_GOV_START) || (!policy->governor_enabled && (event == CPUFREQ_GOV_LIMITS || event == CPUFREQ_GOV_STOP))) {
return -EBUSY; }mutex_unlock(&cpufreq_governor_lock);
@@ -1976,8 +1973,6 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, else if (event == CPUFREQ_GOV_START) policy->governor_enabled = true;
- mutex_unlock(&cpufreq_governor_lock);
- ret = policy->governor->governor(policy, event);
if (!ret) { @@ -1987,12 +1982,10 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, policy->governor->initialized--; } else { /* Restore original values */
if (event == CPUFREQ_GOV_STOP) policy->governor_enabled = true; else if (event == CPUFREQ_GOV_START) policy->governor_enabled = false;mutex_lock(&cpufreq_governor_lock);
}mutex_unlock(&cpufreq_governor_lock);
if (((event == CPUFREQ_GOV_POLICY_INIT) && ret) ||
+ minor cleanup:
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index ed328a39c4ac..7bed63e14e7d 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -214,7 +214,6 @@ static inline int delay_for_sampling_rate(unsigned int sampling_rate) return delay; }
-extern struct mutex cpufreq_governor_lock; extern const struct sysfs_ops governor_sysfs_ops;
void gov_add_timers(struct cpufreq_policy *policy, unsigned int delay);
Hi Viresh,
On 03/02/16 19:32, Viresh Kumar wrote:
Hi Rafael,
Here is the V2 with updated patches as suggested by you guys.
These are pushed here: git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git cpufreq/governor-kobject
The first four patches are for 4.5, if possible and others you can keep for 4.6.
V1->V2:
- Improved changelogs, thanks Rafael.
- Added new dbs_data->mutex to avoid concurrent updates to tunables.
- Moved kobj_type to common_dbs_data.
- Updated macros to static inline routines
- s/show/governor_show
- s/store/governor_store
- Improved comments
@Juri: More testing requested :)
Ouch, I've just got this executing -f basic on Juno. :( It happens with the hotplug_1_by_1 test.
[ 1086.531252] IRQ1 no longer affine to CPU1 [ 1086.531495] CPU1: shutdown [ 1086.538199] psci: CPU1 killed. [ 1086.583396] [ 1086.584881] ====================================================== [ 1086.590999] [ INFO: possible circular locking dependency detected ] [ 1086.597205] 4.5.0-rc2+ #37 Not tainted [ 1086.600914] ------------------------------------------------------- [ 1086.607118] runme.sh/1052 is trying to acquire lock: [ 1086.612031] (sb_writers#7){.+.+.+}, at: [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.620090] [ 1086.620090] but task is already holding lock: [ 1086.625865] (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278 [ 1086.634081] [ 1086.634081] which lock already depends on the new lock. [ 1086.634081] [ 1086.642180] [ 1086.642180] the existing dependency chain (in reverse order) is: [ 1086.649589] -> #1 (&policy->rwsem){+++++.}: [ 1086.653929] [<ffffffc00011d9a4>] check_prev_add+0x670/0x754 [ 1086.660060] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.666876] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.673001] [<ffffffc000120b58>] lock_release+0x244/0x570 [ 1086.678955] [<ffffffc0007351d0>] __mutex_unlock_slowpath+0xa0/0x18c [ 1086.685771] [<ffffffc0007352dc>] mutex_unlock+0x20/0x2c [ 1086.691553] [<ffffffc0002ccd24>] kernfs_fop_write+0xb0/0x194 [ 1086.697768] [<ffffffc00024478c>] __vfs_write+0x48/0x104 [ 1086.703550] [<ffffffc0002457a4>] vfs_write+0x98/0x198 [ 1086.709161] [<ffffffc0002465e4>] SyS_write+0x54/0xb0 [ 1086.714684] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1086.720555] -> #0 (sb_writers#7){.+.+.+}: [ 1086.724730] [<ffffffc00011c574>] print_circular_bug+0x80/0x2e4 [ 1086.731116] [<ffffffc00011d470>] check_prev_add+0x13c/0x754 [ 1086.737243] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.744059] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.750184] [<ffffffc0001207f4>] lock_acquire+0xe4/0x204 [ 1086.756052] [<ffffffc000118da0>] percpu_down_read+0x50/0xe4 [ 1086.762180] [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.768306] [<ffffffc00026ae90>] mnt_want_write+0x28/0x54 [ 1086.774263] [<ffffffc0002555f8>] do_last+0x660/0xcb8 [ 1086.779788] [<ffffffc000255cdc>] path_openat+0x8c/0x2b0 [ 1086.785570] [<ffffffc000256fbc>] do_filp_open+0x78/0xf0 [ 1086.791353] [<ffffffc000244058>] do_sys_open+0x150/0x214 [ 1086.797222] [<ffffffc0002441a0>] SyS_openat+0x3c/0x48 [ 1086.802831] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1086.808700] [ 1086.808700] other info that might help us debug this: [ 1086.808700] [ 1086.816627] Possible unsafe locking scenario: [ 1086.816627] [ 1086.822488] CPU0 CPU1 [ 1086.826971] ---- ---- [ 1086.831453] lock(&policy->rwsem); [ 1086.834918] lock(sb_writers#7); [ 1086.840713] lock(&policy->rwsem); [ 1086.846671] lock(sb_writers#7); [ 1086.849972] [ 1086.849972] *** DEADLOCK *** [ 1086.849972] [ 1086.855836] 1 lock held by runme.sh/1052: [ 1086.859802] #0: (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278 [ 1086.868453] [ 1086.868453] stack backtrace: [ 1086.872769] CPU: 5 PID: 1052 Comm: runme.sh Not tainted 4.5.0-rc2+ #37 [ 1086.879229] Hardware name: ARM Juno development board (r2) (DT) [ 1086.885089] Call trace: [ 1086.887511] [<ffffffc00008a788>] dump_backtrace+0x0/0x1f4 [ 1086.892858] [<ffffffc00008a99c>] show_stack+0x20/0x28 [ 1086.897861] [<ffffffc00041a380>] dump_stack+0x84/0xc0 [ 1086.902863] [<ffffffc00011c6c8>] print_circular_bug+0x1d4/0x2e4 [ 1086.908725] [<ffffffc00011d470>] check_prev_add+0x13c/0x754 [ 1086.914244] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.920448] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.925965] [<ffffffc0001207f4>] lock_acquire+0xe4/0x204 [ 1086.931224] [<ffffffc000118da0>] percpu_down_read+0x50/0xe4 [ 1086.936742] [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.942260] [<ffffffc00026ae90>] mnt_want_write+0x28/0x54 [ 1086.947605] [<ffffffc0002555f8>] do_last+0x660/0xcb8 [ 1086.952520] [<ffffffc000255cdc>] path_openat+0x8c/0x2b0 [ 1086.957693] [<ffffffc000256fbc>] do_filp_open+0x78/0xf0 [ 1086.962865] [<ffffffc000244058>] do_sys_open+0x150/0x214 [ 1086.968123] [<ffffffc0002441a0>] SyS_openat+0x3c/0x48 [ 1086.973124] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1087.019315] Detected PIPT I-cache on CPU1 [ 1087.019373] CPU1: Booted secondary processor [410fd080]
Best,
- Juri
On 03-02-16, 15:54, Juri Lelli wrote:
Ouch, I've just got this executing -f basic on Juno. :( It happens with the hotplug_1_by_1 test.
[ 1086.531252] IRQ1 no longer affine to CPU1 [ 1086.531495] CPU1: shutdown [ 1086.538199] psci: CPU1 killed. [ 1086.583396] [ 1086.584881] ====================================================== [ 1086.590999] [ INFO: possible circular locking dependency detected ] [ 1086.597205] 4.5.0-rc2+ #37 Not tainted [ 1086.600914] ------------------------------------------------------- [ 1086.607118] runme.sh/1052 is trying to acquire lock: [ 1086.612031] (sb_writers#7){.+.+.+}, at: [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.620090] [ 1086.620090] but task is already holding lock: [ 1086.625865] (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278 [ 1086.634081] [ 1086.634081] which lock already depends on the new lock. [ 1086.634081] [ 1086.642180] [ 1086.642180] the existing dependency chain (in reverse order) is: [ 1086.649589] -> #1 (&policy->rwsem){+++++.}: [ 1086.653929] [<ffffffc00011d9a4>] check_prev_add+0x670/0x754 [ 1086.660060] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.666876] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.673001] [<ffffffc000120b58>] lock_release+0x244/0x570 [ 1086.678955] [<ffffffc0007351d0>] __mutex_unlock_slowpath+0xa0/0x18c [ 1086.685771] [<ffffffc0007352dc>] mutex_unlock+0x20/0x2c [ 1086.691553] [<ffffffc0002ccd24>] kernfs_fop_write+0xb0/0x194 [ 1086.697768] [<ffffffc00024478c>] __vfs_write+0x48/0x104 [ 1086.703550] [<ffffffc0002457a4>] vfs_write+0x98/0x198 [ 1086.709161] [<ffffffc0002465e4>] SyS_write+0x54/0xb0 [ 1086.714684] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1086.720555] -> #0 (sb_writers#7){.+.+.+}: [ 1086.724730] [<ffffffc00011c574>] print_circular_bug+0x80/0x2e4 [ 1086.731116] [<ffffffc00011d470>] check_prev_add+0x13c/0x754 [ 1086.737243] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.744059] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.750184] [<ffffffc0001207f4>] lock_acquire+0xe4/0x204 [ 1086.756052] [<ffffffc000118da0>] percpu_down_read+0x50/0xe4 [ 1086.762180] [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.768306] [<ffffffc00026ae90>] mnt_want_write+0x28/0x54 [ 1086.774263] [<ffffffc0002555f8>] do_last+0x660/0xcb8 [ 1086.779788] [<ffffffc000255cdc>] path_openat+0x8c/0x2b0 [ 1086.785570] [<ffffffc000256fbc>] do_filp_open+0x78/0xf0 [ 1086.791353] [<ffffffc000244058>] do_sys_open+0x150/0x214 [ 1086.797222] [<ffffffc0002441a0>] SyS_openat+0x3c/0x48 [ 1086.802831] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1086.808700] [ 1086.808700] other info that might help us debug this: [ 1086.808700] [ 1086.816627] Possible unsafe locking scenario: [ 1086.816627] [ 1086.822488] CPU0 CPU1 [ 1086.826971] ---- ---- [ 1086.831453] lock(&policy->rwsem); [ 1086.834918] lock(sb_writers#7); [ 1086.840713] lock(&policy->rwsem); [ 1086.846671] lock(sb_writers#7); [ 1086.849972] [ 1086.849972] *** DEADLOCK *** [ 1086.849972] [ 1086.855836] 1 lock held by runme.sh/1052: [ 1086.859802] #0: (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278 [ 1086.868453] [ 1086.868453] stack backtrace: [ 1086.872769] CPU: 5 PID: 1052 Comm: runme.sh Not tainted 4.5.0-rc2+ #37 [ 1086.879229] Hardware name: ARM Juno development board (r2) (DT) [ 1086.885089] Call trace: [ 1086.887511] [<ffffffc00008a788>] dump_backtrace+0x0/0x1f4 [ 1086.892858] [<ffffffc00008a99c>] show_stack+0x20/0x28 [ 1086.897861] [<ffffffc00041a380>] dump_stack+0x84/0xc0 [ 1086.902863] [<ffffffc00011c6c8>] print_circular_bug+0x1d4/0x2e4 [ 1086.908725] [<ffffffc00011d470>] check_prev_add+0x13c/0x754 [ 1086.914244] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.920448] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.925965] [<ffffffc0001207f4>] lock_acquire+0xe4/0x204 [ 1086.931224] [<ffffffc000118da0>] percpu_down_read+0x50/0xe4 [ 1086.936742] [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.942260] [<ffffffc00026ae90>] mnt_want_write+0x28/0x54 [ 1086.947605] [<ffffffc0002555f8>] do_last+0x660/0xcb8 [ 1086.952520] [<ffffffc000255cdc>] path_openat+0x8c/0x2b0 [ 1086.957693] [<ffffffc000256fbc>] do_filp_open+0x78/0xf0 [ 1086.962865] [<ffffffc000244058>] do_sys_open+0x150/0x214 [ 1086.968123] [<ffffffc0002441a0>] SyS_openat+0x3c/0x48 [ 1086.973124] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1087.019315] Detected PIPT I-cache on CPU1 [ 1087.019373] CPU1: Booted secondary processor [410fd080]
Urg..
I failed to understand it for now though. Please test only the first 4 patches and leave the bottom three. AFAICT, this is caused by the 6th patch.
The first 4 are important for 4.5 and must be tested soonish.
On 03/02/16 21:40, Viresh Kumar wrote:
On 03-02-16, 15:54, Juri Lelli wrote:
Ouch, I've just got this executing -f basic on Juno. :( It happens with the hotplug_1_by_1 test.
[...]
Urg..
I failed to understand it for now though. Please test only the first 4 patches and leave the bottom three. AFAICT, this is caused by the 6th patch.
The first 4 are important for 4.5 and must be tested soonish.
First 4 look ok from a testing viewpoint.
Best,
- Juri
On Wed, Feb 3, 2016 at 6:20 PM, Juri Lelli juri.lelli@arm.com wrote:
On 03/02/16 21:40, Viresh Kumar wrote:
On 03-02-16, 15:54, Juri Lelli wrote:
Ouch, I've just got this executing -f basic on Juno. :( It happens with the hotplug_1_by_1 test.
[...]
Urg..
I failed to understand it for now though. Please test only the first 4 patches and leave the bottom three. AFAICT, this is caused by the 6th patch.
The first 4 are important for 4.5 and must be tested soonish.
First 4 look ok from a testing viewpoint.
Good, thanks for the confirmation!
I'm going to apply them and they will go to Linus next week.
Thanks, Rafael
Hi,
On 02/03/2016 10:50 PM, Rafael J. Wysocki wrote:
On Wed, Feb 3, 2016 at 6:20 PM, Juri Lelli juri.lelli@arm.com wrote:
On 03/02/16 21:40, Viresh Kumar wrote:
On 03-02-16, 15:54, Juri Lelli wrote:
Ouch, I've just got this executing -f basic on Juno. :( It happens with the hotplug_1_by_1 test.
[...]
Urg..
I failed to understand it for now though. Please test only the first 4 patches and leave the bottom three. AFAICT, this is caused by the 6th patch.
The first 4 are important for 4.5 and must be tested soonish.
First 4 look ok from a testing viewpoint.
Good, thanks for the confirmation!
I'm going to apply them and they will go to Linus next week.
Thanks, Rafael
Sorry for the delayed report. But I see the below backtrace on Power8 box. It has 4 chips with 128 cpus.
I see the below trace with the first four patches on running tests from Viresh's testcase. './runme.sh -f basic' hit this trace at 'shuffle_governors_for_all_cpus' test.
[ 906.762045] ====================================================== [ 906.762114] [ INFO: possible circular locking dependency detected ] [ 906.762172] 4.5.0-rc2-sgb+ #96 Not tainted [ 906.762207] ------------------------------------------------------- [ 906.762263] runme.sh/2840 is trying to acquire lock: [ 906.762309] (s_active#91){++++.+}, at: [<c000000000407db8>] kernfs_remove+0x48/0x70 [ 906.762419] but task is already holding lock: [ 906.762476] (od_dbs_cdata.mutex){+.+.+.}, at: [<c000000000ad7594>] cpufreq_governor_dbs+0x64/0x7e0 [ 906.762592] which lock already depends on the new lock.
[ 906.762659] the existing dependency chain (in reverse order) is: [ 906.762727] -> #2 (od_dbs_cdata.mutex){+.+.+.}: [ 906.762807] [<c000000000d485b0>] mutex_lock_nested+0x90/0x590 [ 906.762877] [<c000000000ad57f8>] update_sampling_rate+0x88/0x1c0 [ 906.762946] [<c000000000ad5990>] store_sampling_rate+0x60/0xa0 [ 906.763013] [<c000000000ad6af0>] governor_store+0x80/0xc0 [ 906.763070] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 906.763128] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 906.763196] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 906.763254] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 906.763311] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 906.763369] [<c00000000000926c>] system_call+0x38/0xd0 [ 906.763427] -> #1 (&dbs_data->mutex){+.+...}: [ 906.763495] [<c000000000d485b0>] mutex_lock_nested+0x90/0x590 [ 906.763563] [<c000000000ad6ac0>] governor_store+0x50/0xc0 [ 906.763620] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 906.763677] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 906.763745] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 906.763801] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 906.763859] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 906.763916] [<c00000000000926c>] system_call+0x38/0xd0 [ 906.763973] -> #0 (s_active#91){++++.+}: [ 906.764052] [<c00000000015f318>] lock_acquire+0xd8/0x1a0 [ 906.764111] [<c0000000004065f4>] __kernfs_remove+0x344/0x410 [ 906.764179] [<c000000000407db8>] kernfs_remove+0x48/0x70 [ 906.764236] [<c00000000040b868>] sysfs_remove_dir+0x78/0xd0 [ 906.764304] [<c0000000005eccec>] kobject_del+0x2c/0x80 [ 906.764362] [<c0000000005ec9e8>] kobject_release+0xa8/0x250 [ 906.764430] [<c000000000ad7c28>] cpufreq_governor_dbs+0x6f8/0x7e0 [ 906.764497] [<c000000000ad4bdc>] od_cpufreq_governor_dbs+0x3c/0x60 [ 906.764567] [<c000000000acf830>] __cpufreq_governor+0x1d0/0x390 [ 906.764634] [<c000000000ad0750>] cpufreq_set_policy+0x3b0/0x450 [ 906.764703] [<c000000000ad12cc>] store_scaling_governor+0x8c/0xf0 [ 906.764771] [<c000000000aced34>] store+0xb4/0x110 [ 906.764828] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 906.764885] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 906.764952] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 906.765048] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 906.765160] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 906.765272] [<c00000000000926c>] system_call+0x38/0xd0 [ 906.765384] other info that might help us debug this:
[ 906.765522] Chain exists of: s_active#91 --> &dbs_data->mutex --> od_dbs_cdata.mutex
[ 906.765768] Possible unsafe locking scenario:
[ 906.765880] CPU0 CPU1 [ 906.765969] ---- ---- [ 906.766058] lock(od_dbs_cdata.mutex); [ 906.766170] lock(&dbs_data->mutex); [ 906.766304] lock(od_dbs_cdata.mutex); [ 906.766461] lock(s_active#91); [ 906.766572] *** DEADLOCK ***
[ 906.766686] 6 locks held by runme.sh/2840: [ 906.766756] #0: (sb_writers#6){.+.+.+}, at: [<c00000000034cf10>] __sb_start_write+0x120/0x150 [ 906.767002] #1: (&of->mutex){+.+.+.}, at: [<c00000000040939c>] kernfs_fop_write+0x7c/0x1f0 [ 906.767225] #2: (s_active#82){.+.+.+}, at: [<c0000000004093a8>] kernfs_fop_write+0x88/0x1f0 [ 906.767471] #3: (cpu_hotplug.lock){++++++}, at: [<c0000000000e06d8>] get_online_cpus+0x48/0xc0 [ 906.767676] #4: (&policy->rwsem){+++++.}, at: [<c000000000aced04>] store+0x84/0x110 [ 906.767878] #5: (od_dbs_cdata.mutex){+.+.+.}, at: [<c000000000ad7594>] cpufreq_governor_dbs+0x64/0x7e0 [ 906.768124] stack backtrace: [ 906.768215] CPU: 0 PID: 2840 Comm: runme.sh Not tainted 4.5.0-rc2-sgb+ #96 [ 906.768329] Call Trace: [ 906.768375] [c000007fe3126ec0] [c000000000d56530] dump_stack+0x90/0xbc (unreliable) [ 906.768536] [c000007fe3126ef0] [c00000000015884c] print_circular_bug+0x28c/0x3e0 [ 906.768696] [c000007fe3126f90] [c00000000015ed88] __lock_acquire+0x2278/0x22d0 [ 906.768853] [c000007fe3127120] [c00000000015f318] lock_acquire+0xd8/0x1a0 [ 906.768987] [c000007fe31271e0] [c0000000004065f4] __kernfs_remove+0x344/0x410 [ 906.769121] [c000007fe31272e0] [c000000000407db8] kernfs_remove+0x48/0x70 [ 906.769256] [c000007fe3127310] [c00000000040b868] sysfs_remove_dir+0x78/0xd0 [ 906.769394] [c000007fe3127350] [c0000000005eccec] kobject_del+0x2c/0x80 [ 906.769528] [c000007fe3127380] [c0000000005ec9e8] kobject_release+0xa8/0x250 [ 906.769607] [c000007fe3127410] [c000000000ad7c28] cpufreq_governor_dbs+0x6f8/0x7e0 [ 906.769687] [c000007fe31274c0] [c000000000ad4bdc] od_cpufreq_governor_dbs+0x3c/0x60 [ 906.769766] [c000007fe3127500] [c000000000acf830] __cpufreq_governor+0x1d0/0x390 [ 906.769845] [c000007fe3127580] [c000000000ad0750] cpufreq_set_policy+0x3b0/0x450 [ 906.769924] [c000007fe3127610] [c000000000ad12cc] store_scaling_governor+0x8c/0xf0 [ 906.770003] [c000007fe3127c10] [c000000000aced34] store+0xb4/0x110 [ 906.770071] [c000007fe3127c60] [c00000000040a8a4] sysfs_kf_write+0x94/0xc0 [ 906.770139] [c000007fe3127ca0] [c0000000004094a8] kernfs_fop_write+0x188/0x1f0 [ 906.770221] [c000007fe3127cf0] [c000000000347b8c] __vfs_write+0x6c/0x180 [ 906.770290] [c000007fe3127d90] [c0000000003490a0] vfs_write+0xc0/0x200 [ 906.770358] [c000007fe3127de0] [c00000000034a3cc] SyS_write+0x6c/0x110 [ 906.770426] [c000007fe3127e30] [c00000000000926c] system_call+0x38/0xd0
Thanks and Regards, Shilpa
On Thu, Feb 4, 2016 at 12:31 AM, Shilpa Bhat shilpabhatppc@gmail.com wrote:
Hi,
On 02/03/2016 10:50 PM, Rafael J. Wysocki wrote:
On Wed, Feb 3, 2016 at 6:20 PM, Juri Lelli juri.lelli@arm.com wrote:
On 03/02/16 21:40, Viresh Kumar wrote:
On 03-02-16, 15:54, Juri Lelli wrote:
Ouch, I've just got this executing -f basic on Juno. :( It happens with the hotplug_1_by_1 test.
[...]
Urg..
I failed to understand it for now though. Please test only the first 4 patches and leave the bottom three. AFAICT, this is caused by the 6th patch.
The first 4 are important for 4.5 and must be tested soonish.
First 4 look ok from a testing viewpoint.
Good, thanks for the confirmation!
I'm going to apply them and they will go to Linus next week.
Thanks, Rafael
Sorry for the delayed report. But I see the below backtrace on Power8 box. It has 4 chips with 128 cpus.
Thanks for the report.
I see the below trace with the first four patches on running tests from Viresh's testcase. './runme.sh -f basic' hit this trace at 'shuffle_governors_for_all_cpus' test.
[ 906.762045] ====================================================== [ 906.762114] [ INFO: possible circular locking dependency detected ] [ 906.762172] 4.5.0-rc2-sgb+ #96 Not tainted [ 906.762207] ------------------------------------------------------- [ 906.762263] runme.sh/2840 is trying to acquire lock: [ 906.762309] (s_active#91){++++.+}, at: [<c000000000407db8>] kernfs_remove+0x48/0x70 [ 906.762419] but task is already holding lock: [ 906.762476] (od_dbs_cdata.mutex){+.+.+.}, at: [<c000000000ad7594>] cpufreq_governor_dbs+0x64/0x7e0 [ 906.762592] which lock already depends on the new lock.
[ 906.762659] the existing dependency chain (in reverse order) is: [ 906.762727] -> #2 (od_dbs_cdata.mutex){+.+.+.}: [ 906.762807] [<c000000000d485b0>] mutex_lock_nested+0x90/0x590 [ 906.762877] [<c000000000ad57f8>] update_sampling_rate+0x88/0x1c0 [ 906.762946] [<c000000000ad5990>] store_sampling_rate+0x60/0xa0 [ 906.763013] [<c000000000ad6af0>] governor_store+0x80/0xc0 [ 906.763070] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 906.763128] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 906.763196] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 906.763254] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 906.763311] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 906.763369] [<c00000000000926c>] system_call+0x38/0xd0 [ 906.763427] -> #1 (&dbs_data->mutex){+.+...}: [ 906.763495] [<c000000000d485b0>] mutex_lock_nested+0x90/0x590 [ 906.763563] [<c000000000ad6ac0>] governor_store+0x50/0xc0 [ 906.763620] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 906.763677] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 906.763745] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 906.763801] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 906.763859] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 906.763916] [<c00000000000926c>] system_call+0x38/0xd0 [ 906.763973] -> #0 (s_active#91){++++.+}: [ 906.764052] [<c00000000015f318>] lock_acquire+0xd8/0x1a0 [ 906.764111] [<c0000000004065f4>] __kernfs_remove+0x344/0x410 [ 906.764179] [<c000000000407db8>] kernfs_remove+0x48/0x70 [ 906.764236] [<c00000000040b868>] sysfs_remove_dir+0x78/0xd0 [ 906.764304] [<c0000000005eccec>] kobject_del+0x2c/0x80 [ 906.764362] [<c0000000005ec9e8>] kobject_release+0xa8/0x250 [ 906.764430] [<c000000000ad7c28>] cpufreq_governor_dbs+0x6f8/0x7e0 [ 906.764497] [<c000000000ad4bdc>] od_cpufreq_governor_dbs+0x3c/0x60 [ 906.764567] [<c000000000acf830>] __cpufreq_governor+0x1d0/0x390 [ 906.764634] [<c000000000ad0750>] cpufreq_set_policy+0x3b0/0x450 [ 906.764703] [<c000000000ad12cc>] store_scaling_governor+0x8c/0xf0 [ 906.764771] [<c000000000aced34>] store+0xb4/0x110 [ 906.764828] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 906.764885] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 906.764952] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 906.765048] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 906.765160] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 906.765272] [<c00000000000926c>] system_call+0x38/0xd0 [ 906.765384] other info that might help us debug this:
[ 906.765522] Chain exists of: s_active#91 --> &dbs_data->mutex --> od_dbs_cdata.mutex
[ 906.765768] Possible unsafe locking scenario:
[ 906.765880] CPU0 CPU1 [ 906.765969] ---- ---- [ 906.766058] lock(od_dbs_cdata.mutex); [ 906.766170] lock(&dbs_data->mutex); [ 906.766304] lock(od_dbs_cdata.mutex); [ 906.766461] lock(s_active#91); [ 906.766572] *** DEADLOCK ***
This is exactly right. We've avoided one deadlock only to trip into another one.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
Thanks, Rafael
On 04-02-16, 00:50, Rafael J. Wysocki wrote:
On Thu, Feb 4, 2016 at 12:31 AM, Shilpa Bhat shilpabhatppc@gmail.com wrote:
Sorry for the delayed report. But I see the below backtrace on Power8 box. It has 4 chips with 128 cpus.
Honestly, I wasn't expecting you to test this stuff, but I really appreciate you doing that.
Thanks a lot ..
[ 906.765768] Possible unsafe locking scenario:
[ 906.765880] CPU0 CPU1 [ 906.765969] ---- ----
This race scenario is perhaps incomplete and difficult to understand without below lines:
Governor's EXIT Update sampling rate from sysfs
lock(s_active#91);
[ 906.766058] lock(od_dbs_cdata.mutex); [ 906.766170] lock(&dbs_data->mutex); [ 906.766304] lock(od_dbs_cdata.mutex); [ 906.766461] lock(s_active#91); [ 906.766572] *** DEADLOCK ***
This is exactly right. We've avoided one deadlock only to trip into another one.
As we discussed on IRC, we haven't introduced this deadlock with the current series. But this is what Juri has reported some days back, while he tested linus/master on TC2.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
That wouldn't be trivial to implement as we discussed.
Okay, here is a proposal for the current series and the series's you have post Rafael:
- Firstly, I would like to clarify that I don't have any issues with rebasing on top of your series, it should be easy enough.
- One thing is for sure that nothing from these 3 series's is getting merged in 4.5, as we aren't fixing the real issue Shilpa/Juril have reported.
- I think the first 4 patches here are just fine and don't need any updates. They actually do the right thing and makes code so much cleaner.
- So, can we apply the first 4 patches (which you have already applied to bleeding-edge) now and do all work on top of that ?
Again, I can rebase if you merge your patches first, no issues at all :)
On 04-02-16, 00:50, Rafael J. Wysocki wrote:
This is exactly right. We've avoided one deadlock only to trip into another one.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
I have tried to explore all possible ways of fixing this, and every other way looked to be racy in some way.
Does anyone else have a better idea (untested):
-------------------------8<-------------------------
Subject: [PATCH] cpufreq: ondemand: Shoot update_sampling_rate with a separate work
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_governor.h | 2 ++ drivers/cpufreq/cpufreq_ondemand.c | 39 +++++++++++++++++++++++++++++--------- 2 files changed, 32 insertions(+), 9 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 7bed63e14e7d..97e604356b20 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -141,6 +141,8 @@ struct od_dbs_tuners { unsigned int powersave_bias; unsigned int io_is_busy; unsigned int min_sampling_rate; + struct work_struct work; + struct dbs_data *dbs_data; };
struct cs_dbs_tuners { diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 82ed490f7de0..93ad7a226aee 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -242,20 +242,27 @@ static struct common_dbs_data od_dbs_cdata; * reducing the sampling rate, we need to make the new value effective * immediately. */ -static void update_sampling_rate(struct dbs_data *dbs_data, - unsigned int new_rate) +static void update_sampling_rate(struct work_struct *work) { - struct od_dbs_tuners *od_tuners = dbs_data->tuners; + struct od_dbs_tuners *od_tuners = container_of(work, struct + od_dbs_tuners, work); + unsigned int new_rate = od_tuners->sampling_rate; + struct dbs_data *dbs_data = od_tuners->dbs_data; struct cpumask cpumask; int cpu;
- od_tuners->sampling_rate = new_rate = max(new_rate, - od_tuners->min_sampling_rate); - /* * Lock governor so that governor start/stop can't execute in parallel. + * + * We can't do a regular mutex_lock() here, as that may deadlock against + * another thread performing CPUFREQ_GOV_POLICY_EXIT event on the + * governor, which might have already taken od_dbs_cdata.mutex and is + * waiting for this work to finish. */ - mutex_lock(&od_dbs_cdata.mutex); + if (!mutex_trylock(&od_dbs_cdata.mutex)) { + queue_work(system_wq, &od_tuners->work); + return; + }
cpumask_copy(&cpumask, cpu_online_mask);
@@ -311,13 +318,22 @@ static void update_sampling_rate(struct dbs_data *dbs_data, static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, size_t count) { + struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); if (ret != 1) return -EINVAL;
- update_sampling_rate(dbs_data, input); + od_tuners->sampling_rate = max(input, od_tuners->min_sampling_rate); + + /* + * update_sampling_rate() requires to hold od_dbs_cdata.mutex, but we + * can't take that from this thread, otherwise it results in ABBA + * lockdep between s_active and od_dbs_cdata.mutex locks. + */ + queue_work(system_wq, &od_tuners->work); + return count; }
@@ -501,6 +517,8 @@ static int od_init(struct dbs_data *dbs_data, bool notify) tuners->ignore_nice_load = 0; tuners->powersave_bias = default_powersave_bias; tuners->io_is_busy = should_io_be_busy(); + INIT_WORK(&tuners->work, update_sampling_rate); + tuners->dbs_data = dbs_data;
dbs_data->tuners = tuners; return 0; @@ -508,7 +526,10 @@ static int od_init(struct dbs_data *dbs_data, bool notify)
static void od_exit(struct dbs_data *dbs_data, bool notify) { - kfree(dbs_data->tuners); + struct od_dbs_tuners *tuners = dbs_data->tuners; + + cancel_work_sync(&tuners->work); + kfree(tuners); }
define_get_cpu_dbs_routines(od_cpu_dbs_info);
On 02/04/2016 03:09 AM, Viresh Kumar wrote:
On 04-02-16, 00:50, Rafael J. Wysocki wrote:
This is exactly right. We've avoided one deadlock only to trip into another one.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
I have tried to explore all possible ways of fixing this, and every other way looked to be racy in some way.
Does anyone else have a better idea (untested):
-------------------------8<-------------------------
Subject: [PATCH] cpufreq: ondemand: Shoot update_sampling_rate with a separate work
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
drivers/cpufreq/cpufreq_governor.h | 2 ++ drivers/cpufreq/cpufreq_ondemand.c | 39 +++++++++++++++++++++++++++++--------- 2 files changed, 32 insertions(+), 9 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 7bed63e14e7d..97e604356b20 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -141,6 +141,8 @@ struct od_dbs_tuners { unsigned int powersave_bias; unsigned int io_is_busy; unsigned int min_sampling_rate;
struct work_struct work;
struct dbs_data *dbs_data; };
struct cs_dbs_tuners {
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 82ed490f7de0..93ad7a226aee 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -242,20 +242,27 @@ static struct common_dbs_data od_dbs_cdata;
- reducing the sampling rate, we need to make the new value effective
- immediately.
*/ -static void update_sampling_rate(struct dbs_data *dbs_data,
unsigned int new_rate)
+static void update_sampling_rate(struct work_struct *work) {
- struct od_dbs_tuners *od_tuners = dbs_data->tuners;
- struct od_dbs_tuners *od_tuners = container_of(work, struct
od_dbs_tuners, work);
- unsigned int new_rate = od_tuners->sampling_rate;
- struct dbs_data *dbs_data = od_tuners->dbs_data; struct cpumask cpumask; int cpu;
- od_tuners->sampling_rate = new_rate = max(new_rate,
od_tuners->min_sampling_rate);
- /*
- Lock governor so that governor start/stop can't execute in parallel.
*
* We can't do a regular mutex_lock() here, as that may deadlock against
* another thread performing CPUFREQ_GOV_POLICY_EXIT event on the
* governor, which might have already taken od_dbs_cdata.mutex and is
*/* waiting for this work to finish.
- mutex_lock(&od_dbs_cdata.mutex);
if (!mutex_trylock(&od_dbs_cdata.mutex)) {
queue_work(system_wq, &od_tuners->work);
return;
}
cpumask_copy(&cpumask, cpu_online_mask);
@@ -311,13 +318,22 @@ static void update_sampling_rate(struct dbs_data *dbs_data, static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, size_t count) {
- struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); if (ret != 1) return -EINVAL;
- update_sampling_rate(dbs_data, input);
- od_tuners->sampling_rate = max(input, od_tuners->min_sampling_rate);
- /*
* update_sampling_rate() requires to hold od_dbs_cdata.mutex, but we
* can't take that from this thread, otherwise it results in ABBA
* lockdep between s_active and od_dbs_cdata.mutex locks.
*/
- queue_work(system_wq, &od_tuners->work);
- return count; }
@@ -501,6 +517,8 @@ static int od_init(struct dbs_data *dbs_data, bool notify) tuners->ignore_nice_load = 0; tuners->powersave_bias = default_powersave_bias; tuners->io_is_busy = should_io_be_busy();
INIT_WORK(&tuners->work, update_sampling_rate);
tuners->dbs_data = dbs_data;
dbs_data->tuners = tuners; return 0;
@@ -508,7 +526,10 @@ static int od_init(struct dbs_data *dbs_data, bool notify)
static void od_exit(struct dbs_data *dbs_data, bool notify) {
- kfree(dbs_data->tuners);
struct od_dbs_tuners *tuners = dbs_data->tuners;
cancel_work_sync(&tuners->work);
kfree(tuners); }
define_get_cpu_dbs_routines(od_cpu_dbs_info);
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
-Saravana
On 02/04/2016 09:43 AM, Saravana Kannan wrote:
On 02/04/2016 03:09 AM, Viresh Kumar wrote:
On 04-02-16, 00:50, Rafael J. Wysocki wrote:
This is exactly right. We've avoided one deadlock only to trip into another one.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
I have tried to explore all possible ways of fixing this, and every other way looked to be racy in some way.
Does anyone else have a better idea (untested):
-------------------------8<-------------------------
Subject: [PATCH] cpufreq: ondemand: Shoot update_sampling_rate with a separate work
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
drivers/cpufreq/cpufreq_governor.h | 2 ++ drivers/cpufreq/cpufreq_ondemand.c | 39 +++++++++++++++++++++++++++++--------- 2 files changed, 32 insertions(+), 9 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.h b/drivers/cpufreq/cpufreq_governor.h index 7bed63e14e7d..97e604356b20 100644 --- a/drivers/cpufreq/cpufreq_governor.h +++ b/drivers/cpufreq/cpufreq_governor.h @@ -141,6 +141,8 @@ struct od_dbs_tuners { unsigned int powersave_bias; unsigned int io_is_busy; unsigned int min_sampling_rate;
struct work_struct work;
struct dbs_data *dbs_data; };
struct cs_dbs_tuners {
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 82ed490f7de0..93ad7a226aee 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -242,20 +242,27 @@ static struct common_dbs_data od_dbs_cdata;
- reducing the sampling rate, we need to make the new value effective
- immediately.
*/ -static void update_sampling_rate(struct dbs_data *dbs_data,
unsigned int new_rate)
+static void update_sampling_rate(struct work_struct *work) {
- struct od_dbs_tuners *od_tuners = dbs_data->tuners;
- struct od_dbs_tuners *od_tuners = container_of(work, struct
od_dbs_tuners, work);
- unsigned int new_rate = od_tuners->sampling_rate;
- struct dbs_data *dbs_data = od_tuners->dbs_data; struct cpumask cpumask; int cpu;
- od_tuners->sampling_rate = new_rate = max(new_rate,
od_tuners->min_sampling_rate);
/* * Lock governor so that governor start/stop can't execute in
parallel.
*
* We can't do a regular mutex_lock() here, as that may deadlock
against
* another thread performing CPUFREQ_GOV_POLICY_EXIT event on the
* governor, which might have already taken od_dbs_cdata.mutex
and is
* waiting for this work to finish. */
- mutex_lock(&od_dbs_cdata.mutex);
if (!mutex_trylock(&od_dbs_cdata.mutex)) {
queue_work(system_wq, &od_tuners->work);
return;
}
cpumask_copy(&cpumask, cpu_online_mask);
@@ -311,13 +318,22 @@ static void update_sampling_rate(struct dbs_data *dbs_data, static ssize_t store_sampling_rate(struct dbs_data *dbs_data, const char *buf, size_t count) {
- struct od_dbs_tuners *od_tuners = dbs_data->tuners; unsigned int input; int ret; ret = sscanf(buf, "%u", &input); if (ret != 1) return -EINVAL;
- update_sampling_rate(dbs_data, input);
- od_tuners->sampling_rate = max(input, od_tuners->min_sampling_rate);
- /*
* update_sampling_rate() requires to hold od_dbs_cdata.mutex,
but we
* can't take that from this thread, otherwise it results in ABBA
* lockdep between s_active and od_dbs_cdata.mutex locks.
*/
- queue_work(system_wq, &od_tuners->work);
}return count;
@@ -501,6 +517,8 @@ static int od_init(struct dbs_data *dbs_data, bool notify) tuners->ignore_nice_load = 0; tuners->powersave_bias = default_powersave_bias; tuners->io_is_busy = should_io_be_busy();
INIT_WORK(&tuners->work, update_sampling_rate);
tuners->dbs_data = dbs_data;
dbs_data->tuners = tuners; return 0;
@@ -508,7 +526,10 @@ static int od_init(struct dbs_data *dbs_data, bool notify)
static void od_exit(struct dbs_data *dbs_data, bool notify) {
- kfree(dbs_data->tuners);
struct od_dbs_tuners *tuners = dbs_data->tuners;
cancel_work_sync(&tuners->work);
kfree(tuners); }
define_get_cpu_dbs_routines(od_cpu_dbs_info);
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
Sent too soon. Not only that, but it can also cause the writes to the sysfs files to get processed in a different order and I don't know what other issues/races THAT will open up.
-Saravana
On Thu, Feb 4, 2016 at 6:44 PM, Saravana Kannan skannan@codeaurora.org wrote:
On 02/04/2016 09:43 AM, Saravana Kannan wrote:
On 02/04/2016 03:09 AM, Viresh Kumar wrote:
On 04-02-16, 00:50, Rafael J. Wysocki wrote:
This is exactly right. We've avoided one deadlock only to trip into another one.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
[cut]
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
Sent too soon. Not only that, but it can also cause the writes to the sysfs files to get processed in a different order and I don't know what other issues/races THAT will open up.
Well, I don't like this too.
I actually do have an idea about how to fix these deadlocks, but it is on top of my cleanup series.
I'll write more about it later today.
Thanks, Rafael
On 04-02-16, 19:18, Rafael J. Wysocki wrote:
On Thu, Feb 4, 2016 at 6:44 PM, Saravana Kannan skannan@codeaurora.org wrote:
On 02/04/2016 09:43 AM, Saravana Kannan wrote:
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
Sent too soon. Not only that, but it can also cause the writes to the sysfs files to get processed in a different order and I don't know what other issues/races THAT will open up.
Well, I don't like this too.
I expected similar responses only, so no surprises for me :)
Though there are few things I would like to tell here: - There wouldn't be any race for updating the file, as that is done directly from store_sampling_rate(). It updates the *real* file we wanted to.
- What's offloaded to the work-handler is something very special about ondemand governor and sampling rate. The same is not done for conservative governor as well, don't know why though.
- After updating the sampling rate, we assess if we need to reschedule the timers/workqueue to a different time for better efficiency. I don't think there can be a race there and it can be safely done in a work..
I actually do have an idea about how to fix these deadlocks, but it is on top of my cleanup series.
I would like to hear that, if you can, to save your time. I have tried so many different ways of fixing this yesterday and found issue everywhere :(
On Thursday, February 04, 2016 07:18:32 PM Rafael J. Wysocki wrote:
On Thu, Feb 4, 2016 at 6:44 PM, Saravana Kannan skannan@codeaurora.org wrote:
On 02/04/2016 09:43 AM, Saravana Kannan wrote:
On 02/04/2016 03:09 AM, Viresh Kumar wrote:
On 04-02-16, 00:50, Rafael J. Wysocki wrote:
This is exactly right. We've avoided one deadlock only to trip into another one.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
[cut]
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
Sent too soon. Not only that, but it can also cause the writes to the sysfs files to get processed in a different order and I don't know what other issues/races THAT will open up.
Well, I don't like this too.
I actually do have an idea about how to fix these deadlocks, but it is on top of my cleanup series.
I'll write more about it later today.
Having actually posted that series again after cleaning it up I can say what I'm thinking about hopefully without confusing anyone too much. So please bear in mind that I'm going to refer to this series below:
http://marc.info/?l=linux-pm&m=145463901630950&w=4
Also this is more of a brain dump rather than actual design description, so there may be holes etc in it. Please let me know if you can see any.
The problem at hand is that policy->rwsem needs to be held around *all* operations in cpufreq_set_policy(). In particular, it cannot be dropped around invocations of __cpufreq_governor() with the event arg equal to _EXIT as that leads to interesting races.
Unfortunately, we know that holding policy->rwsem in those places leads to a deadlock with governor sysfs attributes removal in cpufreq_governor_exit().
Viresh attempted to fix this by avoiding to acquire policy->rwsem for governor attributes access (as holding it is not necessary for them in principle). That was a nice try, but it turned out to be insufficient because of another deadlock scenario uncovered by it. Namely, since the ondemand governor's update_sampling_rate() acquires the governor mutex (called dbs_data_mutex after my patches mentioned above), it may deadlock with exactly the same piece of code in cpufreq_governor_exit() in almost exactly the same way.
To avoid that other deadlock, we'd either need to drop dbs_data_mutex from update_sampling_rate(), or drop it for the removal of the governor sysfs attributes in cpufreq_governor_exit(). I don't think the former is an option at least at this point, so it looks like we pretty much have to do the latter.
With that in mind, I'd start with the changes made by Viresh (maybe without the first patch which really isn't essential here). That is, introduce a separate kobject type for the governor attributes kobject and register that in cpufreq_governor_init(). The show/store callbacks for that kobject type won't acquire policy->rwsem so the first deadlock will be avoided.
But in addition to that, I'd drop dbs_data_mutex before the removal of governor sysfs attributes. That actually happens in two places, in cpufreq_governor_exit() and in the error path of cpufreq_governor_init().
To that end, I'd move the locking from cpufreq_governor_dbs() to the functions called by it. That should be readily doable and they can do all of the necessary checks themselves. cpufreq_governor_dbs() would become a pure mux then, but that's not such a big deal.
With that, cpufreq_governor_exit() may just drop the lock before it does the final kobject_put(). The danger here is that the sysfs show/store callbacks of the governor attributes kobject may see invalid dbs_data for a while, after the lock has been dropped and before the kobject is deleted. That may be addressed by checking, for example, the presence of the dbs_data's "tuners" pointer in those callbacks. If it is NULL, they can simply return -EAGAIN or similar.
Now, that means, though, that they need to acquire the same lock as cpufreq_governor_exit(), or they may see things go away while they are running. The simplest approach here would be to take dbs_data_mutex in them too, although that's a bit of a sledgehammer. It might be better to have a per-policy lock in struct policy_dbs_info for that, for example, but then the governor attribute sysfs callbacks would need to get that object instead of dbs_data.
On the flip side, it might be possible to migrate update_sampling_rate() to that lock too. And maybe we can get rid of dbs_data_mutex even, who knows?
Thanks, Rafael
On 05-02-16, 04:54, Rafael J. Wysocki wrote:
Having actually posted that series again after cleaning it up I can say what I'm thinking about hopefully without confusing anyone too much. So please bear in mind that I'm going to refer to this series below:
http://marc.info/?l=linux-pm&m=145463901630950&w=4
Also this is more of a brain dump rather than actual design description, so there may be holes etc in it. Please let me know if you can see any.
The problem at hand is that policy->rwsem needs to be held around *all* operations in cpufreq_set_policy(). In particular, it cannot be dropped around invocations of __cpufreq_governor() with the event arg equal to _EXIT as that leads to interesting races.
Unfortunately, we know that holding policy->rwsem in those places leads to a deadlock with governor sysfs attributes removal in cpufreq_governor_exit().
Viresh attempted to fix this by avoiding to acquire policy->rwsem for governor attributes access (as holding it is not necessary for them in principle). That was a nice try, but it turned out to be insufficient because of another deadlock scenario uncovered by it.
Not really.
The other deadlock wasn't uncovered by it, its just that Shilpa tested directly after my patches and reported the issue. Later yesterday, she was hitting the exactly same issue on pm/linux-next as well (i.e. without my patches). And ofcourse Juri has also reported the same issue on linux-next few days back.
Namely, since the ondemand governor's update_sampling_rate() acquires the governor mutex (called dbs_data_mutex after my patches mentioned above), it may deadlock with exactly the same piece of code in cpufreq_governor_exit() in almost exactly the same way.
Right.
To avoid that other deadlock, we'd either need to drop dbs_data_mutex from update_sampling_rate(),
And my so called 'ugly' 8th patch tried to do just that :)
But as I also mentioned in reply to the update-util patchset of yours, its possible somewhat.
or drop it for the removal of the governor sysfs attributes in cpufreq_governor_exit(). I don't think the former is an option at least at this point, so it looks like we pretty much have to do the latter.
With that in mind, I'd start with the changes made by Viresh (maybe without the first patch which really isn't essential here).
That was just to cleanup the macro mess a bit, nothing more. Over that, I think the first 7 patches can be picked as it is without any changes. Ofcourse they are required to be rebased over your 13 patches, if those are going in first :)
That is, introduce a separate kobject type for the governor attributes kobject and register that in cpufreq_governor_init(). The show/store callbacks for that kobject type won't acquire policy->rwsem so the first deadlock will be avoided.
But in addition to that, I'd drop dbs_data_mutex before the removal of governor sysfs attributes. That actually happens in two places, in cpufreq_governor_exit() and in the error path of cpufreq_governor_init().
To that end, I'd move the locking from cpufreq_governor_dbs() to the functions called by it. That should be readily doable and they can do all of the necessary checks themselves. cpufreq_governor_dbs() would become a pure mux then, but that's not such a big deal.
With that, cpufreq_governor_exit() may just drop the lock before it does the final kobject_put(). The danger here is that the sysfs show/store callbacks of the governor attributes kobject may see invalid dbs_data for a while, after the lock has been dropped and before the kobject is deleted. That may be addressed by checking, for example, the presence of the dbs_data's "tuners" pointer in those callbacks. If it is NULL, they can simply return -EAGAIN or similar.
So you mean something like this (consider only !governor_per_policy case with ondemand governor for now):
exit() { lock-dbs_data_mutex; ... dbs_data->tuners = NULL; //so that sysfs files can return early dbs_governor->gdbs_data = NULL; //For !governor_per_policy case unlock-dbs_data_mutex;
/* * Problem: Who is stopping us to set ondemand as governor for * another policy, which can try create a kobject which will * try to create sysfs directory at the same path ? * * Though another field in dbs_governor can be used to fix this * I think, which needs to block the other INIT operation. */
kobject_put(dbs_data->kobj); //This should wait for all sysfs operations to end.
kfree(dbs_data); }
And the sysfs operations show/store need to take dbs_data_mutex() for their entire operations.
??
On Friday, February 05, 2016 03:19:25 PM Viresh Kumar wrote:
On 05-02-16, 04:54, Rafael J. Wysocki wrote:
Having actually posted that series again after cleaning it up I can say what I'm thinking about hopefully without confusing anyone too much. So please bear in mind that I'm going to refer to this series below:
http://marc.info/?l=linux-pm&m=145463901630950&w=4
Also this is more of a brain dump rather than actual design description, so there may be holes etc in it. Please let me know if you can see any.
The problem at hand is that policy->rwsem needs to be held around *all* operations in cpufreq_set_policy(). In particular, it cannot be dropped around invocations of __cpufreq_governor() with the event arg equal to _EXIT as that leads to interesting races.
Unfortunately, we know that holding policy->rwsem in those places leads to a deadlock with governor sysfs attributes removal in cpufreq_governor_exit().
Viresh attempted to fix this by avoiding to acquire policy->rwsem for governor attributes access (as holding it is not necessary for them in principle). That was a nice try, but it turned out to be insufficient because of another deadlock scenario uncovered by it.
Not really.
The other deadlock wasn't uncovered by it, its just that Shilpa tested directly after my patches and reported the issue. Later yesterday, she was hitting the exactly same issue on pm/linux-next as well (i.e. without my patches). And ofcourse Juri has also reported the same issue on linux-next few days back.
OK, fair enough.
Namely, since the ondemand governor's update_sampling_rate() acquires the governor mutex (called dbs_data_mutex after my patches mentioned above), it may deadlock with exactly the same piece of code in cpufreq_governor_exit() in almost exactly the same way.
Right.
To avoid that other deadlock, we'd either need to drop dbs_data_mutex from update_sampling_rate(),
And my so called 'ugly' 8th patch tried to do just that :)
But as I also mentioned in reply to the update-util patchset of yours, its possible somewhat.
Yes, it should be possible and not even too difficult.
or drop it for the removal of the governor sysfs attributes in cpufreq_governor_exit(). I don't think the former is an option at least at this point, so it looks like we pretty much have to do the latter.
With that in mind, I'd start with the changes made by Viresh (maybe without the first patch which really isn't essential here).
That was just to cleanup the macro mess a bit, nothing more. Over that, I think the first 7 patches can be picked as it is without any changes. Ofcourse they are required to be rebased over your 13 patches, if those are going in first :)
Yes, please rebase.
Also please skip the first one that was moving min_sampling_rate around, at least for now.
As I said, we may be moving other attributes in the opposite direction, so two sets of macros may be necessary anyway.
That is, introduce a separate kobject type for the governor attributes kobject and register that in cpufreq_governor_init(). The show/store callbacks for that kobject type won't acquire policy->rwsem so the first deadlock will be avoided.
But in addition to that, I'd drop dbs_data_mutex before the removal of governor sysfs attributes. That actually happens in two places, in cpufreq_governor_exit() and in the error path of cpufreq_governor_init().
To that end, I'd move the locking from cpufreq_governor_dbs() to the functions called by it. That should be readily doable and they can do all of the necessary checks themselves. cpufreq_governor_dbs() would become a pure mux then, but that's not such a big deal.
With that, cpufreq_governor_exit() may just drop the lock before it does the final kobject_put(). The danger here is that the sysfs show/store callbacks of the governor attributes kobject may see invalid dbs_data for a while, after the lock has been dropped and before the kobject is deleted. That may be addressed by checking, for example, the presence of the dbs_data's "tuners" pointer in those callbacks. If it is NULL, they can simply return -EAGAIN or similar.
So you mean something like this (consider only !governor_per_policy case with ondemand governor for now):
exit() { lock-dbs_data_mutex; ... dbs_data->tuners = NULL; //so that sysfs files can return early dbs_governor->gdbs_data = NULL; //For !governor_per_policy case unlock-dbs_data_mutex;
/* * Problem: Who is stopping us to set ondemand as governor for * another policy, which can try create a kobject which will * try to create sysfs directory at the same path ? * * Though another field in dbs_governor can be used to fix this * I think, which needs to block the other INIT operation. */ kobject_put(dbs_data->kobj); //This should wait for all sysfs operations to end. kfree(dbs_data);
}
And the sysfs operations show/store need to take dbs_data_mutex() for their entire operations.
??
Yes, roughly.
But it shouldn't be necessary after all, because dropping the mutex from update_sampling_rate() looks easier than I thought previously.
Thanks, Rafael
On 02/04/2016 07:54 PM, Rafael J. Wysocki wrote:
On Thursday, February 04, 2016 07:18:32 PM Rafael J. Wysocki wrote:
On Thu, Feb 4, 2016 at 6:44 PM, Saravana Kannan skannan@codeaurora.org wrote:
On 02/04/2016 09:43 AM, Saravana Kannan wrote:
On 02/04/2016 03:09 AM, Viresh Kumar wrote:
On 04-02-16, 00:50, Rafael J. Wysocki wrote:
This is exactly right. We've avoided one deadlock only to trip into another one.
This happens because update_sampling_rate() acquires od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by cpufreq_governor_dbs().
Worse yet, a deadlock can still happen without (the new) dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if update_sampling_rate() runs in parallel with cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins the race.
It looks like we need to drop the governor mutex before putting the kobject in cpufreq_governor_exit().
[cut]
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
Sent too soon. Not only that, but it can also cause the writes to the sysfs files to get processed in a different order and I don't know what other issues/races THAT will open up.
Well, I don't like this too.
I actually do have an idea about how to fix these deadlocks, but it is on top of my cleanup series.
I'll write more about it later today.
Having actually posted that series again after cleaning it up I can say what I'm thinking about hopefully without confusing anyone too much. So please bear in mind that I'm going to refer to this series below:
http://marc.info/?l=linux-pm&m=145463901630950&w=4
Also this is more of a brain dump rather than actual design description, so there may be holes etc in it. Please let me know if you can see any.
The problem at hand is that policy->rwsem needs to be held around *all* operations in cpufreq_set_policy(). In particular, it cannot be dropped around invocations of __cpufreq_governor() with the event arg equal to _EXIT as that leads to interesting races.
Unfortunately, we know that holding policy->rwsem in those places leads to a deadlock with governor sysfs attributes removal in cpufreq_governor_exit().
Viresh attempted to fix this by avoiding to acquire policy->rwsem for governor attributes access (as holding it is not necessary for them in principle). That was a nice try, but it turned out to be insufficient because of another deadlock scenario uncovered by it. Namely, since the ondemand governor's update_sampling_rate() acquires the governor mutex (called dbs_data_mutex after my patches mentioned above), it may deadlock with exactly the same piece of code in cpufreq_governor_exit() in almost exactly the same way.
To avoid that other deadlock, we'd either need to drop dbs_data_mutex from update_sampling_rate(), or drop it for the removal of the governor sysfs attributes in cpufreq_governor_exit(). I don't think the former is an option at least at this point, so it looks like we pretty much have to do the latter.
With that in mind, I'd start with the changes made by Viresh (maybe without the first patch which really isn't essential here). That is, introduce a separate kobject type for the governor attributes kobject and register that in cpufreq_governor_init(). The show/store callbacks for that kobject type won't acquire policy->rwsem so the first deadlock will be avoided.
But in addition to that, I'd drop dbs_data_mutex before the removal of governor sysfs attributes. That actually happens in two places, in cpufreq_governor_exit() and in the error path of cpufreq_governor_init().
To that end, I'd move the locking from cpufreq_governor_dbs() to the functions called by it. That should be readily doable and they can do all of the necessary checks themselves. cpufreq_governor_dbs() would become a pure mux then, but that's not such a big deal.
With that, cpufreq_governor_exit() may just drop the lock before it does the final kobject_put(). The danger here is that the sysfs show/store callbacks of the governor attributes kobject may see invalid dbs_data for a while, after the lock has been dropped and before the kobject is deleted. That may be addressed by checking, for example, the presence of the dbs_data's "tuners" pointer in those callbacks. If it is NULL, they can simply return -EAGAIN or similar.
Now, that means, though, that they need to acquire the same lock as cpufreq_governor_exit(), or they may see things go away while they are running. The simplest approach here would be to take dbs_data_mutex in them too, although that's a bit of a sledgehammer. It might be better to have a per-policy lock in struct policy_dbs_info for that, for example, but then the governor attribute sysfs callbacks would need to get that object instead of dbs_data.
On the flip side, it might be possible to migrate update_sampling_rate() to that lock too. And maybe we can get rid of dbs_data_mutex even, who knows?
I'm glad you've analyzed it this far. So, the rest of my comments will be easier to understand.
I'm going to go back to my point of NOT doing the sysfs add/remove inside the governor at all (that includes cpufreq_governor.c) and doing it in cpufreq.c. That suggestion was confusing to explain/understand before when we were using policy rwsem inside the show/store ops for the governor attributes. Now that has been removed, my suggestion would be even easier/cleaner to implement/understand and you don't have to worry about ANY races in the governor.
I'll just talk about the have_governor_per_policy() case. It can be easily extended to the global case.
In cpufreq_governor.c: cpufreq_governor_init(...) { ... /* NOT kobject_init_and_add */ kobject_init(); /* New field */ policy->gov_kobj = &dbs_data->kobj); ... }
In cpufreq.c: __cpufreq_governor(...) {
if (event == POLICY_EXIT) { kobject_put(policy->gov_kobj); } ret = policy->governor->governor(policy, event); if (event == POLICY_INIT) { kobj_add(policy->gov_kobj, policy->kobj, policy->governor->name); } }
This guarantees that there can be no races of the governor specific data structures going away while being accessed from sysfs because the first thing we do once we decide to "kill" a governor is to remove the sysfs files and the accesses to governor data (and flush out all on going accesses) and THEN ask the governor to exit.
Thoughts?
Thanks, Saravana
On Friday, February 05, 2016 06:22:35 PM Saravana Kannan wrote:
On 02/04/2016 07:54 PM, Rafael J. Wysocki wrote:
On Thursday, February 04, 2016 07:18:32 PM Rafael J. Wysocki wrote:
On Thu, Feb 4, 2016 at 6:44 PM, Saravana Kannan skannan@codeaurora.org wrote:
On 02/04/2016 09:43 AM, Saravana Kannan wrote:
On 02/04/2016 03:09 AM, Viresh Kumar wrote:
On 04-02-16, 00:50, Rafael J. Wysocki wrote: > > This is exactly right. We've avoided one deadlock only to trip into > another one. > > This happens because update_sampling_rate() acquires > od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by > cpufreq_governor_dbs(). > > Worse yet, a deadlock can still happen without (the new) > dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if > update_sampling_rate() runs in parallel with > cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins > the race. > > It looks like we need to drop the governor mutex before putting the > kobject in cpufreq_governor_exit().
[cut]
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
Sent too soon. Not only that, but it can also cause the writes to the sysfs files to get processed in a different order and I don't know what other issues/races THAT will open up.
Well, I don't like this too.
I actually do have an idea about how to fix these deadlocks, but it is on top of my cleanup series.
I'll write more about it later today.
Having actually posted that series again after cleaning it up I can say what I'm thinking about hopefully without confusing anyone too much. So please bear in mind that I'm going to refer to this series below:
http://marc.info/?l=linux-pm&m=145463901630950&w=4
Also this is more of a brain dump rather than actual design description, so there may be holes etc in it. Please let me know if you can see any.
The problem at hand is that policy->rwsem needs to be held around *all* operations in cpufreq_set_policy(). In particular, it cannot be dropped around invocations of __cpufreq_governor() with the event arg equal to _EXIT as that leads to interesting races.
Unfortunately, we know that holding policy->rwsem in those places leads to a deadlock with governor sysfs attributes removal in cpufreq_governor_exit().
Viresh attempted to fix this by avoiding to acquire policy->rwsem for governor attributes access (as holding it is not necessary for them in principle). That was a nice try, but it turned out to be insufficient because of another deadlock scenario uncovered by it. Namely, since the ondemand governor's update_sampling_rate() acquires the governor mutex (called dbs_data_mutex after my patches mentioned above), it may deadlock with exactly the same piece of code in cpufreq_governor_exit() in almost exactly the same way.
To avoid that other deadlock, we'd either need to drop dbs_data_mutex from update_sampling_rate(), or drop it for the removal of the governor sysfs attributes in cpufreq_governor_exit(). I don't think the former is an option at least at this point, so it looks like we pretty much have to do the latter.
With that in mind, I'd start with the changes made by Viresh (maybe without the first patch which really isn't essential here). That is, introduce a separate kobject type for the governor attributes kobject and register that in cpufreq_governor_init(). The show/store callbacks for that kobject type won't acquire policy->rwsem so the first deadlock will be avoided.
But in addition to that, I'd drop dbs_data_mutex before the removal of governor sysfs attributes. That actually happens in two places, in cpufreq_governor_exit() and in the error path of cpufreq_governor_init().
To that end, I'd move the locking from cpufreq_governor_dbs() to the functions called by it. That should be readily doable and they can do all of the necessary checks themselves. cpufreq_governor_dbs() would become a pure mux then, but that's not such a big deal.
With that, cpufreq_governor_exit() may just drop the lock before it does the final kobject_put(). The danger here is that the sysfs show/store callbacks of the governor attributes kobject may see invalid dbs_data for a while, after the lock has been dropped and before the kobject is deleted. That may be addressed by checking, for example, the presence of the dbs_data's "tuners" pointer in those callbacks. If it is NULL, they can simply return -EAGAIN or similar.
Now, that means, though, that they need to acquire the same lock as cpufreq_governor_exit(), or they may see things go away while they are running. The simplest approach here would be to take dbs_data_mutex in them too, although that's a bit of a sledgehammer. It might be better to have a per-policy lock in struct policy_dbs_info for that, for example, but then the governor attribute sysfs callbacks would need to get that object instead of dbs_data.
On the flip side, it might be possible to migrate update_sampling_rate() to that lock too. And maybe we can get rid of dbs_data_mutex even, who knows?
I'm glad you've analyzed it this far. So, the rest of my comments will be easier to understand.
I'm going to go back to my point of NOT doing the sysfs add/remove inside the governor at all (that includes cpufreq_governor.c) and doing it in cpufreq.c. That suggestion was confusing to explain/understand before when we were using policy rwsem inside the show/store ops for the governor attributes. Now that has been removed, my suggestion would be even easier/cleaner to implement/understand and you don't have to worry about ANY races in the governor.
I'll just talk about the have_governor_per_policy() case. It can be easily extended to the global case.
In cpufreq_governor.c: cpufreq_governor_init(...) { ... /* NOT kobject_init_and_add */ kobject_init(); /* New field */ policy->gov_kobj = &dbs_data->kobj); ... }
In cpufreq.c: __cpufreq_governor(...) {
if (event == POLICY_EXIT) { kobject_put(policy->gov_kobj); } ret = policy->governor->governor(policy, event); if (event == POLICY_INIT) { kobj_add(policy->gov_kobj, policy->kobj, policy->governor->name); }
}
This guarantees that there can be no races of the governor specific data structures going away while being accessed from sysfs because the first thing we do once we decide to "kill" a governor is to remove the sysfs files and the accesses to governor data (and flush out all on going accesses) and THEN ask the governor to exit.
Thoughts?
The core would then have to rely on the governor code to populate the gov_kobj field correctly which doesn't look really straightforward to me. It is better if each code layer arranges the data structures it is going to use by itself.
Besides, ondemand and conservative are the only governors that use the governor kobject at all, so I'm not sure if that really belongs to the core.
Thanks, Rafael
On 02/07/2016 06:28 PM, Rafael J. Wysocki wrote:
On Friday, February 05, 2016 06:22:35 PM Saravana Kannan wrote:
On 02/04/2016 07:54 PM, Rafael J. Wysocki wrote:
On Thursday, February 04, 2016 07:18:32 PM Rafael J. Wysocki wrote:
On Thu, Feb 4, 2016 at 6:44 PM, Saravana Kannan skannan@codeaurora.org wrote:
On 02/04/2016 09:43 AM, Saravana Kannan wrote:
On 02/04/2016 03:09 AM, Viresh Kumar wrote: > > On 04-02-16, 00:50, Rafael J. Wysocki wrote: >> >> This is exactly right. We've avoided one deadlock only to trip into >> another one. >> >> This happens because update_sampling_rate() acquires >> od_dbs_cdata.mutex which is held around cpufreq_governor_exit() by >> cpufreq_governor_dbs(). >> >> Worse yet, a deadlock can still happen without (the new) >> dbs_data->mutex, just between s_active and od_dbs_cdata.mutex if >> update_sampling_rate() runs in parallel with >> cpufreq_governor_dbs()->cpufreq_governor_exit() and the latter wins >> the race. >> >> It looks like we need to drop the governor mutex before putting the >> kobject in cpufreq_governor_exit(). >
[cut]
No no no no! Let's not open up this can of worms of queuing up the work to handle a write to a sysfs file. It *MIGHT* work for this specific tunable (I haven't bothered to analyze), but this makes it impossible to return a useful/proper error value.
Sent too soon. Not only that, but it can also cause the writes to the sysfs files to get processed in a different order and I don't know what other issues/races THAT will open up.
Well, I don't like this too.
I actually do have an idea about how to fix these deadlocks, but it is on top of my cleanup series.
I'll write more about it later today.
Having actually posted that series again after cleaning it up I can say what I'm thinking about hopefully without confusing anyone too much. So please bear in mind that I'm going to refer to this series below:
http://marc.info/?l=linux-pm&m=145463901630950&w=4
Also this is more of a brain dump rather than actual design description, so there may be holes etc in it. Please let me know if you can see any.
The problem at hand is that policy->rwsem needs to be held around *all* operations in cpufreq_set_policy(). In particular, it cannot be dropped around invocations of __cpufreq_governor() with the event arg equal to _EXIT as that leads to interesting races.
Unfortunately, we know that holding policy->rwsem in those places leads to a deadlock with governor sysfs attributes removal in cpufreq_governor_exit().
Viresh attempted to fix this by avoiding to acquire policy->rwsem for governor attributes access (as holding it is not necessary for them in principle). That was a nice try, but it turned out to be insufficient because of another deadlock scenario uncovered by it. Namely, since the ondemand governor's update_sampling_rate() acquires the governor mutex (called dbs_data_mutex after my patches mentioned above), it may deadlock with exactly the same piece of code in cpufreq_governor_exit() in almost exactly the same way.
To avoid that other deadlock, we'd either need to drop dbs_data_mutex from update_sampling_rate(), or drop it for the removal of the governor sysfs attributes in cpufreq_governor_exit(). I don't think the former is an option at least at this point, so it looks like we pretty much have to do the latter.
With that in mind, I'd start with the changes made by Viresh (maybe without the first patch which really isn't essential here). That is, introduce a separate kobject type for the governor attributes kobject and register that in cpufreq_governor_init(). The show/store callbacks for that kobject type won't acquire policy->rwsem so the first deadlock will be avoided.
But in addition to that, I'd drop dbs_data_mutex before the removal of governor sysfs attributes. That actually happens in two places, in cpufreq_governor_exit() and in the error path of cpufreq_governor_init().
To that end, I'd move the locking from cpufreq_governor_dbs() to the functions called by it. That should be readily doable and they can do all of the necessary checks themselves. cpufreq_governor_dbs() would become a pure mux then, but that's not such a big deal.
With that, cpufreq_governor_exit() may just drop the lock before it does the final kobject_put(). The danger here is that the sysfs show/store callbacks of the governor attributes kobject may see invalid dbs_data for a while, after the lock has been dropped and before the kobject is deleted. That may be addressed by checking, for example, the presence of the dbs_data's "tuners" pointer in those callbacks. If it is NULL, they can simply return -EAGAIN or similar.
Now, that means, though, that they need to acquire the same lock as cpufreq_governor_exit(), or they may see things go away while they are running. The simplest approach here would be to take dbs_data_mutex in them too, although that's a bit of a sledgehammer. It might be better to have a per-policy lock in struct policy_dbs_info for that, for example, but then the governor attribute sysfs callbacks would need to get that object instead of dbs_data.
On the flip side, it might be possible to migrate update_sampling_rate() to that lock too. And maybe we can get rid of dbs_data_mutex even, who knows?
I'm glad you've analyzed it this far. So, the rest of my comments will be easier to understand.
I'm going to go back to my point of NOT doing the sysfs add/remove inside the governor at all (that includes cpufreq_governor.c) and doing it in cpufreq.c. That suggestion was confusing to explain/understand before when we were using policy rwsem inside the show/store ops for the governor attributes. Now that has been removed, my suggestion would be even easier/cleaner to implement/understand and you don't have to worry about ANY races in the governor.
I'll just talk about the have_governor_per_policy() case. It can be easily extended to the global case.
In cpufreq_governor.c: cpufreq_governor_init(...) { ... /* NOT kobject_init_and_add */ kobject_init(); /* New field */ policy->gov_kobj = &dbs_data->kobj); ... }
In cpufreq.c: __cpufreq_governor(...) {
if (event == POLICY_EXIT) { kobject_put(policy->gov_kobj); } ret = policy->governor->governor(policy, event); if (event == POLICY_INIT) { kobj_add(policy->gov_kobj, policy->kobj, policy->governor->name); }
}
This guarantees that there can be no races of the governor specific data structures going away while being accessed from sysfs because the first thing we do once we decide to "kill" a governor is to remove the sysfs files and the accesses to governor data (and flush out all on going accesses) and THEN ask the governor to exit.
Thoughts?
The core would then have to rely on the governor code to populate the gov_kobj field correctly which doesn't look really straightforward to me. It is better if each code layer arranges the data structures it is going to use by itself.
The core depends a lot on the drivers and governors filling up some fields correctly. This isn't any worse than that. It just seems way more logical to me to remove the interface to changing governor attributes (the sysfs files) before we start "exiting" a governor. But it looks like there's a v3 series of patches from Viresh that people seem to agree is fixing the race in a different method -- I haven't had time to look at it. So, I'm not going to keep pushing my point about removing the sysfs files at the core level. I'll jump back to it if we later find another race with this v3 patch series :)
Besides, ondemand and conservative are the only governors that use the governor kobject at all, so I'm not sure if that really belongs to the core.
Technically userspace should be using kobject and sysfs attributes for set speed, but for whatever reason (legacy/historical I assume) we let the core add/remove sysfs files for an op that's supported only by userspace governor.
-Saravana
On 03-02-16, 21:40, Viresh Kumar wrote:
On 03-02-16, 15:54, Juri Lelli wrote:
Ouch, I've just got this executing -f basic on Juno. :( It happens with the hotplug_1_by_1 test.
[ 1086.531252] IRQ1 no longer affine to CPU1 [ 1086.531495] CPU1: shutdown [ 1086.538199] psci: CPU1 killed. [ 1086.583396] [ 1086.584881] ====================================================== [ 1086.590999] [ INFO: possible circular locking dependency detected ] [ 1086.597205] 4.5.0-rc2+ #37 Not tainted [ 1086.600914] ------------------------------------------------------- [ 1086.607118] runme.sh/1052 is trying to acquire lock: [ 1086.612031] (sb_writers#7){.+.+.+}, at: [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.620090] [ 1086.620090] but task is already holding lock: [ 1086.625865] (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278 [ 1086.634081] [ 1086.634081] which lock already depends on the new lock. [ 1086.634081] [ 1086.642180] [ 1086.642180] the existing dependency chain (in reverse order) is: [ 1086.649589] -> #1 (&policy->rwsem){+++++.}: [ 1086.653929] [<ffffffc00011d9a4>] check_prev_add+0x670/0x754 [ 1086.660060] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.666876] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.673001] [<ffffffc000120b58>] lock_release+0x244/0x570 [ 1086.678955] [<ffffffc0007351d0>] __mutex_unlock_slowpath+0xa0/0x18c [ 1086.685771] [<ffffffc0007352dc>] mutex_unlock+0x20/0x2c [ 1086.691553] [<ffffffc0002ccd24>] kernfs_fop_write+0xb0/0x194 [ 1086.697768] [<ffffffc00024478c>] __vfs_write+0x48/0x104 [ 1086.703550] [<ffffffc0002457a4>] vfs_write+0x98/0x198 [ 1086.709161] [<ffffffc0002465e4>] SyS_write+0x54/0xb0 [ 1086.714684] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1086.720555] -> #0 (sb_writers#7){.+.+.+}: [ 1086.724730] [<ffffffc00011c574>] print_circular_bug+0x80/0x2e4 [ 1086.731116] [<ffffffc00011d470>] check_prev_add+0x13c/0x754 [ 1086.737243] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.744059] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.750184] [<ffffffc0001207f4>] lock_acquire+0xe4/0x204 [ 1086.756052] [<ffffffc000118da0>] percpu_down_read+0x50/0xe4 [ 1086.762180] [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.768306] [<ffffffc00026ae90>] mnt_want_write+0x28/0x54 [ 1086.774263] [<ffffffc0002555f8>] do_last+0x660/0xcb8 [ 1086.779788] [<ffffffc000255cdc>] path_openat+0x8c/0x2b0 [ 1086.785570] [<ffffffc000256fbc>] do_filp_open+0x78/0xf0 [ 1086.791353] [<ffffffc000244058>] do_sys_open+0x150/0x214 [ 1086.797222] [<ffffffc0002441a0>] SyS_openat+0x3c/0x48 [ 1086.802831] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1086.808700] [ 1086.808700] other info that might help us debug this: [ 1086.808700] [ 1086.816627] Possible unsafe locking scenario: [ 1086.816627] [ 1086.822488] CPU0 CPU1 [ 1086.826971] ---- ---- [ 1086.831453] lock(&policy->rwsem); [ 1086.834918] lock(sb_writers#7); [ 1086.840713] lock(&policy->rwsem); [ 1086.846671] lock(sb_writers#7); [ 1086.849972] [ 1086.849972] *** DEADLOCK *** [ 1086.849972] [ 1086.855836] 1 lock held by runme.sh/1052: [ 1086.859802] #0: (&policy->rwsem){+++++.}, at: [<ffffffc0005c8ee4>] cpufreq_offline+0x7c/0x278 [ 1086.868453] [ 1086.868453] stack backtrace: [ 1086.872769] CPU: 5 PID: 1052 Comm: runme.sh Not tainted 4.5.0-rc2+ #37 [ 1086.879229] Hardware name: ARM Juno development board (r2) (DT) [ 1086.885089] Call trace: [ 1086.887511] [<ffffffc00008a788>] dump_backtrace+0x0/0x1f4 [ 1086.892858] [<ffffffc00008a99c>] show_stack+0x20/0x28 [ 1086.897861] [<ffffffc00041a380>] dump_stack+0x84/0xc0 [ 1086.902863] [<ffffffc00011c6c8>] print_circular_bug+0x1d4/0x2e4 [ 1086.908725] [<ffffffc00011d470>] check_prev_add+0x13c/0x754 [ 1086.914244] [<ffffffc00011e1ac>] validate_chain.isra.36+0x724/0xa0c [ 1086.920448] [<ffffffc00011f904>] __lock_acquire+0x4e4/0xba0 [ 1086.925965] [<ffffffc0001207f4>] lock_acquire+0xe4/0x204 [ 1086.931224] [<ffffffc000118da0>] percpu_down_read+0x50/0xe4 [ 1086.936742] [<ffffffc000249500>] __sb_start_write+0xcc/0xe0 [ 1086.942260] [<ffffffc00026ae90>] mnt_want_write+0x28/0x54 [ 1086.947605] [<ffffffc0002555f8>] do_last+0x660/0xcb8 [ 1086.952520] [<ffffffc000255cdc>] path_openat+0x8c/0x2b0 [ 1086.957693] [<ffffffc000256fbc>] do_filp_open+0x78/0xf0 [ 1086.962865] [<ffffffc000244058>] do_sys_open+0x150/0x214 [ 1086.968123] [<ffffffc0002441a0>] SyS_openat+0x3c/0x48 [ 1086.973124] [<ffffffc000085d30>] el0_svc_naked+0x24/0x28 [ 1087.019315] Detected PIPT I-cache on CPU1 [ 1087.019373] CPU1: Booted secondary processor [410fd080]
Urg..
Urg square :(
I failed to understand it for now though. Please test only the first 4 patches and leave the bottom three. AFAICT, this is caused by the 6th patch.
From the code I still failed to understand this since sometime back
and I something just caught my eyes and the 6th patch needs this fixup:
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 7bc8a5ed97e5..ac3348ecde7b 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1351,7 +1351,7 @@ static void cpufreq_offline(unsigned int cpu) pr_err("%s: Failed to start governor\n", __func__); }
- return; + goto unlock; }
if (cpufreq_driver->stop_cpu) @@ -1373,6 +1373,8 @@ static void cpufreq_offline(unsigned int cpu) cpufreq_driver->exit(policy); policy->freq_table = NULL; } + +unlock: up_write(&policy->rwsem); }
I tried the basic tests using './runme' and they aren't reporting the same lockdep now. And yes, your lockdep occurred on my exynos board as well :)
I have re-pushed my patches again to the same branch. All 7 look fine to me now :)
On 04-02-16, 11:54, Viresh Kumar wrote:
From the code I still failed to understand this since sometime back and I something just caught my eyes and the 6th patch needs this fixup:
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 7bc8a5ed97e5..ac3348ecde7b 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -1351,7 +1351,7 @@ static void cpufreq_offline(unsigned int cpu) pr_err("%s: Failed to start governor\n", __func__); }
return;
goto unlock; }
if (cpufreq_driver->stop_cpu) @@ -1373,6 +1373,8 @@ static void cpufreq_offline(unsigned int cpu) cpufreq_driver->exit(policy); policy->freq_table = NULL; }
+unlock: up_write(&policy->rwsem); }
I tried the basic tests using './runme' and they aren't reporting the same lockdep now. And yes, your lockdep occurred on my exynos board as well :)
I have re-pushed my patches again to the same branch. All 7 look fine to me now :)
FWIW, Juri has reported on IRC that the above diff fixed the lockdep he reported yesterday and all the 7 patches are working fine on his test machine, Juno.
Hi,
FWIW, Juri has reported on IRC that the above diff fixed the lockdep he reported yesterday and all the 7 patches are working fine on his test machine, Juno.
I could see the previous lockdep warnings on pm/next and on top of patch[4]. On Patch[5-7] I see the below lockdep trace on running './runme' on a Power8 box.
[ 710.332841] ====================================================== [ 710.332911] [ INFO: possible circular locking dependency detected ] [ 710.332969] 4.5.0-rc2-sgb+ #104 Not tainted [ 710.333004] ------------------------------------------------------- [ 710.333060] runme.sh/2476 is trying to acquire lock: [ 710.333107] (s_active#91){++++.+}, at: [<c000000000407db8>] kernfs_remove+0x48/0x70 [ 710.333215] but task is already holding lock: [ 710.333272] (od_dbs_cdata.mutex){+.+.+.}, at: [<c000000000ad7434>] cpufreq_governor_dbs+0x64/0x7e0 [ 710.333388] which lock already depends on the new lock.
[ 710.333456] the existing dependency chain (in reverse order) is: [ 710.333523] -> #2 (od_dbs_cdata.mutex){+.+.+.}: [ 710.333604] [<c000000000d48450>] mutex_lock_nested+0x90/0x590 [ 710.333673] [<c000000000ad5698>] update_sampling_rate+0x88/0x1c0 [ 710.333741] [<c000000000ad5830>] store_sampling_rate+0x60/0xa0 [ 710.333809] [<c000000000ad6990>] governor_store+0x80/0xc0 [ 710.333865] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 710.333923] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 710.333991] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 710.334049] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 710.334107] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 710.334163] [<c00000000000926c>] system_call+0x38/0xd0 [ 710.334222] -> #1 (&dbs_data->mutex){+.+...}: [ 710.334290] [<c000000000d48450>] mutex_lock_nested+0x90/0x590 [ 710.334358] [<c000000000ad6960>] governor_store+0x50/0xc0 [ 710.334415] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 710.334471] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 710.334539] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 710.334596] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 710.334653] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 710.334710] [<c00000000000926c>] system_call+0x38/0xd0 [ 710.334767] -> #0 (s_active#91){++++.+}: [ 710.334847] [<c00000000015f318>] lock_acquire+0xd8/0x1a0 [ 710.334905] [<c0000000004065f4>] __kernfs_remove+0x344/0x410 [ 710.334973] [<c000000000407db8>] kernfs_remove+0x48/0x70 [ 710.335030] [<c00000000040b868>] sysfs_remove_dir+0x78/0xd0 [ 710.335098] [<c0000000005eccec>] kobject_del+0x2c/0x80 [ 710.335156] [<c0000000005ec9e8>] kobject_release+0xa8/0x250 [ 710.335224] [<c000000000ad7ac8>] cpufreq_governor_dbs+0x6f8/0x7e0 [ 710.335292] [<c000000000ad4a7c>] od_cpufreq_governor_dbs+0x3c/0x60 [ 710.335361] [<c000000000acf7c4>] __cpufreq_governor+0x164/0x300 [ 710.335429] [<c000000000ad0600>] cpufreq_set_policy+0x3b0/0x450 [ 710.335497] [<c000000000ad117c>] store_scaling_governor+0x8c/0xf0 [ 710.335565] [<c000000000aced34>] store+0xb4/0x110 [ 710.335622] [<c00000000040a8a4>] sysfs_kf_write+0x94/0xc0 [ 710.335679] [<c0000000004094a8>] kernfs_fop_write+0x188/0x1f0 [ 710.335747] [<c000000000347b8c>] __vfs_write+0x6c/0x180 [ 710.335803] [<c0000000003490a0>] vfs_write+0xc0/0x200 [ 710.335861] [<c00000000034a3cc>] SyS_write+0x6c/0x110 [ 710.335918] [<c00000000000926c>] system_call+0x38/0xd0 [ 710.335993] other info that might help us debug this:
[ 710.336130] Chain exists of: s_active#91 --> &dbs_data->mutex --> od_dbs_cdata.mutex
[ 710.336376] Possible unsafe locking scenario:
[ 710.336488] CPU0 CPU1 [ 710.336577] ---- ---- [ 710.336666] lock(od_dbs_cdata.mutex); [ 710.336778] lock(&dbs_data->mutex); [ 710.336911] lock(od_dbs_cdata.mutex); [ 710.337064] lock(s_active#91); [ 710.337176] *** DEADLOCK ***
[ 710.337289] 6 locks held by runme.sh/2476: [ 710.337355] #0: (sb_writers#6){.+.+.+}, at: [<c00000000034cf10>] __sb_start_write+0x120/0x150 [ 710.337600] #1: (&of->mutex){+.+.+.}, at: [<c00000000040939c>] kernfs_fop_write+0x7c/0x1f0 [ 710.337824] #2: (s_active#82){.+.+.+}, at: [<c0000000004093a8>] kernfs_fop_write+0x88/0x1f0 [ 710.338070] #3: (cpu_hotplug.lock){++++++}, at: [<c0000000000e06d8>] get_online_cpus+0x48/0xc0 [ 710.338276] #4: (&policy->rwsem){+++++.}, at: [<c000000000aced04>] store+0x84/0x110 [ 710.338476] #5: (od_dbs_cdata.mutex){+.+.+.}, at: [<c000000000ad7434>] cpufreq_governor_dbs+0x64/0x7e0 [ 710.338722] stack backtrace: [ 710.338813] CPU: 0 PID: 2476 Comm: runme.sh Not tainted 4.5.0-rc2-sgb+ #104 [ 710.338929] Call Trace: [ 710.338978] [c000005fd40eaec0] [c000000000d563d0] dump_stack+0x90/0xbc (unreliable) [ 710.339138] [c000005fd40eaef0] [c00000000015884c] print_circular_bug+0x28c/0x3e0 [ 710.339295] [c000005fd40eaf90] [c00000000015ed88] __lock_acquire+0x2278/0x22d0 [ 710.339455] [c000005fd40eb120] [c00000000015f318] lock_acquire+0xd8/0x1a0 [ 710.339589] [c000005fd40eb1e0] [c0000000004065f4] __kernfs_remove+0x344/0x410 [ 710.339724] [c000005fd40eb2e0] [c000000000407db8] kernfs_remove+0x48/0x70 [ 710.339859] [c000005fd40eb310] [c00000000040b868] sysfs_remove_dir+0x78/0xd0 [ 710.339993] [c000005fd40eb350] [c0000000005eccec] kobject_del+0x2c/0x80 [ 710.340128] [c000005fd40eb380] [c0000000005ec9e8] kobject_release+0xa8/0x250 [ 710.340265] [c000005fd40eb410] [c000000000ad7ac8] cpufreq_governor_dbs+0x6f8/0x7e0 [ 710.340423] [c000005fd40eb4c0] [c000000000ad4a7c] od_cpufreq_governor_dbs+0x3c/0x60 [ 710.340561] [c000005fd40eb500] [c000000000acf7c4] __cpufreq_governor+0x164/0x300 [ 710.340639] [c000005fd40eb580] [c000000000ad0600] cpufreq_set_policy+0x3b0/0x450 [ 710.340719] [c000005fd40eb610] [c000000000ad117c] store_scaling_governor+0x8c/0xf0 [ 710.340797] [c000005fd40ebc10] [c000000000aced34] store+0xb4/0x110 [ 710.340866] [c000005fd40ebc60] [c00000000040a8a4] sysfs_kf_write+0x94/0xc0 [ 710.340934] [c000005fd40ebca0] [c0000000004094a8] kernfs_fop_write+0x188/0x1f0 [ 710.341013] [c000005fd40ebcf0] [c000000000347b8c] __vfs_write+0x6c/0x180 [ 710.341081] [c000005fd40ebd90] [c0000000003490a0] vfs_write+0xc0/0x200 [ 710.341151] [c000005fd40ebde0] [c00000000034a3cc] SyS_write+0x6c/0x110 [ 710.341219] [c000005fd40ebe30] [c00000000000926c] system_call+0x38/0xd0
Thanks and Regards, Shilpa
On 05-02-16, 02:20, Shilpasri G Bhat wrote:
I could see the previous lockdep warnings on pm/next and on top of patch[4]. On Patch[5-7] I see the below lockdep trace on running './runme' on a Power8 box.
[ 710.336130] Chain exists of: s_active#91 --> &dbs_data->mutex --> od_dbs_cdata.mutex
[ 710.336376] Possible unsafe locking scenario:
[ 710.336488] CPU0 CPU1 [ 710.336577] ---- ---- [ 710.336666] lock(od_dbs_cdata.mutex); [ 710.336778] lock(&dbs_data->mutex); [ 710.336911] lock(od_dbs_cdata.mutex); [ 710.337064] lock(s_active#91); [ 710.337176]
This is the same lockdep, just that we have added another mutex (dbs_data->mutex) to it.
Have you tried if all the lockdeps go away with the 8th patch that I provided yesterday ?
linaro-kernel@lists.linaro.org