Hi Rafael,
Here is the third version based on the review comments you gave. I have tried to resolve most of them, and it looks much better now. Thanks for your comments.
As I am traveling this week, don't have access to hardware to retest the series. But I am quite sure it should work just fine, as there weren't lots of updates from how the final code looked earlier. Anyway, I have pushed this for the build bot sometime back and it will let us know of any obvious issues.
V2->V3: - Few got merged already, and are dropped now - Patches are reordered a bit to make them more sensible - gov_queue_work() isn't modified at all with the mask of CPUs, as you suggested earlier. - Some minor commit/logs updated.
V1->V2: - Dropped 2/10 from V1 as it wasn't required - 3/10 saw some changes due to above patch being dropped - 7/10 changed a bit as we check for pending work items by looking at shared->policy, rather than calling delayed_work_pending. We wanted to check if governor is operational or not and the new check is enough for that.
Viresh Kumar (5): cpufreq: ondemand: Drop unnecessary locks from update_sampling_rate() cpufreq: ondemand: update sampling rate immediately cpufreq: ondemand: queue work for policy->cpus together cpufreq: governor: Quit work-handlers early if governor is stopped cpufreq: Get rid of ->governor_enabled and its lock
drivers/cpufreq/cpufreq.c | 24 ------------------ drivers/cpufreq/cpufreq_governor.c | 33 +++++++++++++++++-------- drivers/cpufreq/cpufreq_ondemand.c | 50 ++++++++++++-------------------------- include/linux/cpufreq.h | 1 - 4 files changed, 39 insertions(+), 69 deletions(-)
'timer_mutex' is required to sync work-handlers of policy->cpus. update_sampling_rate() is just canceling the works and queuing them again. This isn't protecting anything at all in update_sampling_rate() and is not gonna be of any use.
Even if a work-handler is already running for a CPU, cancel_delayed_work_sync() will wait for it to finish.
Drop these unnecessary locks.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_ondemand.c | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 1fa9088c84a8..03ac6ce54042 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -267,27 +267,19 @@ static void update_sampling_rate(struct dbs_data *dbs_data, dbs_info = &per_cpu(od_cpu_dbs_info, cpu); cpufreq_cpu_put(policy);
- mutex_lock(&dbs_info->cdbs.shared->timer_mutex); - - if (!delayed_work_pending(&dbs_info->cdbs.dwork)) { - mutex_unlock(&dbs_info->cdbs.shared->timer_mutex); + if (!delayed_work_pending(&dbs_info->cdbs.dwork)) continue; - }
next_sampling = jiffies + usecs_to_jiffies(new_rate); appointed_at = dbs_info->cdbs.dwork.timer.expires;
if (time_before(next_sampling, appointed_at)) { - - mutex_unlock(&dbs_info->cdbs.shared->timer_mutex); cancel_delayed_work_sync(&dbs_info->cdbs.dwork); - mutex_lock(&dbs_info->cdbs.shared->timer_mutex);
gov_queue_work(dbs_data, policy, usecs_to_jiffies(new_rate), true);
} - mutex_unlock(&dbs_info->cdbs.shared->timer_mutex); } }
On Tuesday, October 13, 2015 01:39:01 PM Viresh Kumar wrote:
'timer_mutex' is required to sync work-handlers of policy->cpus. update_sampling_rate() is just canceling the works and queuing them again. This isn't protecting anything at all in update_sampling_rate() and is not gonna be of any use.
Even if a work-handler is already running for a CPU, cancel_delayed_work_sync() will wait for it to finish.
Drop these unnecessary locks.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
I'm queuing this up for 4.4, although I think that the changelog is not right.
While at it, what are the race conditions the lock is protecting against?
Thanks, Rafael
On 28-10-15, 05:05, Rafael J. Wysocki wrote:
On Tuesday, October 13, 2015 01:39:01 PM Viresh Kumar wrote:
'timer_mutex' is required to sync work-handlers of policy->cpus. update_sampling_rate() is just canceling the works and queuing them again. This isn't protecting anything at all in update_sampling_rate() and is not gonna be of any use.
Even if a work-handler is already running for a CPU, cancel_delayed_work_sync() will wait for it to finish.
Drop these unnecessary locks.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
I'm queuing this up for 4.4, although I think that the changelog is not right.
While at it, what are the race conditions the lock is protecting against?
In cases where a single policy controls multiple CPUs, a timer is queued for every cpu present in policy->cpus. When we reach the timer handler (which can be on multiple CPUs together) on any CPU, we trace CPU load for all policy->cpus and update the frequency accordingly.
The lock is for protecting multiple CPUs to do the same thing together, as only its required to be done by a single CPU. Once any CPUs handler has completed, it updates the last update time and drops the mutex. At that point of time, other blocked handler (if any) check the last update time and return early.
And then there are enough minute things that can go wrong if multiple CPUs do the load evaluation and freq-update at the same time, apart from it being an time wasting effort.
And so I still think that the commit log isn't that bad. The timer_mutex lock isn't required in other parts of the governor, they are just for synchronizing the work-handlers of CPUs belonging to the same policy.
On Wednesday, October 28, 2015 10:14:51 AM Viresh Kumar wrote:
On 28-10-15, 05:05, Rafael J. Wysocki wrote:
On Tuesday, October 13, 2015 01:39:01 PM Viresh Kumar wrote:
'timer_mutex' is required to sync work-handlers of policy->cpus. update_sampling_rate() is just canceling the works and queuing them again. This isn't protecting anything at all in update_sampling_rate() and is not gonna be of any use.
Even if a work-handler is already running for a CPU, cancel_delayed_work_sync() will wait for it to finish.
Drop these unnecessary locks.
Reviewed-by: Preeti U Murthy preeti@linux.vnet.ibm.com Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
I'm queuing this up for 4.4, although I think that the changelog is not right.
While at it, what are the race conditions the lock is protecting against?
In cases where a single policy controls multiple CPUs, a timer is queued for every cpu present in policy->cpus. When we reach the timer handler (which can be on multiple CPUs together) on any CPU, we trace CPU load for all policy->cpus and update the frequency accordingly.
That would be in dbs_timer(), right?
The lock is for protecting multiple CPUs to do the same thing together, as only its required to be done by a single CPU. Once any CPUs handler has completed, it updates the last update time and drops the mutex. At that point of time, other blocked handler (if any) check the last update time and return early.
Well, that would mean we only needed to hold the lock around the need_load_eval() evaluation in dbs_timer() if I'm not mistaken.
We also should acquire it around updates of the sampling rate, which essentially is set_sampling_rate().
Is there any reason to acquire it in cpufreq_governor_limits(), then, for example?
And then there are enough minute things that can go wrong if multiple CPUs do the load evaluation and freq-update at the same time, apart from it being an time wasting effort.
And so I still think that the commit log isn't that bad. The timer_mutex lock isn't required in other parts of the governor, they are just for synchronizing the work-handlers of CPUs belonging to the same policy.
I agree that it doesn't serve any purpose in the piece of code you're removing it from (which is why I agree with the patch), but the changelog is incomplete and confusing.
Thanks, Rafael
On 28-10-15, 06:54, Rafael J. Wysocki wrote:
On Wednesday, October 28, 2015 10:14:51 AM Viresh Kumar wrote:
In cases where a single policy controls multiple CPUs, a timer is queued for every cpu present in policy->cpus. When we reach the timer handler (which can be on multiple CPUs together) on any CPU, we trace CPU load for all policy->cpus and update the frequency accordingly.
That would be in dbs_timer(), right?
Yeah, and we already do stuff from within the mutex there.
The lock is for protecting multiple CPUs to do the same thing together, as only its required to be done by a single CPU. Once any CPUs handler has completed, it updates the last update time and drops the mutex. At that point of time, other blocked handler (if any) check the last update time and return early.
Well, that would mean we only needed to hold the lock around the need_load_eval() evaluation in dbs_timer() if I'm not mistaken.
Actually yeah, but then the fourth patch of this series uses the timer_mutex to fix a long standing problem (which was fixed by hacking the code earlier). And so we need to take the lock for the entire dbs_timer() routine.
We also should acquire it around updates of the sampling rate, which essentially is set_sampling_rate().
Why? In the worst case we may schedule the next timer for the earlier sampling rate. But do we care that much for that race, that we want to add locks here as well ?
Is there any reason to acquire it in cpufreq_governor_limits(), then, for example?
Yeah, we are calling dbs_check_cpu(dbs_data, cpu) from that path, which will reevaluate the load.
On Wednesday, October 28, 2015 12:13:17 PM Viresh Kumar wrote:
On 28-10-15, 06:54, Rafael J. Wysocki wrote:
On Wednesday, October 28, 2015 10:14:51 AM Viresh Kumar wrote:
In cases where a single policy controls multiple CPUs, a timer is queued for every cpu present in policy->cpus. When we reach the timer handler (which can be on multiple CPUs together) on any CPU, we trace CPU load for all policy->cpus and update the frequency accordingly.
That would be in dbs_timer(), right?
Yeah, and we already do stuff from within the mutex there.
The lock is for protecting multiple CPUs to do the same thing together, as only its required to be done by a single CPU. Once any CPUs handler has completed, it updates the last update time and drops the mutex. At that point of time, other blocked handler (if any) check the last update time and return early.
Well, that would mean we only needed to hold the lock around the need_load_eval() evaluation in dbs_timer() if I'm not mistaken.
Actually yeah, but then the fourth patch of this series uses the timer_mutex to fix a long standing problem (which was fixed by hacking the code earlier). And so we need to take the lock for the entire dbs_timer() routine.
I don't actually think that that patch is correct and even if it is, we'll only need to do that *after* that patch, so at least it would be fair to say a word about it in the changelog, wouldn't it?
We also should acquire it around updates of the sampling rate, which essentially is set_sampling_rate().
Why? In the worst case we may schedule the next timer for the earlier sampling rate. But do we care that much for that race, that we want to add locks here as well ?
OK
That works because we actully hold the mutex around the whole function, as otherwise we'd have seen races between delayed work items on different CPUs sharing the policy.
Is there any reason to acquire it in cpufreq_governor_limits(), then, for example?
Yeah, we are calling dbs_check_cpu(dbs_data, cpu) from that path, which will reevaluate the load.
Which means that we should take the lock around dbs_check_cpu() everywhere in a consistent way. Which in turn means that the lock actually does more than you said.
My point is basically that we seem to have a vague idea about what the lock is used for, while we need to know exactly why we need it.
Thanks, Rafael
On 28-10-15, 08:46, Rafael J. Wysocki wrote:
On Wednesday, October 28, 2015 12:13:17 PM Viresh Kumar wrote:
Actually yeah, but then the fourth patch of this series uses the timer_mutex to fix a long standing problem (which was fixed by hacking the code earlier). And so we need to take the lock for the entire dbs_timer() routine.
Well, there is another reason why the lock is taken for the complete dbs_timer() routine. There are two parts of that routine: - Checking if load evaluation is required or not + updating the last-update time. - The second is the load evaluation + freq change thing.
Lock around the first check makes sure that timer handlers of other CPUs don't do load evaluation in parallel and that they don't do it before the sampling period.
Lock around the second part makes sure there is only one thread which is doing load evaluation + freq update. The other thread being cpufreq_governor_limits(). And so the same lock taken across that part as well.
I don't actually think that that patch is correct and even if it is, we'll only need to do that *after* that patch, so at least it would be fair to say a word about it in the changelog, wouldn't it?
Hmm, If you agree about the above reasoning, then we may not require an update to the changelog, otherwise I will mention that in the changelog of this patch.
Yeah, we are calling dbs_check_cpu(dbs_data, cpu) from that path, which will reevaluate the load.
Which means that we should take the lock around dbs_check_cpu() everywhere in a consistent way.
We already do this from everywhere.
Which in turn means that the lock actually does more than you said.
What I described towards the top is probably a better answer to the earlier query.
My point is basically that we seem to have a vague idea about what the lock is used for, while we need to know exactly why we need it.
I am totally with you on this, we have surely screwed up on locking for a long time in cpufreq. And we should know exactly why we want to change it now.
We are immediately updating sampling rate for already queued-works, only if the new expiry is lesser than the old one.
But what about the case, where the user doesn't want frequent events and want to increase sampling time immediately? Shouldn't we cancel the works (and so their interrupts) on all policy->cpus (which might occur very shortly).
This patch removes this special case and simplifies code by immediately updating the expiry.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_ondemand.c | 25 ++++--------------------- 1 file changed, 4 insertions(+), 21 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 03ac6ce54042..bf0511a9735c 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -231,17 +231,8 @@ static unsigned int od_dbs_timer(struct cpu_dbs_info *cdbs, static struct common_dbs_data od_dbs_cdata;
/** - * update_sampling_rate - update sampling rate effective immediately if needed. + * update_sampling_rate - update sampling rate immediately. * @new_rate: new sampling rate - * - * If new rate is smaller than the old, simply updating - * dbs_tuners_int.sampling_rate might not be appropriate. For example, if the - * original sampling_rate was 1 second and the requested new sampling rate is 10 - * ms because the user needs immediate reaction from ondemand governor, but not - * sure if higher frequency will be required or not, then, the governor may - * change the sampling rate too late; up to 1 second later. Thus, if we are - * reducing the sampling rate, we need to make the new value effective - * immediately. */ static void update_sampling_rate(struct dbs_data *dbs_data, unsigned int new_rate) @@ -255,7 +246,6 @@ static void update_sampling_rate(struct dbs_data *dbs_data, for_each_online_cpu(cpu) { struct cpufreq_policy *policy; struct od_cpu_dbs_info_s *dbs_info; - unsigned long next_sampling, appointed_at;
policy = cpufreq_cpu_get(cpu); if (!policy) @@ -270,16 +260,9 @@ static void update_sampling_rate(struct dbs_data *dbs_data, if (!delayed_work_pending(&dbs_info->cdbs.dwork)) continue;
- next_sampling = jiffies + usecs_to_jiffies(new_rate); - appointed_at = dbs_info->cdbs.dwork.timer.expires; - - if (time_before(next_sampling, appointed_at)) { - cancel_delayed_work_sync(&dbs_info->cdbs.dwork); - - gov_queue_work(dbs_data, policy, - usecs_to_jiffies(new_rate), true); - - } + cancel_delayed_work_sync(&dbs_info->cdbs.dwork); + gov_queue_work(dbs_data, policy, usecs_to_jiffies(new_rate), + true); } }
On Tuesday, October 13, 2015 01:39:02 PM Viresh Kumar wrote:
We are immediately updating sampling rate for already queued-works, only if the new expiry is lesser than the old one.
But what about the case, where the user doesn't want frequent events and want to increase sampling time immediately? Shouldn't we cancel the works (and so their interrupts) on all policy->cpus (which might occur very shortly).
This patch removes this special case and simplifies code by immediately updating the expiry.
The changelog is a complete disaster. :-/
Your argument seems to be that it should be OK to do the cancel_delayed_work_sync()/gov_queue_work() combo in all cases, because even if the new rate is greater than the old one, the user may actually want it to take effect immediately and it shouldn't hurt to skip the next sample anyway in that case.
Is this really the case, though? What about the old rate is 1s, the new one is 2s and the timer is just about to expire? Won't the canceling effectively move the next sample 3s away from the previous one which may not be desirable?
The current code just allows the timer to expire, unless that would prevent the new rate from taking effect for too long, which seems perfectly reasonable to me.
All that seems to be racy with respect to the delayed work execution, but that's a different problem.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
drivers/cpufreq/cpufreq_ondemand.c | 25 ++++--------------------- 1 file changed, 4 insertions(+), 21 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index 03ac6ce54042..bf0511a9735c 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -231,17 +231,8 @@ static unsigned int od_dbs_timer(struct cpu_dbs_info *cdbs, static struct common_dbs_data od_dbs_cdata; /**
- update_sampling_rate - update sampling rate effective immediately if needed.
- update_sampling_rate - update sampling rate immediately.
- @new_rate: new sampling rate
- If new rate is smaller than the old, simply updating
- dbs_tuners_int.sampling_rate might not be appropriate. For example, if the
- original sampling_rate was 1 second and the requested new sampling rate is 10
- ms because the user needs immediate reaction from ondemand governor, but not
- sure if higher frequency will be required or not, then, the governor may
- change the sampling rate too late; up to 1 second later. Thus, if we are
- reducing the sampling rate, we need to make the new value effective
*/
- immediately.
static void update_sampling_rate(struct dbs_data *dbs_data, unsigned int new_rate) @@ -255,7 +246,6 @@ static void update_sampling_rate(struct dbs_data *dbs_data, for_each_online_cpu(cpu) { struct cpufreq_policy *policy; struct od_cpu_dbs_info_s *dbs_info;
unsigned long next_sampling, appointed_at;
policy = cpufreq_cpu_get(cpu); if (!policy) @@ -270,16 +260,9 @@ static void update_sampling_rate(struct dbs_data *dbs_data, if (!delayed_work_pending(&dbs_info->cdbs.dwork)) continue;
next_sampling = jiffies + usecs_to_jiffies(new_rate);
appointed_at = dbs_info->cdbs.dwork.timer.expires;
if (time_before(next_sampling, appointed_at)) {
cancel_delayed_work_sync(&dbs_info->cdbs.dwork);
gov_queue_work(dbs_data, policy,
usecs_to_jiffies(new_rate), true);
}
cancel_delayed_work_sync(&dbs_info->cdbs.dwork);
gov_queue_work(dbs_data, policy, usecs_to_jiffies(new_rate),
}true);
}
Thanks, Rafael
On 28-10-15, 07:28, Rafael J. Wysocki wrote:
Your argument seems to be that it should be OK to do the cancel_delayed_work_sync()/gov_queue_work() combo in all cases, because even if the new rate is greater than the old one, the user may actually want it to take effect immediately and it shouldn't hurt to skip the next sample anyway in that case.
Is this really the case, though? What about the old rate is 1s, the new one is 2s and the timer is just about to expire? Won't the canceling effectively move the next sample 3s away from the previous one which may not be desirable?
The current code just allows the timer to expire, unless that would prevent the new rate from taking effect for too long, which seems perfectly reasonable to me.
Okay, what about this case: old rate is 1s, new rate it 5s and we have just serviced the timer. With the current code we will receive evaluate again after 1 second instead of 5. Is that desirable ?
I didn't wanted to keep special code for such corner cases. And then how many times are we going to update sampling rates ?
But if we want to do something special, then we may schedule the work for following delay:
delay = shared->time_stamp + new_sampling_rate.
shared->time_stamp is the last time we evaluated the load.
With this, we will be at shoot at the exact requested time, relative to the last time we evaluated the loads.
On Wednesday, October 28, 2015 03:01:09 PM Viresh Kumar wrote:
On 28-10-15, 07:28, Rafael J. Wysocki wrote:
Your argument seems to be that it should be OK to do the cancel_delayed_work_sync()/gov_queue_work() combo in all cases, because even if the new rate is greater than the old one, the user may actually want it to take effect immediately and it shouldn't hurt to skip the next sample anyway in that case.
Is this really the case, though? What about the old rate is 1s, the new one is 2s and the timer is just about to expire? Won't the canceling effectively move the next sample 3s away from the previous one which may not be desirable?
The current code just allows the timer to expire, unless that would prevent the new rate from taking effect for too long, which seems perfectly reasonable to me.
Okay, what about this case: old rate is 1s, new rate it 5s and we have just serviced the timer. With the current code we will receive evaluate again after 1 second instead of 5. Is that desirable ?
That is OK.
The change is not guaranteed to happen instantaneously and the old rate may stay in effect for a longer while.
The case in which that may be annoying (but arguably not incorrect) is when the new rate is much less than the old one, but that is currently optimized for.
I didn't wanted to keep special code for such corner cases. And then how many times are we going to update sampling rates ?
But if we want to do something special, then we may schedule the work for following delay:
delay = shared->time_stamp + new_sampling_rate.
shared->time_stamp is the last time we evaluated the load.
With this, we will be at shoot at the exact requested time, relative to the last time we evaluated the loads.
Is the current code really problematic?
Thanks, Rafael
On 28-10-15, 16:31, Rafael J. Wysocki wrote:
Is the current code really problematic?
Its not problematic, but just that I didn't like special code written here.
Also, its a blocker for the next patch which tries to schedule work on all the policy->cpus together.
On Wednesday, October 28, 2015 08:58:11 PM Viresh Kumar wrote:
On 28-10-15, 16:31, Rafael J. Wysocki wrote:
Is the current code really problematic?
Its not problematic, but just that I didn't like special code written here.
Also, its a blocker for the next patch which tries to schedule work on all the policy->cpus together.
Well, the second statement above sort of contradicts the first one. :-)
I guess the answer is "it is problematic, because I can't do the other optimization then".
To that I'd really suggest trying to rework the code to use timer functions directly in the first place.
Thanks, Rafael
On 28-10-15, 17:13, Rafael J. Wysocki wrote:
Well, the second statement above sort of contradicts the first one. :-)
I guess the answer is "it is problematic, because I can't do the other optimization then".
Hehe, right.
To that I'd really suggest trying to rework the code to use timer functions directly in the first place.
I will, but this problem will be present there as well. Because at that point of time, we will talk about per-cpu timers instead of delayed-works.
Currently update_sampling_rate() runs over each online CPU and cancels/queues work on it. Its very inefficient for the case where a single policy manages multiple CPUs, as they can be processed together.
Also drop the unnecessary cancel_delayed_work_sync() as we are doing a mod_delayed_work_on() in gov_queue_work(), which will take care of pending works for us.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_ondemand.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_ondemand.c b/drivers/cpufreq/cpufreq_ondemand.c index bf0511a9735c..4677c7e9d534 100644 --- a/drivers/cpufreq/cpufreq_ondemand.c +++ b/drivers/cpufreq/cpufreq_ondemand.c @@ -238,29 +238,36 @@ static void update_sampling_rate(struct dbs_data *dbs_data, unsigned int new_rate) { struct od_dbs_tuners *od_tuners = dbs_data->tuners; + struct cpufreq_policy *policy; + struct od_cpu_dbs_info_s *dbs_info; + struct cpumask cpumask; int cpu;
+ cpumask_copy(&cpumask, cpu_online_mask); + od_tuners->sampling_rate = new_rate = max(new_rate, dbs_data->min_sampling_rate);
- for_each_online_cpu(cpu) { - struct cpufreq_policy *policy; - struct od_cpu_dbs_info_s *dbs_info; - + for_each_cpu(cpu, &cpumask) { policy = cpufreq_cpu_get(cpu); if (!policy) continue; + + /* clear all CPUs of this policy */ + cpumask_andnot(&cpumask, &cpumask, policy->cpus); + if (policy->governor != &cpufreq_gov_ondemand) { cpufreq_cpu_put(policy); continue; } + dbs_info = &per_cpu(od_cpu_dbs_info, cpu); cpufreq_cpu_put(policy);
- if (!delayed_work_pending(&dbs_info->cdbs.dwork)) + /* Make sure the work is not canceled on policy->cpus */ + if (!dbs_info->cdbs.shared->policy) continue;
- cancel_delayed_work_sync(&dbs_info->cdbs.dwork); gov_queue_work(dbs_data, policy, usecs_to_jiffies(new_rate), true); }
On Tuesday, October 13, 2015 01:39:03 PM Viresh Kumar wrote:
Currently update_sampling_rate() runs over each online CPU and cancels/queues work on it. Its very inefficient for the case where a single policy manages multiple CPUs, as they can be processed together.
In the case of one policy object shared between multiple CPUs, I'm wondering why we don't use a single delayed work function for all of them in the first place. That would address the problem at the source instead of dealing with the symptoms.
Also drop the unnecessary cancel_delayed_work_sync() as we are doing a mod_delayed_work_on() in gov_queue_work(), which will take care of pending works for us.
I'd prefer a separate patch for that if poss.
Thanks, Rafael
On 28-10-15, 07:38, Rafael J. Wysocki wrote:
On Tuesday, October 13, 2015 01:39:03 PM Viresh Kumar wrote:
Currently update_sampling_rate() runs over each online CPU and cancels/queues work on it. Its very inefficient for the case where a single policy manages multiple CPUs, as they can be processed together.
In the case of one policy object shared between multiple CPUs, I'm wondering why we don't use a single delayed work function for all of them in the first place. That would address the problem at the source instead of dealing with the symptoms.
That's what we had long back. The problem is that the timers queued for cpufreq are deferrable and if the CPU, on which the timer is queued, goes idle, then the governor would halt. And there can be other CPUs in the policy->cpus group which are still running.
Also drop the unnecessary cancel_delayed_work_sync() as we are doing a mod_delayed_work_on() in gov_queue_work(), which will take care of pending works for us.
I'd prefer a separate patch for that if poss.
okay.
On Wednesday, October 28, 2015 12:16:35 PM Viresh Kumar wrote:
On 28-10-15, 07:38, Rafael J. Wysocki wrote:
On Tuesday, October 13, 2015 01:39:03 PM Viresh Kumar wrote:
Currently update_sampling_rate() runs over each online CPU and cancels/queues work on it. Its very inefficient for the case where a single policy manages multiple CPUs, as they can be processed together.
In the case of one policy object shared between multiple CPUs, I'm wondering why we don't use a single delayed work function for all of them in the first place. That would address the problem at the source instead of dealing with the symptoms.
That's what we had long back. The problem is that the timers queued for cpufreq are deferrable and if the CPU, on which the timer is queued, goes idle, then the governor would halt. And there can be other CPUs in the policy->cpus group which are still running.
It looks like we shouldn't be using delayed works for this, really.
We should be using timer functions and normal work items. Schedule the timer function on all CPUs sharing the policy and then queue up the work item from the first one that executes the timer. Then make the timer function bail out immediately until the work has completed and re-schedule the timers from the work item.
Thanks, Rafael
On 28-10-15, 08:33, Rafael J. Wysocki wrote:
It looks like we shouldn't be using delayed works for this, really.
We should be using timer functions and normal work items. Schedule the timer function on all CPUs sharing the policy and then queue up the work item from the first one that executes the timer. Then make the timer function bail out immediately until the work has completed and re-schedule the timers from the work item.
Okay, I will try to get some code out for that then.
cpufreq_governor_lock is abused by using it outside of cpufreq core, i.e. in cpufreq-governors. But we didn't had a better solution to the problem (described later) at that point of time, and following was the only acceptable solution:
6f1e4efd882e ("cpufreq: Fix timer/workqueue corruption by protecting reading governor_enabled")
The cpufreq governor core is fixed against possible races now and things are in much better shape.
The original problem:
When a CPU is hot unplugged, we cancel delayed works for all policy->cpus via gov_cancel_work(). If the work is already running on any CPU, the workqueue code will wait for the work to finish, to prevent the work items from re-queuing themselves.
This works most of the time, except for the case where the work handler determines that it should adjust the delay for all other CPUs, that the policy is managing. When this happens, the canceling CPU will cancel its own work but can queue up the works on other policy->cpus.
For example, consider CPU 0-4 in a policy and we called gov_cancel_work() for them. Workqueue core canceled the works for 0-3 and is waiting for the handler to finish on CPU4. At that time, handler on CPU4 can restart works on CPU 0-3 again. Which makes 0-3 run works, which the governor core thinks are canceled.
To fix that in a different (non-hacky) way, set set shared->policy to false before trying to cancel the work. It should be updated within timer_mutex, which will prevent the work-handlers to start. Once the work-handlers finds that we are already trying to stop the governor, it will exit early. And that will prevent queuing of works again as well.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq_governor.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 750626d8fb03..931424ca96d9 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -171,10 +171,6 @@ void gov_queue_work(struct dbs_data *dbs_data, struct cpufreq_policy *policy, { int i;
- mutex_lock(&cpufreq_governor_lock); - if (!policy->governor_enabled) - goto out_unlock; - if (!all_cpus) { /* * Use raw_smp_processor_id() to avoid preemptible warnings. @@ -188,9 +184,6 @@ void gov_queue_work(struct dbs_data *dbs_data, struct cpufreq_policy *policy, for_each_cpu(i, policy->cpus) __gov_queue_work(i, dbs_data, delay); } - -out_unlock: - mutex_unlock(&cpufreq_governor_lock); } EXPORT_SYMBOL_GPL(gov_queue_work);
@@ -229,13 +222,24 @@ static void dbs_timer(struct work_struct *work) struct cpu_dbs_info *cdbs = container_of(work, struct cpu_dbs_info, dwork.work); struct cpu_common_dbs_info *shared = cdbs->shared; - struct cpufreq_policy *policy = shared->policy; - struct dbs_data *dbs_data = policy->governor_data; + struct cpufreq_policy *policy; + struct dbs_data *dbs_data; unsigned int sampling_rate, delay; bool modify_all = true;
mutex_lock(&shared->timer_mutex);
+ policy = shared->policy; + + /* + * Governor might already be disabled and there is no point continuing + * with the work-handler. + */ + if (!policy) + goto unlock; + + dbs_data = policy->governor_data; + if (dbs_data->cdata->governor == GOV_CONSERVATIVE) { struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
@@ -252,6 +256,7 @@ static void dbs_timer(struct work_struct *work) delay = dbs_data->cdata->gov_dbs_timer(cdbs, dbs_data, modify_all); gov_queue_work(dbs_data, policy, delay, modify_all);
+unlock: mutex_unlock(&shared->timer_mutex); }
@@ -488,9 +493,17 @@ static int cpufreq_governor_stop(struct cpufreq_policy *policy, if (!shared || !shared->policy) return -EBUSY;
+ /* + * Work-handler must see this updated, as it should not proceed any + * further after governor is disabled. And so timer_mutex is taken while + * updating this value. + */ + mutex_lock(&shared->timer_mutex); + shared->policy = NULL; + mutex_unlock(&shared->timer_mutex); + gov_cancel_work(dbs_data, policy);
- shared->policy = NULL; mutex_destroy(&shared->timer_mutex); return 0; }
On Tuesday, October 13, 2015 01:39:04 PM Viresh Kumar wrote:
cpufreq_governor_lock is abused by using it outside of cpufreq core, i.e. in cpufreq-governors. But we didn't had a better solution to the problem (described later) at that point of time, and following was the only acceptable solution:
6f1e4efd882e ("cpufreq: Fix timer/workqueue corruption by protecting reading governor_enabled")
The cpufreq governor core is fixed against possible races now and things are in much better shape.
The original problem:
When a CPU is hot unplugged, we cancel delayed works for all policy->cpus via gov_cancel_work(). If the work is already running on any CPU, the workqueue code will wait for the work to finish, to prevent the work items from re-queuing themselves.
This works most of the time, except for the case where the work handler determines that it should adjust the delay for all other CPUs, that the policy is managing. When this happens, the canceling CPU will cancel its own work but can queue up the works on other policy->cpus.
For example, consider CPU 0-4 in a policy and we called gov_cancel_work() for them. Workqueue core canceled the works for 0-3 and is waiting for the handler to finish on CPU4. At that time, handler on CPU4 can restart works on CPU 0-3 again. Which makes 0-3 run works, which the governor core thinks are canceled.
To fix that in a different (non-hacky) way, set set shared->policy to false before trying to cancel the work. It should be updated within timer_mutex, which will prevent the work-handlers to start. Once the work-handlers finds that we are already trying to stop the governor, it will exit early. And that will prevent queuing of works again as well.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org
I have a hard time figuring out what the patch is supposed to achieve from the above.
Do we eventually want to get rid of cpufreq_governor_lock and that's why we're doing this?
drivers/cpufreq/cpufreq_governor.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index 750626d8fb03..931424ca96d9 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -171,10 +171,6 @@ void gov_queue_work(struct dbs_data *dbs_data, struct cpufreq_policy *policy, { int i;
- mutex_lock(&cpufreq_governor_lock);
- if (!policy->governor_enabled)
goto out_unlock;
- if (!all_cpus) { /*
- Use raw_smp_processor_id() to avoid preemptible warnings.
@@ -188,9 +184,6 @@ void gov_queue_work(struct dbs_data *dbs_data, struct cpufreq_policy *policy, for_each_cpu(i, policy->cpus) __gov_queue_work(i, dbs_data, delay); }
-out_unlock:
- mutex_unlock(&cpufreq_governor_lock);
} EXPORT_SYMBOL_GPL(gov_queue_work); @@ -229,13 +222,24 @@ static void dbs_timer(struct work_struct *work) struct cpu_dbs_info *cdbs = container_of(work, struct cpu_dbs_info, dwork.work); struct cpu_common_dbs_info *shared = cdbs->shared;
- struct cpufreq_policy *policy = shared->policy;
- struct dbs_data *dbs_data = policy->governor_data;
- struct cpufreq_policy *policy;
- struct dbs_data *dbs_data; unsigned int sampling_rate, delay; bool modify_all = true;
mutex_lock(&shared->timer_mutex);
- policy = shared->policy;
- /*
* Governor might already be disabled and there is no point continuing
* with the work-handler.
*/
- if (!policy)
goto unlock;
- dbs_data = policy->governor_data;
- if (dbs_data->cdata->governor == GOV_CONSERVATIVE) { struct cs_dbs_tuners *cs_tuners = dbs_data->tuners;
@@ -252,6 +256,7 @@ static void dbs_timer(struct work_struct *work) delay = dbs_data->cdata->gov_dbs_timer(cdbs, dbs_data, modify_all); gov_queue_work(dbs_data, policy, delay, modify_all); +unlock: mutex_unlock(&shared->timer_mutex); } @@ -488,9 +493,17 @@ static int cpufreq_governor_stop(struct cpufreq_policy *policy, if (!shared || !shared->policy) return -EBUSY;
- /*
* Work-handler must see this updated, as it should not proceed any
* further after governor is disabled. And so timer_mutex is taken while
* updating this value.
*/
- mutex_lock(&shared->timer_mutex);
- shared->policy = NULL;
- mutex_unlock(&shared->timer_mutex);
So this assumes that dbs_timer() will acquire the mutex and see that shared->policy is now NULL, so it will bail out immediately, but ->
- gov_cancel_work(dbs_data, policy);
- shared->policy = NULL; mutex_destroy(&shared->timer_mutex);
-> the mutex is destroyed here, so what the guarantee that the mutex will still be around when dbs_timer() runs?
return 0; }
Thanks, Rafael
On 28-10-15, 08:10, Rafael J. Wysocki wrote:
I have a hard time figuring out what the patch is supposed to achieve from the above.
We had a problem earlier, where even after stopping the governor for a policy, the work was still queued for some of its CPUs.
We failed to understand the real problem then, and so abused the wider cpufreq_governor_lock.
I understood the problem much better now, and got a straight forward, and precise solution for that.
Do we eventually want to get rid of cpufreq_governor_lock and that's why we're doing this?
That's another benefit we get out of this change.
- mutex_lock(&shared->timer_mutex);
- shared->policy = NULL;
- mutex_unlock(&shared->timer_mutex);
Right.
So this assumes that dbs_timer() will acquire the mutex and see that shared->policy is now NULL, so it will bail out immediately, but ->
- gov_cancel_work(dbs_data, policy);
- shared->policy = NULL; mutex_destroy(&shared->timer_mutex);
-> the mutex is destroyed here, so what the guarantee that the mutex will still be around when dbs_timer() runs?
You really got me worried for few minutes :)
The earlier update of shared->policy = NULL, makes sure that no work-handler can start real work. After the unlock the work handlers will start executing but will return early.
We also have gov_cancel_work(), which will until the time all the current handlers have finished executing and no work is queued.
And so we are sure that there are no users of the mutex when it is destroyed.
On Wednesday, October 28, 2015 01:55:59 PM Viresh Kumar wrote:
On 28-10-15, 08:10, Rafael J. Wysocki wrote:
I have a hard time figuring out what the patch is supposed to achieve from the above.
We had a problem earlier, where even after stopping the governor for a policy, the work was still queued for some of its CPUs.
We failed to understand the real problem then, and so abused the wider cpufreq_governor_lock.
I understood the problem much better now, and got a straight forward, and precise solution for that.
Do we eventually want to get rid of cpufreq_governor_lock and that's why we're doing this?
That's another benefit we get out of this change.
- mutex_lock(&shared->timer_mutex);
- shared->policy = NULL;
- mutex_unlock(&shared->timer_mutex);
Right.
So this assumes that dbs_timer() will acquire the mutex and see that shared->policy is now NULL, so it will bail out immediately, but ->
- gov_cancel_work(dbs_data, policy);
- shared->policy = NULL; mutex_destroy(&shared->timer_mutex);
-> the mutex is destroyed here, so what the guarantee that the mutex will still be around when dbs_timer() runs?
You really got me worried for few minutes :)
The earlier update of shared->policy = NULL, makes sure that no work-handler can start real work. After the unlock the work handlers will start executing but will return early.
That's not sufficient, because it doesn't guarantee that the lock will be dropped before we destroy it.
We also have gov_cancel_work(), which will until the time all the current handlers have finished executing and no work is queued.
And so we are sure that there are no users of the mutex when it is destroyed.
OK, so the gov_cancel_work() provides the guarantee.
So this is a changelog matching your patch:
"gov_queue_work() acquires cpufreq_governor_lock to allow cpufreq_governor_stop() to drain delayed work items possibly scheduled on CPUs that share the policy with a CPU being taken offline.
However, the same goal may be achieved in a more straightforward way if the policy pointer in the struct cpu_dbs_info matching the policy CPU is reset upfront by cpufreq_governor_stop() under the timer_mutex belonging to it and checked against NULL, under the same lock, at the beginning of dbs_timer().
In that case every instance of dbs_timer() run for a struct cpu_dbs_info sharing the policy pointer in question after cpufreq_governor_stop() has started will notice that that pointer is NULL and bail out immediately without queuing up any new work items. In turn, gov_cancel_work() called by cpufreq_governor_stop() before destroying timer_mutex will wait for all of the delayed work items currently running on the CPUs sharing the policy to drop the mutex, so it may be destroyed safely.
Make cpufreq_governor_stop() and dbs_timer() work as described and modify gov_queue_work() so it does not acquire cpufreq_governor_lock any more."
Thanks, Rafael
On 28-10-15, 16:12, Rafael J. Wysocki wrote:
So this is a changelog matching your patch:
"gov_queue_work() acquires cpufreq_governor_lock to allow cpufreq_governor_stop() to drain delayed work items possibly scheduled on CPUs that share the policy with a CPU being taken offline.
However, the same goal may be achieved in a more straightforward way if the policy pointer in the struct cpu_dbs_info matching the policy CPU is reset upfront by cpufreq_governor_stop() under the timer_mutex belonging to it and checked against NULL, under the same lock, at the beginning of dbs_timer().
In that case every instance of dbs_timer() run for a struct cpu_dbs_info sharing the policy pointer in question after cpufreq_governor_stop() has started will notice that that pointer is NULL and bail out immediately without queuing up any new work items. In turn, gov_cancel_work() called by cpufreq_governor_stop() before destroying timer_mutex will wait for all of the delayed work items currently running on the CPUs sharing the policy to drop the mutex, so it may be destroyed safely.
Make cpufreq_governor_stop() and dbs_timer() work as described and modify gov_queue_work() so it does not acquire cpufreq_governor_lock any more."
Looks far better, thanks :)
Invalid state-transitions is verified by governor core now and there is no need to replicate that in cpufreq core.
Stop verifying the same in cpufreq core. That will get rid of policy->governor_enabled and cpufreq_governor_lock.
Signed-off-by: Viresh Kumar viresh.kumar@linaro.org --- drivers/cpufreq/cpufreq.c | 24 ------------------------ include/linux/cpufreq.h | 1 - 2 files changed, 25 deletions(-)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index 25c4c15103a0..0c89b09672e4 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -102,7 +102,6 @@ static LIST_HEAD(cpufreq_governor_list); static struct cpufreq_driver *cpufreq_driver; static DEFINE_PER_CPU(struct cpufreq_policy *, cpufreq_cpu_data); static DEFINE_RWLOCK(cpufreq_driver_lock); -DEFINE_MUTEX(cpufreq_governor_lock);
/* Flag to suspend/resume CPUFreq governors */ static bool cpufreq_suspended; @@ -2035,21 +2034,6 @@ static int __cpufreq_governor(struct cpufreq_policy *policy,
pr_debug("%s: for CPU %u, event %u\n", __func__, policy->cpu, event);
- mutex_lock(&cpufreq_governor_lock); - if ((policy->governor_enabled && event == CPUFREQ_GOV_START) - || (!policy->governor_enabled - && (event == CPUFREQ_GOV_LIMITS || event == CPUFREQ_GOV_STOP))) { - mutex_unlock(&cpufreq_governor_lock); - return -EBUSY; - } - - if (event == CPUFREQ_GOV_STOP) - policy->governor_enabled = false; - else if (event == CPUFREQ_GOV_START) - policy->governor_enabled = true; - - mutex_unlock(&cpufreq_governor_lock); - ret = policy->governor->governor(policy, event);
if (!ret) { @@ -2057,14 +2041,6 @@ static int __cpufreq_governor(struct cpufreq_policy *policy, policy->governor->initialized++; else if (event == CPUFREQ_GOV_POLICY_EXIT) policy->governor->initialized--; - } else { - /* Restore original values */ - mutex_lock(&cpufreq_governor_lock); - if (event == CPUFREQ_GOV_STOP) - policy->governor_enabled = true; - else if (event == CPUFREQ_GOV_START) - policy->governor_enabled = false; - mutex_unlock(&cpufreq_governor_lock); }
if (((event == CPUFREQ_GOV_POLICY_INIT) && ret) || diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h index dca22de98d94..d211da0763b7 100644 --- a/include/linux/cpufreq.h +++ b/include/linux/cpufreq.h @@ -80,7 +80,6 @@ struct cpufreq_policy { unsigned int policy; /* see above */ struct cpufreq_governor *governor; /* see below */ void *governor_data; - bool governor_enabled; /* governor start/stop flag */ char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */
struct work_struct update; /* if update_policy() needs to be
linaro-kernel@lists.linaro.org