In cpufreq_sched_set_cap we currently have this:
/* * We only change frequency if this cpu's capacity request represents a * new max. If another cpu has requested a capacity greater than the * previous max then we rely on that cpu to hit this code path and make * the change. IOW, the cpu with the new max capacity is responsible * for setting the new capacity/frequency. * * If this cpu is not the new maximum then bail */ if (capacity_max > capacity) goto out;
But this can lead to situations like (2 CPU cluster, CPUs start with cap request of 0): 1. CPU0 gets heavily loaded, requests cap = 1024 (fmax) 2. CPU1 gets lightly loaded, requests cap = 10 3. CPU0's load goes away, requests cap = 0 4. CPU1's load of 10 persists for a long time
In step #3 we could've set the cluster capacity to 10/1024 but did not because the CPU we were working with at the time (CPU0) was not the CPU driving the new cluster maximum capacity request. As a result we run unnecessarily at fmax for a long time.
Any reason to not set the OPP associated with the new max capacity request immediately, regardless of what CPU is driving it?
thanks, Steve
On Wed, Nov 11, 2015 at 05:28:50PM -0800, Steve Muckle wrote:
In cpufreq_sched_set_cap we currently have this:
/*
- We only change frequency if this cpu's capacity request represents a
- new max. If another cpu has requested a capacity greater than the
- previous max then we rely on that cpu to hit this code path and make
- the change. IOW, the cpu with the new max capacity is responsible
- for setting the new capacity/frequency.
- If this cpu is not the new maximum then bail
*/ if (capacity_max > capacity) goto out;
But this can lead to situations like (2 CPU cluster, CPUs start with cap request of 0):
- CPU0 gets heavily loaded, requests cap = 1024 (fmax)
- CPU1 gets lightly loaded, requests cap = 10
- CPU0's load goes away, requests cap = 0
- CPU1's load of 10 persists for a long time
In step #3 we could've set the cluster capacity to 10/1024 but did not because the CPU we were working with at the time (CPU0) was not the CPU driving the new cluster maximum capacity request. As a result we run unnecessarily at fmax for a long time.
Any reason to not set the OPP associated with the new max capacity request immediately, regardless of what CPU is driving it?
Good finding, this code is a optimization for increasing OPP, maybe it also introduce bug for downgrading OPP?
Thanks, Leo Yan
Hi Steve,
On 11/11/15, Steve Muckle wrote:
In cpufreq_sched_set_cap we currently have this:
/*
- We only change frequency if this cpu's capacity request represents a
- new max. If another cpu has requested a capacity greater than the
- previous max then we rely on that cpu to hit this code path and make
- the change. IOW, the cpu with the new max capacity is responsible
- for setting the new capacity/frequency.
- If this cpu is not the new maximum then bail
*/ if (capacity_max > capacity) goto out;
But this can lead to situations like (2 CPU cluster, CPUs start with cap request of 0):
- CPU0 gets heavily loaded, requests cap = 1024 (fmax)
- CPU1 gets lightly loaded, requests cap = 10
- CPU0's load goes away, requests cap = 0
- CPU1's load of 10 persists for a long time
In step #3 we could've set the cluster capacity to 10/1024 but did not because the CPU we were working with at the time (CPU0) was not the CPU driving the new cluster maximum capacity request. As a result we run unnecessarily at fmax for a long time.
CPU1's tasks sleeps, since utilization is low. Don't we re-evaluate the request as soon as someone goes to sleep/wakes up?
Any reason to not set the OPP associated with the new max capacity request immediately, regardless of what CPU is driving it?
OTOH, what you say makes sense too, it should be safe to go to 10/1024 as soon as CPU0 becomes idle.
Thanks,
- Juri
On 11/12/2015 01:42 AM, Juri Lelli wrote:
CPU1's tasks sleeps, since utilization is low. Don't we re-evaluate the request as soon as someone goes to sleep/wakes up?
We used to, but there are now optimizations that cause the re-evaluation to get skipped if the wakeup/sleep doesn't result in a change in the sched class capacity request. I've run into this during limited testing:
- Big task runs then goes away on CPU X. - Capacity vote of CPU X is reset to 0 but OPP is not changed, we remain at fmax. - Tiny tasks wake and sleep on other CPUs in the same cluster - they are so small that the capacity request stays at 0 for those CPUs and we don't bother going into cpufreq_sched to re-evaluate the cluster.
I think we should go ahead with the ideas that were raised in the meeting earlier today to try and fix this as it will prevent meaningful profiling.
- Ensure idle CPUs have their utilization decayed promptly/regularly via load balance or some other periodic mechanism. - Leave capacity vote in place when a CPU goes idle. As it is decayed, reduce its capacity vote and the cluster OPP as necessary. - If platforms wish to be more aggressive about dropping the frequency during idle, this can be done in the idle loop/driver where a fully informed decision for that instance can be made.
I think we'll still have a major issue in that PELT's decay is too slow but that's nothing new and needs to be addressed anyway.
Thoughts? If no one has concerns or wants to pick this up then I'll start working on it.
On Thu, Nov 12, 2015 at 04:17:37PM -0800, Steve Muckle wrote:
On 11/12/2015 01:42 AM, Juri Lelli wrote:
CPU1's tasks sleeps, since utilization is low. Don't we re-evaluate the request as soon as someone goes to sleep/wakes up?
We used to, but there are now optimizations that cause the re-evaluation to get skipped if the wakeup/sleep doesn't result in a change in the sched class capacity request.
I've run into this during limited testing:
- Big task runs then goes away on CPU X.
- Capacity vote of CPU X is reset to 0 but OPP is not changed, we remain
at fmax.
- Tiny tasks wake and sleep on other CPUs in the same cluster - they are
so small that the capacity request stays at 0 for those CPUs and we don't bother going into cpufreq_sched to re-evaluate the cluster.
You mean that a 'vote' for a 0 capacity does not trigger anymore an aggregation at scheduling class level? AFAIR this was the main (only) mechanism that allowed to reduce the OPP as soon as the CPU running the higest load goes sleep.
I think we should go ahead with the ideas that were raised in the meeting earlier today to try and fix this as it will prevent meaningful profiling.
- Ensure idle CPUs have their utilization decayed promptly/regularly via
load balance or some other periodic mechanism.
- Leave capacity vote in place when a CPU goes idle. As it is decayed,
reduce its capacity vote and the cluster OPP as necessary.
- If platforms wish to be more aggressive about dropping the frequency
during idle, this can be done in the idle loop/driver where a fully informed decision for that instance can be made.
I think we'll still have a major issue in that PELT's decay is too slow but that's nothing new and needs to be addressed anyway.
Thoughts? If no one has concerns or wants to pick this up then I'll start working on it. _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
On 11/13/2015 01:37 AM, Patrick Bellasi wrote:
- Big task runs then goes away on CPU X.
- Capacity vote of CPU X is reset to 0 but OPP is not changed, we remain
at fmax.
- Tiny tasks wake and sleep on other CPUs in the same cluster - they are
so small that the capacity request stays at 0 for those CPUs and we don't bother going into cpufreq_sched to re-evaluate the cluster.
You mean that a 'vote' for a 0 capacity does not trigger anymore an aggregation at scheduling class level? AFAIR this was the main (only) mechanism that allowed to reduce the OPP as soon as the CPU running the higest load goes sleep.
Well, even in the EASv5 posting when a CPU went idle, it set its capacity vote to 0 but did so passively via the special cpufreq_sched_reset_cap() API. This call did not initiate an immediate re-evaluation/aggregation for the cluster or set a new OPP. It was up to the next event in the cluster to cause that to happen when it set a capacity vote with the normal API, cpufreq_sched_set_capacity().
Because of a desire to speed up the fast paths of wakeup/sleep as much as possible, we now avoid going into cpufreq_sched if the capacity being requested does not represent a change for that CPU/sched_class. That's making things worse than they already were w.r.t. getting stuck at elevated OPPs.