delaying update of setting OPP

List overview All Threads
Download

newer

older

cpufreq_sched: reset capacity in...

[eas-dev] CPU invariant usage...

Steve Muckle

12 Nov 2015 12 Nov '15

1:28 a.m.

In cpufreq_sched_set_cap we currently have this:

/* * We only change frequency if this cpu's capacity request represents a * new max. If another cpu has requested a capacity greater than the * previous max then we rely on that cpu to hit this code path and make * the change. IOW, the cpu with the new max capacity is responsible * for setting the new capacity/frequency. * * If this cpu is not the new maximum then bail */ if (capacity_max > capacity) goto out;

But this can lead to situations like (2 CPU cluster, CPUs start with cap request of 0): 1. CPU0 gets heavily loaded, requests cap = 1024 (fmax) 2. CPU1 gets lightly loaded, requests cap = 10 3. CPU0's load goes away, requests cap = 0 4. CPU1's load of 10 persists for a long time

In step #3 we could've set the cluster capacity to 10/1024 but did not because the CPU we were working with at the time (CPU0) was not the CPU driving the new cluster maximum capacity request. As a result we run unnecessarily at fmax for a long time.

Any reason to not set the OPP associated with the new max capacity request immediately, regardless of what CPU is driving it?

thanks, Steve

Show replies by date

Leo Yan

12 Nov 12 Nov

1:43 a.m.

On Wed, Nov 11, 2015 at 05:28:50PM -0800, Steve Muckle wrote:

...

In cpufreq_sched_set_cap we currently have this:

/*

We only change frequency if this cpu's capacity request represents a

new max. If another cpu has requested a capacity greater than the

previous max then we rely on that cpu to hit this code path and make

the change. IOW, the cpu with the new max capacity is responsible

for setting the new capacity/frequency.

If this cpu is not the new maximum then bail

*/ if (capacity_max > capacity) goto out;

But this can lead to situations like (2 CPU cluster, CPUs start with cap request of 0):

CPU0 gets heavily loaded, requests cap = 1024 (fmax)

CPU1 gets lightly loaded, requests cap = 10

CPU0's load goes away, requests cap = 0

CPU1's load of 10 persists for a long time

In step #3 we could've set the cluster capacity to 10/1024 but did not because the CPU we were working with at the time (CPU0) was not the CPU driving the new cluster maximum capacity request. As a result we run unnecessarily at fmax for a long time.

Any reason to not set the OPP associated with the new max capacity request immediately, regardless of what CPU is driving it?

Good finding, this code is a optimization for increasing OPP, maybe it also introduce bug for downgrading OPP?

Thanks, Leo Yan

Juri Lelli

9:42 a.m.

Hi Steve,

On 11/11/15, Steve Muckle wrote:

...

In cpufreq_sched_set_cap we currently have this:

/*

We only change frequency if this cpu's capacity request represents a

new max. If another cpu has requested a capacity greater than the

previous max then we rely on that cpu to hit this code path and make

the change. IOW, the cpu with the new max capacity is responsible

for setting the new capacity/frequency.

If this cpu is not the new maximum then bail

*/ if (capacity_max > capacity) goto out;

But this can lead to situations like (2 CPU cluster, CPUs start with cap request of 0):

CPU0 gets heavily loaded, requests cap = 1024 (fmax)

CPU1 gets lightly loaded, requests cap = 10

CPU0's load goes away, requests cap = 0

CPU1's load of 10 persists for a long time

In step #3 we could've set the cluster capacity to 10/1024 but did not because the CPU we were working with at the time (CPU0) was not the CPU driving the new cluster maximum capacity request. As a result we run unnecessarily at fmax for a long time.

CPU1's tasks sleeps, since utilization is low. Don't we re-evaluate the request as soon as someone goes to sleep/wakes up?

...

Any reason to not set the OPP associated with the new max capacity request immediately, regardless of what CPU is driving it?

OTOH, what you say makes sense too, it should be safe to go to 10/1024 as soon as CPU0 becomes idle.

Thanks,

- Juri

Steve Muckle

13 Nov 13 Nov

12:17 a.m.

On 11/12/2015 01:42 AM, Juri Lelli wrote:

...

CPU1's tasks sleeps, since utilization is low. Don't we re-evaluate the request as soon as someone goes to sleep/wakes up?

We used to, but there are now optimizations that cause the re-evaluation to get skipped if the wakeup/sleep doesn't result in a change in the sched class capacity request. I've run into this during limited testing:

- Big task runs then goes away on CPU X. - Capacity vote of CPU X is reset to 0 but OPP is not changed, we remain at fmax. - Tiny tasks wake and sleep on other CPUs in the same cluster - they are so small that the capacity request stays at 0 for those CPUs and we don't bother going into cpufreq_sched to re-evaluate the cluster.

I think we should go ahead with the ideas that were raised in the meeting earlier today to try and fix this as it will prevent meaningful profiling.

- Ensure idle CPUs have their utilization decayed promptly/regularly via load balance or some other periodic mechanism. - Leave capacity vote in place when a CPU goes idle. As it is decayed, reduce its capacity vote and the cluster OPP as necessary. - If platforms wish to be more aggressive about dropping the frequency during idle, this can be done in the idle loop/driver where a fully informed decision for that instance can be made.

I think we'll still have a major issue in that PELT's decay is too slow but that's nothing new and needs to be addressed anyway.

Thoughts? If no one has concerns or wants to pick this up then I'll start working on it.

Patrick Bellasi

9:37 a.m.

On Thu, Nov 12, 2015 at 04:17:37PM -0800, Steve Muckle wrote:

...

On 11/12/2015 01:42 AM, Juri Lelli wrote:

...
CPU1's tasks sleeps, since utilization is low. Don't we re-evaluate the request as soon as someone goes to sleep/wakes up?

We used to, but there are now optimizations that cause the re-evaluation to get skipped if the wakeup/sleep doesn't result in a change in the sched class capacity request.

I've run into this during limited testing:

Big task runs then goes away on CPU X.

Capacity vote of CPU X is reset to 0 but OPP is not changed, we remain

at fmax.

Tiny tasks wake and sleep on other CPUs in the same cluster - they are

so small that the capacity request stays at 0 for those CPUs and we don't bother going into cpufreq_sched to re-evaluate the cluster.

You mean that a 'vote' for a 0 capacity does not trigger anymore an aggregation at scheduling class level? AFAIR this was the main (only) mechanism that allowed to reduce the OPP as soon as the CPU running the higest load goes sleep.

...

I think we should go ahead with the ideas that were raised in the meeting earlier today to try and fix this as it will prevent meaningful profiling.

Ensure idle CPUs have their utilization decayed promptly/regularly via

load balance or some other periodic mechanism.

Leave capacity vote in place when a CPU goes idle. As it is decayed,

reduce its capacity vote and the cluster OPP as necessary.

If platforms wish to be more aggressive about dropping the frequency

during idle, this can be done in the idle loop/driver where a fully informed decision for that instance can be made.

I think we'll still have a major issue in that PELT's decay is too slow but that's nothing new and needs to be addressed anyway.

Thoughts? If no one has concerns or wants to pick this up then I'll start working on it. _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev

-- #include <best/regards.h> Patrick Bellasi

Steve Muckle

14 Nov 14 Nov

2:08 a.m.

On 11/13/2015 01:37 AM, Patrick Bellasi wrote:

...

...

Big task runs then goes away on CPU X.

...

Capacity vote of CPU X is reset to 0 but OPP is not changed, we remain

at fmax.

Tiny tasks wake and sleep on other CPUs in the same cluster - they are

so small that the capacity request stays at 0 for those CPUs and we don't bother going into cpufreq_sched to re-evaluate the cluster.

You mean that a 'vote' for a 0 capacity does not trigger anymore an aggregation at scheduling class level? AFAIR this was the main (only) mechanism that allowed to reduce the OPP as soon as the CPU running the higest load goes sleep.

Well, even in the EASv5 posting when a CPU went idle, it set its capacity vote to 0 but did so passively via the special cpufreq_sched_reset_cap() API. This call did not initiate an immediate re-evaluation/aggregation for the cluster or set a new OPP. It was up to the next event in the cluster to cause that to happen when it set a capacity vote with the normal API, cpufreq_sched_set_capacity().

Because of a desire to speed up the fast paths of wakeup/sleep as much as possible, we now avoid going into cpufreq_sched if the capacity being requested does not represent a change for that CPU/sched_class. That's making things worse than they already were w.r.t. getting stuck at elevated OPPs.

3090

days inactive

3092

days old

eas-dev@lists.linaro.org

5 comments

participants

tags (0)

participants (4)

Juri Lelli
Leo Yan
Patrick Bellasi
Steve Muckle