Re: [Eas-dev] cpufreq_sched policy for combining requests from multiple sched classes

List overview All Threads
Download

newer

older

sched-dvfs response

Re: [Eas-dev] cpufreq_sched policy...

Steve Muckle

12 Oct 2015 12 Oct '15

6:07 p.m.

(adding eas-dev)

On 10/09/2015 01:41 AM, Patrick Bellasi wrote:

...

...
...
The users might not be happy though

...
...
so we could favor performance for RT tasks to avoid breaking legacy software and ask users that care about energy to migrate to deadline where we actually know the performance constraints of the tasks.

Given that at the moment RT tasks are treated no differently than CFS tasks w.r.t. cpu frequency I'd expect that we could get away without any sort of perf bias for RT bandwidth, which I think would be cost prohibitive for power.

Are you specifically considering instantaneous power? Because from a power standpoint I cannot see any difference for example w.r.t having a batch CFS task.

From an energy standpoint instead, do not you think that a "race-to-idle" policy could be better at least for RT-BATCH tasks?

Sorry I should have said energy rather than power...

...

From my experience race to idle has never panned out as an

energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases. Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption, which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Show replies by date

Patrick Bellasi

14 Oct 14 Oct

8:58 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On Mon, Oct 12, 2015 at 11:07:21AM -0700, Steve Muckle wrote:

...

(adding eas-dev)

On 10/09/2015 01:41 AM, Patrick Bellasi wrote:

...
...
...
The users might not be happy though

...
...
so we could favor performance for RT tasks to avoid breaking legacy software and ask users that care about energy to migrate to deadline where we actually know the performance constraints of the tasks.

Given that at the moment RT tasks are treated no differently than CFS tasks w.r.t. cpu frequency I'd expect that we could get away without any sort of perf bias for RT bandwidth, which I think would be cost prohibitive for power.

Are you specifically considering instantaneous power? Because from a power standpoint I cannot see any difference for example w.r.t having a batch CFS task.

From an energy standpoint instead, do not you think that a "race-to-idle" policy could be better at least for RT-BATCH tasks?

Sorry I should have said energy rather than power...

From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

...

From this viewpoint, I think it's not so fare away from reality that,

if you schedule a task as FIFO or BATCH real-time, you care most about latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

...

Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

But again, the goal of sched-DVFS is to be energy-efficient? I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

...

which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

Cheers Patrick

-- #include <best/regards.h> Patrick Bellasi

Steve Muckle

7:58 p.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...

...
From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...

In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

...

IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

...

latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

...

...
Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

...

But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...

I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...

...
which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

...

Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

...

Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

Vincent Guittot

15 Oct 15 Oct

7:23 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...

On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
...
From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

...

...
IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

...
latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

...
...
Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

...
But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...
I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...
...
which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

If you look at current implementation, we don't aks for max freq or a specific freq as soon as rt task is involved but we use the cpufreq governor policy as for any other task. So we should keep the same behavior with sched-dvfs as a 1st step: rt sched-class will provide is requirement according to current RT task load. Thenwe will see for some improvement but this fall back in a policy and schedTune could probably help in this area so we can "boost" RT class

...

It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

...
Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

...
Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

Patrick Bellasi

9:24 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On Thu, Oct 15, 2015 at 09:23:58AM +0200, Vincent Guittot wrote:

...

On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...
On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
...
From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

However, the point here is that once I've suggested to some partners at Connect to just get ride of the energy-inefficient lower OPP I've got the impression that this is not always possible.

What I was not able to completely understand is if there was some strong technical arguments on the HW/use-cases side or it was just a question of "education" on what makes/dont't make sense for a feasible and effective power-management strategy.

...

...
...
IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

...
latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

...
...
Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

...
But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...
I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...
...
which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

If you look at current implementation, we don't aks for max freq or a specific freq as soon as rt task is involved but we use the cpufreq governor policy as for any other task.

Right, that's what happens right now.

...

So we should keep the same behavior with sched-dvfs as a 1st step: rt sched-class will provide is requirement according to current RT task load. Thenwe will see for some improvement but this fall back in a policy and schedTune could probably help in this area so we can "boost" RT class

Ok, I agree on keeping it simple at the beginning and maybe it's also right as a long term goal. The only points I wanted to raise with this discussion are: a) perhaps some assumptions on OPP curves could not always be true b) maybe the semantics of the FIFO/RR classes could be improved

For the first point we need to better understand from platforms providers what are (if there are) technical points against the usage of an ideal OPP curve where all the OPPs are equally energy-efficient or at least with a monotonically decreasing energy-efficiency.

For the second point we should better understand from final users of FIFO/RR (e.g. system integrators) if there are never use-cases where DEADLINE is not usable but still running at the lowest OPP could be not always enough.

...

...
It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

...
Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

...
Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

Cheers Patrick

-- #include <best/regards.h> Patrick Bellasi

Morten Rasmussen

4:10 p.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...

On Thu, Oct 15, 2015 at 09:23:58AM +0200, Vincent Guittot wrote:

...
On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...
On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
...
From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...

I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

Back to reality... We do have the choice to avoid those inefficient state in the OPP table. We could hide them from schedDVFS/cpufreq governors when the system isn't thermally constrained and only enable them when strictly needed. I think somebody (Steve?) suggested recently that we could do something like reordered the list OPPs to be ordered by decreasing energy efficiency and always start the search for an appropriate OPP from the the beginning. That way we should never pick and inefficient one unless thermal contraints forces us to it. I think that could be a good start.

...

However, the point here is that once I've suggested to some partners at Connect to just get ride of the energy-inefficient lower OPP I've got the impression that this is not always possible.

What I was not able to completely understand is if there was some strong technical arguments on the HW/use-cases side or it was just a question of "education" on what makes/dont't make sense for a feasible and effective power-management strategy.

I guess that there could be some SoC specific contraints that forces them to keep OPPs.

...

...
...
...
IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

...
latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

...
...
Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

...
But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...
I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...
...
which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

I think I agree with Steve about the RT semantics. It doesn't provide any performance/completion time guarantees today, and I don't think we should build it in now. As said below, the current state of mainline doesn't have any special treatment for RT tasks. So why not leave it as it is today (DVFS guided by current utilization) and ask people to move to SCHED_DEADLINE if the want specific performance/latency guarantees?

...

...
If you look at current implementation, we don't aks for max freq or a specific freq as soon as rt task is involved but we use the cpufreq governor policy as for any other task.

Right, that's what happens right now.

...
So we should keep the same behavior with sched-dvfs as a 1st step: rt sched-class will provide is requirement according to current RT task load. Thenwe will see for some improvement but this fall back in a policy and schedTune could probably help in this area so we can "boost" RT class

Ok, I agree on keeping it simple at the beginning and maybe it's also right as a long term goal. The only points I wanted to raise with this discussion are: a) perhaps some assumptions on OPP curves could not always be true b) maybe the semantics of the FIFO/RR classes could be improved

For the first point we need to better understand from platforms providers what are (if there are) technical points against the usage of an ideal OPP curve where all the OPPs are equally energy-efficient or at least with a monotonically decreasing energy-efficiency.

As said above, it could be worth doing a brief investigation to see if something simple like ordering the OPPs by efficiency is feasible and solves the problem.

...

For the second point we should better understand from final users of FIFO/RR (e.g. system integrators) if there are never use-cases where DEADLINE is not usable but still running at the lowest OPP could be not always enough.

I'm not convinced that you want to go to the max OPP by default for RT tasks either. In the past Android have used RT (not sure if it still the case) for tiny tasks related to audio, and I don't think we want to go to max OPP every time one of those runs. If there is indeed a need, we could consider a hack like having some intermediate OPP as the minimum OPP for RT tasks.

...

...
...
It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

...
Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

...
Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

Vincent Guittot

4:24 p.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On 15 October 2015 at 18:10, Morten Rasmussen morten.rasmussen@arm.com wrote:

...

On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...
On Thu, Oct 15, 2015 at 09:23:58AM +0200, Vincent Guittot wrote:

...
On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...
On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
...
From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...
I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

Back to reality... We do have the choice to avoid those inefficient state in the OPP table. We could hide them from schedDVFS/cpufreq governors when the system isn't thermally constrained and only enable them when strictly needed. I think somebody (Steve?) suggested recently that we could do something like reordered the list OPPs to be ordered by decreasing energy efficiency and always start the search for an appropriate OPP from the the beginning. That way we should never pick and inefficient one unless thermal contraints forces us to it. I think that could be a good start.

Looks like we are aligned with the behavior :-) Not sure what is meant by reordering the OPP list But i'm not sure that we really need to order them. We just need to ensure that at each moment, an enable OPP must be more power efficient that an OPP with higher freq. By default, we enable the OPP with max capacity wrt the efficiency Then, we can let thermal manager to enable/disable some OPPs because of thermal mitigation but it msut ensue that the previous rule is always true

...

...
However, the point here is that once I've suggested to some partners at Connect to just get ride of the energy-inefficient lower OPP I've got the impression that this is not always possible.

What I was not able to completely understand is if there was some strong technical arguments on the HW/use-cases side or it was just a question of "education" on what makes/dont't make sense for a feasible and effective power-management strategy.

I guess that there could be some SoC specific contraints that forces them to keep OPPs.

...
...
...
...
IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

...
latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

...
...
Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

...
But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...
I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...
...
which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

I think I agree with Steve about the RT semantics. It doesn't provide any performance/completion time guarantees today, and I don't think we should build it in now. As said below, the current state of mainline doesn't have any special treatment for RT tasks. So why not leave it as it is today (DVFS guided by current utilization) and ask people to move to SCHED_DEADLINE if the want specific performance/latency guarantees?

...
...
If you look at current implementation, we don't aks for max freq or a specific freq as soon as rt task is involved but we use the cpufreq governor policy as for any other task.

Right, that's what happens right now.

...
So we should keep the same behavior with sched-dvfs as a 1st step: rt sched-class will provide is requirement according to current RT task load. Thenwe will see for some improvement but this fall back in a policy and schedTune could probably help in this area so we can "boost" RT class

Ok, I agree on keeping it simple at the beginning and maybe it's also right as a long term goal. The only points I wanted to raise with this discussion are: a) perhaps some assumptions on OPP curves could not always be true b) maybe the semantics of the FIFO/RR classes could be improved

For the first point we need to better understand from platforms providers what are (if there are) technical points against the usage of an ideal OPP curve where all the OPPs are equally energy-efficient or at least with a monotonically decreasing energy-efficiency.

As said above, it could be worth doing a brief investigation to see if something simple like ordering the OPPs by efficiency is feasible and solves the problem.

...
For the second point we should better understand from final users of FIFO/RR (e.g. system integrators) if there are never use-cases where DEADLINE is not usable but still running at the lowest OPP could be not always enough.

I'm not convinced that you want to go to the max OPP by default for RT tasks either. In the past Android have used RT (not sure if it still the case) for tiny tasks related to audio, and I don't think we want to go to max OPP every time one of those runs. If there is indeed a need, we could consider a hack like having some intermediate OPP as the minimum OPP for RT tasks.

...
...
...
It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

...
Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

...
Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

+1

Morten Rasmussen

4:40 p.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On Thu, Oct 15, 2015 at 06:24:16PM +0200, Vincent Guittot wrote:

...

On 15 October 2015 at 18:10, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...
On Thu, Oct 15, 2015 at 09:23:58AM +0200, Vincent Guittot wrote:

...
On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...
On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
> From my experience race to idle has never panned out as an > energy-efficient strategy, presumably due to the nonlinear increase in > power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...
I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

Back to reality... We do have the choice to avoid those inefficient state in the OPP table. We could hide them from schedDVFS/cpufreq governors when the system isn't thermally constrained and only enable them when strictly needed. I think somebody (Steve?) suggested recently that we could do something like reordered the list OPPs to be ordered by decreasing energy efficiency and always start the search for an appropriate OPP from the the beginning. That way we should never pick and inefficient one unless thermal contraints forces us to it. I think that could be a good start.

Looks like we are aligned with the behavior :-) Not sure what is meant by reordering the OPP list But i'm not sure that we really need to order them. We just need to ensure that at each moment, an enable OPP must be more power efficient that an OPP with higher freq. By default, we enable the OPP with max capacity wrt the efficiency Then, we can let thermal manager to enable/disable some OPPs because of thermal mitigation but it msut ensue that the previous rule is always true

I think we are saying the same thing :-) We want to make sure we don't pick inefficient OPPs unless were are forced to do it due to thermal contraints. The rest is just implementation details.

It looks like we can't get rid of those OPPs anytime soon, so we better not assume that aren't there and deal with them instead.

Patrick Bellasi

4:50 p.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On Thu, Oct 15, 2015 at 06:24:16PM +0200, Vincent Guittot wrote:

...

On 15 October 2015 at 18:10, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...
On Thu, Oct 15, 2015 at 09:23:58AM +0200, Vincent Guittot wrote:

...
On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...
On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
> From my experience race to idle has never panned out as an > energy-efficient strategy, presumably due to the nonlinear increase in > power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...
I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

Back to reality... We do have the choice to avoid those inefficient state in the OPP table. We could hide them from schedDVFS/cpufreq governors when the system isn't thermally constrained and only enable them when strictly needed. I think somebody (Steve?) suggested recently that we could do something like reordered the list OPPs to be ordered by decreasing energy efficiency and always start the search for an appropriate OPP from the the beginning. That way we should never pick and inefficient one unless thermal contraints forces us to it. I think that could be a good start.

Looks like we are aligned with the behavior :-)

Seems so.

...

Not sure what is meant by reordering the OPP list But i'm not sure

Somethig like a "pre-processing" of the list of OPP when EM tables are build to ensure that, for example, the topmost entry is alwasys the most energy-efficienct? This should ensure that entries are less and less energy-efficient as well as reduce lookup latencies on the scheduler critical paths.

...

that we really need to order them. We just need to ensure that at each moment, an enable OPP must be more power efficient that an OPP with higher freq.

That's a good strategy which however requires to update the current definition of the EM tables to support a concept of enabled/disabled entries. Maybe a flag per each capacity entry?

...

By default, we enable the OPP with max capacity wrt the efficiency Then, we can let thermal manager to enable/disable some OPPs because of thermal mitigation but it msut ensue that the previous rule is always true

Agree.

...

...
...
However, the point here is that once I've suggested to some partners at Connect to just get ride of the energy-inefficient lower OPP I've got the impression that this is not always possible.

What I was not able to completely understand is if there was some strong technical arguments on the HW/use-cases side or it was just a question of "education" on what makes/dont't make sense for a feasible and effective power-management strategy.

I guess that there could be some SoC specific contraints that forces them to keep OPPs.

...
...
...
...
IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

...
latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

...
> Because of this I think a policy of increasing the OPP when RT tasks > are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

...
But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...
I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...
> which need not be incurred since RT tasks do not > receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

I think I agree with Steve about the RT semantics. It doesn't provide any performance/completion time guarantees today, and I don't think we should build it in now. As said below, the current state of mainline doesn't have any special treatment for RT tasks. So why not leave it as it is today (DVFS guided by current utilization) and ask people to move to SCHED_DEADLINE if the want specific performance/latency guarantees?

...
...
If you look at current implementation, we don't aks for max freq or a specific freq as soon as rt task is involved but we use the cpufreq governor policy as for any other task.

Right, that's what happens right now.

...
So we should keep the same behavior with sched-dvfs as a 1st step: rt sched-class will provide is requirement according to current RT task load. Thenwe will see for some improvement but this fall back in a policy and schedTune could probably help in this area so we can "boost" RT class

Ok, I agree on keeping it simple at the beginning and maybe it's also right as a long term goal. The only points I wanted to raise with this discussion are: a) perhaps some assumptions on OPP curves could not always be true b) maybe the semantics of the FIFO/RR classes could be improved

For the first point we need to better understand from platforms providers what are (if there are) technical points against the usage of an ideal OPP curve where all the OPPs are equally energy-efficient or at least with a monotonically decreasing energy-efficiency.

As said above, it could be worth doing a brief investigation to see if something simple like ordering the OPPs by efficiency is feasible and solves the problem.

...
For the second point we should better understand from final users of FIFO/RR (e.g. system integrators) if there are never use-cases where DEADLINE is not usable but still running at the lowest OPP could be not always enough.

I'm not convinced that you want to go to the max OPP by default for RT tasks either. In the past Android have used RT (not sure if it still the case) for tiny tasks related to audio, and I don't think we want to go to max OPP every time one of those runs. If there is indeed a need, we could consider a hack like having some intermediate OPP as the minimum OPP for RT tasks.

...
...
...
It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

...
Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

...
Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

+1

-- #include <best/regards.h> Patrick Bellasi

Amit Kucheria

16 Oct 16 Oct

9:48 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On Thu, Oct 15, 2015 at 10:20 PM, Patrick Bellasi patrick.bellasi@arm.com wrote:

...

Somethig like a "pre-processing" of the list of OPP when EM tables are build to ensure that, for example, the topmost entry is alwasys the most energy-efficienct? This should ensure that entries are less and less energy-efficient as well as reduce lookup latencies on the scheduler critical paths.

...
that we really need to order them. We just need to ensure that at each moment, an enable OPP must be more power efficient that an OPP with higher freq.

That's a good strategy which however requires to update the current definition of the EM tables to support a concept of enabled/disabled entries. Maybe a flag per each capacity entry?

The OPP library (drivers/base/power/opp.c) has an API to enable/disable OPPs at runtime. Whether you want to depend on this library or just copy some ideas is upto you.

Regards, Amit

Vincent Guittot

10:26 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On 15 October 2015 at 18:50, Patrick Bellasi patrick.bellasi@arm.com wrote:

...

On Thu, Oct 15, 2015 at 06:24:16PM +0200, Vincent Guittot wrote:

...
On 15 October 2015 at 18:10, Morten Rasmussen morten.rasmussen@arm.com wrote:

...
On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...
On Thu, Oct 15, 2015 at 09:23:58AM +0200, Vincent Guittot wrote:

...
On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...
On 10/14/2015 01:58 AM, Patrick Bellasi wrote: >> From my experience race to idle has never panned out as an >> energy-efficient strategy, presumably due to the nonlinear increase in >> power cost as performance increases. > > I agree with you that "race-to-idle" is not (always) a good > energy-efficient strategy. However, is the _main_ goal of sched-DVFS > to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

> In this case, what should we do for platforms where the lower OPP are > less energy-efficient than some higher OPP? > We just discovered from some discussions at Connect that there are > many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...
I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

Back to reality... We do have the choice to avoid those inefficient state in the OPP table. We could hide them from schedDVFS/cpufreq governors when the system isn't thermally constrained and only enable them when strictly needed. I think somebody (Steve?) suggested recently that we could do something like reordered the list OPPs to be ordered by decreasing energy efficiency and always start the search for an appropriate OPP from the the beginning. That way we should never pick and inefficient one unless thermal contraints forces us to it. I think that could be a good start.

Looks like we are aligned with the behavior :-)

Seems so.

...
Not sure what is meant by reordering the OPP list But i'm not sure

Somethig like a "pre-processing" of the list of OPP when EM tables are build to ensure that, for example, the topmost entry is alwasys the most energy-efficienct? This should ensure that entries are less and less energy-efficient as well as reduce lookup latencies on the scheduler critical paths.

...
that we really need to order them. We just need to ensure that at each moment, an enable OPP must be more power efficient that an OPP with higher freq.

That's a good strategy which however requires to update the current definition of the EM tables to support a concept of enabled/disabled entries. Maybe a flag per each capacity entry?

You can be notified of any changes on the OPP table. Then it's up to you to decide how to update the energy model. If we consider that we will not have too much disable OPPs simultaneously, a flag should probably be enough. It could even be included in the capacity field as an example to minimize the number of field and to have atomic access to the value but that becomes an implementation choice.

...

...
By default, we enable the OPP with max capacity wrt the efficiency Then, we can let thermal manager to enable/disable some OPPs because of thermal mitigation but it msut ensue that the previous rule is always true

Agree.

...
...
...
However, the point here is that once I've suggested to some partners at Connect to just get ride of the energy-inefficient lower OPP I've got the impression that this is not always possible.

What I was not able to completely understand is if there was some strong technical arguments on the HW/use-cases side or it was just a question of "education" on what makes/dont't make sense for a feasible and effective power-management strategy.

I guess that there could be some SoC specific contraints that forces them to keep OPPs.

...
...
...
> IMHO one of the "main" goal of sched-DVFS is to contribute to provide > (as much as possible) deterministic behaviors. We have the chance to > refactor CPUFreq to better integrate with the scheduler and thus we > should try to exploit this opportunity to improve the overall > determines of the solution. > > From this viewpoint, I think it's not so fare away from reality that, > if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

> latencies, or you _should_ care about latencies. > Specifically the time to completion of a task. If this is true the > race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

>> Because of this I think a policy of increasing the OPP when RT tasks >> are runnable will cause a net increase in energy consumption, > > I would argue that this is hard to define in general. We actually do > not know if running at a lower OPP could be more/less energy > efficient. It depends from many other (possibly external) factors, > e.g. OPP curves definition, interaction with I/O devices... > Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

> But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

> I think that this responsibility should be better assigned to other > players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

>> which need not be incurred since RT tasks do not >> receive this preferential OPP treatment today. > > Do they not receive such a preferred treatment just because CPUFreq as > always been completely decoupled from scheduler specific information? > If we use just the "average CPU idle time" to select the OPP once a > while, that's according to me the reason why FIFO/BATCH don't get a > specific treatment. > > I think that on this specific point we should better get involved RT > guys and ask them if a race-to-idle strategy could better match their > expectations.

I think I agree with Steve about the RT semantics. It doesn't provide any performance/completion time guarantees today, and I don't think we should build it in now. As said below, the current state of mainline doesn't have any special treatment for RT tasks. So why not leave it as it is today (DVFS guided by current utilization) and ask people to move to SCHED_DEADLINE if the want specific performance/latency guarantees?

...
...
If you look at current implementation, we don't aks for max freq or a specific freq as soon as rt task is involved but we use the cpufreq governor policy as for any other task.

Right, that's what happens right now.

...
So we should keep the same behavior with sched-dvfs as a 1st step: rt sched-class will provide is requirement according to current RT task load. Thenwe will see for some improvement but this fall back in a policy and schedTune could probably help in this area so we can "boost" RT class

Ok, I agree on keeping it simple at the beginning and maybe it's also right as a long term goal. The only points I wanted to raise with this discussion are: a) perhaps some assumptions on OPP curves could not always be true b) maybe the semantics of the FIFO/RR classes could be improved

For the first point we need to better understand from platforms providers what are (if there are) technical points against the usage of an ideal OPP curve where all the OPPs are equally energy-efficient or at least with a monotonically decreasing energy-efficiency.

As said above, it could be worth doing a brief investigation to see if something simple like ordering the OPPs by efficiency is feasible and solves the problem.

...
For the second point we should better understand from final users of FIFO/RR (e.g. system integrators) if there are never use-cases where DEADLINE is not usable but still running at the lowest OPP could be not always enough.

I'm not convinced that you want to go to the max OPP by default for RT tasks either. In the past Android have used RT (not sure if it still the case) for tiny tasks related to audio, and I don't think we want to go to max OPP every time one of those runs. If there is indeed a need, we could consider a hack like having some intermediate OPP as the minimum OPP for RT tasks.

...
...
...
It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

> Maybe I'm wrong but I have the impression that once you schedule a > task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to > ensure a minimum OPP which allows to match your tasks demands in terms > of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

> Here the problem is that with the frameworks we have right now people > need to use/combine features of different frameworks to achieve their > goals. This sounds to me like something which could be improved, > provided that we start by splitting responsibility and let user know > which tool should be used to achieve a specific goal. > > Specifically, if you care about responsiveness and energy-efficiency, > you should better use DEADLINE instead of FIFO/RR. While if you go for > this last class, than you should be aware that you get a race-to-idle > behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

+1

-- #include <best/regards.h>

Patrick Bellasi

Michael Turquette

10:20 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

Quoting Morten Rasmussen (2015-10-15 09:10:44)

...

On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...
Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...
I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

You mean something like this?

git://git.kernel.org/pub/scm/linux/kernel/git/mturquette/linux.git idleforce

Regards, Mike

Vincent Guittot

10:23 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On 16 October 2015 at 12:20, Michael Turquette mturquette@baylibre.com wrote:

...

Quoting Morten Rasmussen (2015-10-15 09:10:44)

...
On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...
Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...
I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

You mean something like this?

git://git.kernel.org/pub/scm/linux/kernel/git/mturquette/linux.git idleforce

I think Morten was referring to intel power clamp / idle injection driver

Regards, Vincent

...

Regards, Mike

Michael Turquette

3:08 p.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

Quoting Vincent Guittot (2015-10-16 03:23:29)

...

On 16 October 2015 at 12:20, Michael Turquette mturquette@baylibre.com wrote:

...
Quoting Morten Rasmussen (2015-10-15 09:10:44)

...
On Thu, Oct 15, 2015 at 10:24:27AM +0100, Patrick Bellasi wrote:

...
Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

I have heard the same argument several times from thermal management people. Despite their lower energy efficiency they still rely on their lower power to stay within the thermal budget.

...
I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

Agreed, using inefficient OPPs is not really desirable, but is the only option available at the moment. I think that ideally, we should have hardware implemented idle-injection (throttling) instead and let the thermal framework specify the duty cycle. Throttling through software could work, but it isn't feasible to do tricks like aligning the throttling across all cpus in a cluster to enter a deeper idle-state and make the throttling even more efficient.

You mean something like this?

git://git.kernel.org/pub/scm/linux/kernel/git/mturquette/linux.git idleforce

I think Morten was referring to intel power clamp / idle injection driver

Yes, this is recent work based on top of that. It includes coordinated all cpus in a cluster to a power down state. Not posted to any upstream list yet, but we implemented it for a customer.

It's just a data point that I am "injecting" into the discussion.

Regards, Mike

...

Regards, Vincent

...
Regards, Mike

Vincent Guittot

15 Oct 15 Oct

4:13 p.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On 15 October 2015 at 11:24, Patrick Bellasi patrick.bellasi@arm.com wrote:

...

On Thu, Oct 15, 2015 at 09:23:58AM +0200, Vincent Guittot wrote:

...
On 14 October 2015 at 21:58, Steve Muckle steve.muckle@linaro.org wrote:

...
On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
...
From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I agree on that point too and i think this has also been discussed on LKML. Having a low OPP that is less power efficicent than an higher one doesn't make any sense for both power and performance pov

Actually, I think it could make sense from a power standpoint.

For example, if you are under a thermal constraint but still want to progress with your workload there are only two possibilities: a) throttle an energy-efficient OPP b) switch to a low-power OPP, even if less energy-efficient i.e. reduced the F without lowering the V

Thermal mitigation is another story. In normal situation, i don't see any benefit of enabling a lower OPP that is less efficient that a higher OPP. But if this latter is not more usable because of thermal mitigation, nothing prevents us to enable this less efficient OPP. What i mean is that enabling a Low OPP if a more efficient higher one is availble should be prohibitied. But If system has to disable this efficient OPP for power mitigation constraint, nothing prevent us to enable this less efficient OPP that will temporary become the most efficient one

...

I'm personally more for the first solution and a suitable usage of bandwidth control could be just enough to provide such a solution.

However, the point here is that once I've suggested to some partners at Connect to just get ride of the energy-inefficient lower OPP I've got the impression that this is not always possible.

What I was not able to completely understand is if there was some strong technical arguments on the HW/use-cases side or it was just a question of "education" on what makes/dont't make sense for a feasible and effective power-management strategy.

...
...
...
IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

...
latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

...
...
Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

...
But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...
I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...
...
which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

If you look at current implementation, we don't aks for max freq or a specific freq as soon as rt task is involved but we use the cpufreq governor policy as for any other task.

Right, that's what happens right now.

...
So we should keep the same behavior with sched-dvfs as a 1st step: rt sched-class will provide is requirement according to current RT task load. Thenwe will see for some improvement but this fall back in a policy and schedTune could probably help in this area so we can "boost" RT class

Ok, I agree on keeping it simple at the beginning and maybe it's also right as a long term goal. The only points I wanted to raise with this discussion are: a) perhaps some assumptions on OPP curves could not always be true b) maybe the semantics of the FIFO/RR classes could be improved

For the first point we need to better understand from platforms providers what are (if there are) technical points against the usage of an ideal OPP curve where all the OPPs are equally energy-efficient or at least with a monotonically decreasing energy-efficiency.

For the second point we should better understand from final users of FIFO/RR (e.g. system integrators) if there are never use-cases where DEADLINE is not usable but still running at the lowest OPP could be not always enough.

...
...
It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

...
Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

...
Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

Cheers Patrick

-- #include <best/regards.h>

Patrick Bellasi

Patrick Bellasi

8:58 a.m.

New subject: cpufreq_sched policy for combining requests from multiple sched classes

On Wed, Oct 14, 2015 at 12:58:31PM -0700, Steve Muckle wrote:

...

On 10/14/2015 01:58 AM, Patrick Bellasi wrote:

...
...
From my experience race to idle has never panned out as an energy-efficient strategy, presumably due to the nonlinear increase in power cost as performance increases.

I agree with you that "race-to-idle" is not (always) a good energy-efficient strategy. However, is the _main_ goal of sched-DVFS to be energy-efficient?

I'd say the primary goal of sched-dvfs is to manage CPU frequency to offer the required performance for a platform at the best possible consumption of energy.

...
In this case, what should we do for platforms where the lower OPP are less energy-efficient than some higher OPP? We just discovered from some discussions at Connect that there are many platforms adopting that strategy for certain different reasons.

If a lower OPP is less energy efficient than a higher one, I'd expect it to be removed from the devicetree configuration of available frequencies for the governor to choose from.

I personally have the same view, but talking with some partners at Connect I had the impression that this is not always possible.

...

...
IMHO one of the "main" goal of sched-DVFS is to contribute to provide (as much as possible) deterministic behaviors. We have the chance to refactor CPUFreq to better integrate with the scheduler and thus we should try to exploit this opportunity to improve the overall determines of the solution.

From this viewpoint, I think it's not so fare away from reality that, if you schedule a task as FIFO or BATCH real-time, you care most about

Sorry I'm not sure what you mean by BATCH real-time - did you mean SCHED_RR? I'm just aware of two real-time policies, FIFO and RR. AFAICS BATCH is very similar to regular CFS.

You right, actually I mean FIFO and RR.

...

...
latencies, or you _should_ care about latencies. Specifically the time to completion of a task. If this is true the race-to-idle is the only "deterministic" way to achieve such a goal.

I don't believe determinism is part of the semantics of the RT class today. RT just offers the capability for strict prioritization of work.

Right, "determinism" is not the proper term. My focus is on "time-to-completion" (TtC), what I was meaning is that the shortest TtC can be achieved just by running at the highest OPP.

However I agree whit you that this semantics is not clearly stated today for FIFO/RR tasks. I'm just wondering if it could make sense.

...

Given that getting EAS/sched-dvfs accepted is such a herculean task I think any semantic changes should be avoided at least until the foundation is upstream and being used. Especially if they may have a significant impact on energy or performance.

Ok, I agree on that tactics. However, I'm still more on the idea that a long term strategy should be that to better define the role of each scheduling class from an performance/power standpoint too.

...

...
...
Because of this I think a policy of increasing the OPP when RT tasks are runnable will cause a net increase in energy consumption,

I would argue that this is hard to define in general. We actually do not know if running at a lower OPP could be more/less energy efficient. It depends from many other (possibly external) factors, e.g. OPP curves definition, interaction with I/O devices... Quite sure instead we will increase power consumption.

Agreed it's hard to define or know for sure but in general for the purposes of energy, I think it's fair to say that usually you should run at the lowest OPP which meets the performance requirements of the usecase. This assumes that OPPs which consume more or equal power to others while providing less performance have been removed. The typical device configuration out there today supports this conclusion IMO (usage of ondemand/interactive governor).

Maybe I'm wrong but, as already told, at Connect I had the impression that this assumption is not always true for reasons that sometimes they cannot or don't want to explain.

...

...
But again, the goal of sched-DVFS is to be energy-efficient?

Partly yes, as energy-efficient as possible while satisfying the demand for performance.

...
I think that this responsibility should be better assigned to other players, i.e. scheduling classes.

I'd agree in as much as if a workload wants a strict determinism guarantee it should migrate to SCHED_DEADLINE.

...
...
which need not be incurred since RT tasks do not receive this preferential OPP treatment today.

Do they not receive such a preferred treatment just because CPUFreq as always been completely decoupled from scheduler specific information? If we use just the "average CPU idle time" to select the OPP once a while, that's according to me the reason why FIFO/BATCH don't get a specific treatment.

I think that on this specific point we should better get involved RT guys and ask them if a race-to-idle strategy could better match their expectations.

It can be debated whether the limitations of CPUfreq established the semantics of the RT class or vice-versa, but either way having RT affect the OPP in this way would be a major semantic/policy change that will almost certainly have significant repercussions in power profiling.

I agree that more broad discussion would be a good before going further. Beyond just the RT guys though I think a community-wide discussion on lkml and linux-pm would be appropriate.

Right, any specific proposal?

...

...
Maybe I'm wrong but I have the impression that once you schedule a task as FIFO/BATCH, sometimes you also need to "hack" into CPUFreq to ensure a minimum OPP which allows to match your tasks demands in terms of time-to-completion.

I've not seen this specific issue. The boosting I've seen is typically associated with CFS tasks. RT tasks on the platforms I've worked with are usually small enough that they can be satisfied regardless of the OPP.

Ok, that's an interesting point. If this is the general use-case for FIFO/RT tasks, i.e. they can always be "completed fast enough" by running at the lowest OPP, then asking sched-DVFS for the cumulative RT load should always work.

...

...
Here the problem is that with the frameworks we have right now people need to use/combine features of different frameworks to achieve their goals. This sounds to me like something which could be improved, provided that we start by splitting responsibility and let user know which tool should be used to achieve a specific goal.

Specifically, if you care about responsiveness and energy-efficiency, you should better use DEADLINE instead of FIFO/RR. While if you go for this last class, than you should be aware that you get a race-to-idle behavior, whatever this means from an energy/power standpoint.

If there's broad consensus that this semantic/policy change is what folks want then I'm all for it, but I'd expect pushback.

Ok, so let's have this discussion on the list perhaps once we have the first initial rework which integrate FIFO/RR without changing the current semitic.

Cheers Patrick

-- #include <best/regards.h> Patrick Bellasi

3123

days inactive

3127

days old

eas-dev@lists.linaro.org

15 comments

participants

tags (0)

participants (6)

Amit Kucheria
Michael Turquette
Morten Rasmussen
Patrick Bellasi
Steve Muckle
Vincent Guittot