Hi,
A number of patch sets related to power-efficient scheduling have been posted over the last couple of months. Most of them do not have much data to back them up, so I decided to do some testing.
Common for all of the patch sets that I have tested, except one, is that they attempt to pack tasks on as few cpus as possible to allow the remaining cpus to enter deeper sleep states - a strategy that should make sense on most platforms that support per-cpu power gating and multi-socket machines.
Kernel: 3.9
Patch sets: rlb-v4: sched: use runnable load based balance (Alex Shi) https://lkml.org/lkml/2013/4/27/13 pas-v7: sched: power aware scheduling (Alex Shi) https://lkml.org/lkml/2013/4/3/732 pst-v3: sched: packing small tasks (Vincent Guittot) https://lkml.org/lkml/2013/3/22/183 pst-v4: sched: packing small tasks (Vincent Guittot) https://lkml.org/lkml/2013/4/25/396
Configuration: pas-v7: Set to "powersaving" mode. pst-v4: Set to "Full" packing mode.
Platform: ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.
Measurement technique: Time spent non-idle (not in idle state) for each cpu based on cpuidle ftrace events. TC2 does not have per-core power-gating, so packing inside the A7 cluster does not lead to any significant power savings. Note that any product grade hardware (TC2 is a test-chip) will very likely have per-core power-gating, so in those cases packing will have an appreciable effect on power savings. Measuring non-idle time rather than power should give a more clear idea about the effect of the patch sets given that the idle back-end is highly implementation specific.
Benchmarks: audio playback (Android): 30s mp3 file playback on Android. bbench+audio (Android): Web page rendering while doing mp3 playback. andebench_native (Android): Android benchmark running in native mode. cyclictest: Short periodic tasks.
Results: Two runs for each patch set.
audio playback (Android) SMP non-idle % cpu 0 cpu 1 cpu 2 3.9_1 11.96 2.86 2.48 3.9_2 12.64 2.81 1.88 rlb-v4_1 12.61 2.44 1.90 rlb-v4_2 12.45 2.44 1.90 pas-v7_1 16.17 0.03 0.24 pas-v7_2 16.08 0.28 0.07 pst-v3_1 15.18 2.76 1.70 pst-v3_2 15.13 0.80 0.38 pst-v4_1 16.14 0.05 0.00 pst-v4_2 16.34 0.06 0.00
bbench+audio (Android) SMP non-idle % cpu 0 cpu 1 cpu 2 render time 3.9_1 25.00 20.73 21.22 812 3.9_2 24.29 19.78 22.34 795 rlb-v4_1 23.84 19.36 22.74 782 rlb-v4_2 24.07 19.36 22.74 797 pas-v7_1 28.29 17.86 16.01 869 pas-v7_2 28.62 18.54 15.05 908 pst-v3_1 29.14 20.59 21.72 830 pst-v3_2 27.69 18.81 20.06 830 pst-v4_1 42.20 13.63 2.29 880 pst-v4_2 41.56 14.40 2.17 935
andebench_native (8 threads) (Android) SMP non-idle % cpu 0 cpu 1 cpu 2 Score 3.9_1 99.22 98.88 99.61 4139 3.9_2 99.56 99.31 99.46 4148 rlb-v4_1 99.49 99.61 99.53 4153 rlb-v4_2 99.56 99.61 99.53 4149 pas-v7_1 99.53 99.59 99.29 4149 pas-v7_2 99.42 99.63 99.48 4150 pst-v3_1 97.89 99.33 99.42 4097 pst-v3_2 99.16 99.62 99.42 4097 pst-v4_1 99.34 99.01 99.59 4146 pst-v4_2 99.49 99.52 99.20 4146
cyclictest SMP non-idle % cpu 0 cpu 1 cpu 2 3.9_1 9.13 8.88 8.41 3.9_2 10.27 8.02 6.30 rlb-v4_1 8.88 8.09 8.11 rlb-v4_2 8.49 8.09 8.11 pas-v7_1 10.20 0.02 11.50 pas-v7_2 7.86 14.31 0.02 pst-v3_1 20.44 8.68 7.97 pst-v3_2 20.41 0.78 1.00 pst-v4_1 21.32 0.21 0.05 pst-v4_2 21.56 0.21 0.04
Overall, pas-v7 seems to do a fairly good job at packing. The idle time distribution seems to be somewhere between pst-v3 and the more aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two cpus nearly idle (<0.25% non-idle) for both cyclictest and audio, which is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest. Packing does come at at cost which can be seen for bbench+audio, where pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which do more aggressive packing. rlb-v4 does not pack, it is only included for reference.
From a packing perspective pst-v4 seems to do the best job for the
workloads that I have tested on ARM TC2. The less aggressive packing in pst-v3 may be a better choice for in terms of performance.
I'm well aware that these tests are heavily focused on mobile workloads. I would therefore encourage people to share your test results for your workloads on your platforms to complete the picture. Comments are also welcome.
Thanks, Morten
On 05/30/2013 09:47 PM, Morten Rasmussen wrote:
Hi,
A number of patch sets related to power-efficient scheduling have been posted over the last couple of months. Most of them do not have much data to back them up, so I decided to do some testing.
Common for all of the patch sets that I have tested, except one, is that they attempt to pack tasks on as few cpus as possible to allow the remaining cpus to enter deeper sleep states - a strategy that should make sense on most platforms that support per-cpu power gating and multi-socket machines.
Kernel: 3.9
Patch sets: rlb-v4: sched: use runnable load based balance (Alex Shi) https://lkml.org/lkml/2013/4/27/13
Thanks for the valuable comparison!
The runnable load balance target is performance. It is still try to disperse tasks to as much as possible CPUs. :) The latest v7 version remove the 6th patch(wake_affine change) in v4. and plus fix a slept time double counting issue, and remove blocked_load_avg in tg load. http://comments.gmane.org/gmane.linux.kernel/1498988 Enjoy!
pas-v7: sched: power aware scheduling (Alex Shi) https://lkml.org/lkml/2013/4/3/732
We still have some internal discussion on this patch set before update it. Sorry for response late on this patchset!
pst-v3: sched: packing small tasks (Vincent Guittot) https://lkml.org/lkml/2013/3/22/183 pst-v4: sched: packing small tasks (Vincent Guittot) https://lkml.org/lkml/2013/4/25/396
Configuration: pas-v7: Set to "powersaving" mode. pst-v4: Set to "Full" packing mode.
Platform: ARM TC2 (test-chip), 2xCortex-A15 + 3xCortex-A7. Cortex-A15s disabled.
Measurement technique: Time spent non-idle (not in idle state) for each cpu based on cpuidle ftrace events. TC2 does not have per-core power-gating, so packing inside the A7 cluster does not lead to any significant power savings. Note that any product grade hardware (TC2 is a test-chip) will very likely have per-core power-gating, so in those cases packing will have an appreciable effect on power savings. Measuring non-idle time rather than power should give a more clear idea about the effect of the patch sets given that the idle back-end is highly implementation specific.
Benchmarks: audio playback (Android): 30s mp3 file playback on Android. bbench+audio (Android): Web page rendering while doing mp3 playback. andebench_native (Android): Android benchmark running in native mode. cyclictest: Short periodic tasks.
Results: Two runs for each patch set.
audio playback (Android) SMP non-idle % cpu 0 cpu 1 cpu 2 3.9_1 11.96 2.86 2.48 3.9_2 12.64 2.81 1.88 rlb-v4_1 12.61 2.44 1.90 rlb-v4_2 12.45 2.44 1.90 pas-v7_1 16.17 0.03 0.24 pas-v7_2 16.08 0.28 0.07 pst-v3_1 15.18 2.76 1.70 pst-v3_2 15.13 0.80 0.38 pst-v4_1 16.14 0.05 0.00 pst-v4_2 16.34 0.06 0.00
bbench+audio (Android) SMP non-idle % cpu 0 cpu 1 cpu 2 render time 3.9_1 25.00 20.73 21.22 812 3.9_2 24.29 19.78 22.34 795 rlb-v4_1 23.84 19.36 22.74 782 rlb-v4_2 24.07 19.36 22.74 797 pas-v7_1 28.29 17.86 16.01 869 pas-v7_2 28.62 18.54 15.05 908 pst-v3_1 29.14 20.59 21.72 830 pst-v3_2 27.69 18.81 20.06 830 pst-v4_1 42.20 13.63 2.29 880 pst-v4_2 41.56 14.40 2.17 935
andebench_native (8 threads) (Android) SMP non-idle % cpu 0 cpu 1 cpu 2 Score 3.9_1 99.22 98.88 99.61 4139 3.9_2 99.56 99.31 99.46 4148 rlb-v4_1 99.49 99.61 99.53 4153 rlb-v4_2 99.56 99.61 99.53 4149 pas-v7_1 99.53 99.59 99.29 4149 pas-v7_2 99.42 99.63 99.48 4150 pst-v3_1 97.89 99.33 99.42 4097 pst-v3_2 99.16 99.62 99.42 4097 pst-v4_1 99.34 99.01 99.59 4146 pst-v4_2 99.49 99.52 99.20 4146
cyclictest SMP non-idle % cpu 0 cpu 1 cpu 2 3.9_1 9.13 8.88 8.41 3.9_2 10.27 8.02 6.30 rlb-v4_1 8.88 8.09 8.11 rlb-v4_2 8.49 8.09 8.11 pas-v7_1 10.20 0.02 11.50 pas-v7_2 7.86 14.31 0.02 pst-v3_1 20.44 8.68 7.97 pst-v3_2 20.41 0.78 1.00 pst-v4_1 21.32 0.21 0.05 pst-v4_2 21.56 0.21 0.04
Overall, pas-v7 seems to do a fairly good job at packing. The idle time distribution seems to be somewhere between pst-v3 and the more aggressive pst-v4 for all the benchmarks. pst-v4 manages to keep two cpus nearly idle (<0.25% non-idle) for both cyclictest and audio, which is better than both pst-v3 and pas-v7. pas-v7 fails to pack cyclictest. Packing does come at at cost which can be seen for bbench+audio, where pst-v3 and rlb-v4 get better render times than pas-v7 and pst-v4 which do more aggressive packing. rlb-v4 does not pack, it is only included for reference.
From a packing perspective pst-v4 seems to do the best job for the workloads that I have tested on ARM TC2. The less aggressive packing in pst-v3 may be a better choice for in terms of performance.
I'm well aware that these tests are heavily focused on mobile workloads. I would therefore encourage people to share your test results for your workloads on your platforms to complete the picture. Comments are also welcome.
Thanks, Morten
On 05/31/2013 09:17 AM, Alex Shi wrote:
Kernel: 3.9
Patch sets: rlb-v4: sched: use runnable load based balance (Alex Shi) https://lkml.org/lkml/2013/4/27/13
Thanks for the valuable comparison!
The runnable load balance target is performance. It is still try to disperse tasks to as much as possible CPUs. :) The latest v7 version remove the 6th patch(wake_affine change) in v4. and plus fix a slept time double counting issue, and remove blocked_load_avg in tg load. http://comments.gmane.org/gmane.linux.kernel/1498988
Even the rlb patch set target is performance, Maybe the power benefit is due to better balancing?
Anyway I appreciate if you like to test the latest v7 version. :) https://github.com/alexshi/power-scheduling.git runnablelb
* Morten Rasmussen morten.rasmussen@arm.com wrote:
Hi,
A number of patch sets related to power-efficient scheduling have been posted over the last couple of months. Most of them do not have much data to back them up, so I decided to do some testing.
Thanks, numbers are always welcome!
Measurement technique: Time spent non-idle (not in idle state) for each cpu based on cpuidle ftrace events. TC2 does not have per-core power-gating, so packing inside the A7 cluster does not lead to any significant power savings. Note that any product grade hardware (TC2 is a test-chip) will very likely have per-core power-gating, so in those cases packing will have an appreciable effect on power savings. Measuring non-idle time rather than power should give a more clear idea about the effect of the patch sets given that the idle back-end is highly implementation specific.
Note that I still disagree with the whole design notion of having an "idle back-end" (and a 'cpufreq back end') separate from scheduler power saving policy, and none of the patch-sets offered so far solve this fundamental design problem.
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
- when a CPU is busy: about how long the current task expects to run
- when a CPU is idle: how long the current CPU expects _not_ to run
- topology: it knows how the CPUs and caches interrelate and already optimizes based on that
- various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
The scheduler is also at a high enough level to host a "I want maximum performance, power does not matter to me" user policy override switch and similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly.
Even when the numbers improve, it's an entirely random, essentially unmaintainable property: because there's no clear split (possible) between 'scheduler policy' and 'idle policy'. This is why we removed the old, broken power saving scheduler code a year ago: to make room for something _better_.
So if we want to add back scheduler power saving then what should happen is genuinely better code:
To create a new low level idle driver mechanism the scheduler could use and integrate proper power saving / idle policy into the scheduler.
In that power saving framework the already existing scheduler topology information should be extended with deep idle parameters:
- enumeration of idle states
- how long it takes to enter+exit a particular idle state
- [ perhaps information about how destructive to CPU caches that particular idle state is. ]
- new driver entry point that allows the scheduler to enter any of the enumerated idle states. Platform code will not change this state, all policy decisions and the idle state is decided at the power saving policy level.
All of this combines into a 'cost to enter and exit an idle state' estimation plus a way to enter idle states. It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence.
Thomas Gleixner's recent work to generalize platform idle routines will further help the implementation of this. (that code is upstream already)
_All_ policy, all metrics, all averaging should happen at the scheduler power saving level, in a single place, and then the scheduler should directly drive the new low level idle state driver mechanism.
'scheduler power saving' and 'idle policy' are one and the same principle and they should be handled in a single place to offer the best power saving results.
Note that any RFC patch-set that offers an implementation for this could be structured in a gradual fashion: only implementing it for a limited CPU range initially. The new framework can then be extended to more and more CPUs and architectures, incorporating more complicated power saving features gradually. (The old, existing idle policy code would remain untouched and available - it would simply not be used when the new policy is activated.)
I.e. I'm not asking for a 'rewrite the world' kind of impossible task - I'm providing an actionable path to get improved power saving upstream, but it has to use a _sane design_.
This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable - and I'm NAK-ing incomplete approaches that don't solve the root design cause of our power saving troubles...
Thanks,
Ingo
enumeration of idle states
how long it takes to enter+exit a particular idle state
[ perhaps information about how destructive to CPU caches that particular idle state is. ]
new driver entry point that allows the scheduler to enter any of the enumerated idle states. Platform code will not change this state, all policy decisions and the idle state is decided at the power saving policy level.
All of this combines into a 'cost to enter and exit an idle state' estimation plus a way to enter idle states. It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence.
you're missing an aspect. Deeper idle states on one core, allow (on Intel and AMD at least) the other cores to go faster. So it's not so simple as "if I want more performance, go less deep". By going less deep you also reduce overall performance of the system... as well as increase the power usage.
This aspect really really cannot be ignored, it's quite significant today, and going forward is only going to get more and more significant.
* Arjan van de Ven arjan@linux.intel.com wrote:
enumeration of idle states
how long it takes to enter+exit a particular idle state
[ perhaps information about how destructive to CPU caches that particular idle state is. ]
new driver entry point that allows the scheduler to enter any of the enumerated idle states. Platform code will not change this state, all policy decisions and the idle state is decided at the power saving policy level.
All of this combines into a 'cost to enter and exit an idle state' estimation plus a way to enter idle states. It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence.
you're missing an aspect.
Deeper idle states on one core, allow (on Intel and AMD at least) the other cores to go faster. So it's not so simple as "if I want more performance, go less deep". By going less deep you also reduce overall performance of the system... as well as increase the power usage.
This aspect really really cannot be ignored, it's quite significant today, and going forward is only going to get more and more significant.
I'm not missing turbo mode, just wanted to keep the above discussion simple. For turbo mode the "go for performance" constraints are simply different, more global. We have similar concerns in the scheduler already - for example system-global scheduling decisions for NUMA balancing.
Turbo mode in fact shows _why_ it's important to decide this on a higher, unified level to achieve best results: as the contraints and interdependencies become more complex it's not a simple CPU-local CPU-resource utilization decision anymore, but a system-wide one, where broad kinds of scheduling information is needed to make a good guess.
Thanks,
Ingo
On Fri, May 31, 2013 at 11:52:04AM +0100, Ingo Molnar wrote:
- Morten Rasmussen morten.rasmussen@arm.com wrote:
Hi,
A number of patch sets related to power-efficient scheduling have been posted over the last couple of months. Most of them do not have much data to back them up, so I decided to do some testing.
Thanks, numbers are always welcome!
Measurement technique: Time spent non-idle (not in idle state) for each cpu based on cpuidle ftrace events. TC2 does not have per-core power-gating, so packing inside the A7 cluster does not lead to any significant power savings. Note that any product grade hardware (TC2 is a test-chip) will very likely have per-core power-gating, so in those cases packing will have an appreciable effect on power savings. Measuring non-idle time rather than power should give a more clear idea about the effect of the patch sets given that the idle back-end is highly implementation specific.
Note that I still disagree with the whole design notion of having an "idle back-end" (and a 'cpufreq back end') separate from scheduler power saving policy, and none of the patch-sets offered so far solve this fundamental design problem.
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
The scheduler is also at a high enough level to host a "I want maximum performance, power does not matter to me" user policy override switch and similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly.
Even when the numbers improve, it's an entirely random, essentially unmaintainable property: because there's no clear split (possible) between 'scheduler policy' and 'idle policy'. This is why we removed the old, broken power saving scheduler code a year ago: to make room for something _better_.
So if we want to add back scheduler power saving then what should happen is genuinely better code:
To create a new low level idle driver mechanism the scheduler could use and integrate proper power saving / idle policy into the scheduler.
In that power saving framework the already existing scheduler topology information should be extended with deep idle parameters:
enumeration of idle states
how long it takes to enter+exit a particular idle state
[ perhaps information about how destructive to CPU caches that particular idle state is. ]
new driver entry point that allows the scheduler to enter any of the enumerated idle states. Platform code will not change this state, all policy decisions and the idle state is decided at the power saving policy level.
All of this combines into a 'cost to enter and exit an idle state' estimation plus a way to enter idle states. It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence.
Thomas Gleixner's recent work to generalize platform idle routines will further help the implementation of this. (that code is upstream already)
_All_ policy, all metrics, all averaging should happen at the scheduler power saving level, in a single place, and then the scheduler should directly drive the new low level idle state driver mechanism.
'scheduler power saving' and 'idle policy' are one and the same principle and they should be handled in a single place to offer the best power saving results.
Note that any RFC patch-set that offers an implementation for this could be structured in a gradual fashion: only implementing it for a limited CPU range initially. The new framework can then be extended to more and more CPUs and architectures, incorporating more complicated power saving features gradually. (The old, existing idle policy code would remain untouched and available - it would simply not be used when the new policy is activated.)
I.e. I'm not asking for a 'rewrite the world' kind of impossible task - I'm providing an actionable path to get improved power saving upstream, but it has to use a _sane design_.
This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable - and I'm NAK-ing incomplete approaches that don't solve the root design cause of our power saving troubles...
Thanks for sharing your view.
I agree with idea of having a high level user switch to change power/performance policy trade-offs for the system. Not only for scheduling. I also share your view that the scheduler is in the ideal place to drive the frequency scaling and idle policies.
However, I think that an integrated solution with one unified policy implemented in the scheduler would take a significant rewrite of the scheduler and the power management frameworks even if we start with just a few SoCs.
To reach an integrated solution that does better than the current approach there is a range of things that need to be considered:
- Define a power-efficient scheduling policy. Depending on the power gating support on the particular system packing tasks may improve power-efficiency while spreading the tasks may be better for others.
- Define how the user policy switch works. In previous discussions it was proposed to have a high level switch that allows specification of what the system should strive to achieve - power saving or performance. In those discussions, what power meant wasn't exactly defined.
- Find a generic way to represent the power topology which includes power domains, voltage domains and frequency domains. Also, more importantly how we can derive the optimal power/performance policy for the specific platform. There may be dependencies between idle and frequency states like it is the case for frequency boost mode like Arjan mentions in his reply.
- The fact that not all platforms expose all idle states to the OS and that closed firmware may do whatever it likes behind the scenes. There are various reasons to do this. Not all of them are bad.
- Define a scheduler driven frequency scaling policy that at least matches the 'performance' of the current cpufreq policies and has potential for further improvements.
- Match the power savings of the current cpuidle governors which are based on arcane heuristics developed over years to predict things like the occurrence of the next interrupt.
- Thermal aspects add more complexity to the power/performance policy. Depending on the platform, overheating may be handled by frequency capping or restricting the number of active cpus.
- Asymmetric/heterogeneous multi-processors need to be dealt with.
This is not a complete list. My point is that moving all policy to the scheduler will significantly increase the complexity of the scheduler. It is my impression that the general opinion is that the scheduler is already too complicated. Correct me if I'm wrong.
While the proposed task packing patches are not complete solutions, they address the first item on the above list and can be seen as a step towards the goal.
Should I read your recommendation as you prefer a complete and potentially huge patch set over incremental patch sets?
It would be good to have even a high level agreement on the path forward where the expectation first and foremost is to take advantage of the schedulers ideal position to drive the power management while simplifying the power management code.
Thanks, Morten
Hi Morten,
I have one point to make below.
On 06/04/2013 08:33 PM, Morten Rasmussen wrote:
Thanks for sharing your view.
I agree with idea of having a high level user switch to change power/performance policy trade-offs for the system. Not only for scheduling. I also share your view that the scheduler is in the ideal place to drive the frequency scaling and idle policies.
However, I think that an integrated solution with one unified policy implemented in the scheduler would take a significant rewrite of the scheduler and the power management frameworks even if we start with just a few SoCs.
To reach an integrated solution that does better than the current approach there is a range of things that need to be considered:
Define a power-efficient scheduling policy. Depending on the power gating support on the particular system packing tasks may improve power-efficiency while spreading the tasks may be better for others.
Define how the user policy switch works. In previous discussions it was proposed to have a high level switch that allows specification of what the system should strive to achieve - power saving or performance. In those discussions, what power meant wasn't exactly defined.
Find a generic way to represent the power topology which includes power domains, voltage domains and frequency domains. Also, more importantly how we can derive the optimal power/performance policy for the specific platform. There may be dependencies between idle and frequency states like it is the case for frequency boost mode like Arjan mentions in his reply.
The fact that not all platforms expose all idle states to the OS and that closed firmware may do whatever it likes behind the scenes. There are various reasons to do this. Not all of them are bad.
Define a scheduler driven frequency scaling policy that at least matches the 'performance' of the current cpufreq policies and has potential for further improvements.
Match the power savings of the current cpuidle governors which are based on arcane heuristics developed over years to predict things like the occurrence of the next interrupt.
Thermal aspects add more complexity to the power/performance policy. Depending on the platform, overheating may be handled by frequency capping or restricting the number of active cpus.
Asymmetric/heterogeneous multi-processors need to be dealt with.
This is not a complete list. My point is that moving all policy to the scheduler will significantly increase the complexity of the scheduler. It is my impression that the general opinion is that the scheduler is already too complicated. Correct me if I'm wrong.
I don't think this is the idea. As you have rightly pointed out above, the current cpuidle and cpufrequency governors are based on heuristics that have been developed over years. So in my opinion, we must not strive at duplicating this effort in the scheduler, rather we must strive at improving the co-operation between scheduler and these governors.
As I have mentioned in the reply to Ingo's mail, we do not have a two way co-operation between cpuidle/cpufrequency subsystems and scheduler. When the scheduler decides not to schedule tasks on certain CPUs for a long time the cpuidle governor for instance, puts them into deep idle state since it looks at load average of CPUs, among other things before doing this.
So here we notice that cpuidle is *listening* to scheduler decisions. However when the scheduler decides to schedule newer/woken up tasks, it looks for the *idlest* cpu to run them on, without considering which idle state that CPU is in. The result is waking up a deep idle state CPU, rather than a shallow one, thus hindering power savings. IOW, the scheduler is *not listening* to the decisions taken by the cpuidle governor.
If we observe the basis and the principle of scheduling today, the scheduler makes its decisions based on the scheduling domain hierarchy and more importantly the *load* on the CPUs. It does not consider other aspects like idleness/frequency/ thermal aspects among the things that you and Ingo have pointed out. I think here is where we need to step in. We need scheduler to be *well aware* of its ecosystem, *not necessarily decide this ecosystem*.
As Amit Kucheria has pointed out, currently without this two way co-operation, we might see scheduler fighting with these subsystems. We could as one of the steps to power savings in scheduler, try and eliminate that.
While the proposed task packing patches are not complete solutions, they address the first item on the above list and can be seen as a step towards the goal.
Should I read your recommendation as you prefer a complete and potentially huge patch set over incremental patch sets?
It would be good to have even a high level agreement on the path forward where the expectation first and foremost is to take advantage of the schedulers ideal position to drive the power management while simplifying the power management code.
Thanks, Morten
Regards Preeti U Murthy
* Morten Rasmussen morten.rasmussen@arm.com wrote:
On Fri, May 31, 2013 at 11:52:04AM +0100, Ingo Molnar wrote:
- Morten Rasmussen morten.rasmussen@arm.com wrote:
Hi,
A number of patch sets related to power-efficient scheduling have been posted over the last couple of months. Most of them do not have much data to back them up, so I decided to do some testing.
Thanks, numbers are always welcome!
Measurement technique: Time spent non-idle (not in idle state) for each cpu based on cpuidle ftrace events. TC2 does not have per-core power-gating, so packing inside the A7 cluster does not lead to any significant power savings. Note that any product grade hardware (TC2 is a test-chip) will very likely have per-core power-gating, so in those cases packing will have an appreciable effect on power savings. Measuring non-idle time rather than power should give a more clear idea about the effect of the patch sets given that the idle back-end is highly implementation specific.
Note that I still disagree with the whole design notion of having an "idle back-end" (and a 'cpufreq back end') separate from scheduler power saving policy, and none of the patch-sets offered so far solve this fundamental design problem.
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
The scheduler is also at a high enough level to host a "I want maximum performance, power does not matter to me" user policy override switch and similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly.
Even when the numbers improve, it's an entirely random, essentially unmaintainable property: because there's no clear split (possible) between 'scheduler policy' and 'idle policy'. This is why we removed the old, broken power saving scheduler code a year ago: to make room for something _better_.
So if we want to add back scheduler power saving then what should happen is genuinely better code:
To create a new low level idle driver mechanism the scheduler could use and integrate proper power saving / idle policy into the scheduler.
In that power saving framework the already existing scheduler topology information should be extended with deep idle parameters:
enumeration of idle states
how long it takes to enter+exit a particular idle state
[ perhaps information about how destructive to CPU caches that particular idle state is. ]
new driver entry point that allows the scheduler to enter any of the enumerated idle states. Platform code will not change this state, all policy decisions and the idle state is decided at the power saving policy level.
All of this combines into a 'cost to enter and exit an idle state' estimation plus a way to enter idle states. It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence.
Thomas Gleixner's recent work to generalize platform idle routines will further help the implementation of this. (that code is upstream already)
_All_ policy, all metrics, all averaging should happen at the scheduler power saving level, in a single place, and then the scheduler should directly drive the new low level idle state driver mechanism.
'scheduler power saving' and 'idle policy' are one and the same principle and they should be handled in a single place to offer the best power saving results.
Note that any RFC patch-set that offers an implementation for this could be structured in a gradual fashion: only implementing it for a limited CPU range initially. The new framework can then be extended to more and more CPUs and architectures, incorporating more complicated power saving features gradually. (The old, existing idle policy code would remain untouched and available - it would simply not be used when the new policy is activated.)
I.e. I'm not asking for a 'rewrite the world' kind of impossible task - I'm providing an actionable path to get improved power saving upstream, but it has to use a _sane design_.
This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable - and I'm NAK-ing incomplete approaches that don't solve the root design cause of our power saving troubles...
Thanks for sharing your view.
I agree with idea of having a high level user switch to change power/performance policy trade-offs for the system. Not only for scheduling. I also share your view that the scheduler is in the ideal place to drive the frequency scaling and idle policies.
However, I think that an integrated solution with one unified policy implemented in the scheduler would take a significant rewrite of the scheduler and the power management frameworks even if we start with just a few SoCs.
To reach an integrated solution that does better than the current approach there is a range of things that need to be considered:
Define a power-efficient scheduling policy. Depending on the power gating support on the particular system packing tasks may improve power-efficiency while spreading the tasks may be better for others.
Define how the user policy switch works. In previous discussions it was proposed to have a high level switch that allows specification of what the system should strive to achieve - power saving or performance. In those discussions, what power meant wasn't exactly defined.
Find a generic way to represent the power topology which includes power domains, voltage domains and frequency domains. Also, more importantly how we can derive the optimal power/performance policy for the specific platform. There may be dependencies between idle and frequency states like it is the case for frequency boost mode like Arjan mentions in his reply.
The fact that not all platforms expose all idle states to the OS and that closed firmware may do whatever it likes behind the scenes. There are various reasons to do this. Not all of them are bad.
Define a scheduler driven frequency scaling policy that at least matches the 'performance' of the current cpufreq policies and has potential for further improvements.
Match the power savings of the current cpuidle governors which are based on arcane heuristics developed over years to predict things like the occurrence of the next interrupt.
Thermal aspects add more complexity to the power/performance policy. Depending on the platform, overheating may be handled by frequency capping or restricting the number of active cpus.
Asymmetric/heterogeneous multi-processors need to be dealt with.
This is not a complete list. My point is that moving all policy to the scheduler will significantly increase the complexity of the scheduler. It is my impression that the general opinion is that the scheduler is already too complicated. Correct me if I'm wrong.
The thing we care about is the net complexity of the kernel. Moving related kernel code next to each other will in the _worst case_ result in exactly the same complexity as we had before.
But even just a small number of unifications will decrease complexity and give us a chance to implement a more workable, more maintainable, more correct power saving policy.
The scheduler maintainers have no problem with going this way - we've asked for such a design and approach for years.
While the proposed task packing patches are not complete solutions, they address the first item on the above list and can be seen as a step towards the goal.
Should I read your recommendation as you prefer a complete and potentially huge patch set over incremental patch sets?
I like incremental and see no reason why this couldn't be made incremental, by adding the new facility for a smallish, manageable number of supported configurations - then extending it gradually as it proves itself.
It would be good to have even a high level agreement on the path forward where the expectation first and foremost is to take advantage of the schedulers ideal position to drive the power management while simplifying the power management code.
I'd suggest to try a set of patches that implements this for the hw configuration you are most interested in - then measure and see where we stand.
It should be a non-disruptive approach: i.e. a new CONFIG_SCHED_POWER .config switch, which, if turned off, makes the new code go away, and it also won't do anything on platforms that don't (yet) support the driver model where the scheduler determines idle and performance states.
On CONFIG_SCHED_POWER=y kernels the new policy activates if there's low level support present.
There's no other mode of operation: either the new scheduling policy is fully there, or it's totally inactive.
This makes it entirely non-disruptive and non-regressive, while still providing a road towards goodness.
Thanks,
Ingo
On Fri, May 31, 2013 at 4:22 PM, Ingo Molnar mingo@kernel.org wrote:
- Morten Rasmussen morten.rasmussen@arm.com wrote:
Hi,
A number of patch sets related to power-efficient scheduling have been posted over the last couple of months. Most of them do not have much data to back them up, so I decided to do some testing.
Thanks, numbers are always welcome!
Measurement technique: Time spent non-idle (not in idle state) for each cpu based on cpuidle ftrace events. TC2 does not have per-core power-gating, so packing inside the A7 cluster does not lead to any significant power savings. Note that any product grade hardware (TC2 is a test-chip) will very likely have per-core power-gating, so in those cases packing will have an appreciable effect on power savings. Measuring non-idle time rather than power should give a more clear idea about the effect of the patch sets given that the idle back-end is highly implementation specific.
Note that I still disagree with the whole design notion of having an "idle back-end" (and a 'cpufreq back end') separate from scheduler power saving policy, and none of the patch-sets offered so far solve this fundamental design problem.
I don't think you'll see any argument on this one.
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
It hasn't been spelled out in as many words before, so thank you!
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
The scheduler is also at a high enough level to host a "I want maximum performance, power does not matter to me" user policy override switch and similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly.
Even when the numbers improve, it's an entirely random, essentially unmaintainable property: because there's no clear split (possible) between 'scheduler policy' and 'idle policy'. This is why we removed the old, broken power saving scheduler code a year ago: to make room for something _better_.
So if we want to add back scheduler power saving then what should happen is genuinely better code:
My understanding (and that of several of my colleagues) in discussions with some of the folks on cc was that we wanted the following things to happen in somewhat this order:
1. Replacement for task packing bits of sched_mc (Vincent's packing small task patchset) 2. General scalability improvements and low-hanging fruit e.g. Thomas' hotplug/kthread rework, un-pinned workqueues (queued for 3.11 by Tejun), migrating running timers (RFC patches being discussed), Adaptive NO_HZ, etc. 3. Scheduler-driven CPU states (DVFS and idle) a. More CPU topology information in scheduler (to replace related_cpus, affected_cpus, couple C-states and other such constructs) b. Intermediate step to replace cpufreq/cpuidle governors with a 'sched governor' that uses scheduler statistics instead of heuristics in the governors today. c. Thermal input into scheduling decisions d. Co-existing sched-driven and legacy cpufreq/cpuidle policies e. Switch over newer HW to default to sched-driven policy
Morten has already gone in great detail about some of the things we need to address before the scheduler can drive power management.
What you've outlined in this email more or less reverses the order we had in mind. And that is fine as long as we're all agreeing that it is the way forward. More below.
To create a new low level idle driver mechanism the scheduler could use and integrate proper power saving / idle policy into the scheduler.
In that power saving framework the already existing scheduler topology information should be extended with deep idle parameters:
enumeration of idle states
how long it takes to enter+exit a particular idle state
[ perhaps information about how destructive to CPU caches that particular idle state is. ]
new driver entry point that allows the scheduler to enter any of the enumerated idle states. Platform code will not change this state, all policy decisions and the idle state is decided at the power saving policy level.
All of this combines into a 'cost to enter and exit an idle state' estimation plus a way to enter idle states. It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence.
Thomas Gleixner's recent work to generalize platform idle routines will further help the implementation of this. (that code is upstream already)
_All_ policy, all metrics, all averaging should happen at the scheduler power saving level, in a single place, and then the scheduler should directly drive the new low level idle state driver mechanism.
'scheduler power saving' and 'idle policy' are one and the same principle and they should be handled in a single place to offer the best power saving results.
Note that any RFC patch-set that offers an implementation for this could be structured in a gradual fashion: only implementing it for a limited CPU range initially. The new framework can then be extended to more and more CPUs and architectures, incorporating more complicated power saving features gradually. (The old, existing idle policy code would remain untouched and available - it would simply not be used when the new policy is activated.)
I.e. I'm not asking for a 'rewrite the world' kind of impossible task - I'm providing an actionable path to get improved power saving upstream, but it has to use a _sane design_.
Someone will have to rewrite the world at some point. IMHO, you're just asking for the schedule to be brought forward. :)
Doing steps 1. and 2. has brought us to an acceptable power/performance threshold. Sure, we still have separate cpuidle, cpufreq and thermal subsystems that sometimes fight each other, but it is mostly a well-understood problem with known workarounds. Step 3 feels like good hygiene at this point, but one that we intend to help with.
This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable - and I'm NAK-ing incomplete approaches that don't solve the root design cause of our power saving troubles...
From what I've read in your proposal, you want step 3. done first. Am
I correct in that assumption? I really want to nail down the requirements and perhaps a sequence of steps that you might have in mind.
Can we also expect more timely feedback/flames on this topic going forward?
Regards, Amit
Hi,
On 05/31/2013 04:22 PM, Ingo Molnar wrote:
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
I don't think the problem lies in the fact that scheduler is not making these decisions about which idle state the CPU should enter or which frequency the CPU should run at.
IIUC, I think the problem lies in the part where although the *cpuidle and cpufrequency governors are co-operating with the scheduler, the scheduler is not doing the same.*
Let me elaborate with respect to cpuidle subsystem. When the scheduler chooses the CPUs to run tasks on, it leaves certain other CPUs idle. The cpuidle governor then evaluates, among other things, the load average of the CPUs, before deciding to put it into an ideal idle state. With the PJT's metric, an idle CPU's load average degrades over time and cpuidle governor will perhaps decide to put such CPUs to deep idle states.
But the problem surfaces when scheduler gets to choose a CPU to run new/woken up tasks on. It chooses the *idlest_cpu* to run the task on without considering how deep an idle state that CPU is in,if at all it is in an idle state. It would end up waking a deep sleeping CPU, which will *hinder power savings*.
I think here is where we need to focus. Currently, there is no *two way co-operation between the scheduler and cpuidle/cpufrequency* subsystems, which makes no sense. In the above case for instance scheduler prompts the cpuidle governor to put CPU to idle state and comes back to hamper that move.
The scheduler is also at a high enough level to host a "I want maximum performance, power does not matter to me" user policy override switch and similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
Therefore I think among other things, this is one fundamental issue that we need to resolve in the steps towards better power savings through scheduler.
Regards Preeti U Murthy
Hi Preeti,
On 7 June 2013 07:03, Preeti U Murthy preeti@linux.vnet.ibm.com wrote:
On 05/31/2013 04:22 PM, Ingo Molnar wrote:
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
I don't think the problem lies in the fact that scheduler is not making these decisions about which idle state the CPU should enter or which frequency the CPU should run at.
IIUC, I think the problem lies in the part where although the *cpuidle and cpufrequency governors are co-operating with the scheduler, the scheduler is not doing the same.*
I think you are missing Ingo's point. It's not about the scheduler complying with decisions made by various governors in the kernel (which may or may not have enough information) but rather the scheduler being in a better position for making such decisions.
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Some tasks could be known to the scheduler to require significant CPU cycles when waken up. The scheduler can make the decision to either boost the frequency of the non-idle CPU and place the task there or simply wake up the idle CPU. There are all sorts of power implications here like whether it's better to keep two CPUs at half speed or one at full speed and the other idle. Such parameters could be provided by per-platform hooks.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
The key here is that cpuidle/cpufreq make their primary decision based on something controlled by the scheduler: the CPU load (via run-queue balancing). You would then like the scheduler take such decision back into account. It just looks like a closed loop, possibly 'unstable' .
So I think we either (a) come up with 'clearer' separation of responsibilities between scheduler and cpufreq/cpuidle or (b) come up with a unified load-balancing/cpufreq/cpuidle implementation as per Ingo's request. The latter is harder but, with a good design, has potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on performance load-balancing but control the balance ratio from a cpufreq governor (via things like arch_scale_freq_power() or something new). CPUfreq would not be concerned just with individual CPU load/frequency but also making a decision on how tasks are balanced between CPUs based on the overall load (e.g. four CPUs are enough for the current load, I can shut the other four off by telling the scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to factor the load balancing out of kernel/sched/fair.c and provide an abstract interface (like load_class?) for easier extending or different policies (e.g. small task packing). You may for example implement a power saving load policy where idle_balance() does not pull tasks from other CPUs but rather invoke cpuidle with a prediction about how long it's going to be idle for. A load class could also give hints to the cpufreq about the actual load needed using normalised values and the cpufreq driver could set the best frequency to match such load. Another hook for task wake-up could place it on the appropriate run-queue (either for power or performance). And so on.
I don't say the above is the right solution, just a proposal. I think an initial prototype for Ingo's approach could make a good topic for the KS.
Best regards.
-- Catalin
Hi Catalin,
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
I think you are missing Ingo's point. It's not about the scheduler complying with decisions made by various governors in the kernel (which may or may not have enough information) but rather the scheduler being in a better position for making such decisions.
My mail pointed out that I disagree with this design ("the scheduler being in a better position for making such decisions"). I think it should be a 2 way co-operation. I have elaborated below.
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
How will the scheduler know that there will not be work in the near future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how much work will come up. All it knows is the current load of the runqueues and the nature of the task (thanks to the PJT's metric). It can then match the task load to the cpu capacity and schedule the tasks on the appropriate cpus.
As a consequence, it leaves certain cpus idle. The load of these cpus degrade. It is via this load that the scheduler asks for a deeper sleep state. Right here we have scheduler talking to the cpuidle governor.
I don't see what the problem is with the cpuidle governor waiting for the load to degrade before putting that cpu to sleep. In my opinion, putting a cpu to deeper sleep states should happen gradually. This means time will tell the governors what kinds of workloads are running on the system. If the cpu is idle for long, it probably means that the system is less loaded and it makes sense to put the cpus to deeper sleep states. Of course there could be sporadic bursts or quieting down of tasks, but these are corner cases.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
Some tasks could be known to the scheduler to require significant CPU cycles when waken up. The scheduler can make the decision to either boost the frequency of the non-idle CPU and place the task there or simply wake up the idle CPU. There are all sorts of power implications here like whether it's better to keep two CPUs at half speed or one at full speed and the other idle. Such parameters could be provided by per-platform hooks.
This is why the cpuidle and cpufrequency drivers are for. They are meant to collect such parameters. It is just that the scheduler should be made aware of them.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
The key here is that cpuidle/cpufreq make their primary decision based on something controlled by the scheduler: the CPU load (via run-queue balancing). You would then like the scheduler take such decision back into account. It just looks like a closed loop, possibly 'unstable' .
Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a closed loop? Here too the scheduler should be made well aware of the decisions it took in the past right?
So I think we either (a) come up with 'clearer' separation of responsibilities between scheduler and cpufreq/cpuidle
I agree with this. This is what I have been emphasizing, if we feel that the cpufrequency/ cpuidle subsystems are suboptimal in terms of the information that they use to make their decisions, let us improve them. But this will not yield us any improvement if the scheduler does not have enough information. And IMHO, the next fundamental information that the scheduler needs should come from cpufreq and cpuidle.
Then we should move onto supplying scheduler information from the power domain topology, thermal factors, user policies. This does not need a re-write of the scheduler, this would need a good interface between the scheduler and the rest of the ecosystem. This ecosystem includes the cpuidle subsystem, cpu frequency subsystems and they are already in place. Lets use them.
or (b) come up
with a unified load-balancing/cpufreq/cpuidle implementation as per Ingo's request. The latter is harder but, with a good design, has potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on performance load-balancing but control the balance ratio from a cpufreq governor (via things like arch_scale_freq_power() or something new). CPUfreq would not be concerned just with individual CPU load/frequency but also making a decision on how tasks are balanced between CPUs based on the overall load (e.g. four CPUs are enough for the current load, I can shut the other four off by telling the scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to factor the load balancing out of kernel/sched/fair.c and provide an abstract interface (like load_class?) for easier extending or different policies (e.g. small task packing).
Let me elaborate on the patches that have been posted so far on the power awareness of the scheduler. When we say *power aware scheduler* what exactly do we want it to do?
In my opinion, we want it to *avoid touching idle cpus*, so as to keep them in that state longer and *keep more power domains idle*, so as to yield power savings with them turned off. The patches released so far are striving to do the latter. Correct me if I am wrong at this. Also feel free to point out any other expectation from the power aware scheduler if I am missing any.
If I have got Ingo's point right, the issues with them are that they are not taking a holistic approach to meet the said goal. Keeping more power domains idle (by packing tasks) would sound much better if the scheduler has taken all aspects of doing such a thing into account, like
1. How idle are the cpus, on the domain that it is packing 2. Can they go to turbo mode, because if they do,then we cant pack tasks. We would need certain cpus in that domain idle. 3. Are the domains in which we pack tasks power gated? 4. Will there be significant performance drop by packing? Meaning do the tasks share cpu resources? If they do there will be severe contention.
The approach I suggest therefore would be to get the scheduler well in sync with the eco system, then the patches posted so far will achieve their goals more easily and with very few regressions because they are well informed decisions.
Regards Preeti U Murthy
Best regards.
-- Catalin
On Fri, 7 Jun 2013, Preeti U Murthy wrote:
Hi Catalin,
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
<SNIP>
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
How will the scheduler know that there will not be work in the near future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how much work will come up. All it knows is the current load of the runqueues and the nature of the task (thanks to the PJT's metric). It can then match the task load to the cpu capacity and schedule the tasks on the appropriate cpus.
how will the cpuidle govenor know what will come up in the future?
the scheduler knows more than the current load on the runqueus, it tracks some information about the past behavior of the process that it uses for it's decisions. This is information that cpuidle doesn't have.
<SNIP>
I don't see what the problem is with the cpuidle governor waiting for the load to degrade before putting that cpu to sleep. In my opinion, putting a cpu to deeper sleep states should happen gradually.
remember that it takes power and time to wake up a cpu to put it in a deeper sleep state.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
how should the scheduler know that the cpufreq governor decided to boost the speed of one CPU to handle an important process as opposed to handling multiple smaller processes?
the communication between the two is starting to sound really messy
David Lang
Hi David,
On 06/07/2013 11:06 PM, David Lang wrote:
On Fri, 7 Jun 2013, Preeti U Murthy wrote:
Hi Catalin,
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
<SNIP> >> Take the cpuidle example, it uses the load average of the CPUs, >> however this load average is currently controlled by the scheduler >> (load balance). Rather than using a load average that degrades over >> time and gradually putting the CPU into deeper sleep states, the >> scheduler could predict more accurately that a run-queue won't have >> any work over the next x ms and ask for a deeper sleep state from the >> beginning. > > How will the scheduler know that there will not be work in the near > future? How will the scheduler ask for a deeper sleep state? > > My answer to the above two questions are, the scheduler cannot know how > much work will come up. All it knows is the current load of the > runqueues and the nature of the task (thanks to the PJT's metric). It > can then match the task load to the cpu capacity and schedule the tasks > on the appropriate cpus.
how will the cpuidle govenor know what will come up in the future?
the scheduler knows more than the current load on the runqueus, it tracks some information about the past behavior of the process that it uses for it's decisions. This is information that cpuidle doesn't have.
This is incorrect. The scheduler knows the possible future load on a cpu due to past behavior, thats right, and so does cpuidle today. It queries the load average for predicted idle time and compares this with exit latencies of the idle states.
<SNIP> > I don't see what the problem is with the cpuidle governor waiting for > the load to degrade before putting that cpu to sleep. In my opinion, > putting a cpu to deeper sleep states should happen gradually.
remember that it takes power and time to wake up a cpu to put it in a deeper sleep state.
Correct. I apologise in saying that it does it gradually. This is not entirely right. cpuidle governor can decide on the state the cpu is best put into directly without going through the shallow idle states. It also takes care to rectify any incorrect prediction. So there is no exit-enter-exit-enter sub optimal implementation.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
how should the scheduler know that the cpufreq governor decided to boost the speed of one CPU to handle an important process as opposed to handling multiple smaller processes?
This has been elaborated in my response to Rafael's mail. Scheduler decides to call cpu frequency governor when it sees fit. Then cpu frequency governor boosts the frequency of that cpu. cpu_power will now match the task load. So scheduler will not move the task away from that cpu since load does not exceed cpu capacity. So scheduler knows in this way.
the communication between the two is starting to sound really messy
Not really. More is elaborated in responses to Catalin and Rafael's mails.
Regards Preeti U Murthy
David Lang
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
I think you are missing Ingo's point. It's not about the scheduler complying with decisions made by various governors in the kernel (which may or may not have enough information) but rather the scheduler being in a better position for making such decisions.
My mail pointed out that I disagree with this design ("the scheduler being in a better position for making such decisions"). I think it should be a 2 way co-operation. I have elaborated below.
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
How will the scheduler know that there will not be work in the near future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how much work will come up. All it knows is the current load of the runqueues and the nature of the task (thanks to the PJT's metric). It can then match the task load to the cpu capacity and schedule the tasks on the appropriate cpus.
The scheduler can decide to load a single CPU or cluster and let the others idle. If the total CPU load can fit into a smaller number of CPUs it could as well tell cpuidle to go into deeper state from the beginning as it moved all the tasks elsewhere.
Regarding future work, neither cpuidle nor the scheduler know this but the scheduler would make a better prediction, for example by tracking task periodicity.
As a consequence, it leaves certain cpus idle. The load of these cpus degrade. It is via this load that the scheduler asks for a deeper sleep state. Right here we have scheduler talking to the cpuidle governor.
So we agree that the scheduler _tells_ the cpuidle governor when to go idle (but not how deep). IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the cpuidle does not get enough information from the scheduler (arguably this could be fixed) and (2) the scheduler does not have any information about the idle states (power gating etc.) to make any informed decision on which/when CPUs should go idle.
As you said, it is a non-optimal one-way communication but the solution is not feedback loop from cpuidle into scheduler. It's like the scheduler managed by chance to get the CPU into a deeper sleep state and now you'd like the scheduler to get feedback form cpuidle and not disturb that CPU anymore. That's the closed loop I disagree with. Could the scheduler not make this informed decision before - it has this total load, let's get this CPU into deeper sleep state?
I don't see what the problem is with the cpuidle governor waiting for the load to degrade before putting that cpu to sleep. In my opinion, putting a cpu to deeper sleep states should happen gradually. This means time will tell the governors what kinds of workloads are running on the system. If the cpu is idle for long, it probably means that the system is less loaded and it makes sense to put the cpus to deeper sleep states. Of course there could be sporadic bursts or quieting down of tasks, but these are corner cases.
It's nothing wrong with degrading given the information that cpuidle currently has. It's a heuristics that worked ok so far and may continue to do so. But see my comments above on why the scheduler could make more informed decisions.
We may not move all the power gating information to the scheduler but maybe find a way to abstract this by giving more hints via the CPU and cache topology. The cpuidle framework (it may not be much left of a governor) would then take hints about estimated idle time and invoke the low-level driver about the right C state.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load, which means reducing the idle time. It does not know whether it is better to boost the frequency twice as high so that it gets to idle quicker. You can change the governor's policy but does it have any information from cpuidle?
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
Same loop again. The cpu load goes high because (a) there is more work, possibly triggered by external events, and (b) the scheduler decided to balance the CPUs in a certain way. As for cpuidle above, the scheduler has direct influence on the cpufreq decisions. How would the scheduler know which CPU not to balance against? Are CPUs in a cluster synchronous? Is it better do let other CPU idle or more efficient to run this cluster at half-speed?
Let's say there is an increase in the load, does the scheduler wait until cpufreq figures this out or tries to take the other CPUs out of idle? Who's making this decision? That's currently a potentially unstable loop.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
The key here is that cpuidle/cpufreq make their primary decision based on something controlled by the scheduler: the CPU load (via run-queue balancing). You would then like the scheduler take such decision back into account. It just looks like a closed loop, possibly 'unstable' .
Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a closed loop? Here too the scheduler should be made well aware of the decisions it took in the past right?
It's more like:
scheduler -> cpuidle/cpufreq -> hardware operating point ^ | +--------------------------------------+
You can argue that you can make an adaptive loop that works fine but there are so many parameters that I don't see how it would work. The patches so far don't seem to address this. Small task packing, while useful, it's some heuristics just at the scheduler level.
With a combined decision maker, you aim to reduce this separate decision process and feedback loop. Probably impossible to eliminate the loop completely because of hardware latencies, PLLs, CPU frequency not always the main factor, but you can make the loop more tolerant to instabilities.
So I think we either (a) come up with 'clearer' separation of responsibilities between scheduler and cpufreq/cpuidle
I agree with this. This is what I have been emphasizing, if we feel that the cpufrequency/ cpuidle subsystems are suboptimal in terms of the information that they use to make their decisions, let us improve them. But this will not yield us any improvement if the scheduler does not have enough information. And IMHO, the next fundamental information that the scheduler needs should come from cpufreq and cpuidle.
What kind of information? Your suggestion that the scheduler should avoid loading a CPU because it went idle is wrong IMHO. It went idle because the scheduler decided this in first instance.
Then we should move onto supplying scheduler information from the power domain topology, thermal factors, user policies.
I agree with this but at this point you get the scheduler to make more informed decisions about task placement. It can then give more precise hints to cpufreq/cpuidle like the predicted load and those frameworks could become dumber in time, just complying with the requested performance level (trying to break the loop above).
or (b) come up with a unified load-balancing/cpufreq/cpuidle implementation as per Ingo's request. The latter is harder but, with a good design, has potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on performance load-balancing but control the balance ratio from a cpufreq governor (via things like arch_scale_freq_power() or something new). CPUfreq would not be concerned just with individual CPU load/frequency but also making a decision on how tasks are balanced between CPUs based on the overall load (e.g. four CPUs are enough for the current load, I can shut the other four off by telling the scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to factor the load balancing out of kernel/sched/fair.c and provide an abstract interface (like load_class?) for easier extending or different policies (e.g. small task packing).
Let me elaborate on the patches that have been posted so far on the power awareness of the scheduler. When we say *power aware scheduler* what exactly do we want it to do?
In my opinion, we want it to *avoid touching idle cpus*, so as to keep them in that state longer and *keep more power domains idle*, so as to yield power savings with them turned off. The patches released so far are striving to do the latter. Correct me if I am wrong at this.
Don't take me wrong, task packing to keep more power domains idle is probably in the right direction but it may not address all issues. You realised this is not enough since you are now asking for the scheduler to take feedback from cpuidle. As I pointed out above, you try to create a loop which may or may not work, especially given the wide variety of hardware parameters.
Also feel free to point out any other expectation from the power aware scheduler if I am missing any.
If the patches so far are enough and solved all the problems, you are not missing any. Otherwise, please see my view above.
Please define clearly what the scheduler, cpufreq, cpuidle should be doing and what communication should happen between them.
If I have got Ingo's point right, the issues with them are that they are not taking a holistic approach to meet the said goal.
Probably because scheduler changes, cpufreq and cpuidle are all trying to address the same thing but independent of each other and possibly conflicting.
Keeping more power domains idle (by packing tasks) would sound much better if the scheduler has taken all aspects of doing such a thing into account, like
- How idle are the cpus, on the domain that it is packing
- Can they go to turbo mode, because if they do,then we cant pack
tasks. We would need certain cpus in that domain idle. 3. Are the domains in which we pack tasks power gated? 4. Will there be significant performance drop by packing? Meaning do the tasks share cpu resources? If they do there will be severe contention.
So by this you add a lot more information about the power configuration into the scheduler, getting it to make more informed decisions about task scheduling. You may eventually reach a point where cpuidle governor doesn't have much to do (which may be a good thing) and reach Ingo's goal.
That's why I suggested maybe starting to take the load balancing out of fair.c and make it easily extensible (my opinion, the scheduler guys may disagree). Then make it more aware of topology, power configuration so that it makes the right task placement decision. You then get it to tell cpufreq about the expected performance requirements (frequency decided by cpufreq) and cpuidle about how long it could be idle for (you detect a periodic task every 1ms, or you don't have any at all because they were migrated, the right C state being decided by the governor).
Regards.
On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
I think you are missing Ingo's point. It's not about the scheduler complying with decisions made by various governors in the kernel (which may or may not have enough information) but rather the scheduler being in a better position for making such decisions.
My mail pointed out that I disagree with this design ("the scheduler being in a better position for making such decisions"). I think it should be a 2 way co-operation. I have elaborated below.
I agree with that.
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
How will the scheduler know that there will not be work in the near future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how much work will come up. All it knows is the current load of the runqueues and the nature of the task (thanks to the PJT's metric). It can then match the task load to the cpu capacity and schedule the tasks on the appropriate cpus.
The scheduler can decide to load a single CPU or cluster and let the others idle. If the total CPU load can fit into a smaller number of CPUs it could as well tell cpuidle to go into deeper state from the beginning as it moved all the tasks elsewhere.
So why can't it do that today? What's the problem?
Regarding future work, neither cpuidle nor the scheduler know this but the scheduler would make a better prediction, for example by tracking task periodicity.
Well, basically, two pieces of information are needed to make target idle state selections: (1) when the CPU (core or package) is going to be used next time and (2) how much latency for going back to the non-idle state can be tolerated. While the scheduler knows (1) to some extent (arguably, it generally cannot predict when hardware interrupts are going to occur), I'm not really sure about (2).
As a consequence, it leaves certain cpus idle. The load of these cpus degrade. It is via this load that the scheduler asks for a deeper sleep state. Right here we have scheduler talking to the cpuidle governor.
So we agree that the scheduler _tells_ the cpuidle governor when to go idle (but not how deep).
It does indicate to cpuidle how deep it can go, however, by providing it with the information about when the CPU is going to be used next time (from the scheduler's perspective).
IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the cpuidle does not get enough information from the scheduler (arguably this could be fixed)
OK, so what information is missing in your opinion?
and (2) the scheduler does not have any information about the idle states (power gating etc.) to make any informed decision on which/when CPUs should go idle.
That's correct, which is a drawback. However, on some systems it may never have that information (because hardware coordinates idle states in a way that is opaque to the OS - e.g. by autopromoting deeper states when idle for sufficiently long time) and on some systems that information may change over time (i.e. the availablility of specific idle states may depend on factors that aren't constant).
If you attempted to take all of the possible complications related to hardware designs in that area in the scheduler, you'd end up with completely unmaintainable piece of code.
As you said, it is a non-optimal one-way communication but the solution is not feedback loop from cpuidle into scheduler. It's like the scheduler managed by chance to get the CPU into a deeper sleep state and now you'd like the scheduler to get feedback form cpuidle and not disturb that CPU anymore. That's the closed loop I disagree with. Could the scheduler not make this informed decision before - it has this total load, let's get this CPU into deeper sleep state?
No, it couldn't in general, for the above reasons.
I don't see what the problem is with the cpuidle governor waiting for the load to degrade before putting that cpu to sleep. In my opinion, putting a cpu to deeper sleep states should happen gradually.
If we know in advance that the CPU can be put into idle state Cn, there is no reason to put it into anything shallower than that.
On the other hand, if the CPU is in Cn already and there is a possibility to put it into a deeper low-power state (which we didn't know about before), it may make sense to promote it into that state (if that's safe) or even wake it up and idle it again.
This means time will tell the governors what kinds of workloads are running on the system. If the cpu is idle for long, it probably means that the system is less loaded and it makes sense to put the cpus to deeper sleep states. Of course there could be sporadic bursts or quieting down of tasks, but these are corner cases.
It's nothing wrong with degrading given the information that cpuidle currently has. It's a heuristics that worked ok so far and may continue to do so. But see my comments above on why the scheduler could make more informed decisions.
We may not move all the power gating information to the scheduler but maybe find a way to abstract this by giving more hints via the CPU and cache topology. The cpuidle framework (it may not be much left of a governor) would then take hints about estimated idle time and invoke the low-level driver about the right C state.
Overall, it looks like it'd be better to split the governor "layer" between the scheduler and the idle driver with a well defined interface between them. That interface needs to be general enough to be independent of the underlying hardware.
We need to determine what kinds of information should be passed both ways and how to represent it.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load, which means reducing the idle time. It does not know whether it is better to boost the frequency twice as high so that it gets to idle quicker. You can change the governor's policy but does it have any information from cpuidle?
Well, it may get that information directly from the hardware. Actually, intel_pstate does that, but intel_pstate is the governor and the scaling driver combined.
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
Same loop again. The cpu load goes high because (a) there is more work, possibly triggered by external events, and (b) the scheduler decided to balance the CPUs in a certain way. As for cpuidle above, the scheduler has direct influence on the cpufreq decisions. How would the scheduler know which CPU not to balance against? Are CPUs in a cluster synchronous? Is it better do let other CPU idle or more efficient to run this cluster at half-speed?
Let's say there is an increase in the load, does the scheduler wait until cpufreq figures this out or tries to take the other CPUs out of idle? Who's making this decision? That's currently a potentially unstable loop.
Yes, it is and I don't think we currently have good answers here.
The results of many measurements seem to indicate that it generally is better to do the work as quickly as possible and then go idle again, but there are costs associated with going back and forth from idle to non-idle etc.
The main problem with cpufreq that I personally have is that the governors carry out their own sampling with pretty much arbitrary resolution that may lead to suboptimal decisions. It would be much better if the scheduler indicated when to *consider* the changing of CPU performance parameters (that may not be frequency alone and not even frequency at all in general), more or less the same way it tells cpuidle about idle CPUs, but I'm not sure if it should decide what performance points to run at.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
The key here is that cpuidle/cpufreq make their primary decision based on something controlled by the scheduler: the CPU load (via run-queue balancing). You would then like the scheduler take such decision back into account. It just looks like a closed loop, possibly 'unstable' .
Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a closed loop? Here too the scheduler should be made well aware of the decisions it took in the past right?
It's more like:
scheduler -> cpuidle/cpufreq -> hardware operating point ^ | +--------------------------------------+
You can argue that you can make an adaptive loop that works fine but there are so many parameters that I don't see how it would work. The patches so far don't seem to address this. Small task packing, while useful, it's some heuristics just at the scheduler level.
I agree.
With a combined decision maker, you aim to reduce this separate decision process and feedback loop. Probably impossible to eliminate the loop completely because of hardware latencies, PLLs, CPU frequency not always the main factor, but you can make the loop more tolerant to instabilities.
Well, in theory. :-)
Another question to ask is whether or not the structure of our software reflects the underlying problem. I mean, on the one hand there is the scheduler that needs to optimally assign work items to computational units (hyperthreads, CPU cores, packages) and on the other hand there's hardware with different capabilities (idle states, performance points etc.). Arguably, the scheduler internals cannot cover all of the differences between all of the existing types of hardware Linux can run on, so there needs to be a layer of code providing an interface between the scheduler and the hardware. But that layer of code needs to be just *one*, so why do we have *two* different frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to the scheduler, but not to each other?
To me, the reason is history, and more precisely the fact that cpufreq had been there first, then came cpuidle and only then poeple started to realize that some scheduler tweaks may allow us to save energy without sacrificing too much performance. However, it looks like there's time to go back and see how we can integrate all that. And there's more, because we may need to take power budgets and thermal management into account as well (i.e. we may not be allowed to use full performance of the processors all the time because of some additional limitations) and the CPUs may be members of power domains, so what we can do with them may depend on the states of other devices.
So I think we either (a) come up with 'clearer' separation of responsibilities between scheduler and cpufreq/cpuidle
I agree with this. This is what I have been emphasizing, if we feel that the cpufrequency/ cpuidle subsystems are suboptimal in terms of the information that they use to make their decisions, let us improve them. But this will not yield us any improvement if the scheduler does not have enough information. And IMHO, the next fundamental information that the scheduler needs should come from cpufreq and cpuidle.
What kind of information? Your suggestion that the scheduler should avoid loading a CPU because it went idle is wrong IMHO. It went idle because the scheduler decided this in first instance.
Then we should move onto supplying scheduler information from the power domain topology, thermal factors, user policies.
I agree with this but at this point you get the scheduler to make more informed decisions about task placement. It can then give more precise hints to cpufreq/cpuidle like the predicted load and those frameworks could become dumber in time, just complying with the requested performance level (trying to break the loop above).
Well, there's nothing like "predicted load". At best, we may be able to make more or less educated guesses about it, so in my opinion it is better to use the information about what happened in the past for making decisions regarding the current settings and re-adjust them over time as we get more information.
So how much decision making regarding the idle state to put the given CPU into should be there in the scheduler? I believe the only information coming out of the scheduler regarding that should be "OK, this CPU is now idle and I'll need it in X nanoseconds from now" plus possibly a hint about the wakeup latency tolerance (but those hints may come from other places too). That said the decision *which* CPU should become idle at the moment very well may require some information about what options are available from the layer below (for example, "putting core X into idle for Y of time will save us Z energy" or something like that).
And what about performance scaling? Quite frankly, in my opinion that requires some more investigation, because there still are some open questions in that area. To start with we can just continue using the current heuristics, but perhaps with the scheduler calling the scaling "governor" when it sees fit instead of that "governor" running kind of in parallel with it.
or (b) come up with a unified load-balancing/cpufreq/cpuidle implementation as per Ingo's request. The latter is harder but, with a good design, has potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on performance load-balancing but control the balance ratio from a cpufreq governor (via things like arch_scale_freq_power() or something new). CPUfreq would not be concerned just with individual CPU load/frequency but also making a decision on how tasks are balanced between CPUs based on the overall load (e.g. four CPUs are enough for the current load, I can shut the other four off by telling the scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to factor the load balancing out of kernel/sched/fair.c and provide an abstract interface (like load_class?) for easier extending or different policies (e.g. small task packing).
Let me elaborate on the patches that have been posted so far on the power awareness of the scheduler. When we say *power aware scheduler* what exactly do we want it to do?
In my opinion, we want it to *avoid touching idle cpus*, so as to keep them in that state longer and *keep more power domains idle*, so as to yield power savings with them turned off. The patches released so far are striving to do the latter. Correct me if I am wrong at this.
Don't take me wrong, task packing to keep more power domains idle is probably in the right direction but it may not address all issues. You realised this is not enough since you are now asking for the scheduler to take feedback from cpuidle. As I pointed out above, you try to create a loop which may or may not work, especially given the wide variety of hardware parameters.
Also feel free to point out any other expectation from the power aware scheduler if I am missing any.
If the patches so far are enough and solved all the problems, you are not missing any. Otherwise, please see my view above.
Please define clearly what the scheduler, cpufreq, cpuidle should be doing and what communication should happen between them.
If I have got Ingo's point right, the issues with them are that they are not taking a holistic approach to meet the said goal.
Probably because scheduler changes, cpufreq and cpuidle are all trying to address the same thing but independent of each other and possibly conflicting.
Keeping more power domains idle (by packing tasks) would sound much better if the scheduler has taken all aspects of doing such a thing into account, like
- How idle are the cpus, on the domain that it is packing
- Can they go to turbo mode, because if they do,then we cant pack
tasks. We would need certain cpus in that domain idle. 3. Are the domains in which we pack tasks power gated? 4. Will there be significant performance drop by packing? Meaning do the tasks share cpu resources? If they do there will be severe contention.
So by this you add a lot more information about the power configuration into the scheduler, getting it to make more informed decisions about task scheduling. You may eventually reach a point where cpuidle governor doesn't have much to do (which may be a good thing) and reach Ingo's goal.
That's why I suggested maybe starting to take the load balancing out of fair.c and make it easily extensible (my opinion, the scheduler guys may disagree). Then make it more aware of topology, power configuration so that it makes the right task placement decision. You then get it to tell cpufreq about the expected performance requirements (frequency decided by cpufreq) and cpuidle about how long it could be idle for (you detect a periodic task every 1ms, or you don't have any at all because they were migrated, the right C state being decided by the governor).
There is another angle to look at that as I said somewhere above.
What if we could integrate cpuidle with cpufreq so that there is one code layer representing what the hardware can do to the scheduler? What benefits can we get from that, if any?
Rafael
Hi Rafael,
On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
I think you are missing Ingo's point. It's not about the scheduler complying with decisions made by various governors in the kernel (which may or may not have enough information) but rather the scheduler being in a better position for making such decisions.
My mail pointed out that I disagree with this design ("the scheduler being in a better position for making such decisions"). I think it should be a 2 way co-operation. I have elaborated below.
I agree with that.
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
How will the scheduler know that there will not be work in the near future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how much work will come up. All it knows is the current load of the runqueues and the nature of the task (thanks to the PJT's metric). It can then match the task load to the cpu capacity and schedule the tasks on the appropriate cpus.
The scheduler can decide to load a single CPU or cluster and let the others idle. If the total CPU load can fit into a smaller number of CPUs it could as well tell cpuidle to go into deeper state from the beginning as it moved all the tasks elsewhere.
So why can't it do that today? What's the problem?
The reason that scheduler does not do it today is due to the prefer_sibling logic. The tasks within a core get distributed across cores if they are more than 1, since the cpu power of a core is not high enough to handle more than one task.
However at a socket level/ MC level (cluster at a low level), there can be as many tasks as there are cores because the socket has enough CPU capacity to handle them. But the prefer_sibling logic moves tasks across socket/MC level domains even when load<=domain_capacity.
I think the reason why the prefer_sibling logic was introduced, is that scheduler looks at spreading tasks across all the resources it has. It believes keeping tasks within a cluster/socket level domain would mean tasks are being throttled by having access to only the cluster/socket level resources. Which is why it spreads.
The prefer_sibling logic is nothing but a flag set at domain level to communicate to the scheduler that load should be spread across the groups of this domain. In the above example across sockets/clusters.
But I think it is time we take another look at the prefer_sibling logic and decide on its worthiness.
Regarding future work, neither cpuidle nor the scheduler know this but the scheduler would make a better prediction, for example by tracking task periodicity.
Well, basically, two pieces of information are needed to make target idle state selections: (1) when the CPU (core or package) is going to be used next time and (2) how much latency for going back to the non-idle state can be tolerated. While the scheduler knows (1) to some extent (arguably, it generally cannot predict when hardware interrupts are going to occur), I'm not really sure about (2).
As a consequence, it leaves certain cpus idle. The load of these cpus degrade. It is via this load that the scheduler asks for a deeper sleep state. Right here we have scheduler talking to the cpuidle governor.
So we agree that the scheduler _tells_ the cpuidle governor when to go idle (but not how deep).
It does indicate to cpuidle how deep it can go, however, by providing it with the information about when the CPU is going to be used next time (from the scheduler's perspective).
IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the cpuidle does not get enough information from the scheduler (arguably this could be fixed)
OK, so what information is missing in your opinion?
and (2) the scheduler does not have any information about the idle states (power gating etc.) to make any informed decision on which/when CPUs should go idle.
That's correct, which is a drawback. However, on some systems it may never have that information (because hardware coordinates idle states in a way that is opaque to the OS - e.g. by autopromoting deeper states when idle for sufficiently long time) and on some systems that information may change over time (i.e. the availablility of specific idle states may depend on factors that aren't constant).
If you attempted to take all of the possible complications related to hardware designs in that area in the scheduler, you'd end up with completely unmaintainable piece of code.
As you said, it is a non-optimal one-way communication but the solution is not feedback loop from cpuidle into scheduler. It's like the scheduler managed by chance to get the CPU into a deeper sleep state and now you'd like the scheduler to get feedback form cpuidle and not disturb that CPU anymore. That's the closed loop I disagree with. Could the scheduler not make this informed decision before - it has this total load, let's get this CPU into deeper sleep state?
No, it couldn't in general, for the above reasons.
I don't see what the problem is with the cpuidle governor waiting for the load to degrade before putting that cpu to sleep. In my opinion, putting a cpu to deeper sleep states should happen gradually.
If we know in advance that the CPU can be put into idle state Cn, there is no reason to put it into anything shallower than that.
On the other hand, if the CPU is in Cn already and there is a possibility to put it into a deeper low-power state (which we didn't know about before), it may make sense to promote it into that state (if that's safe) or even wake it up and idle it again.
Yes, sorry I said it wrong in the previous mail. Today the cpuidle governor is capable of putting a CPU in idle state Cn directly, by looking at various factors like the current load, next timer, history of interrupts, exit latency of states. At the end of this evaluation it puts it into idle state Cn.
Also it cares to check if its decision is right. This is with respect to your statement "if there is a possibility to put it into deeper low power state". It queues a timer at a time just after its predicted wake up time before putting the cpu to idle state. If this time of wakeup prediction is wrong, this timer triggers to wake up the cpu and the cpu is hence put into a deeper sleep state.
This means time will tell the governors what kinds of workloads are running on the system. If the cpu is idle for long, it probably means that the system is less loaded and it makes sense to put the cpus to deeper sleep states. Of course there could be sporadic bursts or quieting down of tasks, but these are corner cases.
It's nothing wrong with degrading given the information that cpuidle currently has. It's a heuristics that worked ok so far and may continue to do so. But see my comments above on why the scheduler could make more informed decisions.
We may not move all the power gating information to the scheduler but maybe find a way to abstract this by giving more hints via the CPU and cache topology. The cpuidle framework (it may not be much left of a governor) would then take hints about estimated idle time and invoke the low-level driver about the right C state.
Overall, it looks like it'd be better to split the governor "layer" between the scheduler and the idle driver with a well defined interface between them. That interface needs to be general enough to be independent of the underlying hardware.
We need to determine what kinds of information should be passed both ways and how to represent it.
I agree with this design decision.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load, which means reducing the idle time. It does not know whether it is better to boost the frequency twice as high so that it gets to idle quicker. You can change the governor's policy but does it have any information from cpuidle?
Well, it may get that information directly from the hardware. Actually, intel_pstate does that, but intel_pstate is the governor and the scaling driver combined.
To add to this, cpufreq currently functions in the below fashion. I am talking of the on demand governor, since it is more relevant to our discussion.
----stepped up frequency------ ----threshold-------- -----stepped down freq level1--- -----stepped down freq level2--- ---stepped down freq level3----
If the cpu idle time is below a threshold , it boosts the frequency to one level above straight away and does not vary it any further. If the cpu idle time is below a threshold there is a step down in frequency levels by 5% of the current frequency at every sampling period, provided the cpu behavior is constant.
I think we can improve this implementation by better interaction with cpuidle and scheduler.
When it is stepping up frequency, it should do it in steps of frequency being a *function of the current cpu load* also, or function of idle time will also do.
When it is stepping down frequency, it should interact with cpuidle. It should get from cpuidle information regarding the idle state that the cpu is in.The reason is cpufrequency governor is aware of only the idle time of the cpu, not the idle state it is in. If it gets to know that the cpu is in a deep idle state, it could step down frequency levels to leveln straight away. Just like cpuidle does to put cpus into state Cn.
Or an alternate option could be just like stepping up, make the stepping down also a function of idle time. Perhaps fn(|threshold-idle_time|).
Also one more point to note is that if cpuidle puts cpus into such idle states that clock gate the cpus, then there is no need for cpufrequency governor for that cpu. cpufreq can check with cpuidle on this front before it queries a cpu.
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
Same loop again. The cpu load goes high because (a) there is more work, possibly triggered by external events, and (b) the scheduler decided to balance the CPUs in a certain way. As for cpuidle above, the scheduler has direct influence on the cpufreq decisions. How would the scheduler know which CPU not to balance against? Are CPUs in a cluster synchronous? Is it better do let other CPU idle or more efficient to run this cluster at half-speed?
Let's say there is an increase in the load, does the scheduler wait until cpufreq figures this out or tries to take the other CPUs out of idle? Who's making this decision? That's currently a potentially unstable loop.
Yes, it is and I don't think we currently have good answers here.
My answer to the above question is scheduler does not wait until cpufreq figures it out. All that the scheduler cares about today is load balancing. Spread the load and hope it finishes soon. There is a possibility today that even before cpu frequency governor can boost the frequency of cpu, the scheduler can spread the load.
As for the second question it will wakeup idle cpus if it must to load balance.
It is a good question asked: "does the scheduler wait until cpufreq figures it out." Currently the answer is no, it does not communicate with cpu frequency at all (except through cpu power, but that is the good part of the story, so I will not get there now). But maybe we should change this. I think we can do so the following way.
When can a scheduler talk to cpu frequency? It can do so under the below circumstances:
1. Load is too high across the systems, all cpus are loaded, no chance of load balancing. Therefore ask cpu frequency governor to step up frequency to get improve performance.
2. The scheduler finds out that if it has to load balance, it has to do so on cpus which are in deep idle state( Currently this logic is not present, but worth getting it in). It then decides to increase the frequency of the already loaded cpus to improve performance. It calls cpu freq governor.
3. The scheduler finds out that if it has to load balance, it has to do so on a different power domain which is idle currently(shallow/deep). It thinks the better of it and calls cpu frequency governor to boost the frequency of the cpus in the current domain.
While 2 and 3 depend on scheduler having knowledge about idle states and power domains, which it currently does not have, 1 can be achieved with the current code. scheduler keeps track of failed ld balancing efforts with lb_failed. If it finds that while load balancing from busy group failed (lb_failed > 0), it can call cpu freq governor to step up the cpu frequency of this busy cpu group, with gov_check_cpu() in cpufrequency governor code.
The results of many measurements seem to indicate that it generally is better to do the work as quickly as possible and then go idle again, but there are costs associated with going back and forth from idle to non-idle etc.
I think we can even out the cost benefit of race to idle, by choosing to do it wisely. Like for example if points 2 and 3 above are true (idle cpus are in deep sleep states or need to ld balance on a different power domain), then step up the frequency of the current working cpus and reap its benefit.
The main problem with cpufreq that I personally have is that the governors carry out their own sampling with pretty much arbitrary resolution that may lead to suboptimal decisions. It would be much better if the scheduler indicated when to *consider* the changing of CPU performance parameters (that may not be frequency alone and not even frequency at all in general), more or less the same way it tells cpuidle about idle CPUs, but I'm not sure if it should decide what performance points to run at.
Very true. See the points 1,2 and 3 above where I list out when scheduler can call cpu frequency. Also an idea about how cpu frequency governor can decide on the scaling frequency is stated above.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
The key here is that cpuidle/cpufreq make their primary decision based on something controlled by the scheduler: the CPU load (via run-queue balancing). You would then like the scheduler take such decision back into account. It just looks like a closed loop, possibly 'unstable' .
Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a closed loop? Here too the scheduler should be made well aware of the decisions it took in the past right?
It's more like:
scheduler -> cpuidle/cpufreq -> hardware operating point ^ | +--------------------------------------+
You can argue that you can make an adaptive loop that works fine but there are so many parameters that I don't see how it would work. The patches so far don't seem to address this. Small task packing, while useful, it's some heuristics just at the scheduler level.
I agree.
With a combined decision maker, you aim to reduce this separate decision process and feedback loop. Probably impossible to eliminate the loop completely because of hardware latencies, PLLs, CPU frequency not always the main factor, but you can make the loop more tolerant to instabilities.
Well, in theory. :-)
Another question to ask is whether or not the structure of our software reflects the underlying problem. I mean, on the one hand there is the scheduler that needs to optimally assign work items to computational units (hyperthreads, CPU cores, packages) and on the other hand there's hardware with different capabilities (idle states, performance points etc.). Arguably, the scheduler internals cannot cover all of the differences between all of the existing types of hardware Linux can run on, so there needs to be a layer of code providing an interface between the scheduler and the hardware. But that layer of code needs to be just *one*, so why do we have *two* different frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to the scheduler, but not to each other?
To me, the reason is history, and more precisely the fact that cpufreq had been there first, then came cpuidle and only then poeple started to realize that some scheduler tweaks may allow us to save energy without sacrificing too much performance. However, it looks like there's time to go back and see how we can integrate all that. And there's more, because we may need to take power budgets and thermal management into account as well (i.e. we may not be allowed to use full performance of the processors all the time because of some additional limitations) and the CPUs may be members of power domains, so what we can do with them may depend on the states of other devices.
So I think we either (a) come up with 'clearer' separation of responsibilities between scheduler and cpufreq/cpuidle
I agree with this. This is what I have been emphasizing, if we feel that the cpufrequency/ cpuidle subsystems are suboptimal in terms of the information that they use to make their decisions, let us improve them. But this will not yield us any improvement if the scheduler does not have enough information. And IMHO, the next fundamental information that the scheduler needs should come from cpufreq and cpuidle.
What kind of information? Your suggestion that the scheduler should avoid loading a CPU because it went idle is wrong IMHO. It went idle because the scheduler decided this in first instance.
Then we should move onto supplying scheduler information from the power domain topology, thermal factors, user policies.
I agree with this but at this point you get the scheduler to make more informed decisions about task placement. It can then give more precise hints to cpufreq/cpuidle like the predicted load and those frameworks could become dumber in time, just complying with the requested performance level (trying to break the loop above).
Well, there's nothing like "predicted load". At best, we may be able to make more or less educated guesses about it, so in my opinion it is better to use the information about what happened in the past for making decisions regarding the current settings and re-adjust them over time as we get more information.
Agree with this as well. scheduler can at best supply information regarding the historic load and hope that it is what defines the future as well. Apart from this I dont know what other information scheduler can supply cpuidle governor with.
So how much decision making regarding the idle state to put the given CPU into should be there in the scheduler? I believe the only information coming out of the scheduler regarding that should be "OK, this CPU is now idle and I'll need it in X nanoseconds from now" plus possibly a hint about the wakeup latency tolerance (but those hints may come from other places too). That said the decision *which* CPU should become idle at the moment very well may require some information about what options are available from the layer below (for example, "putting core X into idle for Y of time will save us Z energy" or something like that).
Agree. Except that the information should be "Ok , this CPU is now idle and it has not done much work in the recent past,it is a 10% loaded CPU".
This can be said today using PJT's metric. It is now for the cpuidle governor to decide the idle state to go to. Thats what happens today too.
And what about performance scaling? Quite frankly, in my opinion that requires some more investigation, because there still are some open questions in that area. To start with we can just continue using the current heuristics, but perhaps with the scheduler calling the scaling "governor" when it sees fit instead of that "governor" running kind of in parallel with it.
Exactly. How this can be done is elaborated above. This is one of the key things we need today,IMHO.
or (b) come up with a unified load-balancing/cpufreq/cpuidle implementation as per Ingo's request. The latter is harder but, with a good design, has potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on performance load-balancing but control the balance ratio from a cpufreq governor (via things like arch_scale_freq_power() or something new). CPUfreq would not be concerned just with individual CPU load/frequency but also making a decision on how tasks are balanced between CPUs based on the overall load (e.g. four CPUs are enough for the current load, I can shut the other four off by telling the scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to factor the load balancing out of kernel/sched/fair.c and provide an abstract interface (like load_class?) for easier extending or different policies (e.g. small task packing).
Let me elaborate on the patches that have been posted so far on the power awareness of the scheduler. When we say *power aware scheduler* what exactly do we want it to do?
In my opinion, we want it to *avoid touching idle cpus*, so as to keep them in that state longer and *keep more power domains idle*, so as to yield power savings with them turned off. The patches released so far are striving to do the latter. Correct me if I am wrong at this.
Don't take me wrong, task packing to keep more power domains idle is probably in the right direction but it may not address all issues. You realised this is not enough since you are now asking for the scheduler to take feedback from cpuidle. As I pointed out above, you try to create a loop which may or may not work, especially given the wide variety of hardware parameters.
Also feel free to point out any other expectation from the power aware scheduler if I am missing any.
If the patches so far are enough and solved all the problems, you are not missing any. Otherwise, please see my view above.
Please define clearly what the scheduler, cpufreq, cpuidle should be doing and what communication should happen between them.
If I have got Ingo's point right, the issues with them are that they are not taking a holistic approach to meet the said goal.
Probably because scheduler changes, cpufreq and cpuidle are all trying to address the same thing but independent of each other and possibly conflicting.
Keeping more power domains idle (by packing tasks) would sound much better if the scheduler has taken all aspects of doing such a thing into account, like
- How idle are the cpus, on the domain that it is packing
- Can they go to turbo mode, because if they do,then we cant pack
tasks. We would need certain cpus in that domain idle. 3. Are the domains in which we pack tasks power gated? 4. Will there be significant performance drop by packing? Meaning do the tasks share cpu resources? If they do there will be severe contention.
So by this you add a lot more information about the power configuration into the scheduler, getting it to make more informed decisions about task scheduling. You may eventually reach a point where cpuidle governor doesn't have much to do (which may be a good thing) and reach Ingo's goal.
That's why I suggested maybe starting to take the load balancing out of fair.c and make it easily extensible (my opinion, the scheduler guys may disagree). Then make it more aware of topology, power configuration so that it makes the right task placement decision. You then get it to tell cpufreq about the expected performance requirements (frequency decided by cpufreq) and cpuidle about how long it could be idle for (you detect a periodic task every 1ms, or you don't have any at all because they were migrated, the right C state being decided by the governor).
There is another angle to look at that as I said somewhere above.
What if we could integrate cpuidle with cpufreq so that there is one code layer representing what the hardware can do to the scheduler? What benefits can we get from that, if any?
We could debate on this point. I am a bit confused about this. As I see it, there is no problem with keeping them separately. One, because of code readability; it is easy to understand what are the different parameters that the performance of CPU depends on, without needing to dig through the code. Two, because cpu frequency kicks in during runtime primarily and cpuidle during idle time of the cpu.
But this would also mean creating well defined interfaces between them. Integrating cpufreq and cpuidle seems like a better argument to make due to their common functionality at a higher level of talking to hardware and tuning the performance parameters of cpu. But I disagree that scheduler should be put into this common framework as well as it has functionalities which are totally disjoint from what subsystems such as cpuidle and cpufreq are intended to do.
Rafael
Regards Preeti U Murthy
Hi Preeti,
(trimming lots of text, hopefully to make it easier to follow)
On Sun, Jun 09, 2013 at 04:42:18AM +0100, Preeti U Murthy wrote:
On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
Same loop again. The cpu load goes high because (a) there is more work, possibly triggered by external events, and (b) the scheduler decided to balance the CPUs in a certain way. As for cpuidle above, the scheduler has direct influence on the cpufreq decisions. How would the scheduler know which CPU not to balance against? Are CPUs in a cluster synchronous? Is it better do let other CPU idle or more efficient to run this cluster at half-speed?
Let's say there is an increase in the load, does the scheduler wait until cpufreq figures this out or tries to take the other CPUs out of idle? Who's making this decision? That's currently a potentially unstable loop.
Yes, it is and I don't think we currently have good answers here.
My answer to the above question is scheduler does not wait until cpufreq figures it out. All that the scheduler cares about today is load balancing. Spread the load and hope it finishes soon. There is a possibility today that even before cpu frequency governor can boost the frequency of cpu, the scheduler can spread the load.
As for the second question it will wakeup idle cpus if it must to load balance.
That's exactly my point. Such behaviour can become unstable (it probably won't oscillate but it affects the power or performance).
It is a good question asked: "does the scheduler wait until cpufreq figures it out." Currently the answer is no, it does not communicate with cpu frequency at all (except through cpu power, but that is the good part of the story, so I will not get there now). But maybe we should change this. I think we can do so the following way.
When can a scheduler talk to cpu frequency? It can do so under the below circumstances:
- Load is too high across the systems, all cpus are loaded, no chance
of load balancing. Therefore ask cpu frequency governor to step up frequency to get improve performance.
Too high or too low loads across the whole system are relatively simple scenarios: for the former boost the frequency (cpufreq can do this on its own, the scheduler has nowhere to balance anyway), for the latter pack small tasks (or other heuristics).
But the bigger issue is where some CPUs are idle while others are running at a smaller frequency. With the current implementation it is even hard to get into this asymmetric state (some cluster loaded while the other in deep sleep) unless the load is low and you apply some small task packing patch.
- The scheduler finds out that if it has to load balance, it has to do
so on cpus which are in deep idle state( Currently this logic is not present, but worth getting it in). It then decides to increase the frequency of the already loaded cpus to improve performance. It calls cpu freq governor.
So you say that the scheduler decides to increase the frequency of the already loaded cpus to improve performance. Doesn't this mean that the scheduler takes on some of the responsibilities of cpufreq? You now add logic about boosting CPU frequency to the scheduler.
What's even more problematic is that cpufreq has policies decided by the user (or pre-configured OS policies) but the scheduler is not aware of them. Let's say the user wants a more conservative cpufreq policy, how long should the scheduler wait for cpufreq to boost the frequency before waking idle CPUs?
There are many questions like above. I'm not looking for specific answers but rather trying get a higher level clear view of the responsibilities of the three main factors contributing to power/performance: load balancing (scheduler), cpufreq and cpuidle.
- The scheduler finds out that if it has to load balance, it has to do
so on a different power domain which is idle currently(shallow/deep). It thinks the better of it and calls cpu frequency governor to boost the frequency of the cpus in the current domain.
As for 2, the scheduler would make power decisions. Then why don't make a unified implementation? Or remove such decisions from the scheduler.
The results of many measurements seem to indicate that it generally is better to do the work as quickly as possible and then go idle again, but there are costs associated with going back and forth from idle to non-idle etc.
I think we can even out the cost benefit of race to idle, by choosing to do it wisely. Like for example if points 2 and 3 above are true (idle cpus are in deep sleep states or need to ld balance on a different power domain), then step up the frequency of the current working cpus and reap its benefit.
And such decision would be made by ...? I guess the scheduler again.
And what about performance scaling? Quite frankly, in my opinion that requires some more investigation, because there still are some open questions in that area. To start with we can just continue using the current heuristics, but perhaps with the scheduler calling the scaling "governor" when it sees fit instead of that "governor" running kind of in parallel with it.
Exactly. How this can be done is elaborated above. This is one of the key things we need today,IMHO.
The scheduler asking the cpufreq governor of what it needs is a too simplistic view IMHO. What if the governor is conservative? How much does the scheduler wait until the feedback loop reacts (CPU frequency raised increasing the idle time so that the scheduler eventually measures a smaller load)?
The scheduler could get more direct feedback from cpufreq like "I'll get to this frequency in x ms" or not at all but then the scheduler needs to make another power-related decision on whether to wait (be conservative) or wake up an idle CPU. Do you want to add various power policies at the scheduler level just to match the cpufreq ones?
That's why I suggested maybe starting to take the load balancing out of fair.c and make it easily extensible (my opinion, the scheduler guys may disagree). Then make it more aware of topology, power configuration so that it makes the right task placement decision. You then get it to tell cpufreq about the expected performance requirements (frequency decided by cpufreq) and cpuidle about how long it could be idle for (you detect a periodic task every 1ms, or you don't have any at all because they were migrated, the right C state being decided by the governor).
There is another angle to look at that as I said somewhere above.
What if we could integrate cpuidle with cpufreq so that there is one code layer representing what the hardware can do to the scheduler? What benefits can we get from that, if any?
We could debate on this point. I am a bit confused about this. As I see it, there is no problem with keeping them separately. One, because of code readability; it is easy to understand what are the different parameters that the performance of CPU depends on, without needing to dig through the code. Two, because cpu frequency kicks in during runtime primarily and cpuidle during idle time of the cpu.
But this would also mean creating well defined interfaces between them. Integrating cpufreq and cpuidle seems like a better argument to make due to their common functionality at a higher level of talking to hardware and tuning the performance parameters of cpu. But I disagree that scheduler should be put into this common framework as well as it has functionalities which are totally disjoint from what subsystems such as cpuidle and cpufreq are intended to do.
It's not about the whole scheduler but rather the load balancing, task placement. You can try to create well defined interfaces between them but first of all let's define clearly what responsibilities each of the three frameworks have.
As I said in my first email on this subject, we could:
a) let the scheduler focus on performance only but control (restrict) the load balancing from cpufreq. For example via cpu_power, a value of 0 meaning don't balance against it. Cpufreq changes the frequency based on the load and may allow the scheduler to use idle CPUs. Such approach requires closer collaboration between cpufreq and cpuidle (possibly even merging them) and cpufreq needs to become even more aware of CPU topology.
or:
b) Merge the load balancer and cpufreq together (could leave cpuidle out initially) with a new design.
Any other proposals are welcome. So far they were either tweaks in various places (small task packing) or are relatively vague (like we need two-way communication between cpuidle and scheduler).
Best regards.
On 06/09/2013 05:42 AM, Preeti U Murthy wrote:
Hi Rafael,
On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
I think you are missing Ingo's point. It's not about the scheduler complying with decisions made by various governors in the kernel (which may or may not have enough information) but rather the scheduler being in a better position for making such decisions.
My mail pointed out that I disagree with this design ("the scheduler being in a better position for making such decisions"). I think it should be a 2 way co-operation. I have elaborated below.
I agree with that.
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
How will the scheduler know that there will not be work in the near future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how much work will come up. All it knows is the current load of the runqueues and the nature of the task (thanks to the PJT's metric). It can then match the task load to the cpu capacity and schedule the tasks on the appropriate cpus.
The scheduler can decide to load a single CPU or cluster and let the others idle. If the total CPU load can fit into a smaller number of CPUs it could as well tell cpuidle to go into deeper state from the beginning as it moved all the tasks elsewhere.
So why can't it do that today? What's the problem?
The reason that scheduler does not do it today is due to the prefer_sibling logic. The tasks within a core get distributed across cores if they are more than 1, since the cpu power of a core is not high enough to handle more than one task.
However at a socket level/ MC level (cluster at a low level), there can be as many tasks as there are cores because the socket has enough CPU capacity to handle them. But the prefer_sibling logic moves tasks across socket/MC level domains even when load<=domain_capacity.
I think the reason why the prefer_sibling logic was introduced, is that scheduler looks at spreading tasks across all the resources it has. It believes keeping tasks within a cluster/socket level domain would mean tasks are being throttled by having access to only the cluster/socket level resources. Which is why it spreads.
The prefer_sibling logic is nothing but a flag set at domain level to communicate to the scheduler that load should be spread across the groups of this domain. In the above example across sockets/clusters.
But I think it is time we take another look at the prefer_sibling logic and decide on its worthiness.
Regarding future work, neither cpuidle nor the scheduler know this but the scheduler would make a better prediction, for example by tracking task periodicity.
Well, basically, two pieces of information are needed to make target idle state selections: (1) when the CPU (core or package) is going to be used next time and (2) how much latency for going back to the non-idle state can be tolerated. While the scheduler knows (1) to some extent (arguably, it generally cannot predict when hardware interrupts are going to occur), I'm not really sure about (2).
As a consequence, it leaves certain cpus idle. The load of these cpus degrade. It is via this load that the scheduler asks for a deeper sleep state. Right here we have scheduler talking to the cpuidle governor.
So we agree that the scheduler _tells_ the cpuidle governor when to go idle (but not how deep).
It does indicate to cpuidle how deep it can go, however, by providing it with the information about when the CPU is going to be used next time (from the scheduler's perspective).
IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the cpuidle does not get enough information from the scheduler (arguably this could be fixed)
OK, so what information is missing in your opinion?
and (2) the scheduler does not have any information about the idle states (power gating etc.) to make any informed decision on which/when CPUs should go idle.
That's correct, which is a drawback. However, on some systems it may never have that information (because hardware coordinates idle states in a way that is opaque to the OS - e.g. by autopromoting deeper states when idle for sufficiently long time) and on some systems that information may change over time (i.e. the availablility of specific idle states may depend on factors that aren't constant).
If you attempted to take all of the possible complications related to hardware designs in that area in the scheduler, you'd end up with completely unmaintainable piece of code.
As you said, it is a non-optimal one-way communication but the solution is not feedback loop from cpuidle into scheduler. It's like the scheduler managed by chance to get the CPU into a deeper sleep state and now you'd like the scheduler to get feedback form cpuidle and not disturb that CPU anymore. That's the closed loop I disagree with. Could the scheduler not make this informed decision before - it has this total load, let's get this CPU into deeper sleep state?
No, it couldn't in general, for the above reasons.
I don't see what the problem is with the cpuidle governor waiting for the load to degrade before putting that cpu to sleep. In my opinion, putting a cpu to deeper sleep states should happen gradually.
If we know in advance that the CPU can be put into idle state Cn, there is no reason to put it into anything shallower than that.
On the other hand, if the CPU is in Cn already and there is a possibility to put it into a deeper low-power state (which we didn't know about before), it may make sense to promote it into that state (if that's safe) or even wake it up and idle it again.
Yes, sorry I said it wrong in the previous mail. Today the cpuidle governor is capable of putting a CPU in idle state Cn directly, by looking at various factors like the current load, next timer, history of interrupts, exit latency of states. At the end of this evaluation it puts it into idle state Cn.
Also it cares to check if its decision is right. This is with respect to your statement "if there is a possibility to put it into deeper low power state". It queues a timer at a time just after its predicted wake up time before putting the cpu to idle state. If this time of wakeup prediction is wrong, this timer triggers to wake up the cpu and the cpu is hence put into a deeper sleep state.
Some SoC can have a cluster of cpus sharing some resources, eg cache, so they must enter the same state at the same moment. Beside the synchronization mechanisms, that adds a dependency with the next event. For example, the u8500 board has a couple of cpus. In order to make them to enter in retention, both must enter the same state, but not necessary at the same moment. The first cpu will wait in WFI and the second one will initiate the retention mode when entering to this state. Unfortunately, some time could have passed while the second cpu entered this state and the next event for the first cpu could be too close, thus violating the criteria of the governor when it choose this state for the second cpu.
Also the latencies could change with the frequencies, so there is a dependency with cpufreq, the lesser the frequency is, the higher the latency is. If the scheduler takes the decision to go to a specific state assuming the exit latency is a given duration, if the frequency decrease, this exit latency could increase also and lead the system to be less responsive.
I don't know, how were made the latencies computation (eg. worst case, taken with the lower frequency or not) but we have just one set of values. That should happen with the current code.
Another point is the timer allowing to detect bad decision and go to a deep idle state. With the cluster dependency described above, we may wake up a particular cpu, which turns on the cluster and make the entire cluster to wake up in order to enter a deeper state, which could fail because of the other cpu may not fulfill the constraint at this moment.
This means time will tell the governors what kinds of workloads are running on the system. If the cpu is idle for long, it probably means that the system is less loaded and it makes sense to put the cpus to deeper sleep states. Of course there could be sporadic bursts or quieting down of tasks, but these are corner cases.
It's nothing wrong with degrading given the information that cpuidle currently has. It's a heuristics that worked ok so far and may continue to do so. But see my comments above on why the scheduler could make more informed decisions.
We may not move all the power gating information to the scheduler but maybe find a way to abstract this by giving more hints via the CPU and cache topology. The cpuidle framework (it may not be much left of a governor) would then take hints about estimated idle time and invoke the low-level driver about the right C state.
Overall, it looks like it'd be better to split the governor "layer" between the scheduler and the idle driver with a well defined interface between them. That interface needs to be general enough to be independent of the underlying hardware.
We need to determine what kinds of information should be passed both ways and how to represent it.
I agree with this design decision.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load, which means reducing the idle time. It does not know whether it is better to boost the frequency twice as high so that it gets to idle quicker. You can change the governor's policy but does it have any information from cpuidle?
Well, it may get that information directly from the hardware. Actually, intel_pstate does that, but intel_pstate is the governor and the scaling driver combined.
To add to this, cpufreq currently functions in the below fashion. I am talking of the on demand governor, since it is more relevant to our discussion.
----stepped up frequency------ ----threshold-------- -----stepped down freq level1--- -----stepped down freq level2--- ---stepped down freq level3----
If the cpu idle time is below a threshold , it boosts the frequency to one level above straight away and does not vary it any further. If the cpu idle time is below a threshold there is a step down in frequency levels by 5% of the current frequency at every sampling period, provided the cpu behavior is constant.
I think we can improve this implementation by better interaction with cpuidle and scheduler.
When it is stepping up frequency, it should do it in steps of frequency being a *function of the current cpu load* also, or function of idle time will also do.
When it is stepping down frequency, it should interact with cpuidle. It should get from cpuidle information regarding the idle state that the cpu is in.The reason is cpufrequency governor is aware of only the idle time of the cpu, not the idle state it is in. If it gets to know that the cpu is in a deep idle state, it could step down frequency levels to leveln straight away. Just like cpuidle does to put cpus into state Cn.
Or an alternate option could be just like stepping up, make the stepping down also a function of idle time. Perhaps fn(|threshold-idle_time|).
Also one more point to note is that if cpuidle puts cpus into such idle states that clock gate the cpus, then there is no need for cpufrequency governor for that cpu. cpufreq can check with cpuidle on this front before it queries a cpu.
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
Same loop again. The cpu load goes high because (a) there is more work, possibly triggered by external events, and (b) the scheduler decided to balance the CPUs in a certain way. As for cpuidle above, the scheduler has direct influence on the cpufreq decisions. How would the scheduler know which CPU not to balance against? Are CPUs in a cluster synchronous? Is it better do let other CPU idle or more efficient to run this cluster at half-speed?
Let's say there is an increase in the load, does the scheduler wait until cpufreq figures this out or tries to take the other CPUs out of idle? Who's making this decision? That's currently a potentially unstable loop.
Yes, it is and I don't think we currently have good answers here.
My answer to the above question is scheduler does not wait until cpufreq figures it out. All that the scheduler cares about today is load balancing. Spread the load and hope it finishes soon. There is a possibility today that even before cpu frequency governor can boost the frequency of cpu, the scheduler can spread the load.
As for the second question it will wakeup idle cpus if it must to load balance.
It is a good question asked: "does the scheduler wait until cpufreq figures it out." Currently the answer is no, it does not communicate with cpu frequency at all (except through cpu power, but that is the good part of the story, so I will not get there now). But maybe we should change this. I think we can do so the following way.
When can a scheduler talk to cpu frequency? It can do so under the below circumstances:
- Load is too high across the systems, all cpus are loaded, no chance
of load balancing. Therefore ask cpu frequency governor to step up frequency to get improve performance.
- The scheduler finds out that if it has to load balance, it has to do
so on cpus which are in deep idle state( Currently this logic is not present, but worth getting it in). It then decides to increase the frequency of the already loaded cpus to improve performance. It calls cpu freq governor.
- The scheduler finds out that if it has to load balance, it has to do
so on a different power domain which is idle currently(shallow/deep). It thinks the better of it and calls cpu frequency governor to boost the frequency of the cpus in the current domain.
While 2 and 3 depend on scheduler having knowledge about idle states and power domains, which it currently does not have, 1 can be achieved with the current code. scheduler keeps track of failed ld balancing efforts with lb_failed. If it finds that while load balancing from busy group failed (lb_failed > 0), it can call cpu freq governor to step up the cpu frequency of this busy cpu group, with gov_check_cpu() in cpufrequency governor code.
The results of many measurements seem to indicate that it generally is better to do the work as quickly as possible and then go idle again, but there are costs associated with going back and forth from idle to non-idle etc.
I think we can even out the cost benefit of race to idle, by choosing to do it wisely. Like for example if points 2 and 3 above are true (idle cpus are in deep sleep states or need to ld balance on a different power domain), then step up the frequency of the current working cpus and reap its benefit.
The main problem with cpufreq that I personally have is that the governors carry out their own sampling with pretty much arbitrary resolution that may lead to suboptimal decisions. It would be much better if the scheduler indicated when to *consider* the changing of CPU performance parameters (that may not be frequency alone and not even frequency at all in general), more or less the same way it tells cpuidle about idle CPUs, but I'm not sure if it should decide what performance points to run at.
Very true. See the points 1,2 and 3 above where I list out when scheduler can call cpu frequency. Also an idea about how cpu frequency governor can decide on the scaling frequency is stated above.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
The key here is that cpuidle/cpufreq make their primary decision based on something controlled by the scheduler: the CPU load (via run-queue balancing). You would then like the scheduler take such decision back into account. It just looks like a closed loop, possibly 'unstable' .
Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a closed loop? Here too the scheduler should be made well aware of the decisions it took in the past right?
It's more like:
scheduler -> cpuidle/cpufreq -> hardware operating point ^ | +--------------------------------------+
You can argue that you can make an adaptive loop that works fine but there are so many parameters that I don't see how it would work. The patches so far don't seem to address this. Small task packing, while useful, it's some heuristics just at the scheduler level.
I agree.
With a combined decision maker, you aim to reduce this separate decision process and feedback loop. Probably impossible to eliminate the loop completely because of hardware latencies, PLLs, CPU frequency not always the main factor, but you can make the loop more tolerant to instabilities.
Well, in theory. :-)
Another question to ask is whether or not the structure of our software reflects the underlying problem. I mean, on the one hand there is the scheduler that needs to optimally assign work items to computational units (hyperthreads, CPU cores, packages) and on the other hand there's hardware with different capabilities (idle states, performance points etc.). Arguably, the scheduler internals cannot cover all of the differences between all of the existing types of hardware Linux can run on, so there needs to be a layer of code providing an interface between the scheduler and the hardware. But that layer of code needs to be just *one*, so why do we have *two* different frameworks (cpuidle and cpufreq) that talk to the same hardware and kind of to the scheduler, but not to each other?
To me, the reason is history, and more precisely the fact that cpufreq had been there first, then came cpuidle and only then poeple started to realize that some scheduler tweaks may allow us to save energy without sacrificing too much performance. However, it looks like there's time to go back and see how we can integrate all that. And there's more, because we may need to take power budgets and thermal management into account as well (i.e. we may not be allowed to use full performance of the processors all the time because of some additional limitations) and the CPUs may be members of power domains, so what we can do with them may depend on the states of other devices.
So I think we either (a) come up with 'clearer' separation of responsibilities between scheduler and cpufreq/cpuidle
I agree with this. This is what I have been emphasizing, if we feel that the cpufrequency/ cpuidle subsystems are suboptimal in terms of the information that they use to make their decisions, let us improve them. But this will not yield us any improvement if the scheduler does not have enough information. And IMHO, the next fundamental information that the scheduler needs should come from cpufreq and cpuidle.
What kind of information? Your suggestion that the scheduler should avoid loading a CPU because it went idle is wrong IMHO. It went idle because the scheduler decided this in first instance.
Then we should move onto supplying scheduler information from the power domain topology, thermal factors, user policies.
I agree with this but at this point you get the scheduler to make more informed decisions about task placement. It can then give more precise hints to cpufreq/cpuidle like the predicted load and those frameworks could become dumber in time, just complying with the requested performance level (trying to break the loop above).
Well, there's nothing like "predicted load". At best, we may be able to make more or less educated guesses about it, so in my opinion it is better to use the information about what happened in the past for making decisions regarding the current settings and re-adjust them over time as we get more information.
Agree with this as well. scheduler can at best supply information regarding the historic load and hope that it is what defines the future as well. Apart from this I dont know what other information scheduler can supply cpuidle governor with.
So how much decision making regarding the idle state to put the given CPU into should be there in the scheduler? I believe the only information coming out of the scheduler regarding that should be "OK, this CPU is now idle and I'll need it in X nanoseconds from now" plus possibly a hint about the wakeup latency tolerance (but those hints may come from other places too). That said the decision *which* CPU should become idle at the moment very well may require some information about what options are available from the layer below (for example, "putting core X into idle for Y of time will save us Z energy" or something like that).
Agree. Except that the information should be "Ok , this CPU is now idle and it has not done much work in the recent past,it is a 10% loaded CPU".
This can be said today using PJT's metric. It is now for the cpuidle governor to decide the idle state to go to. Thats what happens today too.
And what about performance scaling? Quite frankly, in my opinion that requires some more investigation, because there still are some open questions in that area. To start with we can just continue using the current heuristics, but perhaps with the scheduler calling the scaling "governor" when it sees fit instead of that "governor" running kind of in parallel with it.
Exactly. How this can be done is elaborated above. This is one of the key things we need today,IMHO.
or (b) come up with a unified load-balancing/cpufreq/cpuidle implementation as per Ingo's request. The latter is harder but, with a good design, has potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on performance load-balancing but control the balance ratio from a cpufreq governor (via things like arch_scale_freq_power() or something new). CPUfreq would not be concerned just with individual CPU load/frequency but also making a decision on how tasks are balanced between CPUs based on the overall load (e.g. four CPUs are enough for the current load, I can shut the other four off by telling the scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to factor the load balancing out of kernel/sched/fair.c and provide an abstract interface (like load_class?) for easier extending or different policies (e.g. small task packing).
Let me elaborate on the patches that have been posted so far on the power awareness of the scheduler. When we say *power aware scheduler* what exactly do we want it to do?
In my opinion, we want it to *avoid touching idle cpus*, so as to keep them in that state longer and *keep more power domains idle*, so as to yield power savings with them turned off. The patches released so far are striving to do the latter. Correct me if I am wrong at this.
Don't take me wrong, task packing to keep more power domains idle is probably in the right direction but it may not address all issues. You realised this is not enough since you are now asking for the scheduler to take feedback from cpuidle. As I pointed out above, you try to create a loop which may or may not work, especially given the wide variety of hardware parameters.
Also feel free to point out any other expectation from the power aware scheduler if I am missing any.
If the patches so far are enough and solved all the problems, you are not missing any. Otherwise, please see my view above.
Please define clearly what the scheduler, cpufreq, cpuidle should be doing and what communication should happen between them.
If I have got Ingo's point right, the issues with them are that they are not taking a holistic approach to meet the said goal.
Probably because scheduler changes, cpufreq and cpuidle are all trying to address the same thing but independent of each other and possibly conflicting.
Keeping more power domains idle (by packing tasks) would sound much better if the scheduler has taken all aspects of doing such a thing into account, like
- How idle are the cpus, on the domain that it is packing
- Can they go to turbo mode, because if they do,then we cant pack
tasks. We would need certain cpus in that domain idle. 3. Are the domains in which we pack tasks power gated? 4. Will there be significant performance drop by packing? Meaning do the tasks share cpu resources? If they do there will be severe contention.
So by this you add a lot more information about the power configuration into the scheduler, getting it to make more informed decisions about task scheduling. You may eventually reach a point where cpuidle governor doesn't have much to do (which may be a good thing) and reach Ingo's goal.
That's why I suggested maybe starting to take the load balancing out of fair.c and make it easily extensible (my opinion, the scheduler guys may disagree). Then make it more aware of topology, power configuration so that it makes the right task placement decision. You then get it to tell cpufreq about the expected performance requirements (frequency decided by cpufreq) and cpuidle about how long it could be idle for (you detect a periodic task every 1ms, or you don't have any at all because they were migrated, the right C state being decided by the governor).
There is another angle to look at that as I said somewhere above.
What if we could integrate cpuidle with cpufreq so that there is one code layer representing what the hardware can do to the scheduler? What benefits can we get from that, if any?
We could debate on this point. I am a bit confused about this. As I see it, there is no problem with keeping them separately. One, because of code readability; it is easy to understand what are the different parameters that the performance of CPU depends on, without needing to dig through the code. Two, because cpu frequency kicks in during runtime primarily and cpuidle during idle time of the cpu.
But this would also mean creating well defined interfaces between them. Integrating cpufreq and cpuidle seems like a better argument to make due to their common functionality at a higher level of talking to hardware and tuning the performance parameters of cpu. But I disagree that scheduler should be put into this common framework as well as it has functionalities which are totally disjoint from what subsystems such as cpuidle and cpufreq are intended to do.
Rafael
Regards Preeti U Murthy
-- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 10 Jun 2013, Daniel Lezcano wrote:
Some SoC can have a cluster of cpus sharing some resources, eg cache, so they must enter the same state at the same moment. Beside the synchronization mechanisms, that adds a dependency with the next event. For example, the u8500 board has a couple of cpus. In order to make them to enter in retention, both must enter the same state, but not necessary at the same moment. The first cpu will wait in WFI and the second one will initiate the retention mode when entering to this state. Unfortunately, some time could have passed while the second cpu entered this state and the next event for the first cpu could be too close, thus violating the criteria of the governor when it choose this state for the second cpu.
Also the latencies could change with the frequencies, so there is a dependency with cpufreq, the lesser the frequency is, the higher the latency is. If the scheduler takes the decision to go to a specific state assuming the exit latency is a given duration, if the frequency decrease, this exit latency could increase also and lead the system to be less responsive.
I don't know, how were made the latencies computation (eg. worst case, taken with the lower frequency or not) but we have just one set of values. That should happen with the current code.
Another point is the timer allowing to detect bad decision and go to a deep idle state. With the cluster dependency described above, we may wake up a particular cpu, which turns on the cluster and make the entire cluster to wake up in order to enter a deeper state, which could fail because of the other cpu may not fulfill the constraint at this moment.
Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
David Lang
On 6/11/2013 5:27 PM, David Lang wrote:
Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately. C and P states hang together tightly, and even C state on one core impacts other cores' performance, just like P state selection on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency", that's just not the case anymore)
On Wed, Jun 12, 2013 at 7:18 AM, Arjan van de Ven arjan@linux.intel.com wrote:
On 6/11/2013 5:27 PM, David Lang wrote:
Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately. C and P states hang together tightly, and even C state on one core impacts other cores' performance, just like P state selection on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency", that's just not the case anymore)
This is true of ARM platforms too. As Daniel pointed out in an earlier email, the operating point (frequency, voltage) has a bearing on the c-state latency too.
An additional complexity is thermal constraints. E.g. On a quad-core Cortex-A15 processor capable of say 1.5GHz, you won't be able to run all 4 cores at that speed for very long w/o exceeding the thermal envelope. These overdrive frequencies (turbo in x86-speak) impact the rest of the system by either constraining the frequency of other cores or requiring aggresive thermal management.
Do we really want to track these details in the scheduler or just let the scheduler provide notifications to the existing subsystems (cpufreq, cpuidle, thermal, etc.) with some sort of feedback going back to the scheduler to influence future decisions?
Feeback to the scheduler could be something like the following (pardon the names):
1. ISOLATE_CORE: Don't schedule anything on this core - cpuidle might use this to synchronise cores for a cluster shutdown, thermal framework could use this as idle injection to reduce temperature 2. CAP_CAPACITY: Don't expect cpufreq to raise the frequency on this core - cpufreq might use this to cap overall energy since overdrive operating points are very expensive, thermal might use this to slow down rate of increase of die temperature
Regards, Amit
On Wed, 12 Jun 2013, Amit Kucheria wrote:
On Wed, Jun 12, 2013 at 7:18 AM, Arjan van de Ven arjan@linux.intel.com wrote:
On 6/11/2013 5:27 PM, David Lang wrote:
Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately. C and P states hang together tightly, and even C state on one core impacts other cores' performance, just like P state selection on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency", that's just not the case anymore)
This is true of ARM platforms too. As Daniel pointed out in an earlier email, the operating point (frequency, voltage) has a bearing on the c-state latency too.
An additional complexity is thermal constraints. E.g. On a quad-core Cortex-A15 processor capable of say 1.5GHz, you won't be able to run all 4 cores at that speed for very long w/o exceeding the thermal envelope. These overdrive frequencies (turbo in x86-speak) impact the rest of the system by either constraining the frequency of other cores or requiring aggresive thermal management.
Do we really want to track these details in the scheduler or just let the scheduler provide notifications to the existing subsystems (cpufreq, cpuidle, thermal, etc.) with some sort of feedback going back to the scheduler to influence future decisions?
Feeback to the scheduler could be something like the following (pardon the names):
- ISOLATE_CORE: Don't schedule anything on this core - cpuidle might
use this to synchronise cores for a cluster shutdown, thermal framework could use this as idle injection to reduce temperature 2. CAP_CAPACITY: Don't expect cpufreq to raise the frequency on this core - cpufreq might use this to cap overall energy since overdrive operating points are very expensive, thermal might use this to slow down rate of increase of die temperature
How much data are you going to have to move back and forth between the different systems?
do you really only want the all-or-nothing "use this core as much as possible" vs "don't use this core at all"? or do you need the ability to indicate how much to use a particular core (something that is needed anyway for asymetrical cores I think)
If there is too much information that needs to be moved back and forth between these 'subsystems' for the 'right' thing to happen, then it would seem like it makes more sense to combine them.
Even combined, there are parts that are still pretty modular (like the details of shifting from one state to another, and the different high level strategies to follow for different modes of operation), but having access to all the information rather than only bits and pieces of the information at lower granularity would seem like an improvement.
David Lang
Hi Arjan,
On Wed, Jun 12, 2013 at 02:48:58AM +0100, Arjan van de Ven wrote:
On 6/11/2013 5:27 PM, David Lang wrote:
Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately. C and P states hang together tightly, and even C state on one core impacts other cores' performance, just like P state selection on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency", that's just not the case anymore)
I agree, the reality is very complex. But we should go back and analyse what problem we are trying to solve, what each framework is trying to address.
When viewed separately from the scheduler, cpufreq and cpuidle governors do the right thing. But they both base their action on the CPU load (balance) decided by the scheduler and it's the latter that we are trying to adjust (and we are still debating what the right approach is).
Since such information seems too complex to be moved into the scheduler, why don't we get cpufreq in charge of restricting the load balancing to certain CPUs? It already tracks the load/idle time to (gradually) change the P state. Depending on the governor/policy, it could decide that (for example) 4 CPUs running at higher power P state are enough, telling the scheduler to ignore the other CPUs. It won't pick a frequency, but (as it currently does) adjust it to keep a minimal idle state on those CPUs. If that's not longer possible (high load), it can remove the restriction and let the scheduler use the other idle CPUs (cpufreq could even do a direct a load_balance() call). This is a governor decision and the user is in control of what governors are used.
Cpuidle I think for now can stay the same, gradually entering deeper sleep states. It could be later unified with cpufreq if there are any benefits. In deciding the load balancing restrictions, maybe cpufreq should be aware of C-state latencies.
Cpufreq would need to get more knowledge of the power topology and thermal management. It would still be the framework restricting the P state or changing the load balancing restrictions to let CPUs cool down. More hooks could be added if needed for better responsiveness (like entering idle or task wake-up).
With the above, the scheduler will just focus on performance (given the restrictions imposed by cpufreq) and it only needs to be aware of the CPU topology from a performance perspective (caches, hyperthreading) together with the cpu_power parameter for the weighted load.
This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately. C and P states hang together tightly, and even C state on one core impacts other cores' performance, just like P state selection on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency", that's just not the case anymore)
I agree, the reality is very complex. But we should go back and analyse what problem we are trying to solve, what each framework is trying to address.
When viewed separately from the scheduler, cpufreq and cpuidle governors do the right thing. But they both base their action on the CPU load (balance) decided by the scheduler and it's the latter that we are trying to adjust (and we are still debating what the right approach is).
Since such information seems too complex to be moved into the scheduler, why don't we get cpufreq in charge of restricting the load balancing to certain CPUs? It already tracks the load/idle time to (gradually) change the P state. Depending on the governor/policy, it could decide that (for
(btw in case you missed it, for Intel HW we no longer use cpufreq anymore)
Cpuidle I think for now can stay the same, gradually entering deeper sleep states. It could be later unified with cpufreq if there are any benefits. In deciding the load balancing restrictions, maybe cpufreq should be aware of C-state latencies.
on the Intel side, we're likely to merge the Intel idle driver and P state driver in the near future fwiw. We'll keep using cpuidle framework (since it doesn't do all that much other than provide a nice hook for the idle loop), but we likely will make a hw specific selection logic there.
I do agree the scheduler needs to get integrated a bit better, in that it has some better knowledge, and to be honest, we likely need to switch from giving tasks credit for "time consumed" to giving them credit for something like "cycles consumed" or "instructions executed" or a mix thereof. So that a task that runs on a slower CPU (for either policy choice reasons or due to hardware capabilities), it gets charged less than when it runs fast.
On Wed, Jun 12, 2013 at 04:24:52PM +0100, Arjan van de Ven wrote:
This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately. C and P states hang together tightly, and even C state on one core impacts other cores' performance, just like P state selection on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency", that's just not the case anymore)
I agree, the reality is very complex. But we should go back and analyse what problem we are trying to solve, what each framework is trying to address.
When viewed separately from the scheduler, cpufreq and cpuidle governors do the right thing. But they both base their action on the CPU load (balance) decided by the scheduler and it's the latter that we are trying to adjust (and we are still debating what the right approach is).
Since such information seems too complex to be moved into the scheduler, why don't we get cpufreq in charge of restricting the load balancing to certain CPUs? It already tracks the load/idle time to (gradually) change the P state. Depending on the governor/policy, it could decide that (for
(btw in case you missed it, for Intel HW we no longer use cpufreq anymore)
Do you mean the intel_pstate.c code? It indeed doesn't use much of cpufreq, just setpolicy and it's on its own afterwards. Separating this from the framework probably has real benefits for the Intel processors but it would make a unified scheduler/cpufreq/cpuidle solution harder (just a remark, I don't say it's good or bad, there are many opinions against the unified solution; ARM could do the same for configurations like big.LITTLE).
But such driver could still interact with the scheduler to control it's load balancing. At a quick look (I'm not familiar with this driver), it tracks the per-CPU load and increases or decreases the P-state (similar to a cpufreq governor). It could as well track the total load and (depending on hardware configuration), get some CPUs in lower performance P-state (or even C-state) and tell the scheduler to avoid them.
One way to control load-balancing ratio is via something like arch_scale_freq_power(). We could tweak the scheduler further so that something like cpu_power==0 means do not schedule anything there.
So my proposal is to move the load-balancing hints (load ratio, avoiding CPUs etc.) outside the scheduler into drivers like intel_pstate.c or cpufreq governors. We then focus on getting the best performance out of the scheduler (like quicker migration) but it would not be concerned with the power consumption.
I do agree the scheduler needs to get integrated a bit better, in that it has some better knowledge, and to be honest, we likely need to switch from giving tasks credit for "time consumed" to giving them credit for something like "cycles consumed" or "instructions executed" or a mix thereof. So that a task that runs on a slower CPU (for either policy choice reasons or due to hardware capabilities), it gets charged less than when it runs fast.
I agree, this would be useful in optimising the scheduler so that it makes the right task placement/migration decisions (but as I said above, make the power aspect transparent to the scheduler).
On 06/12/2013 02:27 AM, David Lang wrote:
On Mon, 10 Jun 2013, Daniel Lezcano wrote:
Some SoC can have a cluster of cpus sharing some resources, eg cache, so they must enter the same state at the same moment. Beside the synchronization mechanisms, that adds a dependency with the next event. For example, the u8500 board has a couple of cpus. In order to make them to enter in retention, both must enter the same state, but not necessary at the same moment. The first cpu will wait in WFI and the second one will initiate the retention mode when entering to this state. Unfortunately, some time could have passed while the second cpu entered this state and the next event for the first cpu could be too close, thus violating the criteria of the governor when it choose this state for the second cpu.
Also the latencies could change with the frequencies, so there is a dependency with cpufreq, the lesser the frequency is, the higher the latency is. If the scheduler takes the decision to go to a specific state assuming the exit latency is a given duration, if the frequency decrease, this exit latency could increase also and lead the system to be less responsive.
I don't know, how were made the latencies computation (eg. worst case, taken with the lower frequency or not) but we have just one set of values. That should happen with the current code.
Another point is the timer allowing to detect bad decision and go to a deep idle state. With the cluster dependency described above, we may wake up a particular cpu, which turns on the cluster and make the entire cluster to wake up in order to enter a deeper state, which could fail because of the other cpu may not fulfill the constraint at this moment.
Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
As Arjan mentionned it is not as simple as this.
We want the scheduler to take some decisions with the knowledge of idle latencies. In other words move the governor logic into the scheduler.
The scheduler can take decision and the backend driver provides the interface to go to the idle state.
But unfortunately each hardware is behaving in different ways and describing such behaviors will help to find the correct design, I am not raising a lot of issues but just trying to enumerate the constraints we have.
What is the correct decision when a lot of pm blocks are tied together and the
In the example given by Arjan, the frequencies could be per cluster, hence decreasing the frequency for a core will decrease the frequency of the other core. So if the scheduler takes the decision to put one core into a specific idle state, regarding the target residency and the exit latency when the frequency is at max (the other core is doing something), and then the frequency decrease, the exit latency may increase in this case and the idle cpu will take more time to exit the idle state than expected thus adding latency to the system.
What would be the correct decision in this case ? Wake up the idle cpu when the frequency change to re-evaluate an idle state ? Provide idle latencies for the min freq only ? Or is it acceptable to have such latency added when the frequency decrease ?
Also, an interesting question is how do we get these latencies ?
They are all written in the c-state tables but we don't know the accuracy of these values ? Were they measured with freq max / min ?
Were they measured with a driver powering down the peripherals or without ?
For the embedded systems, we may have different implementations and maybe different latencies. Would be makes sense to pass these values through a device tree and let the SoC vendor to specify the right values ? (IMHO, only the SoC vendor can do a correct measurement with an oscilloscope).
I know there are lot of questions :)
On Wed, 12 Jun 2013, Daniel Lezcano wrote:
On Mon, 10 Jun 2013, Daniel Lezcano wrote:
Some SoC can have a cluster of cpus sharing some resources, eg cache, so they must enter the same state at the same moment. Beside the synchronization mechanisms, that adds a dependency with the next event. For example, the u8500 board has a couple of cpus. In order to make them to enter in retention, both must enter the same state, but not necessary at the same moment. The first cpu will wait in WFI and the second one will initiate the retention mode when entering to this state. Unfortunately, some time could have passed while the second cpu entered this state and the next event for the first cpu could be too close, thus violating the criteria of the governor when it choose this state for the second cpu.
Also the latencies could change with the frequencies, so there is a dependency with cpufreq, the lesser the frequency is, the higher the latency is. If the scheduler takes the decision to go to a specific state assuming the exit latency is a given duration, if the frequency decrease, this exit latency could increase also and lead the system to be less responsive.
I don't know, how were made the latencies computation (eg. worst case, taken with the lower frequency or not) but we have just one set of values. That should happen with the current code.
Another point is the timer allowing to detect bad decision and go to a deep idle state. With the cluster dependency described above, we may wake up a particular cpu, which turns on the cluster and make the entire cluster to wake up in order to enter a deeper state, which could fail because of the other cpu may not fulfill the constraint at this moment.
Nobody is saying that this sort of thing should be in the fastpath of the scheduler.
But if the scheduler has a table that tells it the possible states, and the cost to get from the current state to each of these states (and to get back and/or wake up to full power), then the scheduler can make the decision on what to do, invoke a routine to make the change (and in the meantime, not be fighting the change by trying to schedule processes on a core that's about to be powered off), and then when the change happens, the scheduler will have a new version of the table of possible states and costs
This isn't in the fastpath, it's in the rebalancing logic.
As Arjan mentionned it is not as simple as this.
We want the scheduler to take some decisions with the knowledge of idle latencies. In other words move the governor logic into the scheduler.
The scheduler can take decision and the backend driver provides the interface to go to the idle state.
But unfortunately each hardware is behaving in different ways and describing such behaviors will help to find the correct design, I am not raising a lot of issues but just trying to enumerate the constraints we have.
What is the correct decision when a lot of pm blocks are tied together and the
In the example given by Arjan, the frequencies could be per cluster, hence decreasing the frequency for a core will decrease the frequency of the other core. So if the scheduler takes the decision to put one core into a specific idle state, regarding the target residency and the exit latency when the frequency is at max (the other core is doing something), and then the frequency decrease, the exit latency may increase in this case and the idle cpu will take more time to exit the idle state than expected thus adding latency to the system.
What would be the correct decision in this case ? Wake up the idle cpu when the frequency change to re-evaluate an idle state ? Provide idle latencies for the min freq only ? Or is it acceptable to have such latency added when the frequency decrease ?
Also, an interesting question is how do we get these latencies ?
They are all written in the c-state tables but we don't know the accuracy of these values ? Were they measured with freq max / min ?
Were they measured with a driver powering down the peripherals or without ?
For the embedded systems, we may have different implementations and maybe different latencies. Would be makes sense to pass these values through a device tree and let the SoC vendor to specify the right values ? (IMHO, only the SoC vendor can do a correct measurement with an oscilloscope).
I know there are lot of questions :)
well, I have two immediate reactions.
First, use the values provided by the vendor, if they are wrong performance is not optimum and people will pick a different vendor (so they have an incentive to be right :-)
Second, "measure them" :-)
have the device tree enumerate the modes of operation, but then at bootup, run through a series of tests to bounce between the different modes and measure how long it takes to move back and forth. If the system can't measure the difference against it's clocks, then the user isn't going to see the difference either, so there's no need to be as accurate as a lab bench with a scope. What matters is how much work can end up getting done for the user, not the number of nanoseconds between voltage changes (the latter will affect the former, but it's the former that you really care about)
remember, perfect is the enemy of good enough. you don't have to have a perfect mapping of every possible change, you just need to be close enough to make reasonable decisions. You can't really predict the future anyway, so you are making a guess at what the load on the system is going to be in the future. Sometimes you will guess wrong no matter how accurate your latency measurements are. You have to accept that, and once you accept that, the severity of being wrong in some corner cases become less significant.
David Lang
On Sunday, June 09, 2013 09:12:18 AM Preeti U Murthy wrote:
Hi Rafael,
Hi Preeti,
On 06/08/2013 07:32 PM, Rafael J. Wysocki wrote:
On Saturday, June 08, 2013 12:28:04 PM Catalin Marinas wrote:
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
[...]
The scheduler can decide to load a single CPU or cluster and let the others idle. If the total CPU load can fit into a smaller number of CPUs it could as well tell cpuidle to go into deeper state from the beginning as it moved all the tasks elsewhere.
So why can't it do that today? What's the problem?
The reason that scheduler does not do it today is due to the prefer_sibling logic. The tasks within a core get distributed across cores if they are more than 1, since the cpu power of a core is not high enough to handle more than one task.
However at a socket level/ MC level (cluster at a low level), there can be as many tasks as there are cores because the socket has enough CPU capacity to handle them. But the prefer_sibling logic moves tasks across socket/MC level domains even when load<=domain_capacity.
I think the reason why the prefer_sibling logic was introduced, is that scheduler looks at spreading tasks across all the resources it has. It believes keeping tasks within a cluster/socket level domain would mean tasks are being throttled by having access to only the cluster/socket level resources. Which is why it spreads.
The prefer_sibling logic is nothing but a flag set at domain level to communicate to the scheduler that load should be spread across the groups of this domain. In the above example across sockets/clusters.
But I think it is time we take another look at the prefer_sibling logic and decide on its worthiness.
Well, it does look like something that would be good to reconsider.
Some results indicate that for a given CPU package (cluster/socket) there is a threshold number of tasks such that it is beneficial to pack tasks into that package as long as the total number of tasks running on it does not exceed that number. It may be 1 (which is the value used currently with prefer_sibling set if I understood what you said correctly), but it very well may be 2 or more (depending on the hardware characteristics).
[...]
If we know in advance that the CPU can be put into idle state Cn, there is no reason to put it into anything shallower than that.
On the other hand, if the CPU is in Cn already and there is a possibility to put it into a deeper low-power state (which we didn't know about before), it may make sense to promote it into that state (if that's safe) or even wake it up and idle it again.
Yes, sorry I said it wrong in the previous mail. Today the cpuidle governor is capable of putting a CPU in idle state Cn directly, by looking at various factors like the current load, next timer, history of interrupts, exit latency of states. At the end of this evaluation it puts it into idle state Cn.
Also it cares to check if its decision is right. This is with respect to your statement "if there is a possibility to put it into deeper low power state". It queues a timer at a time just after its predicted wake up time before putting the cpu to idle state. If this time of wakeup prediction is wrong, this timer triggers to wake up the cpu and the cpu is hence put into a deeper sleep state.
So I don't think we need to modify that behavior. :-)
This means time will tell the governors what kinds of workloads are running on the system. If the cpu is idle for long, it probably means that the system is less loaded and it makes sense to put the cpus to deeper sleep states. Of course there could be sporadic bursts or quieting down of tasks, but these are corner cases.
It's nothing wrong with degrading given the information that cpuidle currently has. It's a heuristics that worked ok so far and may continue to do so. But see my comments above on why the scheduler could make more informed decisions.
We may not move all the power gating information to the scheduler but maybe find a way to abstract this by giving more hints via the CPU and cache topology. The cpuidle framework (it may not be much left of a governor) would then take hints about estimated idle time and invoke the low-level driver about the right C state.
Overall, it looks like it'd be better to split the governor "layer" between the scheduler and the idle driver with a well defined interface between them. That interface needs to be general enough to be independent of the underlying hardware.
We need to determine what kinds of information should be passed both ways and how to represent it.
I agree with this design decision.
OK, so let's try to take one step more and think about what part should belong to the scheduler and what part should be taken care of by the "idle" driver.
Do you have any specific view on that?
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load, which means reducing the idle time. It does not know whether it is better to boost the frequency twice as high so that it gets to idle quicker. You can change the governor's policy but does it have any information from cpuidle?
Well, it may get that information directly from the hardware. Actually, intel_pstate does that, but intel_pstate is the governor and the scaling driver combined.
To add to this, cpufreq currently functions in the below fashion. I am talking of the on demand governor, since it is more relevant to our discussion.
----stepped up frequency------ ----threshold-------- -----stepped down freq level1--- -----stepped down freq level2--- ---stepped down freq level3----
If the cpu idle time is below a threshold , it boosts the frequency to
Did you mean "above the threshold"?
one level above straight away and does not vary it any further. If the cpu idle time is below a threshold there is a step down in frequency levels by 5% of the current frequency at every sampling period, provided the cpu behavior is constant.
I think we can improve this implementation by better interaction with cpuidle and scheduler.
When it is stepping up frequency, it should do it in steps of frequency being a *function of the current cpu load* also, or function of idle time will also do.
When it is stepping down frequency, it should interact with cpuidle. It should get from cpuidle information regarding the idle state that the cpu is in.The reason is cpufrequency governor is aware of only the idle time of the cpu, not the idle state it is in. If it gets to know that the cpu is in a deep idle state, it could step down frequency levels to leveln straight away. Just like cpuidle does to put cpus into state Cn.
Or an alternate option could be just like stepping up, make the stepping down also a function of idle time. Perhaps fn(|threshold-idle_time|).
Also one more point to note is that if cpuidle puts cpus into such idle states that clock gate the cpus, then there is no need for cpufrequency governor for that cpu. cpufreq can check with cpuidle on this front before it queries a cpu.
cpufreq ondemand (or intel_pstate for that matter) doesn't touch idle CPUs, because it uses deferrable timers. It basically only handles CPUs that aren't idle at the moment.
However, it doesn't exactly know when the given CPU stopped being idle, because its sampling is not generally synchronized with the scheduler's operations. That, among other things, is why I'm thinking that it might be better if the scheduler told cpufreq (or intel_pstate) when to try to adjust frequencies so that it doesn't need to sample by itself.
[...]
Let's say there is an increase in the load, does the scheduler wait until cpufreq figures this out or tries to take the other CPUs out of idle? Who's making this decision? That's currently a potentially unstable loop.
Yes, it is and I don't think we currently have good answers here.
My answer to the above question is scheduler does not wait until cpufreq figures it out. All that the scheduler cares about today is load balancing. Spread the load and hope it finishes soon. There is a possibility today that even before cpu frequency governor can boost the frequency of cpu, the scheduler can spread the load.
That is a valid observation, but I wanted to say that we didn't really understood how those things should be arranged.
As for the second question it will wakeup idle cpus if it must to load balance.
It is a good question asked: "does the scheduler wait until cpufreq figures it out." Currently the answer is no, it does not communicate with cpu frequency at all (except through cpu power, but that is the good part of the story, so I will not get there now). But maybe we should change this. I think we can do so the following way.
When can a scheduler talk to cpu frequency? It can do so under the below circumstances:
- Load is too high across the systems, all cpus are loaded, no chance
of load balancing. Therefore ask cpu frequency governor to step up frequency to get improve performance.
- The scheduler finds out that if it has to load balance, it has to do
so on cpus which are in deep idle state( Currently this logic is not present, but worth getting it in). It then decides to increase the frequency of the already loaded cpus to improve performance. It calls cpu freq governor.
- The scheduler finds out that if it has to load balance, it has to do
so on a different power domain which is idle currently(shallow/deep). It thinks the better of it and calls cpu frequency governor to boost the frequency of the cpus in the current domain.
While 2 and 3 depend on scheduler having knowledge about idle states and power domains, which it currently does not have, 1 can be achieved with the current code. scheduler keeps track of failed ld balancing efforts with lb_failed. If it finds that while load balancing from busy group failed (lb_failed > 0), it can call cpu freq governor to step up the cpu frequency of this busy cpu group, with gov_check_cpu() in cpufrequency governor code.
Well, if the model is that the scheduler tells cpufreq when to modify frequencies, then it'll need to do that on a regular basis, like every time a task is scheduled or similar.
The results of many measurements seem to indicate that it generally is better to do the work as quickly as possible and then go idle again, but there are costs associated with going back and forth from idle to non-idle etc.
I think we can even out the cost benefit of race to idle, by choosing to do it wisely. Like for example if points 2 and 3 above are true (idle cpus are in deep sleep states or need to ld balance on a different power domain), then step up the frequency of the current working cpus and reap its benefit.
The main problem with cpufreq that I personally have is that the governors carry out their own sampling with pretty much arbitrary resolution that may lead to suboptimal decisions. It would be much better if the scheduler indicated when to *consider* the changing of CPU performance parameters (that may not be frequency alone and not even frequency at all in general), more or less the same way it tells cpuidle about idle CPUs, but I'm not sure if it should decide what performance points to run at.
Very true. See the points 1,2 and 3 above where I list out when scheduler can call cpu frequency.
Well, as I said above, I think that'd need to be done more frequently.
Also an idea about how cpu frequency governor can decide on the scaling frequency is stated above.
Actaully, intel_pstate uses a PID controller for making those decisions and I think this may be just the right thing to do.
[...]
Well, there's nothing like "predicted load". At best, we may be able to make more or less educated guesses about it, so in my opinion it is better to use the information about what happened in the past for making decisions regarding the current settings and re-adjust them over time as we get more information.
Agree with this as well. scheduler can at best supply information regarding the historic load and hope that it is what defines the future as well. Apart from this I dont know what other information scheduler can supply cpuidle governor with.
So how much decision making regarding the idle state to put the given CPU into should be there in the scheduler? I believe the only information coming out of the scheduler regarding that should be "OK, this CPU is now idle and I'll need it in X nanoseconds from now" plus possibly a hint about the wakeup latency tolerance (but those hints may come from other places too). That said the decision *which* CPU should become idle at the moment very well may require some information about what options are available from the layer below (for example, "putting core X into idle for Y of time will save us Z energy" or something like that).
Agree. Except that the information should be "Ok , this CPU is now idle and it has not done much work in the recent past,it is a 10% loaded CPU".
And what would that be useful for to the "idle" layer? What matters is the "I'll need it in X nanoseconds from now" part.
Yes, the load part would be interesting to the "frequency" layer.
This can be said today using PJT's metric. It is now for the cpuidle governor to decide the idle state to go to. Thats what happens today too.
And what about performance scaling? Quite frankly, in my opinion that requires some more investigation, because there still are some open questions in that area. To start with we can just continue using the current heuristics, but perhaps with the scheduler calling the scaling "governor" when it sees fit instead of that "governor" running kind of in parallel with it.
Exactly. How this can be done is elaborated above. This is one of the key things we need today,IMHO.
[...]
There is another angle to look at that as I said somewhere above.
What if we could integrate cpuidle with cpufreq so that there is one code layer representing what the hardware can do to the scheduler? What benefits can we get from that, if any?
We could debate on this point. I am a bit confused about this. As I see it, there is no problem with keeping them separately. One, because of code readability; it is easy to understand what are the different parameters that the performance of CPU depends on, without needing to dig through the code. Two, because cpu frequency kicks in during runtime primarily and cpuidle during idle time of the cpu.
That's a very useful observation. Indeed, there's the "idle" part that needs to be invoked when the CPU goes idle (and it should decide what idle state to put that CPU into), and there's the "scaling" part that needs to be invoked when the CPU has work to do (and it should decide what performance point to put that CPU into). The question is, though, if it's better to have two separate frameworks for those things (which is what we have today) or to make them two parts of the same framework (like two callbacks one of which will be executed for CPUs that have just become idle and the other will be invoked for CPUs that have just got work to do).
But this would also mean creating well defined interfaces between them. Integrating cpufreq and cpuidle seems like a better argument to make due to their common functionality at a higher level of talking to hardware and tuning the performance parameters of cpu. But I disagree that scheduler should be put into this common framework as well as it has functionalities which are totally disjoint from what subsystems such as cpuidle and cpufreq are intended to do.
That's correct. The role of the scheduler, in my opinion, may be to call the "idle" and "scaling" functions at the right time and to give them information needed to make optimal choices.
Thanks, Rafael
Hi,
On 06/11/2013 06:20 AM, Rafael J. Wysocki wrote:
OK, so let's try to take one step more and think about what part should belong to the scheduler and what part should be taken care of by the "idle" driver.
Do you have any specific view on that?
I gave it some thought and went through Ingo's mail once again. I have some view points which I have stated at the end of this mail.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load, which means reducing the idle time. It does not know whether it is better to boost the frequency twice as high so that it gets to idle quicker. You can change the governor's policy but does it have any information from cpuidle?
Well, it may get that information directly from the hardware. Actually, intel_pstate does that, but intel_pstate is the governor and the scaling driver combined.
To add to this, cpufreq currently functions in the below fashion. I am talking of the on demand governor, since it is more relevant to our discussion.
----stepped up frequency------ ----threshold-------- -----stepped down freq level1--- -----stepped down freq level2--- ---stepped down freq level3----
If the cpu idle time is below a threshold , it boosts the frequency to
Did you mean "above the threshold"?
No I meant "above". I am referring to the cpu *idle* time.
Also an idea about how cpu frequency governor can decide on the scaling frequency is stated above.
Actaully, intel_pstate uses a PID controller for making those decisions and I think this may be just the right thing to do.
But don't you think we need to include the current cpu load during this decision making as well? I mean a fn(idle_time) logic in cpu frequency governor, which is currently absent. Today, it just checks if idle_time < threshold, and sets one specific frequency. Of course the PID could then make the decision about the frequencies which can be candidates for scaling up, but cpu freq governor could decide which among these to pick based on fn(idle_time) .
[...]
Well, there's nothing like "predicted load". At best, we may be able to make more or less educated guesses about it, so in my opinion it is better to use the information about what happened in the past for making decisions regarding the current settings and re-adjust them over time as we get more information.
Agree with this as well. scheduler can at best supply information regarding the historic load and hope that it is what defines the future as well. Apart from this I dont know what other information scheduler can supply cpuidle governor with.
So how much decision making regarding the idle state to put the given CPU into should be there in the scheduler? I believe the only information coming out of the scheduler regarding that should be "OK, this CPU is now idle and I'll need it in X nanoseconds from now" plus possibly a hint about the wakeup latency tolerance (but those hints may come from other places too). That said the decision *which* CPU should become idle at the moment very well may require some information about what options are available from the layer below (for example, "putting core X into idle for Y of time will save us Z energy" or something like that).
Agree. Except that the information should be "Ok , this CPU is now idle and it has not done much work in the recent past,it is a 10% loaded CPU".
And what would that be useful for to the "idle" layer? What matters is the "I'll need it in X nanoseconds from now" part.
Yes, the load part would be interesting to the "frequency" layer.
What if we could integrate cpuidle with cpufreq so that there is one code layer representing what the hardware can do to the scheduler? What benefits can we get from that, if any?
We could debate on this point. I am a bit confused about this. As I see it, there is no problem with keeping them separately. One, because of code readability; it is easy to understand what are the different parameters that the performance of CPU depends on, without needing to dig through the code. Two, because cpu frequency kicks in during runtime primarily and cpuidle during idle time of the cpu.
That's a very useful observation. Indeed, there's the "idle" part that needs to be invoked when the CPU goes idle (and it should decide what idle state to put that CPU into), and there's the "scaling" part that needs to be invoked when the CPU has work to do (and it should decide what performance point to put that CPU into). The question is, though, if it's better to have two separate frameworks for those things (which is what we have today) or to make them two parts of the same framework (like two callbacks one of which will be executed for CPUs that have just become idle and the other will be invoked for CPUs that have just got work to do).
But this would also mean creating well defined interfaces between them. Integrating cpufreq and cpuidle seems like a better argument to make due to their common functionality at a higher level of talking to hardware and tuning the performance parameters of cpu. But I disagree that scheduler should be put into this common framework as well as it has functionalities which are totally disjoint from what subsystems such as cpuidle and cpufreq are intended to do.
That's correct. The role of the scheduler, in my opinion, may be to call the "idle" and "scaling" functions at the right time and to give them information needed to make optimal choices.
Having looked at the points being brought about in this discussion and the mail that Ingo sent out regarding his view points, I have a few points to make.
David Lezcano made a valid point when he stated that we need to *move cpufrequency and cpuidle governor logic into scheduler while retaining their driver functionality in those subsystems.*
It is true that I was strongly against moving the governor logic into the scheduler, thinking it would be simpler to enhance the communication interface between the scheduler and the governors. But having given this some thought,I think this would mean greater scope for loopholes.
Catalin pointed it out well with an example, when he said in one of his mails that, assuming scheduler ends up telling cpu frequency governor when to boost/lower the frequency and note that scheduler is not aware of the user policies that have gone in to decide if cpu frequency governor actually does what the scheduler is asking it to do.
And it is only cpu frequency governor who is aware of these user policies and not scheduler. So how long should the scheduler wait for cpu frequency governor to boost the frequency? What if the user has selected a powersave mode, and the cpu frequency cannot rise any further? That would mean cpu frequency governor telling scheduler that it can't do what the scheduler is asking it to do. This decision of scheduler then is a waste of time,since it gets rejected by the cpufrequency governor and nothing comes of it.
Very clearly the scheduler not being aware of the user policy is a big drawback; had it known the user policies before hand it would not even have considered boosting the cpu frequency of the cpu in question.
This point that Ingo made is something we need to look hard at."Today the power saving landscape is fragmented." The scheduler today does not know what in the world is the end result of its decisions. cpuidle and cpu frequency could take decisions that is totally counter intuitive to the scheduler's. Improving the communication between them would surely mean we export more and more information back and forth for better communication, whose end result would probably be to merge the governor and scheduler. If this vision that "they will eventually get so close, that we will end up merging them", is agreed upon, then it might be best to merge them right away without wasting effort into adding logic that tries to communicate between them or even trying to separate the functionalities between scheduler and governors.
I don't think removing certain scheduler functionalities and putting it instead into governors is the right thing to do. Scheduler's functions are tightly coupled with one another. Breaking one will in my opinion break a lot of things.
There have been points brought out strongly about how the scheduler should have global view of cores so that it knows the effect on a socket when it decides on what to do with a core for instance. This could be the next step in its enhancement. Taking up one of the examples that Daniel brought out:" Putting one of the cpus to idle state could lower the frequency of the socket,thus hampering the exit latency of this idle state ". (Not the exact words, but this is the point.)
Notice how in the above,if a scheduler were to be able to understand the above statement, it needs to first off be aware of the cpu frequency and idle state details. *Therefore as a first step we need better knowledge in scheduler before it makes global decisions*.
Also note a scheduler cannot under the above circumstances talk back and forth to the governors to begin to learn about idle states and frequencies at that point. This simply does not make sense.(True at this point I am heavily contradicting my previous arguments :P. I felt that the existing communication is good enough and all that was needed a few more additions, but that does not seem to be the case. )
Arjan also pointed out how the a task running on a slower core, should be charged less than when it runs on a faster core. Right here is a use case for scheduler to be aware of the cpu frequency of a core, since today it is the one which charges a task, but is not aware of what cpu frequency it is running on.(It is aware of cpu frequency of core through cpu power stats, but it uses it only for load balancing today and not when it charges a task for its run time).
My suggestion at this point is :
1. Begin to move the cpuidle and cpufreq *governor* logic into the scheduler little by little.
2. Scheduler is already aware of the topology details, maybe enhance that as the next step.
At this point, we would have a scheduler well aware of the effect of its load balancing decisions to some extent.
3. Add the logic for the scheduler to get a global view of the cpufreq and idle.
4. Then get system user policies (powersave/performance) to alter scheduler behavior accordingly.
At this point if we bring in today's patchsets (power aware scheduling and packing tasks), they could fetch us their intended benefits pretty much in most cases as against sporadic behaviour, because the scheduler is aware of the whole picture and will do what these patches command only if it is right till the point of idle states and cpu frequencies and not just till load balancing.
I would appreciate all of yours feedback on the above. I think at this point we are in a position to judge what would be the next move in this direction and make that move soon.
Regards Preeti U Murthy
Hi Catalin,
On 06/08/2013 04:58 PM, Catalin Marinas wrote:
On Fri, Jun 07, 2013 at 07:08:47PM +0100, Preeti U Murthy wrote:
On 06/07/2013 08:21 PM, Catalin Marinas wrote:
I think you are missing Ingo's point. It's not about the scheduler complying with decisions made by various governors in the kernel (which may or may not have enough information) but rather the scheduler being in a better position for making such decisions.
My mail pointed out that I disagree with this design ("the scheduler being in a better position for making such decisions"). I think it should be a 2 way co-operation. I have elaborated below.
Take the cpuidle example, it uses the load average of the CPUs, however this load average is currently controlled by the scheduler (load balance). Rather than using a load average that degrades over time and gradually putting the CPU into deeper sleep states, the scheduler could predict more accurately that a run-queue won't have any work over the next x ms and ask for a deeper sleep state from the beginning.
How will the scheduler know that there will not be work in the near future? How will the scheduler ask for a deeper sleep state?
My answer to the above two questions are, the scheduler cannot know how much work will come up. All it knows is the current load of the runqueues and the nature of the task (thanks to the PJT's metric). It can then match the task load to the cpu capacity and schedule the tasks on the appropriate cpus.
The scheduler can decide to load a single CPU or cluster and let the others idle. If the total CPU load can fit into a smaller number of CPUs it could as well tell cpuidle to go into deeper state from the beginning as it moved all the tasks elsewhere.
This currently does not happen. I have elaborated in the response to Rafael's mail. Sorry I should have put you on the 'To' list, missed that. Do take a look at that mail since many of the replies to your current mail are in it.
What do you mean "from the beginning"? As soon as those cpus go idle, cpuidle will kick in anyway. If you are saying that scheduler should tell cpuidle that "this cpu can go into deep sleep state x, since I am not going to use it for the next y seconds", that is not possible.
Firstly, because scheduler can't "predict" this 'y' parameter. Secondly because hardware could change the idle state availibility or details dynamically as Rafael pointed out and hence this 'x' is best not to be told by the scheduler, but be queried by cpuidle governor by itself.
Regarding future work, neither cpuidle nor the scheduler know this but the scheduler would make a better prediction, for example by tracking task periodicity.
This prediction that you mention scheduler already exports it to cpuidle. load_avg does precisely that, it tracks history and predicts the future based on this. load_avg being tracked by scheduler periodically is already seen by cpuidle governor.
As a consequence, it leaves certain cpus idle. The load of these cpus degrade. It is via this load that the scheduler asks for a deeper sleep state. Right here we have scheduler talking to the cpuidle governor.
So we agree that the scheduler _tells_ the cpuidle governor when to go idle (but not how deep). IOW, the scheduler drives the cpuidle decisions. Two problems: (1) the cpuidle does not get enough information from the scheduler (arguably this could be fixed) and (2) the scheduler does not have any information about the idle states (power gating etc.) to make any informed decision on which/when CPUs should go idle.
As you said, it is a non-optimal one-way communication but the solution is not feedback loop from cpuidle into scheduler. It's like the scheduler managed by chance to get the CPU into a deeper sleep state and now you'd like the scheduler to get feedback form cpuidle and not disturb that CPU anymore. That's the closed loop I disagree with. Could the scheduler not make this informed decision before - it has this total load, let's get this CPU into deeper sleep state?
Lets say the scheduler does make an informed decision before, with lets get this cpu into idle state. Then what? Say the load begins to increase on the system. The scheduler has to wake up cpus. Which cpus to wake up best? Who tells scheduler this? One, the power gating information which is yet to be exported to the scheduler can tell scheduler this to an extent. As far as I can see the next person to guide the scheduler here is cpuidle, isnt it?
I don't see what the problem is with the cpuidle governor waiting for the load to degrade before putting that cpu to sleep. In my opinion, putting a cpu to deeper sleep states should happen gradually. This means time will tell the governors what kinds of workloads are running on the system. If the cpu is idle for long, it probably means that the system is less loaded and it makes sense to put the cpus to deeper sleep states. Of course there could be sporadic bursts or quieting down of tasks, but these are corner cases.
It's nothing wrong with degrading given the information that cpuidle currently has. It's a heuristics that worked ok so far and may continue to do so. But see my comments above on why the scheduler could make more informed decisions.
scheduler can certainly make more informed decisions like:
1. Dont wakup idle cpus 2. Dont wake up cpus in a different power domain 3. Do not move task away from cpus in turbo mode.
These are a few. See how all of them require scheduler to talk to cpufreq and cpuidle to find out? Can you list how scheduler can make informed decision without getting information from them?
For this you may say that which is why we need to get all the decision making into the scheduler. But I disagree because integrating cpuidle and cpufreq governing seems fine, because at a high level their functionality is the same; that being querying the hardware and deciding what is best for cpus. But thats not the case with scheduler. Its primary aim is to make sure there are enough resources for the tasks, that it is able to see the topology of cpus and load balance bottom up, do fair scheduling within a cpu and so on. Why would you want to add more complexity to it?
We may not move all the power gating information to the scheduler but maybe find a way to abstract this by giving more hints via the CPU and cache topology.
Correct.Power gating and topology information should best be in scheduler primarily because this information is no where else and secondly because scheduling domains and groups topology were created specifically for the scheduler.
The cpuidle framework (it may not be much left of a governor) would then take hints about estimated idle time and invoke the low-level driver about the right C state.
This happens today.
Of course, you could export more scheduler information to cpuidle, various hooks (task wakeup etc.) but then we have another framework, cpufreq. It also decides the CPU parameters (frequency) based on the load controlled by the scheduler. Can cpufreq decide whether it's better to keep the CPU at higher frequency so that it gets to idle quicker and therefore deeper sleep states? I don't think it has enough information because there are at least three deciding factors (cpufreq, cpuidle and scheduler's load balancing) which are not unified.
Why not? When the cpu load is high, cpu frequency governor knows it has to boost the frequency of that CPU. The task gets over quickly, the CPU goes idle. Then the cpuidle governor kicks in to put the CPU to deeper sleep state gradually.
The cpufreq governor boosts the frequency enough to cover the load, which means reducing the idle time. It does not know whether it is better to boost the frequency twice as high so that it gets to idle quicker. You can change the governor's policy but does it have any information from cpuidle?
This I have elaborated in the response to Rafael's mail.
Meanwhile the scheduler should ensure that the tasks are retained on that CPU,whose frequency is boosted and should not load balance it, so that they can get over quickly. This I think is what is missing. Again this comes down to the scheduler taking feedback from the CPU frequency governors which is not currently happening.
Same loop again. The cpu load goes high because (a) there is more work, possibly triggered by external events, and (b) the scheduler decided to balance the CPUs in a certain way. As for cpuidle above, the scheduler has direct influence on the cpufreq decisions. How would the scheduler know which CPU not to balance against? Are CPUs in a cluster synchronous? Is it better do let other CPU idle or more efficient to run this cluster at half-speed?
Let's say there is an increase in the load, does the scheduler wait until cpufreq figures this out or tries to take the other CPUs out of idle? Who's making this decision? That's currently a potentially unstable loop.
The answers to the above as I see it are in my response to Rafael's mail. I don't intend to duplicate the replies, hence I would be glad if you could read through that mail and give your feedback on the same.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
The key here is that cpuidle/cpufreq make their primary decision based on something controlled by the scheduler: the CPU load (via run-queue balancing). You would then like the scheduler take such decision back into account. It just looks like a closed loop, possibly 'unstable' .
Why? Why would you call a scheduler->cpuidle->cpufrequency interaction a closed loop and not the new_scheduler = scheduler+cpuidle+cpufrequency a closed loop? Here too the scheduler should be made well aware of the decisions it took in the past right?
It's more like:
scheduler -> cpuidle/cpufreq -> hardware operating point ^ | +--------------------------------------+
You can argue that you can make an adaptive loop that works fine but there are so many parameters that I don't see how it would work. The patches so far don't seem to address this. Small task packing, while useful, it's some heuristics just at the scheduler level.
Correct. That is the issue with them and we need to rectify that.
With a combined decision maker, you aim to reduce this separate decision process and feedback loop. Probably impossible to eliminate the loop completely because of hardware latencies, PLLs, CPU frequency not always the main factor, but you can make the loop more tolerant to instabilities.
I dont see how we can break the above loop that you have drawn and I dont think it is a good idea to merge scheduler and cpuidle/cpufreq into one for reasons mentioned above.
So I think we either (a) come up with 'clearer' separation of responsibilities between scheduler and cpufreq/cpuidle
I agree with this. This is what I have been emphasizing, if we feel that the cpufrequency/ cpuidle subsystems are suboptimal in terms of the information that they use to make their decisions, let us improve them. But this will not yield us any improvement if the scheduler does not have enough information. And IMHO, the next fundamental information that the scheduler needs should come from cpufreq and cpuidle.
What kind of information? Your suggestion that the scheduler should avoid loading a CPU because it went idle is wrong IMHO. It went idle because the scheduler decided this in first instance.
With regard to cpu idle, which idle state a CPU is in and with regard to cpu freq, when to call it. The former is detailed above and latter is detailed in my response to Rafael's mail.
Then we should move onto supplying scheduler information from the power domain topology, thermal factors, user policies.
I agree with this but at this point you get the scheduler to make more informed decisions about task placement. It can then give more precise hints to cpufreq/cpuidle like the predicted load and those frameworks could become dumber in time, just complying with the requested performance level (trying to break the loop above).
or (b) come up with a unified load-balancing/cpufreq/cpuidle implementation as per Ingo's request. The latter is harder but, with a good design, has potentially a lot more benefits.
A possible implementation for (a) is to let the scheduler focus on performance load-balancing but control the balance ratio from a cpufreq governor (via things like arch_scale_freq_power() or something new). CPUfreq would not be concerned just with individual CPU load/frequency but also making a decision on how tasks are balanced between CPUs based on the overall load (e.g. four CPUs are enough for the current load, I can shut the other four off by telling the scheduler not to use them).
As for Ingo's preferred solution (b), a proposal forward could be to factor the load balancing out of kernel/sched/fair.c and provide an abstract interface (like load_class?) for easier extending or different policies (e.g. small task packing).
Let me elaborate on the patches that have been posted so far on the power awareness of the scheduler. When we say *power aware scheduler* what exactly do we want it to do?
In my opinion, we want it to *avoid touching idle cpus*, so as to keep them in that state longer and *keep more power domains idle*, so as to yield power savings with them turned off. The patches released so far are striving to do the latter. Correct me if I am wrong at this.
Don't take me wrong, task packing to keep more power domains idle is probably in the right direction but it may not address all issues. You realised this is not enough since you are now asking for the scheduler to take feedback from cpuidle. As I pointed out above, you try to create a loop which may or may not work, especially given the wide variety of hardware parameters.
Also feel free to point out any other expectation from the power aware scheduler if I am missing any.
If the patches so far are enough and solved all the problems, you are not missing any. Otherwise, please see my view above.
Please define clearly what the scheduler, cpufreq, cpuidle should be doing and what communication should happen between them.
This I have to an extent elaborated in this mail and in the response to Rafael's.
If I have got Ingo's point right, the issues with them are that they are not taking a holistic approach to meet the said goal.
Probably because scheduler changes, cpufreq and cpuidle are all trying to address the same thing but independent of each other and possibly conflicting.
Keeping more power domains idle (by packing tasks) would sound much better if the scheduler has taken all aspects of doing such a thing into account, like
- How idle are the cpus, on the domain that it is packing
- Can they go to turbo mode, because if they do,then we cant pack
tasks. We would need certain cpus in that domain idle. 3. Are the domains in which we pack tasks power gated? 4. Will there be significant performance drop by packing? Meaning do the tasks share cpu resources? If they do there will be severe contention.
So by this you add a lot more information about the power configuration into the scheduler, getting it to make more informed decisions about task scheduling. You may eventually reach a point where cpuidle governor doesn't have much to do (which may be a good thing) and reach Ingo's goal.
That's why I suggested maybe starting to take the load balancing out of fair.c and make it easily extensible (my opinion, the scheduler guys may disagree). Then make it more aware of topology, power configuration so that it makes the right task placement decision. You then get it to tell cpufreq about the expected performance requirements (frequency decided by cpufreq) and cpuidle about how long it could be idle for (you detect a periodic task every 1ms, or you don't have any at all because they were migrated, the right C state being decided by the governor).
All the above questions have been addressed above.
Regards.
Regards Preeti U Murthy
On 6/6/2013 11:03 PM, Preeti U Murthy wrote:
Hi,
On 05/31/2013 04:22 PM, Ingo Molnar wrote:
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
and I will argue we do too much of this already; various caches (and tlbs) get flushed (on x86 at least) much much more than you'd think.
so the scheduler is in an _ideal_ position to do a judgement call about the near future
this part I will buy
and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
this part I cannot buy. First of all, we really need to stop thinking about choosing frequency (at least for x86). that concept basically died for x86 6 years ago.
Second, the interactions between these two, and the "what does it mean if I chose something" is highly hardware specific and complex nowadays, and going forward is going to be increasingly so. If anything, we've been moving AWAY from centralized infrastructure there, going towards CPU specific drivers/policies. And hardware rules are very different between platforms here. On Intel, asking for different performance is just an MSR write, and going idle is usually just one instruction. On some ARM, this might involve a long complex interaction calculations, or even *blocking* operation manipulating VRs and PLLs directly... depending on the platform and the states you want to pick. (Hence the CPUFREQ design of requiring changes to be done in a kernel thread)
Now, I would like the scheduler to give some notifications at certain events (like migrations, starting realtime tasks)...but a few atomic notifier chains will do for that.
The policies will be very hardware specific, and thus will live outside the scheduler, no matter which way you put it. Now, the scheduler can and should participate more in terms of sharing information in both directions... that I think we can all agree on.
Hi,
On Fri, May 31, 2013 at 11:52:04AM +0100, Ingo Molnar wrote:
- Morten Rasmussen morten.rasmussen@arm.com wrote:
Hi,
A number of patch sets related to power-efficient scheduling have been posted over the last couple of months. Most of them do not have much data to back them up, so I decided to do some testing.
Thanks, numbers are always welcome!
Measurement technique: Time spent non-idle (not in idle state) for each cpu based on cpuidle ftrace events. TC2 does not have per-core power-gating, so packing inside the A7 cluster does not lead to any significant power savings. Note that any product grade hardware (TC2 is a test-chip) will very likely have per-core power-gating, so in those cases packing will have an appreciable effect on power savings. Measuring non-idle time rather than power should give a more clear idea about the effect of the patch sets given that the idle back-end is highly implementation specific.
Note that I still disagree with the whole design notion of having an "idle back-end" (and a 'cpufreq back end') separate from scheduler power saving policy, and none of the patch-sets offered so far solve this fundamental design problem.
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
The scheduler is also at a high enough level to host a "I want maximum performance, power does not matter to me" user policy override switch and similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly.
Even when the numbers improve, it's an entirely random, essentially unmaintainable property: because there's no clear split (possible) between 'scheduler policy' and 'idle policy'. This is why we removed the old, broken power saving scheduler code a year ago: to make room for something _better_.
So if we want to add back scheduler power saving then what should happen is genuinely better code:
To create a new low level idle driver mechanism the scheduler could use and integrate proper power saving / idle policy into the scheduler.
In that power saving framework the already existing scheduler topology information should be extended with deep idle parameters:
enumeration of idle states
how long it takes to enter+exit a particular idle state
[ perhaps information about how destructive to CPU caches that particular idle state is. ]
new driver entry point that allows the scheduler to enter any of the enumerated idle states. Platform code will not change this state, all policy decisions and the idle state is decided at the power saving policy level.
All of this combines into a 'cost to enter and exit an idle state' estimation plus a way to enter idle states. It should be presented to the scheduler in a platform independent fashion, but without policy embedded: a low level platform driver interface in essence.
Thomas Gleixner's recent work to generalize platform idle routines will further help the implementation of this. (that code is upstream already)
_All_ policy, all metrics, all averaging should happen at the scheduler power saving level, in a single place, and then the scheduler should directly drive the new low level idle state driver mechanism.
'scheduler power saving' and 'idle policy' are one and the same principle and they should be handled in a single place to offer the best power saving results.
Note that any RFC patch-set that offers an implementation for this could be structured in a gradual fashion: only implementing it for a limited CPU range initially. The new framework can then be extended to more and more CPUs and architectures, incorporating more complicated power saving features gradually. (The old, existing idle policy code would remain untouched and available - it would simply not be used when the new policy is activated.)
I.e. I'm not asking for a 'rewrite the world' kind of impossible task - I'm providing an actionable path to get improved power saving upstream, but it has to use a _sane design_.
This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable - and I'm NAK-ing incomplete approaches that don't solve the root design cause of our power saving troubles...
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
What is less clear is how such design would look like. Catalin has suggested two different approaches. Integrating cpufreq into the load balancing, or let the scheduler focus on load balancing and extend cpufreq to also restrict number of cpus available to the scheduler using cpu_power. The former approach would increase the scheduler complexity significantly as I already highlighted in my first reply. The latter approach introduces a way to, at lease initially, separate load balancing from capacity management, which I think is an interesting approach. Based on this idea I propose the following design:
+-----------------+ | | +----------+ current load | Power scheduler |<----+ cpufreq | +--------->| sched/power.c +---->| driver | | | | +----------+ | +-------+---------+ | ^ | +-----+---------+ | | | | | | available capacity | Scheduler |<--+----+ (e.g. cpu_power) | sched/fair.c | | | +--+| +---------------+ || ^ || | v| +---------+--------+ +----------+ | task load metric | | cpuidle | | arch/* | | driver | +------------------+ +----------+
The intention is that the power scheduler will implement the (unified) power policy. It gets the current load of the system from the scheduler. Based on this information it will adjust the compute capacity available to the scheduler and drive frequency changes such that enough compute capacity is available to handle the current load. If the total load can be handled by a subset of cpus, it will reduce the capacity of the excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will increase capacity of one or more idle cpus to allow the scheduler to spread the load. The power scheduler has knowledge about the power topology and will guide the scheduler to idle the most optimum cpus by reducing its capacity. Global idle decision will be handled by the power scheduler, so cpuidle can over time be reduced to become just a driver, once we have added C-state selection to the power scheduler.
The scheduler is left to focus on scheduling mechanics and finding the best possible load balance on the cpu capacities set by the power scheduler. It will share a detailed view of the current load with the power scheduler to enable it to make the right capacity adjustments. The scheduler will need some optimization to cope better with asymmetric compute capacities. We may want to reduce capacity of some cpu to increase their idle time while letting others take the majority of the load.
Frequency scaling has a problematic impact on PJT's load metic, which was pointed out a while ago by Chris Redpath https://lkml.org/lkml/2013/4/16/289. So I agree with Arjan's suggestion to change the load calculation basis to something which is frequency invariant. Use whatever counters that are available on the specific platform.
I'm aware that the scheduler and power scheduler decisions may be inextricably linked so we may decide to merge them. However, I think it is worth trying to keep the power scheduling decisions out of the scheduler until we have proven it infeasible.
We are going to start working on this design and see where it takes us. We will post any results and suggested patches for folk to comment on. As a starting point we are planning to create a power scheduler (kernel/sched/power.c) similar to a cpufreq governor that does capacity management, and then evolve the solution from there.
Morten
Thanks,
Ingo
On Fri, Jun 14, 2013 at 05:05:22PM +0100, Morten Rasmussen wrote:
The intention is that the power scheduler will implement the (unified) power policy. It gets the current load of the system from the scheduler. Based on this information it will adjust the compute capacity available to the scheduler and drive frequency changes such that enough compute capacity is available to handle the current load. If the total load can be handled by a subset of cpus, it will reduce the capacity of the excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will increase capacity of one or more idle cpus to allow the scheduler to spread the load. The power scheduler has knowledge about the power topology and will guide the scheduler to idle the most optimum cpus by reducing its capacity. Global idle decision will be handled by the power scheduler, so cpuidle can over time be reduced to become just a driver, once we have added C-state selection to the power scheduler.
The scheduler is left to focus on scheduling mechanics and finding the best possible load balance on the cpu capacities set by the power scheduler. It will share a detailed view of the current load with the power scheduler to enable it to make the right capacity adjustments. The scheduler will need some optimization to cope better with asymmetric compute capacities. We may want to reduce capacity of some cpu to increase their idle time while letting others take the majority of the load.
...
I'm aware that the scheduler and power scheduler decisions may be inextricably linked so we may decide to merge them. However, I think it is worth trying to keep the power scheduling decisions out of the scheduler until we have proven it infeasible.
Thanks for posting this, I agree with the proposal. I would like to emphasise that this is a rather "divide and conquer" approach to reaching a unified solution. Some of the steps involved (not necessarily in this order):
1. Introduction of a power scheduler (replacing cpufreq governor) aware of the overall load and CPU capacities. It requests CPU frequency changes from the low-level cpufreq driver and gives hints to the task scheduler about load asymmetry (via cpu_power). 2. More accurate task load tracking (an attempt here - https://lkml.org/lkml/2013/4/16/289 - but possibly better accuracy using CPU cycles or other arch-specific counters). 3. Load balancer improvements for asymmetric CPU performance levels (e.g. frequency scaling). 4. Power scheduler driving the CPU idle decisions (replacing the cpuidle governor). 5. Power scheduler increased awareness of the run-queues content (number of tasks, individual task loads) and load balancer behaviour, feeding extra hints back to the load balancer (e.g. only move tasks below/above certain load, trigger a load balance). 6. Performance vs power saving tuning (policies). 7. More specific optimisations based on the CPU topology (big.little, turbo boost, etc.) ?. Lots of other things based on testing and community reviews.
Step 5 above will further increase the coupling between load balancer and power scheduler and we could end up with a unified implementation. But before then it is simpler to reason in terms of (a) better load balancing in an asymmetric configuration and (b) CPU capacity needed for the overall load.
On Fri, 14 Jun 2013, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
What is less clear is how such design would look like. Catalin has suggested two different approaches. Integrating cpufreq into the load balancing, or let the scheduler focus on load balancing and extend cpufreq to also restrict number of cpus available to the scheduler using cpu_power. The former approach would increase the scheduler complexity significantly as I already highlighted in my first reply. The latter approach introduces a way to, at lease initially, separate load balancing from capacity management, which I think is an interesting approach. Based on this idea I propose the following design:
+-----------------+ | | +----------+ current load | Power scheduler |<----+ cpufreq | +--------->| sched/power.c +---->| driver | | | | +----------+ | +-------+---------+ | ^ | +-----+---------+ | | | | | | available capacity | Scheduler |<--+----+ (e.g. cpu_power) | sched/fair.c | | | +--+| +---------------+ || ^ || | v|
+---------+--------+ +----------+ | task load metric | | cpuidle | | arch/* | | driver | +------------------+ +----------+
The intention is that the power scheduler will implement the (unified) power policy. It gets the current load of the system from the scheduler. Based on this information it will adjust the compute capacity available to the scheduler and drive frequency changes such that enough compute capacity is available to handle the current load. If the total load can be handled by a subset of cpus, it will reduce the capacity of the excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will increase capacity of one or more idle cpus to allow the scheduler to spread the load. The power scheduler has knowledge about the power topology and will guide the scheduler to idle the most optimum cpus by reducing its capacity. Global idle decision will be handled by the power scheduler, so cpuidle can over time be reduced to become just a driver, once we have added C-state selection to the power scheduler.
The scheduler is left to focus on scheduling mechanics and finding the best possible load balance on the cpu capacities set by the power scheduler. It will share a detailed view of the current load with the power scheduler to enable it to make the right capacity adjustments. The scheduler will need some optimization to cope better with asymmetric compute capacities. We may want to reduce capacity of some cpu to increase their idle time while letting others take the majority of the load.
Frequency scaling has a problematic impact on PJT's load metic, which was pointed out a while ago by Chris Redpath https://lkml.org/lkml/2013/4/16/289. So I agree with Arjan's suggestion to change the load calculation basis to something which is frequency invariant. Use whatever counters that are available on the specific platform.
I'm aware that the scheduler and power scheduler decisions may be inextricably linked so we may decide to merge them. However, I think it is worth trying to keep the power scheduling decisions out of the scheduler until we have proven it infeasible.
We are going to start working on this design and see where it takes us. We will post any results and suggested patches for folk to comment on. As a starting point we are planning to create a power scheduler (kernel/sched/power.c) similar to a cpufreq governor that does capacity management, and then evolve the solution from there.
I don't think that you are passing nearly enough information around.
A fairly simple example
take a relatively modern 4-core system with turbo mode where speed controls affect two cores at a time (I don't know the details of the available CPUs to know if this is an exact fit to any existing system, but I think it's a reasonable fit)
If you are running with a loadave of 2, should you power down 2 cores and run the other two in turbo mode, power down 2 cores and not increase the speed, or leave all 4 cores running as is.
Depending on the mix of processes, I could see any one of the three being the right answer.
If you have a process that's maxing out it's cpu time on one core, going to turbo mode is the right thing as the other processes should fit on the other core and that process will use more CPU (theoretically getting done sooner)
If no process is close to maxing out the core, then if you are in power saving mode, you probably want to shut down two cores and run everything on the other two
If you only have two processes eating almost all your CPU time, going to two cores is probably the right thing to do.
If you have more processes, each eating a little bit of time, then continuing to run on all four cores uses more cache, and could let all of the tasks finish faster.
So, how is the Power Scheduler going to get this level of information?
It doesn't seem reasonable to either pass this much data around, or to try and give two independant tools access to the same raw data (since that data is so tied to the internal details of the scheduler). If we are talking two parts of the same thing, then it's perfectly legitimate to have this sort of intimate knowledge of the internal data structures.
Also, if the power scheduler puts the cores at different speeds, how is the balancing scheduler supposed to know so that it can schedule appropriately? This is the bigLittle problem again.
It's this level of knowledge that both the power management and the scheduler need to know about what's going on in the guts of the other that make me say that they really are going to need to be merged.
The routines to change the core modes will be external, and will vary wildly between different systems, but the decision making logic should be unified.
David Lang
On Tue, Jun 18, 2013 at 02:37:21AM +0100, David Lang wrote:
On Fri, 14 Jun 2013, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
What is less clear is how such design would look like. Catalin has suggested two different approaches. Integrating cpufreq into the load balancing, or let the scheduler focus on load balancing and extend cpufreq to also restrict number of cpus available to the scheduler using cpu_power. The former approach would increase the scheduler complexity significantly as I already highlighted in my first reply. The latter approach introduces a way to, at lease initially, separate load balancing from capacity management, which I think is an interesting approach. Based on this idea I propose the following design:
+-----------------+ | | +----------+ current load | Power scheduler |<----+ cpufreq | +--------->| sched/power.c +---->| driver | | | | +----------+ | +-------+---------+ | ^ | +-----+---------+ | | | | | | available capacity | Scheduler |<--+----+ (e.g. cpu_power) | sched/fair.c | | | +--+| +---------------+ || ^ || | v|
+---------+--------+ +----------+ | task load metric | | cpuidle | | arch/* | | driver | +------------------+ +----------+
The intention is that the power scheduler will implement the (unified) power policy. It gets the current load of the system from the scheduler. Based on this information it will adjust the compute capacity available to the scheduler and drive frequency changes such that enough compute capacity is available to handle the current load. If the total load can be handled by a subset of cpus, it will reduce the capacity of the excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will increase capacity of one or more idle cpus to allow the scheduler to spread the load. The power scheduler has knowledge about the power topology and will guide the scheduler to idle the most optimum cpus by reducing its capacity. Global idle decision will be handled by the power scheduler, so cpuidle can over time be reduced to become just a driver, once we have added C-state selection to the power scheduler.
The scheduler is left to focus on scheduling mechanics and finding the best possible load balance on the cpu capacities set by the power scheduler. It will share a detailed view of the current load with the power scheduler to enable it to make the right capacity adjustments. The scheduler will need some optimization to cope better with asymmetric compute capacities. We may want to reduce capacity of some cpu to increase their idle time while letting others take the majority of the load.
Frequency scaling has a problematic impact on PJT's load metic, which was pointed out a while ago by Chris Redpath https://lkml.org/lkml/2013/4/16/289. So I agree with Arjan's suggestion to change the load calculation basis to something which is frequency invariant. Use whatever counters that are available on the specific platform.
I'm aware that the scheduler and power scheduler decisions may be inextricably linked so we may decide to merge them. However, I think it is worth trying to keep the power scheduling decisions out of the scheduler until we have proven it infeasible.
We are going to start working on this design and see where it takes us. We will post any results and suggested patches for folk to comment on. As a starting point we are planning to create a power scheduler (kernel/sched/power.c) similar to a cpufreq governor that does capacity management, and then evolve the solution from there.
I don't think that you are passing nearly enough information around.
A fairly simple example
take a relatively modern 4-core system with turbo mode where speed controls affect two cores at a time (I don't know the details of the available CPUs to know if this is an exact fit to any existing system, but I think it's a reasonable fit)
If you are running with a loadave of 2, should you power down 2 cores and run the other two in turbo mode, power down 2 cores and not increase the speed, or leave all 4 cores running as is.
Depending on the mix of processes, I could see any one of the three being the right answer.
If you have a process that's maxing out it's cpu time on one core, going to turbo mode is the right thing as the other processes should fit on the other core and that process will use more CPU (theoretically getting done sooner)
If no process is close to maxing out the core, then if you are in power saving mode, you probably want to shut down two cores and run everything on the other two
If you only have two processes eating almost all your CPU time, going to two cores is probably the right thing to do.
If you have more processes, each eating a little bit of time, then continuing to run on all four cores uses more cache, and could let all of the tasks finish faster.
So, how is the Power Scheduler going to get this level of information?
It doesn't seem reasonable to either pass this much data around, or to try and give two independant tools access to the same raw data (since that data is so tied to the internal details of the scheduler). If we are talking two parts of the same thing, then it's perfectly legitimate to have this sort of intimate knowledge of the internal data structures.
I realize that my description is not very clear about this point. Total load is clearly not enough information for the power scheduler to take any reasonable decisions. By current load, I mean per-cpu load, number of tasks, and possibly more task statistics. Enough information to determine the best use of the system cpus.
As stated in my previous reply, this is not the ultimate design. It expect to have many design iterations. If it turns out that it doesn't make sense to have a separate power scheduler, then we should merge them. I just propose to divide the design into manageable components. A unified design covering the scheduler, two other policy frameworks, and new policies is too complex in my opinion.
The power scheduler may be viewed as an external extension to the periodic scheduler load balance. I don't see a major problem in accessing raw data in the scheduler. The power scheduler will live in sched/power.c. In a unified solution where you put everything into sched/fair.c you would still need access to the same raw data to make the right power scheduling decisions. By having the power scheduler separately we just attempt to minimize the entanglement.
Also, if the power scheduler puts the cores at different speeds, how is the balancing scheduler supposed to know so that it can schedule appropriately? This is the bigLittle problem again.
It's this level of knowledge that both the power management and the scheduler need to know about what's going on in the guts of the other that make me say that they really are going to need to be merged.
The scheduler will need to be tuned to make the "right" load balancing decisions based on the compute capacity made available by the power scheduler. That includes dealing with symmetric systems with different cpu frequencies and asymmetric systems, like bigLittle. Clearly, the power scheduler must be able to trust that the load balancer will do the right thing.
In an example scenario on bigLittle where you have a single task fully utilizing a single Little cpu, I would expect the power scheduler to detect this situation and enable a big cpu (increase its cpu_power). The tuned load balancer will then move the task to the cpu with the highest capacity.
So, the power scheduler should figure out the best setup for the current load, and the scheduler (load balancer) should take care of putting the right tasks on the right cpus according to the capacities (cpu_power) set by the power scheduler. For this to work the load balancer must adhere to a set of rules such that the power scheduler can reason about the the load balancer behaviour like in the above example. Moving big tasks to cpus with highest capacity is one of these rules. More will probably be needed as we refine the design.
Morten
The routines to change the core modes will be external, and will vary wildly between different systems, but the decision making logic should be unified.
David Lang
On Tue, 18 Jun 2013, Morten Rasmussen wrote:
I don't think that you are passing nearly enough information around.
A fairly simple example
take a relatively modern 4-core system with turbo mode where speed controls affect two cores at a time (I don't know the details of the available CPUs to know if this is an exact fit to any existing system, but I think it's a reasonable fit)
If you are running with a loadave of 2, should you power down 2 cores and run the other two in turbo mode, power down 2 cores and not increase the speed, or leave all 4 cores running as is.
Depending on the mix of processes, I could see any one of the three being the right answer.
If you have a process that's maxing out it's cpu time on one core, going to turbo mode is the right thing as the other processes should fit on the other core and that process will use more CPU (theoretically getting done sooner)
If no process is close to maxing out the core, then if you are in power saving mode, you probably want to shut down two cores and run everything on the other two
If you only have two processes eating almost all your CPU time, going to two cores is probably the right thing to do.
If you have more processes, each eating a little bit of time, then continuing to run on all four cores uses more cache, and could let all of the tasks finish faster.
So, how is the Power Scheduler going to get this level of information?
It doesn't seem reasonable to either pass this much data around, or to try and give two independant tools access to the same raw data (since that data is so tied to the internal details of the scheduler). If we are talking two parts of the same thing, then it's perfectly legitimate to have this sort of intimate knowledge of the internal data structures.
I realize that my description is not very clear about this point. Total load is clearly not enough information for the power scheduler to take any reasonable decisions. By current load, I mean per-cpu load, number of tasks, and possibly more task statistics. Enough information to determine the best use of the system cpus.
As stated in my previous reply, this is not the ultimate design. It expect to have many design iterations. If it turns out that it doesn't make sense to have a separate power scheduler, then we should merge them. I just propose to divide the design into manageable components. A unified design covering the scheduler, two other policy frameworks, and new policies is too complex in my opinion.
The power scheduler may be viewed as an external extension to the periodic scheduler load balance. I don't see a major problem in accessing raw data in the scheduler. The power scheduler will live in sched/power.c. In a unified solution where you put everything into sched/fair.c you would still need access to the same raw data to make the right power scheduling decisions. By having the power scheduler separately we just attempt to minimize the entanglement.
Why insist on this being treated as an external component that you have to pass messages to?
If you allow it to be combined, then it can lookup the info it needs rather than trying to define an API between the two that accounts for everything that you need to know (now and in the future)
This will mean that as the internals of one change it will affect the internals of the other, but it seems like this is far more likely to be successful.
If you have hundreds or thousands of processes, it's bad enough to lookup the data directly, but trying to marshal the infromation to send it to a separate component seems counterproductive.
David Lang
On Tue, Jun 18, 2013 at 06:39:27PM +0100, David Lang wrote:
On Tue, 18 Jun 2013, Morten Rasmussen wrote:
I don't think that you are passing nearly enough information around.
A fairly simple example
take a relatively modern 4-core system with turbo mode where speed controls affect two cores at a time (I don't know the details of the available CPUs to know if this is an exact fit to any existing system, but I think it's a reasonable fit)
If you are running with a loadave of 2, should you power down 2 cores and run the other two in turbo mode, power down 2 cores and not increase the speed, or leave all 4 cores running as is.
Depending on the mix of processes, I could see any one of the three being the right answer.
If you have a process that's maxing out it's cpu time on one core, going to turbo mode is the right thing as the other processes should fit on the other core and that process will use more CPU (theoretically getting done sooner)
If no process is close to maxing out the core, then if you are in power saving mode, you probably want to shut down two cores and run everything on the other two
If you only have two processes eating almost all your CPU time, going to two cores is probably the right thing to do.
If you have more processes, each eating a little bit of time, then continuing to run on all four cores uses more cache, and could let all of the tasks finish faster.
So, how is the Power Scheduler going to get this level of information?
It doesn't seem reasonable to either pass this much data around, or to try and give two independant tools access to the same raw data (since that data is so tied to the internal details of the scheduler). If we are talking two parts of the same thing, then it's perfectly legitimate to have this sort of intimate knowledge of the internal data structures.
I realize that my description is not very clear about this point. Total load is clearly not enough information for the power scheduler to take any reasonable decisions. By current load, I mean per-cpu load, number of tasks, and possibly more task statistics. Enough information to determine the best use of the system cpus.
As stated in my previous reply, this is not the ultimate design. It expect to have many design iterations. If it turns out that it doesn't make sense to have a separate power scheduler, then we should merge them. I just propose to divide the design into manageable components. A unified design covering the scheduler, two other policy frameworks, and new policies is too complex in my opinion.
The power scheduler may be viewed as an external extension to the periodic scheduler load balance. I don't see a major problem in accessing raw data in the scheduler. The power scheduler will live in sched/power.c. In a unified solution where you put everything into sched/fair.c you would still need access to the same raw data to make the right power scheduling decisions. By having the power scheduler separately we just attempt to minimize the entanglement.
Why insist on this being treated as an external component that you have to pass messages to?
If you allow it to be combined, then it can lookup the info it needs rather than trying to define an API between the two that accounts for everything that you need to know (now and in the future)
I don't see why you cannot read the internal scheduler data structures from the power scheduler (with appropriate attention to locking). The point of the proposed design is not to define interfaces, it is to divide the problem into manageable components.
Let me repeat again, if we while developing the solution find out that the separation doesn't make sense I have no problem merging them. I don't insist on the separation, my point is that we need to partition this very complex problem and let it evolve into a reasonable solution.
This will mean that as the internals of one change it will affect the internals of the other, but it seems like this is far more likely to be successful.
That is no different from having a merged design. If you change something in the scheduler you would have to consider all the power implications anyway. The power scheduler design would give you at least a vague separation and the possibility of not having a power scheduler at all.
If you have hundreds or thousands of processes, it's bad enough to lookup the data directly, but trying to marshal the infromation to send it to a separate component seems counterproductive.
I don't see why that should be necessary.
Morten
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
The OS does not get to really pick the CPU "frequency" (never mind that frequency is not what gets controlled), the hardware picks the frequency. The OS can do some level of requests (best to think of this as a percentage more than frequency) but what you actually get is more often than not what you asked for.
You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
Treating "frequency" (well "performance) and idle separately is also a false thing to do (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working on fixing that). They are by no means separate things. One guy's idle state is the other guys power budget (and thus performance)!.
On Tue, 18 Jun 2013, Arjan van de Ven wrote:
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
The OS does not get to really pick the CPU "frequency" (never mind that frequency is not what gets controlled), the hardware picks the frequency. The OS can do some level of requests (best to think of this as a percentage more than frequency) but what you actually get is more often than not what you asked for.
so this sounds to me like the process for changing settings on this Intel hardware is a two phase process
something looks up what should be possible and says "switch to mode X" after mode switch happens it then looks and finds "it's now in mode Y"
As long as there is some table to list the possible X modes to switch to, and some table to lookup the characteristics of the possible Y modes that you are in (and the list of modes you can change to may be different depending on what mode you are in), this doesn't seem to be a huge problem.
And if you can't tell what mode you are in, or what the expected performance characteristics are, then you can't possibly do any intellegant allocations.
If Intel is doing this for current CPUs, I expect that they will fix this before too much longer.
You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
If you have no way of knowing how much processing power you should expect to have on each core in the near future, then you have no way of allocating processes appropriately between the cores.
It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
David Lang
On 6/18/2013 10:47 AM, David Lang wrote:
so this sounds to me like the process for changing settings on this Intel hardware is a two phase process
something looks up what should be possible and says "switch to mode X"
more a case of "I would like to request X" it's not a mandate, it's a polite request/suggestion
after mode switch happens it then looks and finds "it's now in mode Y"
you don't really know what you are in, you can only really know on average what you were in over some time in the past. As such, Y is not really discrete/enumeratable (well since it's all fixed point math, it is, sure, in steps if 1 Hz)
the "current" thing is changing all the time on a very fine grained timescale, depending on what the other cores in the system are doing, what graphics is doing, what the temperature is etc etc.
And if you can't tell what mode you are in, or what the expected performance characteristics are, then you can't possibly do any intellegant allocations.
you can tell what you were in looking in the rear-view mirror. you have no idea what it'll be going forward.
If Intel is doing this for current CPUs, I expect that they will fix this before too much longer.
I'm pretty sure that won't happen, and I'm also pretty sure the other CPU vendors are either there today (AMD) or will be there in the next few years (ARM). It's the nature of how CPUs do power and thermal management and the physics behind that.
You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
If you have no way of knowing how much processing power you should expect to have on each core in the near future, then you have no way of allocating processes appropriately between the cores.
It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
you can give some suggestions to the hardware. But how much you actually get can be off by 2x or more in either direction. And most of that will depend on what other cores/graphics in the system are doing (in terms of idle or their own requests and the amount of the total power budget they are consuming)
On 6/18/2013 10:47 AM, David Lang wrote:
It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
btw one way to look at this is to assume that (with some minimal hinting) the CPU driver will do the right thing and get you just about the best performance you can get (that is appropriate for the task at hand)... ... and don't do anything in the scheduler proactively.
Now for big.little and other temporary or permanent asymmetries, we may want to have a "max performance level" type indicator, and that's fair enough (and this can be dynamic, since it for thermal reasons this can change over time, but on a somewhat slower timescale)
the hints I have in mind are not all that complex; we have the biggest issues today around task migration (the task migrates to a cold cpu... so a simple notifier chain on the new cpu as it is accepting a task and we can bump it up), real time tasks (again, simple notifier chain to get you to a predictably high performance level) and we're a long way better than we are today in terms of actual problems.
For all the talk of ondemand (as ARM still uses that today)... that guy puts you in either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions like on Intel are bit more advanced (and will grow more so over time), but even there, in the grand scheme of things, the scheduler shouldn't have to care anymore with those two notifiers in place.
On Wed, Jun 19, 2013 at 04:39:39PM +0100, Arjan van de Ven wrote:
On 6/18/2013 10:47 AM, David Lang wrote:
It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
btw one way to look at this is to assume that (with some minimal hinting) the CPU driver will do the right thing and get you just about the best performance you can get (that is appropriate for the task at hand)... ... and don't do anything in the scheduler proactively.
If I understand correctly, you mean if your hardware/firmware is fully in control of the p-state selection and changes it fast enough to match the current load, the scheduler doesn't have to care? By fast enough I mean, faster than the scheduler would notice if a cpu was temporarily overloaded at a low p-state. In that case, you wouldn't need cpufreq/p-state hints, and the scheduler would only move tasks between cpus when cpus are fully loaded at their max p-state.
Now for big.little and other temporary or permanent asymmetries, we may want to have a "max performance level" type indicator, and that's fair enough (and this can be dynamic, since it for thermal reasons this can change over time, but on a somewhat slower timescale)
the hints I have in mind are not all that complex; we have the biggest issues today around task migration (the task migrates to a cold cpu... so a simple notifier chain on the new cpu as it is accepting a task and we can bump it up), real time tasks (again, simple notifier chain to get you to a predictably high performance level) and we're a long way better than we are today in terms of actual problems.
For all the talk of ondemand (as ARM still uses that today)... that guy puts you in either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions like on Intel are bit more advanced (and will grow more so over time), but even there, in the grand scheme of things, the scheduler shouldn't have to care anymore with those two notifiers in place.
You would need more than a few hints to implement more advanced capacity management like proposed for the power scheduler. I believe that Intel would benefit as well from guiding the scheduler to idle the right cpu to enable deeper idle states and/or enable turbo-boost for other cpus.
Morten
On 6/19/2013 10:00 AM, Morten Rasmussen wrote:
On Wed, Jun 19, 2013 at 04:39:39PM +0100, Arjan van de Ven wrote:
On 6/18/2013 10:47 AM, David Lang wrote:
It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
btw one way to look at this is to assume that (with some minimal hinting) the CPU driver will do the right thing and get you just about the best performance you can get (that is appropriate for the task at hand)... ... and don't do anything in the scheduler proactively.
If I understand correctly, you mean if your hardware/firmware is fully
hardware, firmware and the driver
in control of the p-state selection and changes it fast enough to match the current load, the scheduler doesn't have to care? By fast enough I mean, faster than the scheduler would notice if a cpu was temporarily overloaded at a low p-state. In that case, you wouldn't need cpufreq/p-state hints, and the scheduler would only move tasks between cpus when cpus are fully loaded at their max p-state.
with the migration hint, I'm pretty sure we'll be there today typically. we'll notice within 10 msec regardless, but the migration hint will take the edge of those 10 msec normally.
I would argue that the "at their max p-state" in your sentence needs to go away. since you don't know what you actually are except in hindsight. And even then you don't know if you could have gone higher or not.
the hints I have in mind are not all that complex; we have the biggest issues today around task migration (the task migrates to a cold cpu... so a simple notifier chain on the new cpu as it is accepting a task and we can bump it up), real time tasks (again, simple notifier chain to get you to a predictably high performance level) and we're a long way better than we are today in terms of actual problems.
For all the talk of ondemand (as ARM still uses that today)... that guy puts you in either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions like on Intel are bit more advanced (and will grow more so over time), but even there, in the grand scheme of things, the scheduler shouldn't have to care anymore with those two notifiers in place.
You would need more than a few hints to implement more advanced capacity management like proposed for the power scheduler. I believe that Intel would benefit as well from guiding the scheduler to idle the right cpu to enable deeper idle states and/or enable turbo-boost for other cpus.
that's an interesting theory. I've yet to see any way to actually have that do something useful.
yes there is some value in grouping a lot of very short tasks together. not a lot of value, but at least some.
and there is some value in the grouping within a package (to a degree) thing.
(both are basically "statistically, sort left" as policy)
more finegrained than that (esp tied to P states).. not so much.
On Wed, Jun 19, 2013 at 06:08:29PM +0100, Arjan van de Ven wrote:
On 6/19/2013 10:00 AM, Morten Rasmussen wrote:
On Wed, Jun 19, 2013 at 04:39:39PM +0100, Arjan van de Ven wrote:
On 6/18/2013 10:47 AM, David Lang wrote:
It's bad enough trying to guess the needs of the processes, but if you also are reduced to guessing the capabilities of the cores, how can anything be made to work?
btw one way to look at this is to assume that (with some minimal hinting) the CPU driver will do the right thing and get you just about the best performance you can get (that is appropriate for the task at hand)... ... and don't do anything in the scheduler proactively.
If I understand correctly, you mean if your hardware/firmware is fully
hardware, firmware and the driver
in control of the p-state selection and changes it fast enough to match the current load, the scheduler doesn't have to care? By fast enough I mean, faster than the scheduler would notice if a cpu was temporarily overloaded at a low p-state. In that case, you wouldn't need cpufreq/p-state hints, and the scheduler would only move tasks between cpus when cpus are fully loaded at their max p-state.
with the migration hint, I'm pretty sure we'll be there today typically.
A hint when a task is moved to a new cpu is too late if the migration shouldn't have happened at all. If the scheduler knows that the cpu is able to switch to a higher p-state it can decide to wait for the p-state change instead of migrating the task and waking up another cpu.
we'll notice within 10 msec regardless, but the migration hint will take the edge of those 10 msec normally.
I'm not sure if 10 msec is fast enough for the scheduler to not notice. Real use-case studies will tell.
I would argue that the "at their max p-state" in your sentence needs to go away. since you don't know what you actually are except in hindsight. And even then you don't know if you could have gone higher or not.
Yes. What I meant was that if your p-state selection is responsive enough the scheduler would only see the cpu as overloaded when it is in its highest available p-state. That may determined dynamically by power, thermal, and other factors.
the hints I have in mind are not all that complex; we have the biggest issues today around task migration (the task migrates to a cold cpu... so a simple notifier chain on the new cpu as it is accepting a task and we can bump it up), real time tasks (again, simple notifier chain to get you to a predictably high performance level) and we're a long way better than we are today in terms of actual problems.
For all the talk of ondemand (as ARM still uses that today)... that guy puts you in either the lowest or highest frequency over 95% of the time. Other non-cpufreq solutions like on Intel are bit more advanced (and will grow more so over time), but even there, in the grand scheme of things, the scheduler shouldn't have to care anymore with those two notifiers in place.
You would need more than a few hints to implement more advanced capacity management like proposed for the power scheduler. I believe that Intel would benefit as well from guiding the scheduler to idle the right cpu to enable deeper idle states and/or enable turbo-boost for other cpus.
that's an interesting theory. I've yet to see any way to actually have that do something useful.
yes there is some value in grouping a lot of very short tasks together. not a lot of value, but at least some.
and there is some value in the grouping within a package (to a degree) thing.
(both are basically "statistically, sort left" as policy)
The proposed task packing patches have shown significant benefits for scenarios with many short tasks. This is a typical scenario on android.
Morten
On 6/21/2013 1:50 AM, Morten Rasmussen wrote:
in control of the p-state selection and changes it fast enough to match the current load, the scheduler doesn't have to care? By fast enough I mean, faster than the scheduler would notice if a cpu was temporarily overloaded at a low p-state. In that case, you wouldn't need cpufreq/p-state hints, and the scheduler would only move tasks between cpus when cpus are fully loaded at their max p-state.
with the migration hint, I'm pretty sure we'll be there today typically.
A hint when a task is moved to a new cpu is too late if the migration shouldn't have happened at all. If the scheduler knows that the cpu is able to switch to a higher p-state it can decide to wait for the p-state change instead of migrating the task and waking up another cpu.
ok maybe I am missing something but at least on the hardware I am familiar with (Intel and somewhat AMD), the frequency (and voltage) when idle is ... 0 Hz... no matter what the OS chose for when the CPU is running. And when coming out of idle, as part of the cost of that, is ramping up to something appropriate.
And such ramps are FAST. Changing P state is as a result generally quite fast as well... think "single digit microseconds" kind of fast. Much faster than waking a CPU up in the first place (by design.. since a wakeup of a CPU includes effectively a P state change)
I read your statement as "lets wait for the idle CPU to ramp its frequency up first", which doesn't really make sense to me...
On 6/21/2013 1:50 AM, Morten Rasmussen wrote:
ypically.
A hint when a task is moved to a new cpu is too late if the migration shouldn't have happened at all. If the scheduler knows that the cpu is able to switch to a higher p-state it can decide to wait for the p-state change instead of migrating the task and waking up another cpu.
oops sorry I misread your mail (lack of early coffee I suppose)
I can see your point of having a thing for "did we ask for all the performance we could ask for" prior to doing a load balance (although, for power efficiency, if you have two tasks that could run in parallel, it's usually better to run them in parallel... so likely we should balance anyway)
On 21 June 2013 16:38, Arjan van de Ven arjan@linux.intel.com wrote:
On 6/21/2013 1:50 AM, Morten Rasmussen wrote:
A hint when a task is moved to a new cpu is too late if the migration shouldn't have happened at all. If the scheduler knows that the cpu is able to switch to a higher p-state it can decide to wait for the p-state change instead of migrating the task and waking up another cpu.
oops sorry I misread your mail (lack of early coffee I suppose)
I can see your point of having a thing for "did we ask for all the performance we could ask for" prior to doing a load balance (although, for power efficiency, if you have two tasks that could run in parallel, it's usually better to run them in parallel... so likely we should balance anyway)
Not necessarily, especially if parallel running implies powering up a full cluster just for one CPU (it depends on the hardware but for example a cluster may not be able to go in deeper sleep states unless all the CPUs in that cluster are idle).
-- Catalin
On 6/21/2013 2:23 PM, Catalin Marinas wrote:
oops sorry I misread your mail (lack of early coffee I suppose)
I can see your point of having a thing for "did we ask for all the performance we could ask for" prior to doing a load balance (although, for power efficiency, if you have two tasks that could run in parallel, it's usually better to run them in parallel... so likely we should balance anyway)
Not necessarily, especially if parallel running implies powering up a full cluster just for one CPU (it depends on the hardware but for example a cluster may not be able to go in deeper sleep states unless all the CPUs in that cluster are idle).
I guess it depends on the system
the very first cpu needs to power on * the core itself * the "cluster" that you mention * the memory controller * the memory (out of self refresh)
while the second cpu needs * the core itself * maybe a second cluster
normally on Intel systems, the memory power delta is quite significant which then means the efficiency of the second core is huge compared to running things in sequence.
On Fri, 2013-06-21 at 14:34 -0700, Arjan van de Ven wrote:
On 6/21/2013 2:23 PM, Catalin Marinas wrote:
oops sorry I misread your mail (lack of early coffee I suppose)
I can see your point of having a thing for "did we ask for all the performance we could ask for" prior to doing a load balance (although, for power efficiency, if you have two tasks that could run in parallel, it's usually better to run them in parallel... so likely we should balance anyway)
Not necessarily, especially if parallel running implies powering up a full cluster just for one CPU (it depends on the hardware but for example a cluster may not be able to go in deeper sleep states unless all the CPUs in that cluster are idle).
I guess it depends on the system
Sort-of. We have something similar with threads on ppc. IE, the core can only really stop if all threads are. From a Linux persepctive it's a matter of how we define the scope of that 'cluster' Catalin is talking about. I'm sure you do too.
Then there is the package, which adds MC etc...
the very first cpu needs to power on
- the core itself
- the "cluster" that you mention
- the memory controller
- the memory (out of self refresh)
while the second cpu needs
- the core itself
- maybe a second cluster
normally on Intel systems, the memory power delta is quite significant which then means the efficiency of the second core is huge compared to running things in sequence.
What's your typical latency for bringing an MC back (and memory out of self refresh) ? IE. Basically bringing a package back up ?
Cheers, Ben.
-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Mon, Jun 24, 2013 at 12:32:00AM +0100, Benjamin Herrenschmidt wrote:
On Fri, 2013-06-21 at 14:34 -0700, Arjan van de Ven wrote:
On 6/21/2013 2:23 PM, Catalin Marinas wrote:
oops sorry I misread your mail (lack of early coffee I suppose)
I can see your point of having a thing for "did we ask for all the performance we could ask for" prior to doing a load balance (although, for power efficiency, if you have two tasks that could run in parallel, it's usually better to run them in parallel... so likely we should balance anyway)
Not necessarily, especially if parallel running implies powering up a full cluster just for one CPU (it depends on the hardware but for example a cluster may not be able to go in deeper sleep states unless all the CPUs in that cluster are idle).
I guess it depends on the system
Sort-of. We have something similar with threads on ppc. IE, the core can only really stop if all threads are. From a Linux persepctive it's a matter of how we define the scope of that 'cluster' Catalin is talking about. I'm sure you do too.
Then there is the package, which adds MC etc...
I think we can say cluster == package so that we use some common terminology. On a big.little configuration (TC2), we have 3xA7 in one package and 2xA15 in the other. So to efficiently stop an entire package (cluster, multi-core etc.) we need to stop all the CPUs it has.
I guess it depends on the system
Sort-of. We have something similar with threads on ppc. IE, the core can only really stop if all threads are. From a Linux persepctive it's a matter of how we define the scope of that 'cluster' Catalin is talking about. I'm sure you do too.
Then there is the package, which adds MC etc...
the very first cpu needs to power on
- the core itself
- the "cluster" that you mention
- the memory controller
- the memory (out of self refresh)
while the second cpu needs
- the core itself
- maybe a second cluster
normally on Intel systems, the memory power delta is quite significant which then means the efficiency of the second core is huge compared to running things in sequence.
What's your typical latency for bringing an MC back (and memory out of self refresh) ? IE. Basically bringing a package back up ?
to bring the system back up if all cores in the whole system are idle and power gated, memory in SR etc... is typically < 250 usec (depends on the exact version of the cpu etc). But the moment even one core is running, that core will keep the system out of such deep state, and waking up a consecutive entity is much faster
to bring just a core out of power gating is more in the 40 to 50 usec range
On Mon, 2013-06-24 at 08:26 -0700, Arjan van de Ven wrote:
to bring the system back up if all cores in the whole system are idle and power gated, memory in SR etc... is typically < 250 usec (depends on the exact version of the cpu etc). But the moment even one core is running, that core will keep the system out of such deep state, and waking up a consecutive entity is much faster
to bring just a core out of power gating is more in the 40 to 50 usec range
Out of curiosity, what happens to PCIe when you bring a package down like this ?
Cheers, Ben.
On 6/24/2013 2:59 PM, Benjamin Herrenschmidt wrote:
On Mon, 2013-06-24 at 08:26 -0700, Arjan van de Ven wrote:
to bring the system back up if all cores in the whole system are idle and power gated, memory in SR etc... is typically < 250 usec (depends on the exact version of the cpu etc). But the moment even one core is running, that core will keep the system out of such deep state, and waking up a consecutive entity is much faster
to bring just a core out of power gating is more in the 40 to 50 usec range
Out of curiosity, what happens to PCIe when you bring a package down like this ?
PCIe devices can communicate latency requirements (LTR) if they need something more aggressive than this; otherwise 250 usec afaik falls within what doesn't break (devices need to cope with arbitrage/etc delays anyway) and with PCIe link power management there are delays regardless; once a PCIe link gets powered back on the memory controller/etc also will come back online
On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
I think it can work (see below).
The OS does not get to really pick the CPU "frequency" (never mind that frequency is not what gets controlled), the hardware picks the frequency. The OS can do some level of requests (best to think of this as a percentage more than frequency) but what you actually get is more often than not what you asked for.
Morten's proposal does not try to "pick" a frequency. The P-state change is still done gradually based on the load (so we still have an adaptive loop). The load (total or per-task) can be tracked in an arch-specific way (using aperf/mperf on x86).
The difference from what intel_pstate.c does now is that it has a view of the total load (across all CPUs) and the run-queue content. It can "guide" the load balancer into favouring one or two CPUs and ignoring the rest (using cpu_power).
If several CPUs have small aperf/mperf ratio, it can decide to use fewer CPUs at a higher aperf/mperf by telling the load balancer not to use them (cpu_power = 1). All of this is continuously re-adjusted to cope with changes in the load and hardware variations like turbo boost.
Similarly, if a CPU has aperf/mperf >= 1, it keeps increasing the P-state (depending on the policy). Once it got to the highest level, depending on the number of threads in the run-queue (doesn't make sense for only one), it can open up other CPUs and let the load balancer use them.
You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
We don't need absolute figures matching load to P-states but we'll continue with an adaptive system. What we have now is also an adaptive system but with independent decisions taken by the load balancer and the P-state driver. The load balancer can even get confused by the cpufreq decisions and move tasks around unnecessarily. With Morten's proposal we get the power scheduler to adjust the P-state while giving hints to the load balancer at the same time (it adjusts both, it doesn't try to re-adjust itself after the load balancer).
Treating "frequency" (well "performance) and idle separately is also a false thing to do (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working on fixing that). They are by no means separate things. One guy's idle state is the other guys power budget (and thus performance)!.
I agree.
On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
The OS does not get to really pick the CPU "frequency" (never mind that frequency is not what gets controlled), the hardware picks the frequency. The OS can do some level of requests (best to think of this as a percentage more than frequency) but what you actually get is more often than not what you asked for.
You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
The proposed power scheduler doesn't have to drive p-state selection if it doesn't make sense for the particular platform. The aim of the power scheduler is integration of power policies in general.
Treating "frequency" (well "performance) and idle separately is also a false thing to do (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working on fixing that). They are by no means separate things. One guy's idle state is the other guys power budget (and thus performance)!.
I agree.
Based on our discussions so far, where it has become more clear where Intel is heading, and Ingo's reply I think we have three ways to ahead with the power-aware scheduling work. Each with their advantages and disadvantages:
1. We work on a generic power scheduler with appropriate abstractions that will work for all of us. Current and future Intel p-state policies will be implemented through the power scheduler.
Pros: We can arrive at fairly standard solution with standard tunables. There will be one interface to the scheduler.
Cons: Finding a suitable platform abstraction for the power scheduler.
2. Like 1, but we introduce a CONFIG_SCHED_POWER as suggested by Ingo, that makes it all go away.
Pros: Intel can keep intel_pstate.c others can use the power scheduler or their own driver.
Cons: Different platform specific drivers may need different interfaces to the scheduler. Harder to define cross-platform tunables.
3. We go for independent platform specific power policy driver that may or may not use existing frameworks, like intel_pstate.c.
Pros: No need to find common platform abstraction. Power policy is implemented in arch/* and won't affect others.
Cons: Same as 2. Everybody would have to implement their own frequency, idle and thermal solutions. Potential duplication of functionality.
In my opinion we should aim for 1., but start out with a CONFIG_SCHED_POWER and see where we get to. Feedback from everybody is essential to arrive at a generic solution.
Morten
* Morten Rasmussen morten.rasmussen@arm.com wrote:
On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
The OS does not get to really pick the CPU "frequency" (never mind that frequency is not what gets controlled), the hardware picks the frequency. The OS can do some level of requests (best to think of this as a percentage more than frequency) but what you actually get is more often than not what you asked for.
You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
The proposed power scheduler doesn't have to drive p-state selection if it doesn't make sense for the particular platform. The aim of the power scheduler is integration of power policies in general.
Exactly.
Treating "frequency" (well "performance) and idle separately is also a false thing to do (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working on fixing that). They are by no means separate things. One guy's idle state is the other guys power budget (and thus performance)!.
I agree.
Based on our discussions so far, where it has become more clear where Intel is heading, and Ingo's reply I think we have three ways to ahead with the power-aware scheduling work. Each with their advantages and disadvantages:
- We work on a generic power scheduler with appropriate abstractions
that will work for all of us. Current and future Intel p-state policies will be implemented through the power scheduler.
Pros: We can arrive at fairly standard solution with standard tunables. There will be one interface to the scheduler.
This is what we prefer really, made available under CONFIG_SCHED_POWER=y.
With CONFIG_SCHED_POWER=y, or if low level facilities are not (yet) available then the kernel falls back to legacy (current) behavior.
Cons: Finding a suitable platform abstraction for the power scheduler.
Just do it incrementally. Start from the dumbest possible state: all CPUs are powered up fully, there's no idle state selection essentially. Then go for the biggest effect first and add the ability to idle in a lower power state (with new functions and a low level driver that implements this for the platform with no policy embedded into it - just p-state switching logic), and combine that with task packing.
Then do small, measured steps to integrate more and more facilities, the ability to turn off more and more hardware, etc. The more basic steps you can figure out to iterate this, the better.
Important: it's not a problem that the initial code won't outperform the current kernel's performance. It should outperform the _initial_ 'dumb' code in the first step. Then the next step should outperform the previous step, etc.
The quality of this iterative approach will eventually surpass the combined effect of currently available but non-integrated facilities.
Since this can be done without touching all the other existing facilities it's fundamentally non-intrusive.
An initial implementation should probably cover just two platforms, a modern ARM platform and Intel - those two are far enough from each other so that if a generic approach helps both we are reasonably certain that the generalization makes sense.
The new code could live under a new file in kernel/sched/power.c, to separate it out in a tidy fashion, and to make it easy to understand.
- Like 1, but we introduce a CONFIG_SCHED_POWER as suggested by Ingo,
that makes it all go away.
That's not really what CONFIG_SCHED_POWER should do: its purpose is to allow a 'legacy power saving mode' that makes any new logic go away.
Pros: Intel can keep intel_pstate.c others can use the power scheduler or their own driver.
Cons: Different platform specific drivers may need different interfaces to the scheduler. Harder to define cross-platform tunables.
- We go for independent platform specific power policy driver that may
or may not use existing frameworks, like intel_pstate.c.
And that's a NAK from the scheduler maintainers.
Thanks,
Ingo
linaro-kernel@lists.linaro.org