Hi,
On 05/31/2013 04:22 PM, Ingo Molnar wrote:
PeterZ and me tried to point out the design requirements previously, but it still does not appear to be clear enough to people, so let me spell it out again, in a hopefully clearer fashion.
The scheduler has valuable power saving information available:
when a CPU is busy: about how long the current task expects to run
when a CPU is idle: how long the current CPU expects _not_ to run
topology: it knows how the CPUs and caches interrelate and already optimizes based on that
various high level and low level load averages and other metrics about the recent past that show how busy a particular CPU is, how busy the whole system is, and what the runtime properties of individual tasks is (how often it sleeps, etc.)
so the scheduler is in an _ideal_ position to do a judgement call about the near future and estimate how deep an idle state a CPU core should enter into and what frequency it should run at.
I don't think the problem lies in the fact that scheduler is not making these decisions about which idle state the CPU should enter or which frequency the CPU should run at.
IIUC, I think the problem lies in the part where although the *cpuidle and cpufrequency governors are co-operating with the scheduler, the scheduler is not doing the same.*
Let me elaborate with respect to cpuidle subsystem. When the scheduler chooses the CPUs to run tasks on, it leaves certain other CPUs idle. The cpuidle governor then evaluates, among other things, the load average of the CPUs, before deciding to put it into an ideal idle state. With the PJT's metric, an idle CPU's load average degrades over time and cpuidle governor will perhaps decide to put such CPUs to deep idle states.
But the problem surfaces when scheduler gets to choose a CPU to run new/woken up tasks on. It chooses the *idlest_cpu* to run the task on without considering how deep an idle state that CPU is in,if at all it is in an idle state. It would end up waking a deep sleeping CPU, which will *hinder power savings*.
I think here is where we need to focus. Currently, there is no *two way co-operation between the scheduler and cpuidle/cpufrequency* subsystems, which makes no sense. In the above case for instance scheduler prompts the cpuidle governor to put CPU to idle state and comes back to hamper that move.
The scheduler is also at a high enough level to host a "I want maximum performance, power does not matter to me" user policy override switch and similar user policy details.
No ifs and whens about that.
Today the power saving landscape is fragmented and sad: we just randomly interface scheduler task packing changes with some idle policy (and cpufreq policy), which might or might not combine correctly.
I would repeat here that today we interface cpuidle/cpufrequency policies with scheduler but not the other way around. They do their bit when a cpu is busy/idle. However scheduler does not see that somebody else is taking instructions from it and comes back to give different instructions!
Therefore I think among other things, this is one fundamental issue that we need to resolve in the steps towards better power savings through scheduler.
Regards Preeti U Murthy