On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
Looking at the discussion it seems that people have slightly different views, but most agree that the goal is an integrated scheduling, frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
I think it can work (see below).
The OS does not get to really pick the CPU "frequency" (never mind that frequency is not what gets controlled), the hardware picks the frequency. The OS can do some level of requests (best to think of this as a percentage more than frequency) but what you actually get is more often than not what you asked for.
Morten's proposal does not try to "pick" a frequency. The P-state change is still done gradually based on the load (so we still have an adaptive loop). The load (total or per-task) can be tracked in an arch-specific way (using aperf/mperf on x86).
The difference from what intel_pstate.c does now is that it has a view of the total load (across all CPUs) and the run-queue content. It can "guide" the load balancer into favouring one or two CPUs and ignoring the rest (using cpu_power).
If several CPUs have small aperf/mperf ratio, it can decide to use fewer CPUs at a higher aperf/mperf by telling the load balancer not to use them (cpu_power = 1). All of this is continuously re-adjusted to cope with changes in the load and hardware variations like turbo boost.
Similarly, if a CPU has aperf/mperf >= 1, it keeps increasing the P-state (depending on the policy). Once it got to the highest level, depending on the number of threads in the run-queue (doesn't make sense for only one), it can open up other CPUs and let the load balancer use them.
You can look in hindsight what kind of performance you got (from some basic counters in MSRs), and the scheduler can use that to account backwards to what some process got. But to predict what you will get in the future...... that's near impossible on any realistic system nowadays (and even more so in the future).
We don't need absolute figures matching load to P-states but we'll continue with an adaptive system. What we have now is also an adaptive system but with independent decisions taken by the load balancer and the P-state driver. The load balancer can even get confused by the cpufreq decisions and move tasks around unnecessarily. With Morten's proposal we get the power scheduler to adjust the P-state while giving hints to the load balancer at the same time (it adjusts both, it doesn't try to re-adjust itself after the load balancer).
Treating "frequency" (well "performance) and idle separately is also a false thing to do (yes I know in 3.9/3.10 we still do that for Intel hw, but we're working on fixing that). They are by no means separate things. One guy's idle state is the other guys power budget (and thus performance)!.
I agree.