Re: power-efficient scheduling design

18 Jun 2013


      On Tue, Jun 18, 2013 at 04:20:28PM +0100, Arjan van de Ven wrote:
...
On 6/14/2013 9:05 AM, Morten Rasmussen wrote:
...
Looking at the discussion it seems that people have slightly different
views, but most agree that the goal is an integrated scheduling,
frequency, and idle policy like you pointed out from the beginning.
... except that such a solution does not really work for Intel hardware.
I think it can work (see below).
...
The OS does not get to really pick the CPU "frequency" (never mind that
frequency is not what gets controlled), the hardware picks the frequency.
The OS can do some level of requests (best to think of this as a percentage
more than frequency) but what you actually get is more often than not
what you asked for.
Morten's proposal does not try to "pick" a frequency. The P-state change
is still done gradually based on the load (so we still have an adaptive
loop). The load (total or per-task) can be tracked in an arch-specific
way (using aperf/mperf on x86).
The difference from what intel_pstate.c does now is that it has a view
of the total load (across all CPUs) and the run-queue content. It can
"guide" the load balancer into favouring one or two CPUs and ignoring
the rest (using cpu_power).
If several CPUs have small aperf/mperf ratio, it can decide to use fewer
CPUs at a higher aperf/mperf by telling the load balancer not to use
them (cpu_power = 1). All of this is continuously re-adjusted to cope
with changes in the load and hardware variations like turbo boost.
Similarly, if a CPU has aperf/mperf >= 1, it keeps increasing the
P-state (depending on the policy). Once it got to the highest level,
depending on the number of threads in the run-queue (doesn't make sense
for only one), it can open up other CPUs and let the load balancer use
them.
...
You can look in hindsight what kind of performance you got (from some basic
counters in MSRs), and the scheduler can use that to account backwards to what some process
got. But to predict what you will get in the future...... that's near impossible
on any realistic system nowadays (and even more so in the future).
We don't need absolute figures matching load to P-states but we'll
continue with an adaptive system. What we have now is also an adaptive
system but with independent decisions taken by the load balancer and the
P-state driver. The load balancer can even get confused by the cpufreq
decisions and move tasks around unnecessarily. With Morten's proposal we
get the power scheduler to adjust the P-state while giving hints to the
load balancer at the same time (it adjusts both, it doesn't try to
re-adjust itself after the load balancer).
...
Treating "frequency" (well "performance) and idle separately is also a false thing to do
(yes I know in 3.9/3.10 we still do that for Intel hw, but we're working
on fixing that). They are by no means separate things. One guy's idle state
is the other guys power budget (and thus performance)!.
I agree.
-- 
Catalin

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: power-efficient scheduling design