On Wed, Jun 12, 2013 at 04:24:52PM +0100, Arjan van de Ven wrote:
This isn't in the fastpath, it's in the rebalancing logic.
the reality is much more complex unfortunately. C and P states hang together tightly, and even C state on one core impacts other cores' performance, just like P state selection on one core impacts other cores.
(at least for x86, we should really stop talking as if the OS picks the "frequency", that's just not the case anymore)
I agree, the reality is very complex. But we should go back and analyse what problem we are trying to solve, what each framework is trying to address.
When viewed separately from the scheduler, cpufreq and cpuidle governors do the right thing. But they both base their action on the CPU load (balance) decided by the scheduler and it's the latter that we are trying to adjust (and we are still debating what the right approach is).
Since such information seems too complex to be moved into the scheduler, why don't we get cpufreq in charge of restricting the load balancing to certain CPUs? It already tracks the load/idle time to (gradually) change the P state. Depending on the governor/policy, it could decide that (for
(btw in case you missed it, for Intel HW we no longer use cpufreq anymore)
Do you mean the intel_pstate.c code? It indeed doesn't use much of cpufreq, just setpolicy and it's on its own afterwards. Separating this from the framework probably has real benefits for the Intel processors but it would make a unified scheduler/cpufreq/cpuidle solution harder (just a remark, I don't say it's good or bad, there are many opinions against the unified solution; ARM could do the same for configurations like big.LITTLE).
But such driver could still interact with the scheduler to control it's load balancing. At a quick look (I'm not familiar with this driver), it tracks the per-CPU load and increases or decreases the P-state (similar to a cpufreq governor). It could as well track the total load and (depending on hardware configuration), get some CPUs in lower performance P-state (or even C-state) and tell the scheduler to avoid them.
One way to control load-balancing ratio is via something like arch_scale_freq_power(). We could tweak the scheduler further so that something like cpu_power==0 means do not schedule anything there.
So my proposal is to move the load-balancing hints (load ratio, avoiding CPUs etc.) outside the scheduler into drivers like intel_pstate.c or cpufreq governors. We then focus on getting the best performance out of the scheduler (like quicker migration) but it would not be concerned with the power consumption.
I do agree the scheduler needs to get integrated a bit better, in that it has some better knowledge, and to be honest, we likely need to switch from giving tasks credit for "time consumed" to giving them credit for something like "cycles consumed" or "instructions executed" or a mix thereof. So that a task that runs on a slower CPU (for either policy choice reasons or due to hardware capabilities), it gets charged less than when it runs fast.
I agree, this would be useful in optimising the scheduler so that it makes the right task placement/migration decisions (but as I said above, make the power aspect transparent to the scheduler).