On Fri, Jun 14, 2013 at 05:05:22PM +0100, Morten Rasmussen wrote:
The intention is that the power scheduler will implement the (unified) power policy. It gets the current load of the system from the scheduler. Based on this information it will adjust the compute capacity available to the scheduler and drive frequency changes such that enough compute capacity is available to handle the current load. If the total load can be handled by a subset of cpus, it will reduce the capacity of the excess cpus to 0 (cpu_power=1). Likewise, if the load increases it will increase capacity of one or more idle cpus to allow the scheduler to spread the load. The power scheduler has knowledge about the power topology and will guide the scheduler to idle the most optimum cpus by reducing its capacity. Global idle decision will be handled by the power scheduler, so cpuidle can over time be reduced to become just a driver, once we have added C-state selection to the power scheduler.
The scheduler is left to focus on scheduling mechanics and finding the best possible load balance on the cpu capacities set by the power scheduler. It will share a detailed view of the current load with the power scheduler to enable it to make the right capacity adjustments. The scheduler will need some optimization to cope better with asymmetric compute capacities. We may want to reduce capacity of some cpu to increase their idle time while letting others take the majority of the load.
...
I'm aware that the scheduler and power scheduler decisions may be inextricably linked so we may decide to merge them. However, I think it is worth trying to keep the power scheduling decisions out of the scheduler until we have proven it infeasible.
Thanks for posting this, I agree with the proposal. I would like to emphasise that this is a rather "divide and conquer" approach to reaching a unified solution. Some of the steps involved (not necessarily in this order):
1. Introduction of a power scheduler (replacing cpufreq governor) aware of the overall load and CPU capacities. It requests CPU frequency changes from the low-level cpufreq driver and gives hints to the task scheduler about load asymmetry (via cpu_power). 2. More accurate task load tracking (an attempt here - https://lkml.org/lkml/2013/4/16/289 - but possibly better accuracy using CPU cycles or other arch-specific counters). 3. Load balancer improvements for asymmetric CPU performance levels (e.g. frequency scaling). 4. Power scheduler driving the CPU idle decisions (replacing the cpuidle governor). 5. Power scheduler increased awareness of the run-queues content (number of tasks, individual task loads) and load balancer behaviour, feeding extra hints back to the load balancer (e.g. only move tasks below/above certain load, trigger a load balance). 6. Performance vs power saving tuning (policies). 7. More specific optimisations based on the CPU topology (big.little, turbo boost, etc.) ?. Lots of other things based on testing and community reviews.
Step 5 above will further increase the coupling between load balancer and power scheduler and we could end up with a unified implementation. But before then it is simpler to reason in terms of (a) better load balancing in an asymmetric configuration and (b) CPU capacity needed for the overall load.