On Mon, 18 Aug 2014, Preeti U Murthy wrote:
On 08/18/2014 09:24 PM, Nicolas Pitre wrote:
On Mon, 11 Aug 2014, Preeti U Murthy wrote:
The goal of the power aware scheduling design is to integrate all policy, metrics and averaging into the scheduler. Today the cpu power management is fragmented and hence inconsistent.
As a first step towards this integration, rid the cpuidle state management of the governors. Retain only the cpuidle driver in the cpu idle susbsystem which acts as an interface between the scheduler and low level platform specific cpuidle drivers. For all decision making around selection of idle states,the cpuidle driver falls back to the scheduler.
The current algorithm for idle state selection is the same as the logic used by the menu governor. However going ahead the heuristics will be tuned and improved upon with metrics better known to the scheduler.
I'd strongly suggest a different approach here. Instead of copying the menu governor code and tweaking it afterwards, it would be cleaner to literally start from scratch with a new governor. Said new governor would grow inside the scheduler with more design freedom instead of being strapped on the side.
By copying existing code, the chance for cruft to remain for a long time is close to 100%. We already have one copy of it, let's keep it working and start afresh instead.
By starting clean it is way easier to explain and justify additions to a new design than convincing ourselves about the removal of no longer needed pieces from a legacy design.
Ok. The reason I did it this way was that I did not find anything grossly wrong in the current cpuidle governor algorithm. Of course this can be improved but I did not see strong reasons to completely wipe it away. I see good scope to improve upon the existing algorithm with additional knowledge of *the idle states being mapped to scheduling domains*. This will in itself give us a better algorithm and does not mandate significant changes from the current algorithm. So I really don't see why we need to start from scratch.
Sure the current algorithm can be improved. But it has its limitations by design. And simply making it more topology aware wouldn't justify moving it into the scheduler.
What we're contemplating is something completely integrated with the scheduler where cpuidle and cpufreq (and eventually thermal management) together are part of the same "governor" to provide global decisions on all fronts.
Not only should the next wake-up event be predicted, but also the anticipated system load, etc. The scheduler may know that a given CPU is unlikely to be used for a while and could call for the deepest C-state right away without waiting for the current menu heuristic to converge.
There is also Daniel's I/O latency tracking that could replace the menu governor latency guessing, the later based on heuristics that could be described as black magic.
And all this has to eventually be policed by a global performance/power concern that should weight C-states, P-states and task placement together and select the best combination (Morten's work).
Therefore the current menu algorithm won't do it. It simply wasn't designed for that.
We'll have the opportunity to discuss this further tomorrow anyway.
The primary issue that I found was that with the goal being power aware scheduler we must ensure that the possibility of a governor getting registered with cpuidle to choose idle states no longer will exist. The reason being there is just *one entity who will take this decision and there is no option about it*. This patch intends to bring the focus to this specific detail.
I think there is nothing wrong with having multiple governors being registered. We simply decide at runtime via sysfs which one has control over the low-level cpuidle drivers.
Nicolas