Hi Daniel,
I've reviewed the draft doc you sent on cpuidle/cpufreq integration. Things have progressed a bit since the initial round of comments in the doc in April, also I thought it'd be good to open the discussion on eas-dev, so I figured I'd try and summarize the main points of the doc here and comment on them. Apologies if I mis-state anything, please correct me if necessary.
Some high level thoughts: - I have yet to see a platform where race to idle is a win for power. Because of the exponential shape of the power/perf curves and other issues such as random wakeups interrupting deep sleep, it's always been better in my experience to run at as low an OPP as possible within the performance requirements of the workload. As a general policy at least. - The validation tests mentioned in the doc seemed to be focused uniformly on performance but I think power measurements must be given equal consideration and mention in validating any of these changes.
My comments on each of the individual proposals:
1. Managing frequency during idle when blocked load is high.
The proposal in this section was to keep the frequency unchanged when entering idle and there is significant blocked load. Blocked load has since been included in the utilization metric which determines frequency. But the policy in sched-freq on what to do when the last task on a runqueue blocks (i.e. the CPU goes idle) is still evolving.
Currently the CPU's CFS capacity vote in the frequency domain is passively dropped when that CPU goes idle. It zeros out its capacity vote but does not trigger a recalculation of the frequency domain's new overall required capacity or set a new frequency. This means that the frequency will remain as-is until a different event occurs which forces re-evaluation of the CPU capacity votes in the frequency domain. I won't go into enumerating those events here, suffice it to say I don't think the current policy will work. It's possible for the CPU to stay at an elevated frequency for far too long which would have an unacceptable power impact.
I'm also concerned about the way blocked load is included in the utilization metric and potentially keying off that for frequency during periods of idle. It's certainly a more power-hungry policy than what is in place today, plus there's really no way to tune it since it's part of the per-entity load tracking scheme. The interactive governor had a tunable (the slack timer) which controlled how long you could sit idle at a frequency greater than fmin.
Schedtune aims to provide a per-task tunable which can boost/scale up the load value for that task calculated by PELT. So that would provide some mechanism to tweak this although it affects the task's contribution at all times rather than only when it is blocked. It also currently only is built to inflate/scale up a task's demand rather than decrease it.
2. Make the idle task in charge of changing the frequency to minimum
Proposal is to have idle main loop set frequency to minimum, do idle, and then restore frequency coming out of idle.
My thinking is that when we enter idle, somewhere there has to be a decision made as to whether it is worth it to change frequency. The input for that decision could include - the current frequency - the energy data of the target (power consumption at each frequency at the C-state we will be entering) - the expected duration of the idle period - the latency of changing between the two frequencies in question, as well as the latency of C-state entry and exit - performance and latency requirements for the currently blocked tasks
The idle task seems to me like a reasonable place for the decision logic, which could then call some not-yet-existent API into sched-freq to ask for the frequency change. Sched class capacity votes would be retained and reinstated on idle exit. The temporary idle frequency would respect any min or max freq constraints that had been previously registered in cpufreq.
3. The expected sleep duration is less than the frequency transition latency
Agreed this needs to be considered, a full implementation of the logic I mentioned in #2 would cover this one.
4. Align frequency with task priority
I'd agree with MikeT's comment in the doc that changing frequency according to the niceness of the running tasks would be a pretty big change in semantics that we should stay away from. Schedtune may offer a way to negatively bias the performance of a task, although currently it only can be used to inflate a task's performance demands.
5. Consider CPU frequency when evaluating wake-up latency
Agreed this should be taken into account. Again would be part of the logic I mentioned in #2 I think, deciding whether we can even change frequencies during idle at all (possibly ruled out due to QoS constraints), and then if it is possible, whether it is worth it from an energy standpoint.
6. Multiple freq governors as input to cpufreq core
I'd agree this seems like it'd have value but barring a major shift in priorities, it's going to be a while before this would get focus given the effort required just to get the basic sched-freq feature merged.
7. Increase freq of little cluster when freq of big cluster increases, to make migrations back to little faster.
This wouldn't be for everyone IMO due to the power impact. I'm not sure where exactly this policy would go, especially given the current crusade in the community against plugin governors and tunables. Perhaps in the platform-level cpufreq driver.
thanks, Steve