To summarize the current problem with idle CPU capacity votes:
- When the last task on a CPU (say CPU X) sleeps and the CPU goes idle, we currently drop its capacity vote to zero. We do not immediately update the cluster frequency based on this information however.
- It depends on when other CPUs in the frequency domain have an event which forces re-evaluation of the capacity votes and corresponding frequency. It could occur right away, lowering the frequency only to require raising it again immediately if CPU X is idle a very short time. Or it could be a very long time before such an event occurs which will leave the cluster at an unnecessarily high OPP and waste energy.
I have a draft of a change which modifies the nohz idle balance path a bit to ensure that update_blocked_averages() is called for tickless idle CPUs at least every X ms. This alone won't solve the above problems though. You need to force re-evaluation of the capacity votes somewhere to update the cluster frequency. I was originally going to call into cpufreq_sched as idle CPU loads are decayed to update the frequency there but folks didn't seem to like this during Thursday's call.
We could get rid of the clearing of the capacity vote when entering idle and use a passive update when decaying idle CPU utilizations (setting the capacity vote but not triggering a re-evaluation of cluster frequency). That would solve the problem of risking the cluster frequency dropping to fmin during a very short idle and having to be immediately ramped up again. It will not solve the issue of the cluster potentially getting stuck idle at fmax/high frequency for long periods of time and wasting energy though.
There's been some discussion on this issue in the context of integration of cpuidle with cpufreq and the scheduler (see attached). Rather than force regular load decay updates via the load balancer and figure out when to force frequency re-evaluation I'm inclined to just remove the clearing of the capacity vote in dequeue_task_fair when going idle and tackle this problem within cpuidle as part of an energy aware/platform aware decision (see #2 in the attachment). A possible policy in cpuidle might look like:
- If it's a short idle, don't bother removing capacity vote. - If it's a long idle and the system doesn't burn extra power in idle at elevated frequency, passively remove the capacity vote. Frequency gets adjusted if another CPU has a freq-evaluating event, like today. - If it's a long idle and the system burns extra power in idle, actively remove the capacity vote, immediately adjusting frequency if needed.
A slack timer mechanism may still be desirable in cpuidle to guard against the prediction being wrong (you think it's a short idle and leave a high capacity vote in, but it ends up being a long idle).
Thanks if you've read this far! Also, I hope to migrate these discussions to lkml+linux-pm. Perhaps after the next sched-freq RFC posting which will surely spawn discussions there anyway and get everyone up to speed on our current status and issues, making it a good cutover point.