On Mon, Nov 11, 2013 at 04:39:45PM +0000, Arjan van de Ven wrote:
I think the scheduler simply wants to say: we expect to go idle for X ns, we want a guaranteed wakeup latency of Y ns -- go do your thing.
as long as Y normally is "large" or "infinity" that is ok ;-) (a smaller Y will increase power consumption and decrease system performance)
Cpuidle already takes a latency into account via pm_qos. The scheduler could pass this information down to the hardware driver or the cpuidle driver could use pm_qos directly (as it's currently done in governors).
The scheduler may have its own requirements in terms of latency (e.g. some real-time thread) and we could extend the pm_qos API with per-thread information. But so far we don't have a way to pass such per-thread requirements from user space (unless we assume that any real-time thread has some fixed latency requirements). I suggest we ignore this per-thread part until we find an actual need.
I think you also raised the point in that we do want some feedback as to the cost of waking up particular cores to better make decisions on which to wake. That is indeed so.
having a hardware driver give a prefered CPU ordering for wakes can indeed be useful. (I'm doubtful that changing the recommendation for each idle is going to pay off, but proof is in the pudding; there are certainly long term effects where this can help)
The ordering is based on the actual C-state, so a simple way is to wake up the CPU in the shallowest C-state. With asymmetric configurations (big.LITTLE) we have different costs for the same C-state, so this would come in handy.
Even for symmetric configuration, the cost of moving a task to a CPU includes wake-up cost plus the run-time cost which depends on the P-state after wake-up (that's much trickier since we can't easily estimate the cost of a P-state and it may change once you place a task on it).