On Mon, Nov 11, 2013 at 04:54:54PM +0000, Morten Rasmussen wrote:
On Mon, Nov 11, 2013 at 11:33:45AM +0000, Catalin Marinas wrote:
I would rather start by defining the main goal and working backwards to an algorithm. We may as well find that task packing based on this patch set is sufficient but we may also get packing-like behaviour as a side effect of a broader approach (better energy cost awareness). An important aspect even in the mobile space is keeping the performance as close as possible to the standard scheduler while saving a bit more
With the exception of big.LITTLE where we want to out-perform the standard scheduler while saving power.
Good point. Maybe we should start with a separate set of patches for improving the performance on asymmetric configurations like big.LITTLE while ignoring (deferring) the power aspect. Things like placing bigger threads on bigger CPUs and so on (you know better what's needed here ;).
My understanding from the recent discussions is that the scheduler should decide directly on the C-state (or rather the deepest C-state possible since we don't want to duplicate the backend logic for synchronising CPUs going up or down). This means that the scheduler needs to know about C-state target residency, wake-up latency (I think we can leave coupled C-states to the backend, there is some complex synchronisation which I wouldn't duplicate).
It would be nice and simple to hide the complexity of the coupled C-states, but we would loose the ability to prefer waking up cpus in a cluster/package that already has non-idle cpus over cpus in a cluster/package that has entered the coupled C-state. If we just know the requested C-state of a cpu we can't tell the difference as it is now.
I agree, we can't rely on the requested C-state but the _actual_ state and this means querying the hardware driver. Can we abstract this via some interface which provides the cost of waking up a CPU? This could take into account the state of the other CPUs in the cluster and the scheduler is simply concerned with the wake-up costs.
Alternatively (my preferred approach), we get the scheduler to predict and pass the expected residency and latency requirements down to a power driver and read back the actual C-states for making task placement decisions. Some of the menu governor prediction logic could be turned into a library and used by the scheduler. Basically what this tries to achieve is better scheduler awareness of the current C-states decided by a cpuidle/power driver based on the scheduler constraints.
It might be easier to deal with the couple C-states using this approach.
We already have drivers taking care of the couple C-states, so it means passing the information back to the scheduler in some way (actual C-state or wake-up cost).
It would be nice if we can describe the wake-up costs statically while considering coupled C-states but it needs more thinking.