On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
On Tue, Jul 09, 2013 at 04:55:29PM +0100, Morten Rasmussen wrote:
Hi,
This patch set is an initial prototype aiming at the overall power-aware scheduler design proposal that I previously described http://permalink.gmane.org/gmane.linux.kernel/1508480.
The patch set introduces a cpu capacity managing 'power scheduler' which lives by the side of the existing (process) scheduler. Its role is to monitor the system load and decide which cpus that should be available to the process scheduler.
Hmm...
This looks like a userspace hotplug deamon approach lifted to kernel space :/
How about instead of layering over the load-balancer to constrain its behaviour you change the behaviour to not need constraint? Fix it so it does the right thing, instead of limiting it.
I don't think its _that_ hard to make the balancer do packing over spreading. The power balance code removed in 8e7fbcbc had things like that (although it was broken). And I'm sure I've seen patches over the years that did similar things. Didn't Vincent and Alex also do things like that?
a basic "sort left" (e.g. when needing to pick a cpu for a task that is short running, pick the lowest numbered idle one) will already have the effect of packing in practice. it's not perfect packing, but on a statistical level it'll be quite good.
(this all assumes relatively idle systems with spare capacity to play with of course.. ... but that's the domain where packing plays a role)
Arjan; from reading your emails you're mostly busy explaining what cannot be done. Please explain what _can_ be done and what Intel wants. From what I can see you basically promote a max P state max concurrency race to idle FTW.
btw one more thing I'd like to get is a communication between the scheduler and the policy/hardware drivers about task migration. When a task migrates to another CPU, the statistics that the hardware/driver/policy were keeping on that target CPU are really not valid anymore in terms of forward looking predictive power. A communication (API or notification or whatever form it takes) around this would be quite helpful. This could be as simple as just setting a flag on the target cpu (in their rq), so that the next power event (exiting idle, P state evaluation, whatever) the policy code can flush-and-start-over.
on thinking more about the short running task thing; there is an optimization we currently don't do, mostly for hyperthreading. (and HT is just one out of a set of cases with similar power behavior) If we know a task runs briefly AND is not performance critical, it's much much better to place it on a hyperthreading buddy of an already busy core than it is to place it on an empty core (or to delay it). Yes a HT pair isn't the same performance as a full core, but in terms of power the 2nd half of a HT pair is nearly free... so if there's a task that's not performance sensitive (and won't disturb the other task too much, e.g. runs briefly enough)... it's better to pack onto a core than to spread. you can generalize this to a class of systems where adding work to a core (read: group of cpus that share resources) is significantly cheaper than running on a full empty core.
(there is clearly a tradeoff, by sharing resources you also end up reducing performance/efficiency, and that has its own effect on power, so there is some kind of balance needed and a big enough gain to be worth the loss)