Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal

15 Jul 2013

      On Sat, Jul 13, 2013 at 07:40:08AM -0700, Arjan van de Ven wrote:
...
On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
...
Arjan; from reading your emails you're mostly busy explaining what cannot be
done. Please explain what _can_ be done and what Intel wants. From what I can
see you basically promote a max P state max concurrency race to idle FTW.
...
Since you can't say what the max P state is; and I think I understand the
reasons for that, and the hardware might not even respect the P state you tell
it to run at, does it even make sense to talk about Intel P states? When would
you not program the max P state?
this is where it gets complicated ;-( the race-to-idle depends on the type of
code that is running, if things are memory bound it's outright not true, but
for compute bound it often is.
So you didn't actually answer the question about when you'd program a less than
max P state. Your recommended interface also glaringly lacks the
arch_please_go_slower_noaw() function.
What's the point of having a 'go faster' button if you can't also go slower?
So you can program any P state; but the hardware is free do as it pleases but
not slower than the lowest P state. So clearly the hardware is 'smart'.
Going by your interface there's also not much influence as to where the 'power'
goes; can we for example force the GPU to clock lower in order to 'free' up
power for cores?
If we can, we should very much include that in the entire discussion.
...
What I would like to see is

Move the idle predictor logic into the scheduler, or at least a library
(I'm not sure the scheduler can do better than the current code, but it might,
 and what menu does today is at least worth putting in some generic library)

Right, so the idea is that these days we have much better task runtime
behaviour tracking than we used to have and this might help. I also realize the
idle guestimator uses more than just task activity, interrupt activity is also
very important.
This also makes it not a pure scheduling thing so I wouldn't be too bothered if
it lived in kernel/cpu/idle.c instead of in the scheduler proper.
Not sure calling it a generic library would be wise; that has such an optional
sound to it. The thing we want to avoid is people brewing their own etc..
Also, my interest in it is that the scheduler wants to use it; and when we go
do power aware scheduling I feel it should live very near the scheduler if not
in the scheduler for the simple reason that part of being power aware is trying
to stay idle as long as possible; the idle guestimator is the measure of that.
So in that sense they are closely related.
...

An interface between scheduler and P state code in the form of (and don't take the names as actual function names ;-)
 void arch_please_go_fastest(void);   /* or maybe int cpunr as argument, but that's harder to implement */

Here again, the only thing this allows is max P state race for idle. Why would
Intel still pretend to have P states if they're so useless and mean so little?
...
int arch_can_you_go_faster(void);  /* if the scheduler would like to know this instead of load balancing .. unsure */

You said Intel could not say if it were at the max P state; so how could it
possibly answer this one?
...
unsigned long arch_instructions_executed(void); /* like tsc, but on instructions, so the scheduler can account actual work done */

To what purpose? People mostly still care about wall-time for things like
response and such. Also, its not something most arch will be able to provide
without sacrificing a PMU counter if they even have such a thing. Also not
everybody is as 'fast' in reading PMU state as one would like.
...
the first one is for the scheduler to call when it sees a situation of "we
  care deeply about performance now" coming, for example near overload, or
  when a realtime (or otherwise high priority) task gets scheduled.  the
  second one I am dubious about, but maybe you have a use for it; some folks
  think that there is value in deciding to ramp up the performance rather
  than load balancing. For load balancing to an idle cpu, I don't see that
  value (in terms of power efficiency) but I do see a case where your 2 cores
  happen to be busy (some sort of thundering herd effect) but imbalanced; in
  that case going faster rather than rebalance... I can certainly see the
  point.
(reformatted to 80 col text)
The entire scheme seems to disregards everybody who doesn't have a 'smart'
micro controller doing the P state management. Some people will have to
actually control the cpufreq.
...

an interface from the C state hardware driver to the scheduler to say "oh

btw, the LLC got flushed, forget about past cache affinity". The C state
driver can sometimes know this.. and linux today tries to keep affinity
anyway while we could get more optimal by being allowed to balance more
freely
This shouldn't be hard to implement at all.
...

this is the most important one, but like the hardest one: An interface

from the scheduler that says "we are performance sensitive now": void
arch_sched_performance_sensitive(int duration_ms);
I've put a duration as argument, rather than a "arch_no_longer_sensitive",
   to avoid for the scheduler to run some periodic timer/whatever to keep
   this; rather it is sort of a "lease", that the scheduler can renew as
   often as it wants; but it auto-expires eventually.
with this the hardware and/or hardware drivers can make a performance bias
   in their decisions based on what is actually the driving force behind both
   P and C state decisions: performance sensitivity.  (all this utilization
   stuff menu but also the P state drivers try to do is estimating how
   sensitive we are to performance, and if we're not sensitive, consider
   sacrificing some performance for power. Even with race-to-halt, sometimes
   sacrificing a little performance gives a power benefit at the top of the
   range)
Right, trouble is of course we have nothing to base this on. Our task model
completely lacks any clue for this. And the problem with introducing something
like that would also be that I suspect that within a few years every single
task on the system would find itself 'important'.
...
...
IIRC you at one point said there was a time limit below which concurrency
spread wasn't useful anymore?
there is a time below which waking up a core (not hyperthread pair, that is
ALWAYS worth it since it's insanely cheap) is not worth it.  Think in the
order of "+/- 50 microseconds".
OK.
...
...
Also, most what you say for single socket systems; what does Intel want for
multi-socket systems?
for multisocket, rule number one is "don't screw up numa".
for tasks where numa matters, that's the top priority.
OK, so again, make sure to get the work done as quickly as possible and go idle
again.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [RFC][PATCH 0/9] sched: Power scheduler design proposal