On Sat, Jul 13, 2013 at 07:40:08AM -0700, Arjan van de Ven wrote:
On 7/12/2013 11:49 PM, Peter Zijlstra wrote:
Arjan; from reading your emails you're mostly busy explaining what cannot be done. Please explain what _can_ be done and what Intel wants. From what I can see you basically promote a max P state max concurrency race to idle FTW.
Since you can't say what the max P state is; and I think I understand the reasons for that, and the hardware might not even respect the P state you tell it to run at, does it even make sense to talk about Intel P states? When would you not program the max P state?
this is where it gets complicated ;-( the race-to-idle depends on the type of code that is running, if things are memory bound it's outright not true, but for compute bound it often is.
So you didn't actually answer the question about when you'd program a less than max P state. Your recommended interface also glaringly lacks the arch_please_go_slower_noaw() function.
What's the point of having a 'go faster' button if you can't also go slower?
So you can program any P state; but the hardware is free do as it pleases but not slower than the lowest P state. So clearly the hardware is 'smart'.
Going by your interface there's also not much influence as to where the 'power' goes; can we for example force the GPU to clock lower in order to 'free' up power for cores?
If we can, we should very much include that in the entire discussion.
What I would like to see is
- Move the idle predictor logic into the scheduler, or at least a library (I'm not sure the scheduler can do better than the current code, but it might, and what menu does today is at least worth putting in some generic library)
Right, so the idea is that these days we have much better task runtime behaviour tracking than we used to have and this might help. I also realize the idle guestimator uses more than just task activity, interrupt activity is also very important.
This also makes it not a pure scheduling thing so I wouldn't be too bothered if it lived in kernel/cpu/idle.c instead of in the scheduler proper.
Not sure calling it a generic library would be wise; that has such an optional sound to it. The thing we want to avoid is people brewing their own etc..
Also, my interest in it is that the scheduler wants to use it; and when we go do power aware scheduling I feel it should live very near the scheduler if not in the scheduler for the simple reason that part of being power aware is trying to stay idle as long as possible; the idle guestimator is the measure of that.
So in that sense they are closely related.
- An interface between scheduler and P state code in the form of (and don't take the names as actual function names ;-) void arch_please_go_fastest(void); /* or maybe int cpunr as argument, but that's harder to implement */
Here again, the only thing this allows is max P state race for idle. Why would Intel still pretend to have P states if they're so useless and mean so little?
int arch_can_you_go_faster(void); /* if the scheduler would like to know this instead of load balancing .. unsure */
You said Intel could not say if it were at the max P state; so how could it possibly answer this one?
unsigned long arch_instructions_executed(void); /* like tsc, but on instructions, so the scheduler can account actual work done */
To what purpose? People mostly still care about wall-time for things like response and such. Also, its not something most arch will be able to provide without sacrificing a PMU counter if they even have such a thing. Also not everybody is as 'fast' in reading PMU state as one would like.
the first one is for the scheduler to call when it sees a situation of "we care deeply about performance now" coming, for example near overload, or when a realtime (or otherwise high priority) task gets scheduled. the second one I am dubious about, but maybe you have a use for it; some folks think that there is value in deciding to ramp up the performance rather than load balancing. For load balancing to an idle cpu, I don't see that value (in terms of power efficiency) but I do see a case where your 2 cores happen to be busy (some sort of thundering herd effect) but imbalanced; in that case going faster rather than rebalance... I can certainly see the point.
(reformatted to 80 col text)
The entire scheme seems to disregards everybody who doesn't have a 'smart' micro controller doing the P state management. Some people will have to actually control the cpufreq.
- an interface from the C state hardware driver to the scheduler to say "oh
btw, the LLC got flushed, forget about past cache affinity". The C state driver can sometimes know this.. and linux today tries to keep affinity anyway while we could get more optimal by being allowed to balance more freely
This shouldn't be hard to implement at all.
- this is the most important one, but like the hardest one: An interface
from the scheduler that says "we are performance sensitive now": void arch_sched_performance_sensitive(int duration_ms);
I've put a duration as argument, rather than a "arch_no_longer_sensitive", to avoid for the scheduler to run some periodic timer/whatever to keep this; rather it is sort of a "lease", that the scheduler can renew as often as it wants; but it auto-expires eventually.
with this the hardware and/or hardware drivers can make a performance bias in their decisions based on what is actually the driving force behind both P and C state decisions: performance sensitivity. (all this utilization stuff menu but also the P state drivers try to do is estimating how sensitive we are to performance, and if we're not sensitive, consider sacrificing some performance for power. Even with race-to-halt, sometimes sacrificing a little performance gives a power benefit at the top of the range)
Right, trouble is of course we have nothing to base this on. Our task model completely lacks any clue for this. And the problem with introducing something like that would also be that I suspect that within a few years every single task on the system would find itself 'important'.
IIRC you at one point said there was a time limit below which concurrency spread wasn't useful anymore?
there is a time below which waking up a core (not hyperthread pair, that is ALWAYS worth it since it's insanely cheap) is not worth it. Think in the order of "+/- 50 microseconds".
OK.
Also, most what you say for single socket systems; what does Intel want for multi-socket systems?
for multisocket, rule number one is "don't screw up numa". for tasks where numa matters, that's the top priority.
OK, so again, make sure to get the work done as quickly as possible and go idle again.