But on x86 you still have a P-state hint for the CPU and the scheduler could at least hope for more CPU performance. We can make the power scheduler ask the power driver for an increase or decrease of performance (as Preeti suggested) and give it the current load as argument rather than a precise performance/frequency level. The power driver would change the P-state accordingly and take the load into account (or ignore it, something like intel_pstate.c can do its own aperf/mperf tracking). But the power driver will inform the scheduler that it can't change the P-state further and the power scheduler can decide to spread the load out to other CPUs.
I am completely fine with an interface that is something like
void arch_please_go_faster(int cpunr); void arch_please_go_fastest(int cpunr); int arch_can_you_go_faster_than_now(int cpunr);
(maybe without the arguments and only make it for the local cpu, that would make the implementation surely simpler)
with the understanding that these are instant requests (e.g. longer term policy will clobber requests eventually).
it makes total sense to me for the scheduler to indicate "I need performance NOW". Either when it sees it's on the verge of needing to load balance, or when it is about to schedule a high priority (think realtime) task.
Part of the reason I like such interface is that it is a higher level one, it's a clear and high level enough policy request that the hardware driver can translate into a hardware specific thing.
An interface that would be "put it at THIS much" is not. It's too low level and makes assumptions about hardware things that change between generations/vendors that the scheduler really shouldn't know about.