Re: sched: ARM: arch_scale_freq_power

11 Oct 2011


      On 11 October 2011 12:27, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
...
On Tue, 2011-10-11 at 11:40 +0200, Vincent Guittot wrote:
...
On 11 October 2011 11:13, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
...
On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
...
I have several goals. The 1st one is that I need to put more load on
some cpus when I have packages with different cpu frequency.
That should be rather easy.
I agree, I was mainly wondering If I should use a [1-1024] or a
[1024-xxxx] range and it seems that both can be used according : SMT
uses <1024 and x86 turbo mode uses >1024
Well, turbo mode would typically only boost a cpu 25% or so, and only
while idling other cores to keep under its thermal limit. So its not
sufficient to actually affect the capacity calculation much if at all.
OK
...
...
...
...
Then, I have some use cases which have several running tasks but a low
cpu load. In this case, the small tasks are spread on several cpu by
the load_balance whereas they could be easily handled by one cpu
without significant performance modification.
That shouldn't be done using cpu_power, we have sched_smt_power_savings
and sched_mc_power_savings for stuff like that.
sched_mc_power_saving works fine when we have more than 2 cpus but
can't apply on a dual core because it needs at least 2 sched_groups
and the nr_running of these sched_groups must be higher than 0 but
smaller than group_capacity which is 1 on a dual core system.
SD_POWERSAVINGS_BALANCE does /=2 to nr_running, effectively doubling the
capacity iirc. And I know some IBM dudes were toying with the idea of
playing tricks with the capacity numbers, but that never went anywhere.
yes but it's only a special case for 2 tasks on a dual core and the
SD_WAKE_AFFINE flag and cpu_idle_sibling can overwrite this decision.
...
...
...
Although I would really like to kill all those different
sched_*_power_savings knobs and reduce it to one.
...
If the cpu_power is
higher than 1024, the cpu is no more seen out of capacity by the
load_balance as soon as a short process is running and teh main result
is that the small tasks will stay on the same cpu. This configuration
is mainly usefull for ARM dual core system when we want to power gate
one cpu. I use cyclictest to simulate such use case.
Yeah, but that's wrong.
That's the only way I have found to gathers small task without any
relationship on one cpu. Do you know any better solution ?
How do you know the task is 'small' ?
I want to use cpufreq to be notified that we have a large/small cpu
load. If we have several tasks but the cpu uses the lowest frequency,
it "should" mean that we have small tasks that are running (less than
20ms*95% of added duration) and we could gather them on one cpu (by
increasing the cpu_power on a dual core).
...
For that you would need to track a time-weighted effective load average
of the task and we don't have that.
yes, that's why I use cpufreq until better option, like a
time-weighted load average, is available
...
[ how bad is all this u64 math on ARM btw? and when will ARM finally
 agree all this 32bit nonsense is a waste of time and silicon? ]
But yeah, the whole nr_running vs capacity thing was traditionally to
deal with spreading single tasks around. And traditional power aware
scheduling was mostly about packing those on sockets (keeps other
sockets idle) instead of spreading them around sockets (optimizes
cache).
Now I wouldn't at all mind you ripping out all that
sched_*_power_savings crap and replacing it, I doubt it actually works
anyway. I haven't got many patches on the subject, and I know I don't
have the equipment to measure power usage.
Also, the few patches I got mostly made the sched_*_power_savings mess
bigger, which I refuse to do (what sysad wants to have a 27-state space
to configure his power aware scheduling). This has mostly made people go
away instead of fixing things up :-(
As to what the replacement would have to look like, dunno, its not
something I've really thought much about, but maybe the time-weighted
stuff is the only sane approach, that combined with options on how to
spread tasks (core, socket, node, etc..).
I really think changing the load-balancer is the right way to go about
solving your power issue (hot-plugging a cpu really is an insane way to
idle a core) and I'm open to discussing what would work for you.
Great. My 1st goal was not to modify the load-balancer and sched_mc
(or as less as possible) and to study how I could tune the scheduler
parameters to have the best power consumption on ARM platform. Now,
changing the load-balancer is probably a better solution.
...
All I really ask is to not cobble something together, the load-balancer
is a horridly complex thing already and the last thing it needs is more
special cases that don't interact properly.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: sched: ARM: arch_scale_freq_power