I work to link the cpu_power of ARM cores to their frequency by using arch_scale_freq_power. It's explained in the kernel that cpu_power is used to distribute load on cpus and a cpu with more cpu_power will pick up more load. The default value is SCHED_POWER_SCALE and I increase the value if I want a cpu to have more load than another one. Is there an advised range for cpu_power value as well as some time scale constraints for updating the cpu_power value ? I'm also wondering why this scheduler feature is currently disable by default ?
Regards, Vincent
Adding Peter to the discussion..
On Thu, Oct 6, 2011 at 5:06 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
I work to link the cpu_power of ARM cores to their frequency by using arch_scale_freq_power. It's explained in the kernel that cpu_power is used to distribute load on cpus and a cpu with more cpu_power will pick up more load. The default value is SCHED_POWER_SCALE and I increase the value if I want a cpu to have more load than another one. Is there an advised range for cpu_power value as well as some time scale constraints for updating the cpu_power value ? I'm also wondering why this scheduler feature is currently disable by default ?
Regards, Vincent
In discussions with Vincent regarding this, I've wondered whether cpu_power wouldn't be better renamed to cpu_capacity since that is what it really seems to describe.
On Tue, 2011-10-11 at 12:46 +0530, Amit Kucheria wrote:
Adding Peter to the discussion..
Right, CCing the folks who actually wrote the code you're asking questions about always helps ;-)
On Thu, Oct 6, 2011 at 5:06 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
I work to link the cpu_power of ARM cores to their frequency by using arch_scale_freq_power.
Why and how? In particular note that if you're using something like the on-demand cpufreq governor this isn't going to work.
It's explained in the kernel that cpu_power is
used to distribute load on cpus and a cpu with more cpu_power will pick up more load. The default value is SCHED_POWER_SCALE and I increase the value if I want a cpu to have more load than another one. Is there an advised range for cpu_power value as well as some time scale constraints for updating the cpu_power value ?
Basically 1024 is the unit and denotes the capacity of a full core at 'normal' speed.
Typically cpufreq would down-clock a core and thus you'd end up with a smaller number (linearly proportional to the freq ratio etc. although if you want to go really fancy you could determine the actual throughput/freq curves).
Things like x86 turbo mode would result in a >1024 value.
Things like SMT would typically result in <1024 and the SMT sum over the core >1024 (if you're lucky).
I'm also wondering why this scheduler feature is currently disable by default ?
Because the only implementation in existence (x86) is broken and I haven't gotten around to fixing it. Arguable we should disable that for the time being, see below.
In discussions with Vincent regarding this, I've wondered whether cpu_power wouldn't be better renamed to cpu_capacity since that is what it really seems to describe.
Possibly, but its been cpu_power for ages and we use capacity to describe something else.
--- arch/x86/kernel/cpu/sched.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/sched.c b/arch/x86/kernel/cpu/sched.c index a640ae5..90ae68c 100644 --- a/arch/x86/kernel/cpu/sched.c +++ b/arch/x86/kernel/cpu/sched.c @@ -6,7 +6,14 @@ #include <asm/cpufeature.h> #include <asm/processor.h>
-#ifdef CONFIG_SMP +#if 0 /* def CONFIG_SMP */ + +/* + * Currently broken, we need to filter out idle time because the aperf/mperf + * ratio measures actual throughput, not capacity. This means that if a logical + * cpu idles it will report less capacity and receive less work, which isn't + * what we want. + */
static DEFINE_PER_CPU(struct aperfmperf, old_perf_sched);
On 11 October 2011 09:57, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
On Tue, 2011-10-11 at 12:46 +0530, Amit Kucheria wrote:
Adding Peter to the discussion..
Right, CCing the folks who actually wrote the code you're asking questions about always helps ;-)
On Thu, Oct 6, 2011 at 5:06 PM, Vincent Guittot vincent.guittot@linaro.org wrote:
I work to link the cpu_power of ARM cores to their frequency by using arch_scale_freq_power.
Why and how? In particular note that if you're using something like the on-demand cpufreq governor this isn't going to work.
I have several goals. The 1st one is that I need to put more load on some cpus when I have packages with different cpu frequency. I also study if I can follow the real cpu frequency but it seems to be not so easy. I have noticed that the cpu_power is updated periodical except when we have a lot of newly_idle events. Then, I have some use cases which have several running tasks but a low cpu load. In this case, the small tasks are spread on several cpu by the load_balance whereas they could be easily handled by one cpu without significant performance modification. If the cpu_power is higher than 1024, the cpu is no more seen out of capacity by the load_balance as soon as a short process is running and teh main result is that the small tasks will stay on the same cpu. This configuration is mainly usefull for ARM dual core system when we want to power gate one cpu. I use cyclictest to simulate such use case.
It's explained in the kernel that cpu_power is
used to distribute load on cpus and a cpu with more cpu_power will pick up more load. The default value is SCHED_POWER_SCALE and I increase the value if I want a cpu to have more load than another one. Is there an advised range for cpu_power value as well as some time scale constraints for updating the cpu_power value ?
Basically 1024 is the unit and denotes the capacity of a full core at 'normal' speed.
Typically cpufreq would down-clock a core and thus you'd end up with a smaller number (linearly proportional to the freq ratio etc. although if you want to go really fancy you could determine the actual throughput/freq curves).
Things like x86 turbo mode would result in a >1024 value.
Things like SMT would typically result in <1024 and the SMT sum over the core >1024 (if you're lucky).
I'm also wondering why this scheduler feature is currently disable by default ?
Because the only implementation in existence (x86) is broken and I haven't gotten around to fixing it. Arguable we should disable that for the time being, see below.
In discussions with Vincent regarding this, I've wondered whether cpu_power wouldn't be better renamed to cpu_capacity since that is what it really seems to describe.
Possibly, but its been cpu_power for ages and we use capacity to describe something else.
arch/x86/kernel/cpu/sched.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/sched.c b/arch/x86/kernel/cpu/sched.c index a640ae5..90ae68c 100644 --- a/arch/x86/kernel/cpu/sched.c +++ b/arch/x86/kernel/cpu/sched.c @@ -6,7 +6,14 @@ #include <asm/cpufeature.h> #include <asm/processor.h>
-#ifdef CONFIG_SMP +#if 0 /* def CONFIG_SMP */
+/*
- Currently broken, we need to filter out idle time because the aperf/mperf
- ratio measures actual throughput, not capacity. This means that if a logical
- cpu idles it will report less capacity and receive less work, which isn't
- what we want.
- */
static DEFINE_PER_CPU(struct aperfmperf, old_perf_sched);
On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
I have several goals. The 1st one is that I need to put more load on some cpus when I have packages with different cpu frequency.
That should be rather easy.
I also study if I can follow the real cpu frequency but it seems to be not so easy.
Why not?
I have noticed that the cpu_power is updated periodical except when we have a lot of newly_idle events.
We can certainly fix that.
Then, I have some use cases which have several running tasks but a low cpu load. In this case, the small tasks are spread on several cpu by the load_balance whereas they could be easily handled by one cpu without significant performance modification.
That shouldn't be done using cpu_power, we have sched_smt_power_savings and sched_mc_power_savings for stuff like that.
Although I would really like to kill all those different sched_*_power_savings knobs and reduce it to one.
If the cpu_power is higher than 1024, the cpu is no more seen out of capacity by the load_balance as soon as a short process is running and teh main result is that the small tasks will stay on the same cpu. This configuration is mainly usefull for ARM dual core system when we want to power gate one cpu. I use cyclictest to simulate such use case.
Yeah, but that's wrong.
On Tue, Oct 11, 2011 at 2:43 PM, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
I have several goals. The 1st one is that I need to put more load on some cpus when I have packages with different cpu frequency.
That should be rather easy.
I also study if I can follow the real cpu frequency but it seems to be not so easy.
Why not?
I have noticed that the cpu_power is updated periodical except when we have a lot of newly_idle events.
We can certainly fix that.
Then, I have some use cases which have several running tasks but a low cpu load. In this case, the small tasks are spread on several cpu by the load_balance whereas they could be easily handled by one cpu without significant performance modification.
That shouldn't be done using cpu_power, we have sched_smt_power_savings and sched_mc_power_savings for stuff like that.
AFAICT, sched_mc assume all cores to have the same capacity - which is certainly true of the x86 architecture. But in ARM you can see hybrid cores[1] designed using different fab technology, so that some cores can run at 'n' GHz and some at 'm' GHz. The idea being that when there isn't much to do (e.g periodic keep alives for messaging, email, etc.) you don't wake up the higher power-consuming cores.
From TFA[1], "Sheeva was already capable of 1.2GHz, but the new
design can go up to 1.5GHz. But only two of the 628's Sheeva cores run at the full 1.5GHz. The third one is down-clocked to 624MHz, and interesting design choice that saves on power but adds some extra utility. In a sense, the 628 could be called a 2.5-core design."
Are we mistaken in thinking that sched_mc can not currently handle this usecase? How would we 'tune' sched_mc to do this w/o playing with cpu_power?
Although I would really like to kill all those different sched_*_power_savings knobs and reduce it to one.
If the cpu_power is higher than 1024, the cpu is no more seen out of capacity by the load_balance as soon as a short process is running and teh main result is that the small tasks will stay on the same cpu. This configuration is mainly usefull for ARM dual core system when we want to power gate one cpu. I use cyclictest to simulate such use case.
Yeah, but that's wrong.
What is wrong - the use case simulation using cyclictest? Can you suggest better tools?
Regards, /Amit
[1] http://arstechnica.com/gadgets/news/2010/09/marvells-tri-core-chip-has-near-...
On Tue, 2011-10-11 at 15:08 +0530, Amit Kucheria wrote:
That shouldn't be done using cpu_power, we have sched_smt_power_savings and sched_mc_power_savings for stuff like that.
AFAICT, sched_mc assume all cores to have the same capacity - which is certainly true of the x86 architecture. But in ARM you can see hybrid cores[1] designed using different fab technology, so that some cores can run at 'n' GHz and some at 'm' GHz. The idea being that when there isn't much to do (e.g periodic keep alives for messaging, email, etc.) you don't wake up the higher power-consuming cores.
From TFA[1], "Sheeva was already capable of 1.2GHz, but the new design can go up to 1.5GHz. But only two of the 628's Sheeva cores run at the full 1.5GHz. The third one is down-clocked to 624MHz, and interesting design choice that saves on power but adds some extra utility. In a sense, the 628 could be called a 2.5-core design."
Cute :-)
Are we mistaken in thinking that sched_mc can not currently handle this usecase? How would we 'tune' sched_mc to do this w/o playing with cpu_power?
Yeah, sched_mc wants some TLC there.
Although I would really like to kill all those different sched_*_power_savings knobs and reduce it to one.
If the cpu_power is higher than 1024, the cpu is no more seen out of capacity by the load_balance as soon as a short process is running and teh main result is that the small tasks will stay on the same cpu. This configuration is mainly usefull for ARM dual core system when we want to power gate one cpu. I use cyclictest to simulate such use case.
Yeah, but that's wrong.
What is wrong - the use case simulation using cyclictest? Can you suggest better tools?
Using cpu_power to do power saving load-balancing like that.
So ideally cpu_power is simply a factor in the weight balance decision such that:
cpu_weight_i cpu_weigjt_j ------------ ~= ------------ cpu_power_i cpu_power_j
This yields that under sufficient[*] load, eg. 5 equal weight tasks and your 2.5 core thingy, you'd get 2:2:1
The decision on what to do on under-utilized systems should be separate from this.
Currently the load-balancer doesn't know about 'short' running processes at all, we just have nr_running and weight it doesn't know/care about for how long those will be around for.
Now for some of the cgroup crap we track a time-weighted weight average, and pjt was talking about pulling that up into the normal code to get rid of our multitude of different ways to calculate actual load. [**]
(/me pokes pjt with a sharp stick, where those patches at!?)
But that only gets you half-way there, you also need to compute an effective time-weighted load per task to go with that.. now while all that is quite feasible, the problem is overhead. We very much already are way to expensive and should be cutting back, not keep adding more and more accounting.
[*] Sufficient such that the weight problem is feasible. eg. 3 equal tasks on 2 equal cores can never be statically balanced, 2 unequal tasks on 2 equal cores (or v.v.) can't ever be balanced.
[**] I suspect this might solve the over-balancing problem triggered by tasks woken from the tick that also does the load-balance pass. This load-balance pass will run in sIRQ context and thus preempt running all those just woken tasks, thus giving the impression the CPU is very busy, while in fact most those tasks will instantly go back to sleep after finding nothing to do.
On 11 October 2011 11:13, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
I have several goals. The 1st one is that I need to put more load on some cpus when I have packages with different cpu frequency.
That should be rather easy.
I agree, I was mainly wondering If I should use a [1-1024] or a [1024-xxxx] range and it seems that both can be used according : SMT uses <1024 and x86 turbo mode uses >1024
I also study if I can follow the real cpu frequency but it seems to be not so easy.
Why not?
In fact, the problem is not really to follow the frequency but to be sure that update_group_power is called often enough by load_balance. The newly_idle event was also one main problem.
I have noticed that the cpu_power is updated periodical except when we have a lot of newly_idle events.
We can certainly fix that.
That's a good news.
Then, I have some use cases which have several running tasks but a low cpu load. In this case, the small tasks are spread on several cpu by the load_balance whereas they could be easily handled by one cpu without significant performance modification.
That shouldn't be done using cpu_power, we have sched_smt_power_savings and sched_mc_power_savings for stuff like that.
sched_mc_power_saving works fine when we have more than 2 cpus but can't apply on a dual core because it needs at least 2 sched_groups and the nr_running of these sched_groups must be higher than 0 but smaller than group_capacity which is 1 on a dual core system.
Although I would really like to kill all those different sched_*_power_savings knobs and reduce it to one.
If the cpu_power is higher than 1024, the cpu is no more seen out of capacity by the load_balance as soon as a short process is running and teh main result is that the small tasks will stay on the same cpu. This configuration is mainly usefull for ARM dual core system when we want to power gate one cpu. I use cyclictest to simulate such use case.
Yeah, but that's wrong.
That's the only way I have found to gathers small task without any relationship on one cpu. Do you know any better solution ?
Regards, Vincent
On Tue, 2011-10-11 at 11:40 +0200, Vincent Guittot wrote:
On 11 October 2011 11:13, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
I have several goals. The 1st one is that I need to put more load on some cpus when I have packages with different cpu frequency.
That should be rather easy.
I agree, I was mainly wondering If I should use a [1-1024] or a [1024-xxxx] range and it seems that both can be used according : SMT uses <1024 and x86 turbo mode uses >1024
Well, turbo mode would typically only boost a cpu 25% or so, and only while idling other cores to keep under its thermal limit. So its not sufficient to actually affect the capacity calculation much if at all.
Then, I have some use cases which have several running tasks but a low cpu load. In this case, the small tasks are spread on several cpu by the load_balance whereas they could be easily handled by one cpu without significant performance modification.
That shouldn't be done using cpu_power, we have sched_smt_power_savings and sched_mc_power_savings for stuff like that.
sched_mc_power_saving works fine when we have more than 2 cpus but can't apply on a dual core because it needs at least 2 sched_groups and the nr_running of these sched_groups must be higher than 0 but smaller than group_capacity which is 1 on a dual core system.
SD_POWERSAVINGS_BALANCE does /=2 to nr_running, effectively doubling the capacity iirc. And I know some IBM dudes were toying with the idea of playing tricks with the capacity numbers, but that never went anywhere.
Although I would really like to kill all those different sched_*_power_savings knobs and reduce it to one.
If the cpu_power is higher than 1024, the cpu is no more seen out of capacity by the load_balance as soon as a short process is running and teh main result is that the small tasks will stay on the same cpu. This configuration is mainly usefull for ARM dual core system when we want to power gate one cpu. I use cyclictest to simulate such use case.
Yeah, but that's wrong.
That's the only way I have found to gathers small task without any relationship on one cpu. Do you know any better solution ?
How do you know the task is 'small' ?
For that you would need to track a time-weighted effective load average of the task and we don't have that.
[ how bad is all this u64 math on ARM btw? and when will ARM finally agree all this 32bit nonsense is a waste of time and silicon? ]
But yeah, the whole nr_running vs capacity thing was traditionally to deal with spreading single tasks around. And traditional power aware scheduling was mostly about packing those on sockets (keeps other sockets idle) instead of spreading them around sockets (optimizes cache).
Now I wouldn't at all mind you ripping out all that sched_*_power_savings crap and replacing it, I doubt it actually works anyway. I haven't got many patches on the subject, and I know I don't have the equipment to measure power usage.
Also, the few patches I got mostly made the sched_*_power_savings mess bigger, which I refuse to do (what sysad wants to have a 27-state space to configure his power aware scheduling). This has mostly made people go away instead of fixing things up :-(
As to what the replacement would have to look like, dunno, its not something I've really thought much about, but maybe the time-weighted stuff is the only sane approach, that combined with options on how to spread tasks (core, socket, node, etc..).
I really think changing the load-balancer is the right way to go about solving your power issue (hot-plugging a cpu really is an insane way to idle a core) and I'm open to discussing what would work for you.
All I really ask is to not cobble something together, the load-balancer is a horridly complex thing already and the last thing it needs is more special cases that don't interact properly.
On 11 October 2011 12:27, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
On Tue, 2011-10-11 at 11:40 +0200, Vincent Guittot wrote:
On 11 October 2011 11:13, Peter Zijlstra a.p.zijlstra@chello.nl wrote:
On Tue, 2011-10-11 at 10:51 +0200, Vincent Guittot wrote:
I have several goals. The 1st one is that I need to put more load on some cpus when I have packages with different cpu frequency.
That should be rather easy.
I agree, I was mainly wondering If I should use a [1-1024] or a [1024-xxxx] range and it seems that both can be used according : SMT uses <1024 and x86 turbo mode uses >1024
Well, turbo mode would typically only boost a cpu 25% or so, and only while idling other cores to keep under its thermal limit. So its not sufficient to actually affect the capacity calculation much if at all.
OK
Then, I have some use cases which have several running tasks but a low cpu load. In this case, the small tasks are spread on several cpu by the load_balance whereas they could be easily handled by one cpu without significant performance modification.
That shouldn't be done using cpu_power, we have sched_smt_power_savings and sched_mc_power_savings for stuff like that.
sched_mc_power_saving works fine when we have more than 2 cpus but can't apply on a dual core because it needs at least 2 sched_groups and the nr_running of these sched_groups must be higher than 0 but smaller than group_capacity which is 1 on a dual core system.
SD_POWERSAVINGS_BALANCE does /=2 to nr_running, effectively doubling the capacity iirc. And I know some IBM dudes were toying with the idea of playing tricks with the capacity numbers, but that never went anywhere.
yes but it's only a special case for 2 tasks on a dual core and the SD_WAKE_AFFINE flag and cpu_idle_sibling can overwrite this decision.
Although I would really like to kill all those different sched_*_power_savings knobs and reduce it to one.
If the cpu_power is higher than 1024, the cpu is no more seen out of capacity by the load_balance as soon as a short process is running and teh main result is that the small tasks will stay on the same cpu. This configuration is mainly usefull for ARM dual core system when we want to power gate one cpu. I use cyclictest to simulate such use case.
Yeah, but that's wrong.
That's the only way I have found to gathers small task without any relationship on one cpu. Do you know any better solution ?
How do you know the task is 'small' ?
I want to use cpufreq to be notified that we have a large/small cpu load. If we have several tasks but the cpu uses the lowest frequency, it "should" mean that we have small tasks that are running (less than 20ms*95% of added duration) and we could gather them on one cpu (by increasing the cpu_power on a dual core).
For that you would need to track a time-weighted effective load average of the task and we don't have that.
yes, that's why I use cpufreq until better option, like a time-weighted load average, is available
[ how bad is all this u64 math on ARM btw? and when will ARM finally agree all this 32bit nonsense is a waste of time and silicon? ]
But yeah, the whole nr_running vs capacity thing was traditionally to deal with spreading single tasks around. And traditional power aware scheduling was mostly about packing those on sockets (keeps other sockets idle) instead of spreading them around sockets (optimizes cache).
Now I wouldn't at all mind you ripping out all that sched_*_power_savings crap and replacing it, I doubt it actually works anyway. I haven't got many patches on the subject, and I know I don't have the equipment to measure power usage.
Also, the few patches I got mostly made the sched_*_power_savings mess bigger, which I refuse to do (what sysad wants to have a 27-state space to configure his power aware scheduling). This has mostly made people go away instead of fixing things up :-(
As to what the replacement would have to look like, dunno, its not something I've really thought much about, but maybe the time-weighted stuff is the only sane approach, that combined with options on how to spread tasks (core, socket, node, etc..).
I really think changing the load-balancer is the right way to go about solving your power issue (hot-plugging a cpu really is an insane way to idle a core) and I'm open to discussing what would work for you.
Great. My 1st goal was not to modify the load-balancer and sched_mc (or as less as possible) and to study how I could tune the scheduler parameters to have the best power consumption on ARM platform. Now, changing the load-balancer is probably a better solution.
All I really ask is to not cobble something together, the load-balancer is a horridly complex thing already and the last thing it needs is more special cases that don't interact properly.
On Tue, 2011-10-11 at 18:03 +0200, Vincent Guittot wrote:
How do you know the task is 'small' ?
I want to use cpufreq to be notified that we have a large/small cpu load. If we have several tasks but the cpu uses the lowest frequency, it "should" mean that we have small tasks that are running (less than 20ms*95% of added duration) and we could gather them on one cpu (by increasing the cpu_power on a dual core).
For that you would need to track a time-weighted effective load average of the task and we don't have that.
yes, that's why I use cpufreq until better option, like a time-weighted load average, is available
Egads... so basically you're (ab)using the ondemand cpufreq stats to get a guestimate of the time-weighted load of the cpu, and then (ab)use the scheduler cpufreq hook to pump its capacity numbers.
No cookies for you.