Hi all,
Below are some thoughts and questions after reviewed EAS's energy model; my purpose is want to get clear the energy model from user's perspective, so below question will _ONLY_ focus on the model and not dig into the implementation.
This email is related long, but i think if use formulas, we can easily get the same page; So i lists the energy model's formulas, then based on them i try to match with TC2's power data and bring up some questions. Look forward to your suggestions and comments.
* Basic Energy and Power Calculation Formulas
From the doc Documentation/scheduler/sched-energy.txt, we can get to know the energy can be calculated with:
Energy [j] = Power [w] * Time [s] (F.1)
So let's assume there have one piece of code, which has fixed instruction numbers will be executed on CPU, the execution duration is depend on CPU's pipeline and CPU's frequency. So can convert F.1 to F.2:
Code [instructions] Energy [j] = Power [w] * ------------------------------ (Inst Per Cycle) * Frequency
Code [instructions] = Power [w] * ------------------------------ (F.2) MIPS(f) `-> 'f' is factor of frequency
Because MIPS(f) can be normalize as the CPU's capacity corresponding to OPP, so we can simply convert from F.2 to F.3:
Code [instructions] Energy [j] = Power [w] * ------------------------------ (F.3) CPU_Capacity(f)
If breakdown Power[w], we can split it into two parts: static leakage, and dynamic leakage:
Power [w] = Ps [w] + Pd [w] (F.4)
Static power leakage can be calculated with below formula: Ps [w] = i * V [v] (F.5) `-> 'i' is coefficient for according to silicon's process V [v] is voltage according to OPP
Dynamic power leakage can be calculated with below formula: Pd [w] = b * V [v] * V [v] * frequency (F.6) `-> 'b' is coefficient for according to silicon's process V [v] is voltage according to OPP
Here have two special cases, if the island's clock is gated, then Pd [w] = 0, So: Power [w] = Ps [w] (F.7)
If the island is powered off, then Ps [w] = 0, Pd [w] = 0; So: Power [w] = 0 (F.8)
So energy can be calculated as (come from F.3 and F.4):
Code [instructions] Energy [j] = (Ps [w] + Pd [w]) * ---------------------- (F.9) CPU_Capacity(f)
* Formulas for duty cycle
We separate the logic (cluster or CPU) into two states: P-state and C-state, for P-state and C-state they have different power data, this is because after the logic enter C-state, it will be clock gating or powered off. So if we expand the time axis for relative long time, we need calculate CPU's utilization percentage (for CPU is full running, util = 100%). Let's simplize the ratio between "Code [instructions]" and "CPU_Capactity(f)" as the utilization, So the energy calculation can be depicted as:
Code [instructions] Util(f) = -------------------------- (F.10) CPU_Capacity(f)
Energy [j] = Power_Pstate [w] * Util(f) + Power_Cstate [w] * (1 - Util(f)) (F.11)
(F.12) Energy [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=0..MAX_IDL)(Power_Cstate [w](i) * Util_IDL(i)) Sum(i=0..MAX_OPP)Util_OPP(i) + Sum(i..MAX_IDLE)Util_IDL(i) = 1
* Formulas for clusters (F.13) Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
(F.14) Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, ClusterOff)(Power_Cstate [w](i) * Util_IDL(i))
(F.15) Energy_cpu [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, CPUOff)(Power_Cstate [w](i) * Util_IDL(i))
* Thoughts and Questions
- Let's summary EAS's energy model as below:
CPU::capacity_state::power : CPU's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w]
CPU::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] Power(IDLE_CPUOff) = 0
CPU's IDLE_WFI means: CPU is clock gating, so has static leakage but don't include dynamic leakage.
CLUSTER::capacity_state::power : Cluster's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w]
CLUSTER::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] + Pd [w] Power(IDLE_CLSOff) = 0
Cluster's IDLE_WFI is quite special, means all CPUs in cluster have been powered off, but cluster's logic (L2$ and SCU, etc) is powered on and clock is enabled, so it includes cluster level's static power and dynamic power.
Are these formulas matching the original design?
- TC2's data for cluster's sleep:
static struct idle_state idle_states_cluster_a7[] = { { .power = 25 }, /* WFI */ { .power = 10 }, /* cluster-sleep-l */ };
static struct idle_state idle_states_cluster_a15[] = { { .power = 70 }, /* WFI */ { .power = 25 }, /* cluster-sleep-b */ };
For cluster level's sleep, the clock is gating and domain is powered off, so the dynamic leakage and static leakge should be zero, right?
- TC2's data for CPU's idle state:
static struct idle_state idle_states_core_a7[] = { { .power = 0 }, /* WFI */ };
static struct idle_state idle_states_core_a15[] = { { .power = 0 }, /* WFI */ };
CPU has two idle state, one is 'WFI' and another is 'C2'; For 'WFI' state, the power will not be zero, this is because 'WFI' state means internal clock gating, so according to F.7, there should have static leakage.
BTW, for TC2, there have no corresponding idle state for 'C2', this is weird. Could you confirm it has been delibrately removed?
- TC2's data for P-state:
static struct capacity_state cap_states_cluster_a7[] = { /* Cluster only power */ { .cap = 150, .power = 2967, }, /* 350 MHz */ [...] };
static struct capacity_state cap_states_core_a7[] = { /* Power per cpu */ { .cap = 150, .power = 187, }, /* 350 MHz */ [...] };
From previous experience, the CPU level's power leakage is very higher than cluster level's leakage. For example, for CA7, if only power on cluster (all CPUs in cluster are powered off), the power delta is ~10mA@156MHz; if power on one CPUs, the power delta is about 30mA@156MHz. I also checked the data for CA53, it has similar result.
So this is confilict with TC2's power data, you can see the cluster level's power leakage is quite high (almost 15 times than CPU level). This means almostly we cannot get much benefit from CPU level's low power state, due cluster level will contribute most of power consumption. This is not make sense.
- From formula F.4, we can combine power with static leakage and dynamic leakage; IPA also used static/dynamic leakage to depict energy model. But EAS uses another way, which provide the power data according to every OPP and idle state. So that means on one platform, we need provide two kinds of power data.
IMHO, i think the static and dynamic leakage is more simple; because usually we will use (mW/MHz) to describe the power efficiency for specific CPU, though (mW/MHz) cannot very accurately for power consumption if the voltage has been changed (See formula F.6, usually the voltage will be increased at higher frequency). But if we use mW/MHz, maybe we can calculate with very simple way for we can just only use it to mulitplate with frequency to get dynamic power.
So we only need provide below parameters: P-state: static leakage, power efficiency (mW/MHz), capacity (DMIPS/MHz); C-state: static leakage, power efficiency (mW/MHz);
What's the thoughts for unify the energy model?
Thanks, Leo Yan
Hi Leo,
Thanks for sharing this excellent write-up. I'm tempted to suggest that we add this to the documention.
On Thu, Sep 17, 2015 at 04:02:09PM +0100, Leo Yan wrote:
Hi all,
Below are some thoughts and questions after reviewed EAS's energy model; my purpose is want to get clear the energy model from user's perspective, so below question will _ONLY_ focus on the model and not dig into the implementation.
This email is related long, but i think if use formulas, we can easily get the same page; So i lists the energy model's formulas, then based on them i try to match with TC2's power data and bring up some questions. Look forward to your suggestions and comments.
Basic Energy and Power Calculation Formulas
From the doc Documentation/scheduler/sched-energy.txt, we can get to know the energy can be calculated with:
Energy [j] = Power [w] * Time [s] (F.1)
So let's assume there have one piece of code, which has fixed instruction numbers will be executed on CPU, the execution duration is depend on CPU's pipeline and CPU's frequency. So can convert F.1 to F.2:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (Inst Per Cycle) * Frequency
Code [instructions] = Power [w] * ------------------------------ (F.2) MIPS(f) `-> 'f' is factor of frequency
Because MIPS(f) can be normalize as the CPU's capacity corresponding to OPP, so we can simply convert from F.2 to F.3:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (F.3) CPU_Capacity(f)
If breakdown Power[w], we can split it into two parts: static leakage, and dynamic leakage:
Power [w] = Ps [w] + Pd [w] (F.4)
Static power leakage can be calculated with below formula: Ps [w] = i * V [v] (F.5) `-> 'i' is coefficient for according to silicon's process V [v] is voltage according to OPP
Dynamic power leakage can be calculated with below formula: Pd [w] = b * V [v] * V [v] * frequency (F.6) `-> 'b' is coefficient for according to silicon's process V [v] is voltage according to OPP
Here have two special cases, if the island's clock is gated, then Pd [w] = 0, So: Power [w] = Ps [w] (F.7)
If the island is powered off, then Ps [w] = 0, Pd [w] = 0; So: Power [w] = 0 (F.8)
So energy can be calculated as (come from F.3 and F.4):
Code [instructions]
Energy [j] = (Ps [w] + Pd [w]) * ---------------------- (F.9) CPU_Capacity(f)
Formulas for duty cycle
We separate the logic (cluster or CPU) into two states: P-state and C-state, for P-state and C-state they have different power data, this is because after the logic enter C-state, it will be clock gating or powered off. So if we expand the time axis for relative long time, we need calculate CPU's utilization percentage (for CPU is full running, util = 100%). Let's simplize the ratio between "Code [instructions]" and "CPU_Capactity(f)" as the utilization, So the energy calculation can be depicted as:
Code [instructions]
Util(f) = -------------------------- (F.10) CPU_Capacity(f)
Energy [j] = Power_Pstate [w] * Util(f) + Power_Cstate [w] * (1 - Util(f)) (F.11)
(F.12)
Energy [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=0..MAX_IDL)(Power_Cstate [w](i) * Util_IDL(i)) Sum(i=0..MAX_OPP)Util_OPP(i) + Sum(i..MAX_IDLE)Util_IDL(i) = 1
Formulas for clusters (F.13) Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
(F.14)
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, ClusterOff)(Power_Cstate [w](i) * Util_IDL(i))
A minor detail here is that a cluster and/or cpu may be idle (from a utilization point of view) but not actually in an idle state (from a hardware point of view). For example, all the cpus may be in WFI or cpu_power_down while the cluster is still has power and clock going. You point that out towards the end as well. For this reason, the model has to consider this idle, but not really idle, state too. I called it 'active idle' in the past.
(F.15)
Energy_cpu [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, CPUOff)(Power_Cstate [w](i) * Util_IDL(i))
Thoughts and Questions
Let's summary EAS's energy model as below:
CPU::capacity_state::power : CPU's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w]
CPU::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] Power(IDLE_CPUOff) = 0
CPU's IDLE_WFI means: CPU is clock gating, so has static leakage but don't include dynamic leakage.
Agreed, but if we imagine that we have state between WFI and CPUOff which powers down a part of the cpu core, but not everything (like CPUOff), it would consume
Power(IDLE_CPUalmostOff) = a * Ps [w] -> a = ratio of transistors powered down.
F.5 assume that all transistors are affected, which holds as long as all transitors in the power domains that we provide separate model data for (cpu core and cluster) are all equally affected by each idle-state.
F.6 makes a similar assumption about the toggling rate of all transistors scaling linearly with the frequency. I think that one is probably fine for the model precision that we after, but I haven't verified using actual measurements.
CLUSTER::capacity_state::power : Cluster's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w] CLUSTER::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] + Pd [w] Power(IDLE_CLSOff) = 0 Cluster's IDLE_WFI is quite special, means all CPUs in cluster have been powered off, but cluster's logic (L2$ and SCU, etc) is powered on and clock is enabled, so it includes cluster level's static power and dynamic power.
Right, this the 'active idle' state I mentioned earlier.
Are these formulas matching the original design?
Very much, yes. The only difference is that in the current design I don't distinguish between static and dynamic power, so if you substitute Ps [w] + Pd [w] = P [w] it is the same.
TC2's data for cluster's sleep:
static struct idle_state idle_states_cluster_a7[] = { { .power = 25 }, /* WFI */ { .power = 10 }, /* cluster-sleep-l */ };
static struct idle_state idle_states_cluster_a15[] = { { .power = 70 }, /* WFI */ { .power = 25 }, /* cluster-sleep-b */ };
For cluster level's sleep, the clock is gating and domain is powered off, so the dynamic leakage and static leakge should be zero, right?
In an ideal world, yes. These numbers come from actual measurements using the TC2 energy counters so this is down to practical issues. Something must still be leaking while the cluster is off which is included in the power domain monitored by the counters, or the energy counter circuits may not be 100% accurate. We didn't tweak the numbers to make them fit theory ;-)
TC2's data for CPU's idle state:
static struct idle_state idle_states_core_a7[] = {
{ .power = 0 }, /* WFI */ };
static struct idle_state idle_states_core_a15[] = {
{ .power = 0 }, /* WFI */ };
CPU has two idle state, one is 'WFI' and another is 'C2'; For 'WFI' state, the power will not be zero, this is because 'WFI' state means internal clock gating, so according to F.7, there should have static leakage. BTW, for TC2, there have no corresponding idle state for 'C2', this is weird. Could you confirm it has been delibrately removed?
I assume that by 'C2' you mean CPUOff. You seem to be assuming that all cpus have WFI and CPUOff. This is not the case. TC2 has no CPUOff state, so it wasn't removed, it was never there :-) It only has WFI (clock-gating each individual core) and CLSOff (power down the entire cluster). We need to be able to handle those systems too, as well as systems with more per-cpu idle-states.
The WFI power is zero for practical reasons. It is not possible derive the per-core WFI power with the energy counters. We can put all cpus into WFI and measure the cluster energy, which would be the result of F.13, but we have no way of figuring out how to decompose it into cluster and cpu energy contributions. We have to account for all the energy somewhere, so instead of assuming some arbitrary split between cluster and cpu energy, we assume that it is all cluster energy. Hence, the WFI power is accounted for in the cluster 'active idle' power.
IOW, it isn't missing, it is just accounted for somewhere else as we didn't have a way to figure out the true split between cluster and core.
Talking about idle-state representation. The current idle-state tables are quite confusing. We only have per-cpu states listed in the per-cpu tables, and per-cluster in the per-cluster tables (+ active idle). This is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff for the cluster tables for TC2. I'm planning on changing that so we have the full list of states in all tables, but with zeros or repeated power numbers for states that don't affect the associated power domain.
TC2's data for P-state:
static struct capacity_state cap_states_cluster_a7[] = {
/* Cluster only power */ { .cap = 150, .power = 2967, }, /* 350 MHz */ [...] };
static struct capacity_state cap_states_core_a7[] = {
/* Power per cpu */ { .cap = 150, .power = 187, }, /* 350 MHz */ [...] };
From previous experience, the CPU level's power leakage is very higher than cluster level's leakage. For example, for CA7, if only power on cluster (all CPUs in cluster are powered off), the power delta is ~10mA@156MHz; if power on one CPUs, the power delta is about 30mA@156MHz. I also checked the data for CA53, it has similar result. So this is confilict with TC2's power data, you can see the cluster level's power leakage is quite high (almost 15 times than CPU level). This means almostly we cannot get much benefit from CPU level's low power state, due cluster level will contribute most of power consumption. This is not make sense.
As said above, TC2 doesn't have a CPUOff state which makes it really crippled in terms of power management. As soon as the cluster is power up, all cores are sitting in WFI leaking (Ps) with caches being kept coherent and everything. As said above, we had to account for the core WFI power in the cluster active power (OPP) so it ends up becoming quite high.
So the numbers do make sense for TC2, it is just not a very well-designed SoC from a power management point of view. It was a very early test chip not designed for power management experiments at all, but it has really good power measurement infrastructure (energy counters) and everything is upstream and has been that for years. Your previous experience has most likely been with more representative platforms, so I expect numbers for other platforms to be in line with your experience. Juno, which is also a test chip, is closer to what you describe but still not really representative for product grade SoCs, but we don't have anything better with upstream support.
From formula F.4, we can combine power with static leakage and dynamic leakage; IPA also used static/dynamic leakage to depict energy model. But EAS uses another way, which provide the power data according to every OPP and idle state. So that means on one platform, we need provide two kinds of power data.
IMHO, i think the static and dynamic leakage is more simple; because usually we will use (mW/MHz) to describe the power efficiency for specific CPU, though (mW/MHz) cannot very accurately for power consumption if the voltage has been changed (See formula F.6, usually the voltage will be increased at higher frequency). But if we use mW/MHz, maybe we can calculate with very simple way for we can just only use it to mulitplate with frequency to get dynamic power.
So we only need provide below parameters: P-state: static leakage, power efficiency (mW/MHz), capacity (DMIPS/MHz); C-state: static leakage, power efficiency (mW/MHz);
What's the thoughts for unify the energy model?
We want to unify the power models if at all possible. The IPA people are looking into it. The difficulty is that we are looking for different things, so the models have to capture enough detail to be useful for both.
Are you proposing to derive the individual P-state numbers from global numbers or do you propose to have the three parameters for each P-state in tables like we currently have them?
If you want to derive them from global numbers, you would need to compensate for voltage scaling for both Ps and Pd so you would need the voltage for each state. Otherwise you energy efficiency will _improve_ as you increasing frequency.
It might work. I think the first step is to see if the derived curves would correlate well with real measurements. We would need a way to derive static leakage and power efficiency from measurements. I don't know if that can be easily done. Do you have any suggestions for that?
Deriving the table data using F.5 and F.6 would mean that we can only model systems that follow those formulas reasonably well. The current tables are pure measurement data with a little bit of extrapolation to find the cluster power, which should be a bit more flexible. I'm not sure if that really matter though.
Thanks, Morten
Hi Morten,
Thanks for review, please see below comments and further more questions.
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
Thanks for sharing this excellent write-up. I'm tempted to suggest that we add this to the documention.
Glad it's helpful and free to use it if you want.
On Thu, Sep 17, 2015 at 04:02:09PM +0100, Leo Yan wrote:
Hi all,
Below are some thoughts and questions after reviewed EAS's energy model; my purpose is want to get clear the energy model from user's perspective, so below question will _ONLY_ focus on the model and not dig into the implementation.
This email is related long, but i think if use formulas, we can easily get the same page; So i lists the energy model's formulas, then based on them i try to match with TC2's power data and bring up some questions. Look forward to your suggestions and comments.
Basic Energy and Power Calculation Formulas
From the doc Documentation/scheduler/sched-energy.txt, we can get to know the energy can be calculated with:
Energy [j] = Power [w] * Time [s] (F.1)
So let's assume there have one piece of code, which has fixed instruction numbers will be executed on CPU, the execution duration is depend on CPU's pipeline and CPU's frequency. So can convert F.1 to F.2:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (Inst Per Cycle) * Frequency
Code [instructions] = Power [w] * ------------------------------ (F.2) MIPS(f) `-> 'f' is factor of frequency
Because MIPS(f) can be normalize as the CPU's capacity corresponding to OPP, so we can simply convert from F.2 to F.3:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (F.3) CPU_Capacity(f)
If breakdown Power[w], we can split it into two parts: static leakage, and dynamic leakage:
Power [w] = Ps [w] + Pd [w] (F.4)
Static power leakage can be calculated with below formula: Ps [w] = i * V [v] (F.5) `-> 'i' is coefficient for according to silicon's process V [v] is voltage according to OPP
Dynamic power leakage can be calculated with below formula: Pd [w] = b * V [v] * V [v] * frequency (F.6) `-> 'b' is coefficient for according to silicon's process V [v] is voltage according to OPP
Here have two special cases, if the island's clock is gated, then Pd [w] = 0, So: Power [w] = Ps [w] (F.7)
If the island is powered off, then Ps [w] = 0, Pd [w] = 0; So: Power [w] = 0 (F.8)
So energy can be calculated as (come from F.3 and F.4):
Code [instructions]
Energy [j] = (Ps [w] + Pd [w]) * ---------------------- (F.9) CPU_Capacity(f)
Formulas for duty cycle
We separate the logic (cluster or CPU) into two states: P-state and C-state, for P-state and C-state they have different power data, this is because after the logic enter C-state, it will be clock gating or powered off. So if we expand the time axis for relative long time, we need calculate CPU's utilization percentage (for CPU is full running, util = 100%). Let's simplize the ratio between "Code [instructions]" and "CPU_Capactity(f)" as the utilization, So the energy calculation can be depicted as:
Code [instructions]
Util(f) = -------------------------- (F.10) CPU_Capacity(f)
Energy [j] = Power_Pstate [w] * Util(f) + Power_Cstate [w] * (1 - Util(f)) (F.11)
(F.12)
Energy [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=0..MAX_IDL)(Power_Cstate [w](i) * Util_IDL(i)) Sum(i=0..MAX_OPP)Util_OPP(i) + Sum(i..MAX_IDLE)Util_IDL(i) = 1
Formulas for clusters (F.13) Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
(F.14)
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, ClusterOff)(Power_Cstate [w](i) * Util_IDL(i))
A minor detail here is that a cluster and/or cpu may be idle (from a utilization point of view) but not actually in an idle state (from a hardware point of view). For example, all the cpus may be in WFI or cpu_power_down while the cluster is still has power and clock going. You point that out towards the end as well. For this reason, the model has to consider this idle, but not really idle, state too. I called it 'active idle' in the past.
OK, now 'active idle' is quite clear. i'd like discuss it further more in below comments.
Energy_cpu [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, CPUOff)(Power_Cstate [w](i) * Util_IDL(i))
Thoughts and Questions
Let's summary EAS's energy model as below:
CPU::capacity_state::power : CPU's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w]
CPU::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] Power(IDLE_CPUOff) = 0
CPU's IDLE_WFI means: CPU is clock gating, so has static leakage but don't include dynamic leakage.
Agreed, but if we imagine that we have state between WFI and CPUOff which powers down a part of the cpu core, but not everything (like CPUOff), it would consume
Power(IDLE_CPUalmostOff) = a * Ps [w] -> a = ratio of transistors powered down.
Totally agree that CPU may have other extra idle states, and for a common solution, we should not expose the limitation on idle states.
F.5 assume that all transistors are affected, which holds as long as all transitors in the power domains that we provide separate model data for (cpu core and cluster) are all equally affected by each idle-state.
For one specific power state, whatever it's a P-state or C-state, actually we need define it with three factors: voltage domain, power domain, and clock domain. After we define well these factors for a state, then we can easily to just apply F.5/F.6.
So just like the cases of "IDLE_CPUalmostOff" and "IDLE_CPUOFF", there must be something difference b/t them, for example they have different power domain but may have same clock domain and voltage domain. So naturally we can calculate different power result for them.
F.6 makes a similar assumption about the toggling rate of all transistors scaling linearly with the frequency. I think that one is probably fine for the model precision that we after, but I haven't verified using actual measurements.
Here need clarify, F.5 and F.6 will _NOT_ assume for all transistors, it will totally depend on the upper three domains' definitions and then correctly to use these two formulas.
Usually if have errors, it's very likely we cannot define these three domains clearly and then introduce incorrect concept.
CLUSTER::capacity_state::power : Cluster's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w] CLUSTER::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] + Pd [w] Power(IDLE_CLSOff) = 0 Cluster's IDLE_WFI is quite special, means all CPUs in cluster have been powered off, but cluster's logic (L2$ and SCU, etc) is powered on and clock is enabled, so it includes cluster level's static power and dynamic power.
Right, this the 'active idle' state I mentioned earlier.
Are these formulas matching the original design?
Very much, yes. The only difference is that in the current design I don't distinguish between static and dynamic power, so if you substitute Ps [w] + Pd [w] = P [w] it is the same.
Got it, it's fine to just use summed power data.
TC2's data for cluster's sleep:
static struct idle_state idle_states_cluster_a7[] = { { .power = 25 }, /* WFI */ { .power = 10 }, /* cluster-sleep-l */ };
static struct idle_state idle_states_cluster_a15[] = { { .power = 70 }, /* WFI */ { .power = 25 }, /* cluster-sleep-b */ };
For cluster level's sleep, the clock is gating and domain is powered off, so the dynamic leakage and static leakge should be zero, right?
In an ideal world, yes. These numbers come from actual measurements using the TC2 energy counters so this is down to practical issues. Something must still be leaking while the cluster is off which is included in the power domain monitored by the counters, or the energy counter circuits may not be 100% accurate. We didn't tweak the numbers to make them fit theory ;-)
Make sense, it's acceptable if have some little inaccuracy.
TC2's data for CPU's idle state:
static struct idle_state idle_states_core_a7[] = {
{ .power = 0 }, /* WFI */ };
static struct idle_state idle_states_core_a15[] = {
{ .power = 0 }, /* WFI */ };
CPU has two idle state, one is 'WFI' and another is 'C2'; For 'WFI' state, the power will not be zero, this is because 'WFI' state means internal clock gating, so according to F.7, there should have static leakage. BTW, for TC2, there have no corresponding idle state for 'C2', this is weird. Could you confirm it has been delibrately removed?
I assume that by 'C2' you mean CPUOff. You seem to be assuming that all cpus have WFI and CPUOff. This is not the case. TC2 has no CPUOff state, so it wasn't removed, it was never there :-) It only has WFI (clock-gating each individual core) and CLSOff (power down the entire cluster). We need to be able to handle those systems too, as well as systems with more per-cpu idle-states.
Now know why TC2 has suck kind of power data.
The WFI power is zero for practical reasons. It is not possible derive the per-core WFI power with the energy counters. We can put all cpus into WFI and measure the cluster energy, which would be the result of F.13, but we have no way of figuring out how to decompose it into cluster and cpu energy contributions. We have to account for all the energy somewhere, so instead of assuming some arbitrary split between cluster and cpu energy, we assume that it is all cluster energy. Hence, the WFI power is accounted for in the cluster 'active idle' power.
IOW, it isn't missing, it is just accounted for somewhere else as we didn't have a way to figure out the true split between cluster and core.
Yes, it's hard to extract power data independently for cluster level and core level. The main reason is hard to get the delta value for WFI if SoC don't support CPU's power off.
Just curious, if it's feasible with below steps to measure WFI state in TC2? - Firstly measure the power date when cluster is powered off; - Then power on CPU0 only, and place CPU0 into "WFI": Power_Delta0 = cluster level power + one CPU's "WFI"; - Then power on CPU1, and place and can get: Power_Delta1 = cluster level power + two CPUs' "WFI"; - So finally can get "WFI" power = Power_Delta1 - Power_Delta2;
The key point is step 2, when power on one core, will other cores in the same cluster be automatically be powered on as well?
Talking about idle-state representation. The current idle-state tables are quite confusing. We only have per-cpu states listed in the per-cpu tables, and per-cluster in the per-cluster tables (+ active idle). This is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff for the cluster tables for TC2. I'm planning on changing that so we have the full list of states in all tables, but with zeros or repeated power numbers for states that don't affect the associated power domain.
Here i think we should create a clear principle for enery model and apply it. If we go back to review for state "WFI", its power domain/voltage domain/clock domain are all in CPU's level but not in cluster level. So the most reasonable calculation for 'active idle' state should be despicted as below:
Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) Sum(i=0..MAX_OPP)Util_OPP(i) = 1
Energy_cpuE [j] = Power(IDLE_WFI)
So that means for 'active idle' state, all cpus stay in "WFI" state, but for cluster level, actually it always stays in P-state but not C-state. This is decided by cluster level's power domain/clock domain is always ON for 'active idle'.
But now EAS consider cluster level as a idle state for 'active idle', right?
So let's dig further more for this question, and find actually this question is very related how look at the idle states. We need create a idle voting mechanism for different schedule domain level, and there should have mechanism can roll back to lower level's schedule domain for idle state's selection if upper schedule domain is in P-state.
Below is a example for voting:
0: Power On state 1: idle state 1 2: idle state 2 ...
Example 1:
SCHED_DOMAIN (CPU) SCHED_DOMAIN(MC) CPU0 0 1 CPU1 0 1 CPU2 0 1 CPU3 0 1
So all 4 CPUs vote 0 for cluster level, means to power on cluster and all 4 CPUs run into idle state 1; finally scheduler can easily know for SCHED_DOMAIN (CPU) (or cluster level) is not in idle state, so it can rollback to SCHED_DOMAIN(MC) (or cpu level) to find correct idle state.
Example 2:
SCHED_DOMAIN (CPU) SCHED_DOMAIN(MC) CPU0 1 1 CPU1 1 1 CPU2 1 1 CPU3 0 1
3 CPUs vote 1 to power off cluster and 1 CPU votes 0 to power on cluster, finally scheduler can easily know for SCHED_DOMAIN (CPU) (or cluster level) the minimum vote is 0, means cluster will be powered on, so it will rollback to SCHED_DOMAIN(MC) (or cpu level) to find correct idle state for CPU level.
Example 3:
SCHED_DOMAIN (CPU) SCHED_DOMAIN(MC) CPU0 1 2 CPU1 1 2 CPU2 1 2 CPU3 0 1
3 CPUs vote 1 to power off cluster and 1 CPU votes 0 to power on cluster, finally scheduler can easily know for SCHED_DOMAIN (CPU) (or cluster level) the minimum vote is 0, means cluster will be powered on, so it will rollback to SCHED_DOMAIN(MC) (or cpu level) to find correct idle state for CPU level, Example 3 wants to demonstrate there have two different idle states for CPU level, so scheduler need to know the CPU will rollback to which exactly idle state for individual CPU.
TC2's data for P-state:
static struct capacity_state cap_states_cluster_a7[] = {
/* Cluster only power */ { .cap = 150, .power = 2967, }, /* 350 MHz */ [...] };
static struct capacity_state cap_states_core_a7[] = {
/* Power per cpu */ { .cap = 150, .power = 187, }, /* 350 MHz */ [...] };
From previous experience, the CPU level's power leakage is very higher than cluster level's leakage. For example, for CA7, if only power on cluster (all CPUs in cluster are powered off), the power delta is ~10mA@156MHz; if power on one CPUs, the power delta is about 30mA@156MHz. I also checked the data for CA53, it has similar result. So this is confilict with TC2's power data, you can see the cluster level's power leakage is quite high (almost 15 times than CPU level). This means almostly we cannot get much benefit from CPU level's low power state, due cluster level will contribute most of power consumption. This is not make sense.
As said above, TC2 doesn't have a CPUOff state which makes it really crippled in terms of power management. As soon as the cluster is power up, all cores are sitting in WFI leaking (Ps) with caches being kept coherent and everything. As said above, we had to account for the core WFI power in the cluster active power (OPP) so it ends up becoming quite high.
So the numbers do make sense for TC2, it is just not a very well-designed SoC from a power management point of view. It was a very early test chip not designed for power management experiments at all, but it has really good power measurement infrastructure (energy counters) and everything is upstream and has been that for years. Your previous experience has most likely been with more representative platforms, so I expect numbers for other platforms to be in line with your experience. Juno, which is also a test chip, is closer to what you describe but still not really representative for product grade SoCs, but we don't have anything better with upstream support.
So P-state's Power data for TC2 is actually below combination :)
CLUSTER::capacity_state::power Power_CLUSTER(OPP) = Cluster (Ps [w] + Pd [w]) + CPU (Ps [w]) * 4 `-> include 4 CPU's static leakage
CPU::capacity_state::power Power_CPU(OPP) = CPU (Pd [w])
[...]
Thanks, Leo Yan
On Mon, Sep 21, 2015 at 06:58:30AM +0100, Leo Yan wrote:
Hi Morten,
Thanks for review, please see below comments and further more questions.
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
Thanks for sharing this excellent write-up. I'm tempted to suggest that we add this to the documention.
Glad it's helpful and free to use it if you want.
On Thu, Sep 17, 2015 at 04:02:09PM +0100, Leo Yan wrote:
Hi all,
Below are some thoughts and questions after reviewed EAS's energy model; my purpose is want to get clear the energy model from user's perspective, so below question will _ONLY_ focus on the model and not dig into the implementation.
This email is related long, but i think if use formulas, we can easily get the same page; So i lists the energy model's formulas, then based on them i try to match with TC2's power data and bring up some questions. Look forward to your suggestions and comments.
Basic Energy and Power Calculation Formulas
From the doc Documentation/scheduler/sched-energy.txt, we can get to know the energy can be calculated with:
Energy [j] = Power [w] * Time [s] (F.1)
So let's assume there have one piece of code, which has fixed instruction numbers will be executed on CPU, the execution duration is depend on CPU's pipeline and CPU's frequency. So can convert F.1 to F.2:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (Inst Per Cycle) * Frequency
Code [instructions] = Power [w] * ------------------------------ (F.2) MIPS(f) `-> 'f' is factor of frequency
Because MIPS(f) can be normalize as the CPU's capacity corresponding to OPP, so we can simply convert from F.2 to F.3:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (F.3) CPU_Capacity(f)
If breakdown Power[w], we can split it into two parts: static leakage, and dynamic leakage:
Power [w] = Ps [w] + Pd [w] (F.4)
Static power leakage can be calculated with below formula: Ps [w] = i * V [v] (F.5) `-> 'i' is coefficient for according to silicon's process V [v] is voltage according to OPP
Dynamic power leakage can be calculated with below formula: Pd [w] = b * V [v] * V [v] * frequency (F.6) `-> 'b' is coefficient for according to silicon's process V [v] is voltage according to OPP
Here have two special cases, if the island's clock is gated, then Pd [w] = 0, So: Power [w] = Ps [w] (F.7)
If the island is powered off, then Ps [w] = 0, Pd [w] = 0; So: Power [w] = 0 (F.8)
So energy can be calculated as (come from F.3 and F.4):
Code [instructions]
Energy [j] = (Ps [w] + Pd [w]) * ---------------------- (F.9) CPU_Capacity(f)
Formulas for duty cycle
We separate the logic (cluster or CPU) into two states: P-state and C-state, for P-state and C-state they have different power data, this is because after the logic enter C-state, it will be clock gating or powered off. So if we expand the time axis for relative long time, we need calculate CPU's utilization percentage (for CPU is full running, util = 100%). Let's simplize the ratio between "Code [instructions]" and "CPU_Capactity(f)" as the utilization, So the energy calculation can be depicted as:
Code [instructions]
Util(f) = -------------------------- (F.10) CPU_Capacity(f)
Energy [j] = Power_Pstate [w] * Util(f) + Power_Cstate [w] * (1 - Util(f)) (F.11)
(F.12)
Energy [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=0..MAX_IDL)(Power_Cstate [w](i) * Util_IDL(i)) Sum(i=0..MAX_OPP)Util_OPP(i) + Sum(i..MAX_IDLE)Util_IDL(i) = 1
Formulas for clusters (F.13) Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
(F.14)
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, ClusterOff)(Power_Cstate [w](i) * Util_IDL(i))
A minor detail here is that a cluster and/or cpu may be idle (from a utilization point of view) but not actually in an idle state (from a hardware point of view). For example, all the cpus may be in WFI or cpu_power_down while the cluster is still has power and clock going. You point that out towards the end as well. For this reason, the model has to consider this idle, but not really idle, state too. I called it 'active idle' in the past.
OK, now 'active idle' is quite clear. i'd like discuss it further more in below comments.
Energy_cpu [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, CPUOff)(Power_Cstate [w](i) * Util_IDL(i))
Thoughts and Questions
Let's summary EAS's energy model as below:
CPU::capacity_state::power : CPU's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w]
CPU::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] Power(IDLE_CPUOff) = 0
CPU's IDLE_WFI means: CPU is clock gating, so has static leakage but don't include dynamic leakage.
Agreed, but if we imagine that we have state between WFI and CPUOff which powers down a part of the cpu core, but not everything (like CPUOff), it would consume
Power(IDLE_CPUalmostOff) = a * Ps [w] -> a = ratio of transistors powered down.
Totally agree that CPU may have other extra idle states, and for a common solution, we should not expose the limitation on idle states.
F.5 assume that all transistors are affected, which holds as long as all transitors in the power domains that we provide separate model data for (cpu core and cluster) are all equally affected by each idle-state.
For one specific power state, whatever it's a P-state or C-state, actually we need define it with three factors: voltage domain, power domain, and clock domain. After we define well these factors for a state, then we can easily to just apply F.5/F.6.
So just like the cases of "IDLE_CPUalmostOff" and "IDLE_CPUOFF", there must be something difference b/t them, for example they have different power domain but may have same clock domain and voltage domain. So naturally we can calculate different power result for them.
Agreed, if we capture all power domains in the model applying F.5/F.6 isn't a problem as all transistors in the domain will be affected by definition. However, it does mean potentially having power domains which are more fine grained than just one cpu core. It doesn't map well to our current model representation using the sched_domain hierarchy. Also, while it makes a lot of sense from a theoretical point of view, I'm not sure if we should worry about intra-core power domains. I don't see how we would define them beyond just being some a interpolated factor which would basically be 'a' in the above formula. 'a' should be sufficient for what we want to do as well.
F.6 makes a similar assumption about the toggling rate of all transistors scaling linearly with the frequency. I think that one is probably fine for the model precision that we after, but I haven't verified using actual measurements.
Here need clarify, F.5 and F.6 will _NOT_ assume for all transistors, it will totally depend on the upper three domains' definitions and then correctly to use these two formulas.
I meant all transistors as in all transistors in the power domain with our current definition of power domains (not the more fine-grained one that you are proposing). I think we agree, it is just two different ways of expressing the model (ratio/toggling rate vs fine-grained power and clock domains).
Usually if have errors, it's very likely we cannot define these three domains clearly and then introduce incorrect concept.
Yes. I think it might be difficult to define those domains correctly.
CLUSTER::capacity_state::power : Cluster's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w] CLUSTER::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] + Pd [w] Power(IDLE_CLSOff) = 0 Cluster's IDLE_WFI is quite special, means all CPUs in cluster have been powered off, but cluster's logic (L2$ and SCU, etc) is powered on and clock is enabled, so it includes cluster level's static power and dynamic power.
Right, this the 'active idle' state I mentioned earlier.
Are these formulas matching the original design?
Very much, yes. The only difference is that in the current design I don't distinguish between static and dynamic power, so if you substitute Ps [w] + Pd [w] = P [w] it is the same.
Got it, it's fine to just use summed power data.
TC2's data for cluster's sleep:
static struct idle_state idle_states_cluster_a7[] = { { .power = 25 }, /* WFI */ { .power = 10 }, /* cluster-sleep-l */ };
static struct idle_state idle_states_cluster_a15[] = { { .power = 70 }, /* WFI */ { .power = 25 }, /* cluster-sleep-b */ };
For cluster level's sleep, the clock is gating and domain is powered off, so the dynamic leakage and static leakge should be zero, right?
In an ideal world, yes. These numbers come from actual measurements using the TC2 energy counters so this is down to practical issues. Something must still be leaking while the cluster is off which is included in the power domain monitored by the counters, or the energy counter circuits may not be 100% accurate. We didn't tweak the numbers to make them fit theory ;-)
Make sense, it's acceptable if have some little inaccuracy.
TC2's data for CPU's idle state:
static struct idle_state idle_states_core_a7[] = { { .power = 0 }, /* WFI */ };
static struct idle_state idle_states_core_a15[] = { { .power = 0 }, /* WFI */ };
CPU has two idle state, one is 'WFI' and another is 'C2'; For 'WFI' state, the power will not be zero, this is because 'WFI' state means internal clock gating, so according to F.7, there should have static leakage.
BTW, for TC2, there have no corresponding idle state for 'C2', this is weird. Could you confirm it has been delibrately removed?
I assume that by 'C2' you mean CPUOff. You seem to be assuming that all cpus have WFI and CPUOff. This is not the case. TC2 has no CPUOff state, so it wasn't removed, it was never there :-) It only has WFI (clock-gating each individual core) and CLSOff (power down the entire cluster). We need to be able to handle those systems too, as well as systems with more per-cpu idle-states.
Now know why TC2 has suck kind of power data.
:-)
The WFI power is zero for practical reasons. It is not possible derive the per-core WFI power with the energy counters. We can put all cpus into WFI and measure the cluster energy, which would be the result of F.13, but we have no way of figuring out how to decompose it into cluster and cpu energy contributions. We have to account for all the energy somewhere, so instead of assuming some arbitrary split between cluster and cpu energy, we assume that it is all cluster energy. Hence, the WFI power is accounted for in the cluster 'active idle' power.
IOW, it isn't missing, it is just accounted for somewhere else as we didn't have a way to figure out the true split between cluster and core.
Yes, it's hard to extract power data independently for cluster level and core level. The main reason is hard to get the delta value for WFI if SoC don't support CPU's power off.
It is actually a generic problem, we can't derive the per-core idle power for the deepest per-core state. If we had CPUOff and WFI, we could measure the WFI-CPUOff delta, which would give us a non-zero WFI power. But we run into the same problem with measuring CPUOff as we currently have we WFI on TC2.
Also, the WFI-CPUOff delta isn't the true per-core WFI power, it is the delta on top of the CPUOff power which we can't measure. So the whole table of per-core idle-state power is offset such that CPUOff = 0 (or whatever the deepest per-core state is). As with the TC2 case it doesn't mean that the power is unaccounted for, it is just accounted for elsewhere (in the cluster power).
Just curious, if it's feasible with below steps to measure WFI state in TC2?
- Firstly measure the power date when cluster is powered off;
- Then power on CPU0 only, and place CPU0 into "WFI": Power_Delta0 = cluster level power + one CPU's "WFI";
- Then power on CPU1, and place and can get: Power_Delta1 = cluster level power + two CPUs' "WFI";
- So finally can get "WFI" power = Power_Delta1 - Power_Delta2;
The key point is step 2, when power on one core, will other cores in the same cluster be automatically be powered on as well?
Unfortunately yes. We only have one physical power domain which spans the entire cluster. So you can only power up everything in one go. If you try tricks like hotplugging cpus out, they are just parked in WFI by the driver/firmware even though they are removed from as OS perspective. It is a limitation in the hardware which we can't work around.
Talking about idle-state representation. The current idle-state tables are quite confusing. We only have per-cpu states listed in the per-cpu tables, and per-cluster in the per-cluster tables (+ active idle). This is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff for the cluster tables for TC2. I'm planning on changing that so we have the full list of states in all tables, but with zeros or repeated power numbers for states that don't affect the associated power domain.
Here i think we should create a clear principle for enery model and apply it. If we go back to review for state "WFI", its power domain/voltage domain/clock domain are all in CPU's level but not in cluster level. So the most reasonable calculation for 'active idle' state should be despicted as below:
Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) Sum(i=0..MAX_OPP)Util_OPP(i) = 1
Energy_cpuE [j] = Power(IDLE_WFI)
So that means for 'active idle' state, all cpus stay in "WFI" state, but for cluster level, actually it always stays in P-state but not C-state. This is decided by cluster level's power domain/clock domain is always ON for 'active idle'.
But now EAS consider cluster level as a idle state for 'active idle', right?
Yes, but it isn't easy to generalize based on the TC2 model due the limitations of TC2. From a model point of view we want to know which state the cpu/cluster is in: Running or idling. The C-states represents the hardware supported idle-states (controlling clock and/or power). An idle cluster or core may idle in one of these states or sit idle with everything power up and clocked. The latter is 'active idle'. A cluster may be active idle if all the cpus are idling in some per-cpu idle-state and the cpuidle governor has chosen to leave it powered up (possibly due to target residency constraints). The same could in theory be the case for a cpu core. It could be spinning in the idle loop if cpuidle didn't decide to enter a C-state. On ARM WFI is practically free to enter, so we always enter a proper hardware idle-state whenever we are idle. Even if it is only for a single clock cycle. Hence, we would never be active idling an ARM cpu, so WFI takes the role of active idle in this case. If WFI had a target_residency that would prevent cpus to enter it and leave them spinning, we would need an active idle state for the cpus as well.
In the model we treat active idle as an idle state despite the cpu/cluster being fully operational and running. The reason for this is that even though we are in some P-state, we aren't actually doing anything useful and the power consumption is likely to be very different from when we are busy. In the cluster active idle case, all the cpus are idling, which means nobody is accessing caches and memory hence the transistor toggling is very limited (though it might be affected by snooping traffic if another cluster is busy). If we used the busy P-state power, we would vastly over-estimate the active idle power for the cluster in most cases. In the cpu case (if we weren't guaranteed to enter WFI), we would be spinning in some simple loop that probably wouldn't exercise the entire cpu core and hopefully use a little less power (no cache access and expensive instructions).
Since we are technically running when active idling, one could argue that we should have an active idle power number for each P-state. For ARM that isn't an issue for per-core idling as we have WFI. For clusters we may want to consider it.
The short answer is: In active idle the cpu/cluster is in a P-state doing nothing. We can make WFI the active idle state per-core (cpu) on ARM as we are guaranteed to enter it when the cpu is idle.
So let's dig further more for this question, and find actually this question is very related how look at the idle states. We need create a idle voting mechanism for different schedule domain level, and there should have mechanism can roll back to lower level's schedule domain for idle state's selection if upper schedule domain is in P-state.
Below is a example for voting:
0: Power On state 1: idle state 1 2: idle state 2 ...
Example 1:
SCHED_DOMAIN (CPU) SCHED_DOMAIN(MC)
CPU0 0 1 CPU1 0 1 CPU2 0 1 CPU3 0 1
So all 4 CPUs vote 0 for cluster level, means to power on cluster and all 4 CPUs run into idle state 1; finally scheduler can easily know for SCHED_DOMAIN (CPU) (or cluster level) is not in idle state, so it can rollback to SCHED_DOMAIN(MC) (or cpu level) to find correct idle state.
Example 2:
SCHED_DOMAIN (CPU) SCHED_DOMAIN(MC)
CPU0 1 1 CPU1 1 1 CPU2 1 1 CPU3 0 1
3 CPUs vote 1 to power off cluster and 1 CPU votes 0 to power on cluster, finally scheduler can easily know for SCHED_DOMAIN (CPU) (or cluster level) the minimum vote is 0, means cluster will be powered on, so it will rollback to SCHED_DOMAIN(MC) (or cpu level) to find correct idle state for CPU level.
Example 3:
SCHED_DOMAIN (CPU) SCHED_DOMAIN(MC)
CPU0 1 2 CPU1 1 2 CPU2 1 2 CPU3 0 1
3 CPUs vote 1 to power off cluster and 1 CPU votes 0 to power on cluster, finally scheduler can easily know for SCHED_DOMAIN (CPU) (or cluster level) the minimum vote is 0, means cluster will be powered on, so it will rollback to SCHED_DOMAIN(MC) (or cpu level) to find correct idle state for CPU level, Example 3 wants to demonstrate there have two different idle states for CPU level, so scheduler need to know the CPU will rollback to which exactly idle state for individual CPU.
Yes, we basically have to redo the voting already one in cpuidle to figure out the actual idle-states. We do already have a function that tries to figure out the group idle state based on the idle-states requested by the cpus in the group. I'm not a big fan of it as idle-states may change very frequently so whatever we just computed might already be wrong a couple of cycles later. I think we should consider looking at average idle-state instead and see if that makes sense.
TC2's data for P-state:
static struct capacity_state cap_states_cluster_a7[] = { /* Cluster only power */ { .cap = 150, .power = 2967, }, /* 350 MHz */ [...] };
static struct capacity_state cap_states_core_a7[] = { /* Power per cpu */ { .cap = 150, .power = 187, }, /* 350 MHz */ [...] };
From previous experience, the CPU level's power leakage is very higher than cluster level's leakage. For example, for CA7, if only power on cluster (all CPUs in cluster are powered off), the power delta is ~10mA@156MHz; if power on one CPUs, the power delta is about 30mA@156MHz. I also checked the data for CA53, it has similar result.
So this is confilict with TC2's power data, you can see the cluster level's power leakage is quite high (almost 15 times than CPU level). This means almostly we cannot get much benefit from CPU level's low power state, due cluster level will contribute most of power consumption. This is not make sense.
As said above, TC2 doesn't have a CPUOff state which makes it really crippled in terms of power management. As soon as the cluster is power up, all cores are sitting in WFI leaking (Ps) with caches being kept coherent and everything. As said above, we had to account for the core WFI power in the cluster active power (OPP) so it ends up becoming quite high.
So the numbers do make sense for TC2, it is just not a very well-designed SoC from a power management point of view. It was a very early test chip not designed for power management experiments at all, but it has really good power measurement infrastructure (energy counters) and everything is upstream and has been that for years. Your previous experience has most likely been with more representative platforms, so I expect numbers for other platforms to be in line with your experience. Juno, which is also a test chip, is closer to what you describe but still not really representative for product grade SoCs, but we don't have anything better with upstream support.
So P-state's Power data for TC2 is actually below combination :)
CLUSTER::capacity_state::power Power_CLUSTER(OPP) = Cluster (Ps [w] + Pd [w]) + CPU (Ps [w]) * 4 `-> include 4 CPU's static leakage
Yes :-)
CPU::capacity_state::power Power_CPU(OPP) = CPU (Pd [w])
Thanks, Morten
On Mon, Sep 21, 2015 at 05:31:37PM +0100, Morten Rasmussen wrote:
On Mon, Sep 21, 2015 at 06:58:30AM +0100, Leo Yan wrote:
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
[...]
Energy_cpu [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, CPUOff)(Power_Cstate [w](i) * Util_IDL(i))
Thoughts and Questions
Let's summary EAS's energy model as below:
CPU::capacity_state::power : CPU's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w]
CPU::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] Power(IDLE_CPUOff) = 0
CPU's IDLE_WFI means: CPU is clock gating, so has static leakage but don't include dynamic leakage.
Agreed, but if we imagine that we have state between WFI and CPUOff which powers down a part of the cpu core, but not everything (like CPUOff), it would consume
Power(IDLE_CPUalmostOff) = a * Ps [w] -> a = ratio of transistors powered down.
Totally agree that CPU may have other extra idle states, and for a common solution, we should not expose the limitation on idle states.
F.5 assume that all transistors are affected, which holds as long as all transitors in the power domains that we provide separate model data for (cpu core and cluster) are all equally affected by each idle-state.
For one specific power state, whatever it's a P-state or C-state, actually we need define it with three factors: voltage domain, power domain, and clock domain. After we define well these factors for a state, then we can easily to just apply F.5/F.6.
So just like the cases of "IDLE_CPUalmostOff" and "IDLE_CPUOFF", there must be something difference b/t them, for example they have different power domain but may have same clock domain and voltage domain. So naturally we can calculate different power result for them.
Agreed, if we capture all power domains in the model applying F.5/F.6 isn't a problem as all transistors in the domain will be affected by definition. However, it does mean potentially having power domains which are more fine grained than just one cpu core. It doesn't map well to our current model representation using the sched_domain hierarchy. Also, while it makes a lot of sense from a theoretical point of view, I'm not sure if we should worry about intra-core power domains. I don't see how we would define them beyond just being some a interpolated factor which would basically be 'a' in the above formula. 'a' should be sufficient for what we want to do as well.
Understood, it's hard to handle the 'a' issue, especially if we want to simplize the energy model parameters.
[...]
The WFI power is zero for practical reasons. It is not possible derive the per-core WFI power with the energy counters. We can put all cpus into WFI and measure the cluster energy, which would be the result of F.13, but we have no way of figuring out how to decompose it into cluster and cpu energy contributions. We have to account for all the energy somewhere, so instead of assuming some arbitrary split between cluster and cpu energy, we assume that it is all cluster energy. Hence, the WFI power is accounted for in the cluster 'active idle' power.
IOW, it isn't missing, it is just accounted for somewhere else as we didn't have a way to figure out the true split between cluster and core.
Yes, it's hard to extract power data independently for cluster level and core level. The main reason is hard to get the delta value for WFI if SoC don't support CPU's power off.
It is actually a generic problem, we can't derive the per-core idle power for the deepest per-core state. If we had CPUOff and WFI, we could measure the WFI-CPUOff delta, which would give us a non-zero WFI power. But we run into the same problem with measuring CPUOff as we currently have we WFI on TC2.
Also, the WFI-CPUOff delta isn't the true per-core WFI power, it is the delta on top of the CPUOff power which we can't measure. So the whole table of per-core idle-state power is offset such that CPUOff = 0 (or whatever the deepest per-core state is). As with the TC2 case it doesn't mean that the power is unaccounted for, it is just accounted for elsewhere (in the cluster power).
Yes, exactly.
Just curious, if it's feasible with below steps to measure WFI state in TC2?
- Firstly measure the power date when cluster is powered off;
- Then power on CPU0 only, and place CPU0 into "WFI": Power_Delta0 = cluster level power + one CPU's "WFI";
- Then power on CPU1, and place and can get: Power_Delta1 = cluster level power + two CPUs' "WFI";
- So finally can get "WFI" power = Power_Delta1 - Power_Delta2;
The key point is step 2, when power on one core, will other cores in the same cluster be automatically be powered on as well?
Unfortunately yes. We only have one physical power domain which spans the entire cluster. So you can only power up everything in one go. If you try tricks like hotplugging cpus out, they are just parked in WFI by the driver/firmware even though they are removed from as OS perspective. It is a limitation in the hardware which we can't work around.
OK.
Talking about idle-state representation. The current idle-state tables are quite confusing. We only have per-cpu states listed in the per-cpu tables, and per-cluster in the per-cluster tables (+ active idle). This is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff for the cluster tables for TC2. I'm planning on changing that so we have the full list of states in all tables, but with zeros or repeated power numbers for states that don't affect the associated power domain.
Here i think we should create a clear principle for enery model and apply it. If we go back to review for state "WFI", its power domain/voltage domain/clock domain are all in CPU's level but not in cluster level. So the most reasonable calculation for 'active idle' state should be despicted as below:
Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) Sum(i=0..MAX_OPP)Util_OPP(i) = 1
Energy_cpuE [j] = Power(IDLE_WFI)
So that means for 'active idle' state, all cpus stay in "WFI" state, but for cluster level, actually it always stays in P-state but not C-state. This is decided by cluster level's power domain/clock domain is always ON for 'active idle'.
But now EAS consider cluster level as a idle state for 'active idle', right?
Yes, but it isn't easy to generalize based on the TC2 model due the limitations of TC2. From a model point of view we want to know which state the cpu/cluster is in: Running or idling. The C-states represents the hardware supported idle-states (controlling clock and/or power). An idle cluster or core may idle in one of these states or sit idle with everything power up and clocked. The latter is 'active idle'. A cluster may be active idle if all the cpus are idling in some per-cpu idle-state and the cpuidle governor has chosen to leave it powered up (possibly due to target residency constraints). The same could in theory be the case for a cpu core. It could be spinning in the idle loop if cpuidle didn't decide to enter a C-state. On ARM WFI is practically free to enter, so we always enter a proper hardware idle-state whenever we are idle. Even if it is only for a single clock cycle. Hence, we would never be active idling an ARM cpu, so WFI takes the role of active idle in this case. If WFI had a target_residency that would prevent cpus to enter it and leave them spinning, we would need an active idle state for the cpus as well.
In the model we treat active idle as an idle state despite the cpu/cluster being fully operational and running. The reason for this is that even though we are in some P-state, we aren't actually doing anything useful and the power consumption is likely to be very different from when we are busy. In the cluster active idle case, all the cpus are idling, which means nobody is accessing caches and memory hence the transistor toggling is very limited (though it might be affected by snooping traffic if another cluster is busy). If we used the busy P-state power, we would vastly over-estimate the active idle power for the cluster in most cases. In the cpu case (if we weren't guaranteed to enter WFI), we would be spinning in some simple loop that probably wouldn't exercise the entire cpu core and hopefully use a little less power (no cache access and expensive instructions).
Since we are technically running when active idling, one could argue that we should have an active idle power number for each P-state. For ARM that isn't an issue for per-core idling as we have WFI. For clusters we may want to consider it.
The short answer is: In active idle the cpu/cluster is in a P-state doing nothing. We can make WFI the active idle state per-core (cpu) on ARM as we are guaranteed to enter it when the cpu is idle.
Agreed, here have two concerns:
- If take cluster's 'active idle' as an idle state, that means it will totally ignore Pd [w] for it. That means whatever frequency the cluster level is running at, the dynamic power leakage will be ignored.
Below are some power data on CA7 for 'active idle' data:
CPUFreq@156MHz: 11mA CPUFreq@312MHz: 28mA CPUFreq@624MHz: 36mA CPUFreq@800MHz: 45mA CPUFreq@1100Hz: 56mA
So in practice, if we use lowest frequency for cluster's 'active idle', it will have some deviation if cluster actually is running at highest frequency.
- There may have more than one kind of 'active idle' state for cluster; for example, all cores in cluster can into 'WFI' state will have a corresponding 'active idle' state; and all cores in cluster run into 'CPUOFF' state will have another corresponding 'active idle' state. These two kind of 'active idle' state we also should handle as the same one?
Furthermore, if one CPU only run into 'WFI' and other CPUs in the cluster run into 'CPUOFF', how to select the 'active idle' state?
If we change to take 'active idle' state as cluster level's P-state, upper issues can easily dismiss.
Thanks, Leo Yan
On Tue, Sep 22, 2015 at 08:44:40PM +0100, Leo Yan wrote:
On Mon, Sep 21, 2015 at 05:31:37PM +0100, Morten Rasmussen wrote:
On Mon, Sep 21, 2015 at 06:58:30AM +0100, Leo Yan wrote:
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
Talking about idle-state representation. The current idle-state tables are quite confusing. We only have per-cpu states listed in the per-cpu tables, and per-cluster in the per-cluster tables (+ active idle). This is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff for the cluster tables for TC2. I'm planning on changing that so we have the full list of states in all tables, but with zeros or repeated power numbers for states that don't affect the associated power domain.
Here i think we should create a clear principle for enery model and apply it. If we go back to review for state "WFI", its power domain/voltage domain/clock domain are all in CPU's level but not in cluster level. So the most reasonable calculation for 'active idle' state should be despicted as below:
Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) Sum(i=0..MAX_OPP)Util_OPP(i) = 1
Energy_cpuE [j] = Power(IDLE_WFI)
So that means for 'active idle' state, all cpus stay in "WFI" state, but for cluster level, actually it always stays in P-state but not C-state. This is decided by cluster level's power domain/clock domain is always ON for 'active idle'.
But now EAS consider cluster level as a idle state for 'active idle', right?
Yes, but it isn't easy to generalize based on the TC2 model due the limitations of TC2. From a model point of view we want to know which state the cpu/cluster is in: Running or idling. The C-states represents the hardware supported idle-states (controlling clock and/or power). An idle cluster or core may idle in one of these states or sit idle with everything power up and clocked. The latter is 'active idle'. A cluster may be active idle if all the cpus are idling in some per-cpu idle-state and the cpuidle governor has chosen to leave it powered up (possibly due to target residency constraints). The same could in theory be the case for a cpu core. It could be spinning in the idle loop if cpuidle didn't decide to enter a C-state. On ARM WFI is practically free to enter, so we always enter a proper hardware idle-state whenever we are idle. Even if it is only for a single clock cycle. Hence, we would never be active idling an ARM cpu, so WFI takes the role of active idle in this case. If WFI had a target_residency that would prevent cpus to enter it and leave them spinning, we would need an active idle state for the cpus as well.
In the model we treat active idle as an idle state despite the cpu/cluster being fully operational and running. The reason for this is that even though we are in some P-state, we aren't actually doing anything useful and the power consumption is likely to be very different from when we are busy. In the cluster active idle case, all the cpus are idling, which means nobody is accessing caches and memory hence the transistor toggling is very limited (though it might be affected by snooping traffic if another cluster is busy). If we used the busy P-state power, we would vastly over-estimate the active idle power for the cluster in most cases. In the cpu case (if we weren't guaranteed to enter WFI), we would be spinning in some simple loop that probably wouldn't exercise the entire cpu core and hopefully use a little less power (no cache access and expensive instructions).
Since we are technically running when active idling, one could argue that we should have an active idle power number for each P-state. For ARM that isn't an issue for per-core idling as we have WFI. For clusters we may want to consider it.
I should add that the P-state influence does not go away entirely for cores when they enter WFI. Ps (F.5) is still there since WFI is only clock gating so the voltage of the P-state still has an affect. It isn't voltage squared, so I'm not if it is really a problem.
The short answer is: In active idle the cpu/cluster is in a P-state doing nothing. We can make WFI the active idle state per-core (cpu) on ARM as we are guaranteed to enter it when the cpu is idle.
Agreed, here have two concerns:
- If take cluster's 'active idle' as an idle state, that means it will totally ignore Pd [w] for it. That means whatever frequency the cluster level is running at, the dynamic power leakage will be ignored.
I wouldn't say we totally ignore Pd, we measure the total power P = Ps + Pd, but I agree with you that Pd depends on the P-state in which we are active idling. As I just added above, the same is also true for Ps (F.5). It is just worse for Pd (F.6) as it has voltage squared.
Below are some power data on CA7 for 'active idle' data:
CPUFreq@156MHz: 11mA CPUFreq@312MHz: 28mA CPUFreq@624MHz: 36mA CPUFreq@800MHz: 45mA CPUFreq@1100Hz: 56mA
So in practice, if we use lowest frequency for cluster's 'active idle', it will have some deviation if cluster actually is running at highest frequency.
Yes, that is quite a difference, around 5x. The question is whether it actually affects the scheduling decisions if we include this in the model, or if we can get away with just picking something in the middle, like 36mA. If we pick 36mA, we would overestimate energy expense of idling the cluster in low-utilization scenarios, and under-estimate in high-utilization scenarios. I think it could give some strange results if active idling turns out to consume more energy than being busy for the lowest P-states. I can't come up with a scenario where it is a problem though. More thinking is needed I think.
If it turns out that we need to capture active idle more accurately in the model, we could extend the P-state table to have idle-power numbers for each state in addition to the busy power. We would need a special case in the idle energy calculation to use those numbers instead when we are in active idle and use the C-state data when we are in a true hardware idle state.
There may have more than one kind of 'active idle' state for cluster; for example, all cores in cluster can into 'WFI' state will have a corresponding 'active idle' state; and all cores in cluster run into 'CPUOFF' state will have another corresponding 'active idle' state. These two kind of 'active idle' state we also should handle as the same one?
Furthermore, if one CPU only run into 'WFI' and other CPUs in the cluster run into 'CPUOFF', how to select the 'active idle' state?
Wouldn't it primarily affect the core energy consumption? I would associate the energy delta between all WFI and all CPUOff with the cores and not the cluster as I would have thought it was caused by powering off the cores. The cluster logic would be on and clocked in both cases and since the cores are idling they shouldn't cause any (different) Pd for the cluster in the two cases. Why would the selected core idle-state affect the cluster? Do you have an example?
If we change to take 'active idle' state as cluster level's P-state, upper issues can easily dismiss.
Agreed, I think we should consider letting the active idle power depend on the actual P-state. Your numbers above definitely shows it is something that needs further investigation. Thanks for sharing the numbers.
Thanks, Morten
On Wed, Sep 23, 2015 at 12:01:34PM +0100, Morten Rasmussen wrote:
On Tue, Sep 22, 2015 at 08:44:40PM +0100, Leo Yan wrote:
On Mon, Sep 21, 2015 at 05:31:37PM +0100, Morten Rasmussen wrote:
On Mon, Sep 21, 2015 at 06:58:30AM +0100, Leo Yan wrote:
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
[...]
There may have more than one kind of 'active idle' state for cluster; for example, all cores in cluster can into 'WFI' state will have a corresponding 'active idle' state; and all cores in cluster run into 'CPUOFF' state will have another corresponding 'active idle' state. These two kind of 'active idle' state we also should handle as the same one?
Furthermore, if one CPU only run into 'WFI' and other CPUs in the cluster run into 'CPUOFF', how to select the 'active idle' state?
Wouldn't it primarily affect the core energy consumption? I would associate the energy delta between all WFI and all CPUOff with the cores and not the cluster as I would have thought it was caused by powering off the cores. The cluster logic would be on and clocked in both cases and since the cores are idling they shouldn't cause any (different) Pd for the cluster in the two cases. Why would the selected core idle-state affect the cluster? Do you have an example?
Totally agree for this; and selected core idle-state will _NOT_ affect cluster level at all.
"The cluster logic would be on and clocked in both cases and since the cores are idling they shouldn't cause any (different) Pd for the cluster in the two cases."
So during 'active idle' period we need directly to use cluster's (Pd + Ps) to caluculate cluster level's power.
Whatever it's 'active idle' state and other running states, the cluster level is always active (there may have small difference for snooping); so we can caluclate the cluster level's power with the same way.
Thanks, Leo Yan
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
On Thu, Sep 17, 2015 at 04:02:09PM +0100, Leo Yan wrote:
[...]
From formula F.4, we can combine power with static leakage and dynamic leakage; IPA also used static/dynamic leakage to depict energy model. But EAS uses another way, which provide the power data according to every OPP and idle state. So that means on one platform, we need provide two kinds of power data.
IMHO, i think the static and dynamic leakage is more simple; because usually we will use (mW/MHz) to describe the power efficiency for specific CPU, though (mW/MHz) cannot very accurately for power consumption if the voltage has been changed (See formula F.6, usually the voltage will be increased at higher frequency). But if we use mW/MHz, maybe we can calculate with very simple way for we can just only use it to mulitplate with frequency to get dynamic power.
So we only need provide below parameters: P-state: static leakage, power efficiency (mW/MHz), capacity (DMIPS/MHz); C-state: static leakage, power efficiency (mW/MHz);
What's the thoughts for unify the energy model?
We want to unify the power models if at all possible. The IPA people are looking into it. The difficulty is that we are looking for different things, so the models have to capture enough detail to be useful for both.
Are you proposing to derive the individual P-state numbers from global numbers or do you propose to have the three parameters for each P-state in tables like we currently have them?
I'm referring first one to derive the individual P-state numbers from global numbers.
If you want to derive them from global numbers, you would need to compensate for voltage scaling for both Ps and Pd so you would need the voltage for each state. Otherwise you energy efficiency will _improve_ as you increasing frequency.
Correct.
It might work. I think the first step is to see if the derived curves would correlate well with real measurements. We would need a way to derive static leakage and power efficiency from measurements. I don't know if that can be easily done. Do you have any suggestions for that?
Pd [w] = b * V [v] * V [v] * frequency (F.6)
From previous experience, if fix the voltage for all OPPs then we can
get almostly linear ratio between Pd [w] and frequency, this is because we have fixed voltage for 'b * V [v] * V [v]'. The ratio will skew after voltage is increased.
We can do power measurement on simple enviornment (bare metal code or simple generic Linux envirnment); Below are some measurement methods:
1. Firstly need a stable baseline before power meansurement; for exmaple, need firstly power off all other CPUs, and only use one CPU to meansurement. So we can firstly hotplug all unused CPUs.
2. CPU(Ps [w]) = Power(CPU_WFI) - Power (CPU_OFF) CPU(Pd [w]) = Power(OPP) - Power (CPU_WFI) or CPU(Pd [w]) = Power(OPP') - Power (OPP)
For Pd [w], we need run benchmark (CoreMark) to let CPU run with 100% percentage.
Then we can get the "b * V [v] * V [v]" = Pd [w] / freuqency, this is usually we say the value of Pe (mW/MHz).
I think Pe (mW/MHz) still cannot not really reflect power efficiency, We also need take account into CPU's performance improvement (with more stages' pipeline) and the relatioship with power consumption. so Pe (mW/MHz) / DMIPS (or capacity) can easily let use know if run one specific piece of code, which CPU will comsume more power.
Deriving the table data using F.5 and F.6 would mean that we can only model systems that follow those formulas reasonably well. The current tables are pure measurement data with a little bit of extrapolation to find the cluster power, which should be a bit more flexible. I'm not sure if that really matter though.
Agree we can firstly use pure measurement data, and later we can check if can use global power efficiency number for some optimization (may be we can simplize energy model and improve scheduling performance).
i also have no confidence which way is better :)
Thanks, Leo Yan
On Mon, Sep 21, 2015 at 02:17:37PM +0100, Leo Yan wrote:
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
On Thu, Sep 17, 2015 at 04:02:09PM +0100, Leo Yan wrote:
[...]
From formula F.4, we can combine power with static leakage and dynamic leakage; IPA also used static/dynamic leakage to depict energy model. But EAS uses another way, which provide the power data according to every OPP and idle state. So that means on one platform, we need provide two kinds of power data.
IMHO, i think the static and dynamic leakage is more simple; because usually we will use (mW/MHz) to describe the power efficiency for specific CPU, though (mW/MHz) cannot very accurately for power consumption if the voltage has been changed (See formula F.6, usually the voltage will be increased at higher frequency). But if we use mW/MHz, maybe we can calculate with very simple way for we can just only use it to mulitplate with frequency to get dynamic power.
So we only need provide below parameters: P-state: static leakage, power efficiency (mW/MHz), capacity (DMIPS/MHz); C-state: static leakage, power efficiency (mW/MHz);
What's the thoughts for unify the energy model?
We want to unify the power models if at all possible. The IPA people are looking into it. The difficulty is that we are looking for different things, so the models have to capture enough detail to be useful for both.
Are you proposing to derive the individual P-state numbers from global numbers or do you propose to have the three parameters for each P-state in tables like we currently have them?
I'm referring first one to derive the individual P-state numbers from global numbers.
If you want to derive them from global numbers, you would need to compensate for voltage scaling for both Ps and Pd so you would need the voltage for each state. Otherwise you energy efficiency will _improve_ as you increasing frequency.
Correct.
It might work. I think the first step is to see if the derived curves would correlate well with real measurements. We would need a way to derive static leakage and power efficiency from measurements. I don't know if that can be easily done. Do you have any suggestions for that?
Pd [w] = b * V [v] * V [v] * frequency (F.6)
From previous experience, if fix the voltage for all OPPs then we can get almostly linear ratio between Pd [w] and frequency, this is because we have fixed voltage for 'b * V [v] * V [v]'. The ratio will skew after voltage is increased.
Yes, fixing the voltage would be one way of getting more measurement points to derrive 'b'. It does require setting up cpufreq to leave the voltage fixed though. We can't use an optimized cpufreq driver which scales the voltage.
We can do power measurement on simple enviornment (bare metal code or simple generic Linux envirnment); Below are some measurement methods:
- Firstly need a stable baseline before power meansurement; for exmaple, need firstly power off all other CPUs, and only use one CPU to meansurement. So we can firstly hotplug all unused CPUs.
You may want to repeat the experiments with more than one cpu just to verify that the power consumption should be associated with the core and not the cluster.
As mentioned in my reply from yesterday, hotplug may not actually power down the cpu (it doesn't on TC2). It most likely will on most systems, but it worth keeping in mind.
- CPU(Ps [w]) = Power(CPU_WFI) - Power (CPU_OFF) CPU(Pd [w]) = Power(OPP) - Power (CPU_WFI) or CPU(Pd [w]) = Power(OPP') - Power (OPP)
The last formula with fixed frequency and some additional computation to figure out the Pd, I assume.
For Pd [w], we need run benchmark (CoreMark) to let CPU run with 100% percentage.
Then we can get the "b * V [v] * V [v]" = Pd [w] / freuqency, this is usually we say the value of Pe (mW/MHz).
And when we have Pe, we can then compensate for the voltage scaling afterwards. Either directly as part of the energy calculatations or to generate tables similar to the existing ones with precomputed values.
I think Pe (mW/MHz) still cannot not really reflect power efficiency, We also need take account into CPU's performance improvement (with more stages' pipeline) and the relatioship with power consumption. so Pe (mW/MHz) / DMIPS (or capacity) can easily let use know if run one specific piece of code, which CPU will comsume more power.
Right, Pe is just a value expressing the relation between frequency and dynamic power for a particular processor implementation at a specific voltage. You are right that energy efficiency is comparison of real work (instructions executed) and energy cost (work/energy or the inverse). IPC is different between processors.
It actually depends on the workload, but in the interest of keeping the model simple enough to be used for scheduling decisions I think we should stick to some average expression of the IPC (and compute capacity).
Deriving the table data using F.5 and F.6 would mean that we can only model systems that follow those formulas reasonably well. The current tables are pure measurement data with a little bit of extrapolation to find the cluster power, which should be a bit more flexible. I'm not sure if that really matter though.
Agree we can firstly use pure measurement data, and later we can check if can use global power efficiency number for some optimization (may be we can simplize energy model and improve scheduling performance).
i also have no confidence which way is better :)
If it turns out that we can express the energy model using fewer input parameters and it works for real systems I think it could things easier for us in the long run. Less input parameters means less opportunities for people to do something wrong and we can probably more easily do some quick checks of the values to see if they make sense.
Also it mean less data stick into DT or wherever it is going to live.
Thanks, Morten