Hi Leo,
Thanks for sharing this excellent write-up. I'm tempted to suggest that we add this to the documention.
On Thu, Sep 17, 2015 at 04:02:09PM +0100, Leo Yan wrote:
Hi all,
Below are some thoughts and questions after reviewed EAS's energy model; my purpose is want to get clear the energy model from user's perspective, so below question will _ONLY_ focus on the model and not dig into the implementation.
This email is related long, but i think if use formulas, we can easily get the same page; So i lists the energy model's formulas, then based on them i try to match with TC2's power data and bring up some questions. Look forward to your suggestions and comments.
Basic Energy and Power Calculation Formulas
From the doc Documentation/scheduler/sched-energy.txt, we can get to know the energy can be calculated with:
Energy [j] = Power [w] * Time [s] (F.1)
So let's assume there have one piece of code, which has fixed instruction numbers will be executed on CPU, the execution duration is depend on CPU's pipeline and CPU's frequency. So can convert F.1 to F.2:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (Inst Per Cycle) * Frequency
Code [instructions] = Power [w] * ------------------------------ (F.2) MIPS(f) `-> 'f' is factor of frequency
Because MIPS(f) can be normalize as the CPU's capacity corresponding to OPP, so we can simply convert from F.2 to F.3:
Code [instructions]
Energy [j] = Power [w] * ------------------------------ (F.3) CPU_Capacity(f)
If breakdown Power[w], we can split it into two parts: static leakage, and dynamic leakage:
Power [w] = Ps [w] + Pd [w] (F.4)
Static power leakage can be calculated with below formula: Ps [w] = i * V [v] (F.5) `-> 'i' is coefficient for according to silicon's process V [v] is voltage according to OPP
Dynamic power leakage can be calculated with below formula: Pd [w] = b * V [v] * V [v] * frequency (F.6) `-> 'b' is coefficient for according to silicon's process V [v] is voltage according to OPP
Here have two special cases, if the island's clock is gated, then Pd [w] = 0, So: Power [w] = Ps [w] (F.7)
If the island is powered off, then Ps [w] = 0, Pd [w] = 0; So: Power [w] = 0 (F.8)
So energy can be calculated as (come from F.3 and F.4):
Code [instructions]
Energy [j] = (Ps [w] + Pd [w]) * ---------------------- (F.9) CPU_Capacity(f)
Formulas for duty cycle
We separate the logic (cluster or CPU) into two states: P-state and C-state, for P-state and C-state they have different power data, this is because after the logic enter C-state, it will be clock gating or powered off. So if we expand the time axis for relative long time, we need calculate CPU's utilization percentage (for CPU is full running, util = 100%). Let's simplize the ratio between "Code [instructions]" and "CPU_Capactity(f)" as the utilization, So the energy calculation can be depicted as:
Code [instructions]
Util(f) = -------------------------- (F.10) CPU_Capacity(f)
Energy [j] = Power_Pstate [w] * Util(f) + Power_Cstate [w] * (1 - Util(f)) (F.11)
(F.12)
Energy [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=0..MAX_IDL)(Power_Cstate [w](i) * Util_IDL(i)) Sum(i=0..MAX_OPP)Util_OPP(i) + Sum(i..MAX_IDLE)Util_IDL(i) = 1
Formulas for clusters (F.13) Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
(F.14)
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, ClusterOff)(Power_Cstate [w](i) * Util_IDL(i))
A minor detail here is that a cluster and/or cpu may be idle (from a utilization point of view) but not actually in an idle state (from a hardware point of view). For example, all the cpus may be in WFI or cpu_power_down while the cluster is still has power and clock going. You point that out towards the end as well. For this reason, the model has to consider this idle, but not really idle, state too. I called it 'active idle' in the past.
(F.15)
Energy_cpu [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) + Sum(i=WFI, CPUOff)(Power_Cstate [w](i) * Util_IDL(i))
Thoughts and Questions
Let's summary EAS's energy model as below:
CPU::capacity_state::power : CPU's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w]
CPU::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] Power(IDLE_CPUOff) = 0
CPU's IDLE_WFI means: CPU is clock gating, so has static leakage but don't include dynamic leakage.
Agreed, but if we imagine that we have state between WFI and CPUOff which powers down a part of the cpu core, but not everything (like CPUOff), it would consume
Power(IDLE_CPUalmostOff) = a * Ps [w] -> a = ratio of transistors powered down.
F.5 assume that all transistors are affected, which holds as long as all transitors in the power domains that we provide separate model data for (cpu core and cluster) are all equally affected by each idle-state.
F.6 makes a similar assumption about the toggling rate of all transistors scaling linearly with the frequency. I think that one is probably fine for the model precision that we after, but I haven't verified using actual measurements.
CLUSTER::capacity_state::power : Cluster's power [w] for specific OPP Power(OPP) = Ps [w] + Pd [w] CLUSTER::idle_state::power : CPU's power [w] for specific idle state Power(IDLE_WFI) = Ps [w] + Pd [w] Power(IDLE_CLSOff) = 0 Cluster's IDLE_WFI is quite special, means all CPUs in cluster have been powered off, but cluster's logic (L2$ and SCU, etc) is powered on and clock is enabled, so it includes cluster level's static power and dynamic power.
Right, this the 'active idle' state I mentioned earlier.
Are these formulas matching the original design?
Very much, yes. The only difference is that in the current design I don't distinguish between static and dynamic power, so if you substitute Ps [w] + Pd [w] = P [w] it is the same.
TC2's data for cluster's sleep:
static struct idle_state idle_states_cluster_a7[] = { { .power = 25 }, /* WFI */ { .power = 10 }, /* cluster-sleep-l */ };
static struct idle_state idle_states_cluster_a15[] = { { .power = 70 }, /* WFI */ { .power = 25 }, /* cluster-sleep-b */ };
For cluster level's sleep, the clock is gating and domain is powered off, so the dynamic leakage and static leakge should be zero, right?
In an ideal world, yes. These numbers come from actual measurements using the TC2 energy counters so this is down to practical issues. Something must still be leaking while the cluster is off which is included in the power domain monitored by the counters, or the energy counter circuits may not be 100% accurate. We didn't tweak the numbers to make them fit theory ;-)
TC2's data for CPU's idle state:
static struct idle_state idle_states_core_a7[] = {
{ .power = 0 }, /* WFI */ };
static struct idle_state idle_states_core_a15[] = {
{ .power = 0 }, /* WFI */ };
CPU has two idle state, one is 'WFI' and another is 'C2'; For 'WFI' state, the power will not be zero, this is because 'WFI' state means internal clock gating, so according to F.7, there should have static leakage. BTW, for TC2, there have no corresponding idle state for 'C2', this is weird. Could you confirm it has been delibrately removed?
I assume that by 'C2' you mean CPUOff. You seem to be assuming that all cpus have WFI and CPUOff. This is not the case. TC2 has no CPUOff state, so it wasn't removed, it was never there :-) It only has WFI (clock-gating each individual core) and CLSOff (power down the entire cluster). We need to be able to handle those systems too, as well as systems with more per-cpu idle-states.
The WFI power is zero for practical reasons. It is not possible derive the per-core WFI power with the energy counters. We can put all cpus into WFI and measure the cluster energy, which would be the result of F.13, but we have no way of figuring out how to decompose it into cluster and cpu energy contributions. We have to account for all the energy somewhere, so instead of assuming some arbitrary split between cluster and cpu energy, we assume that it is all cluster energy. Hence, the WFI power is accounted for in the cluster 'active idle' power.
IOW, it isn't missing, it is just accounted for somewhere else as we didn't have a way to figure out the true split between cluster and core.
Talking about idle-state representation. The current idle-state tables are quite confusing. We only have per-cpu states listed in the per-cpu tables, and per-cluster in the per-cluster tables (+ active idle). This is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff for the cluster tables for TC2. I'm planning on changing that so we have the full list of states in all tables, but with zeros or repeated power numbers for states that don't affect the associated power domain.
TC2's data for P-state:
static struct capacity_state cap_states_cluster_a7[] = {
/* Cluster only power */ { .cap = 150, .power = 2967, }, /* 350 MHz */ [...] };
static struct capacity_state cap_states_core_a7[] = {
/* Power per cpu */ { .cap = 150, .power = 187, }, /* 350 MHz */ [...] };
From previous experience, the CPU level's power leakage is very higher than cluster level's leakage. For example, for CA7, if only power on cluster (all CPUs in cluster are powered off), the power delta is ~10mA@156MHz; if power on one CPUs, the power delta is about 30mA@156MHz. I also checked the data for CA53, it has similar result. So this is confilict with TC2's power data, you can see the cluster level's power leakage is quite high (almost 15 times than CPU level). This means almostly we cannot get much benefit from CPU level's low power state, due cluster level will contribute most of power consumption. This is not make sense.
As said above, TC2 doesn't have a CPUOff state which makes it really crippled in terms of power management. As soon as the cluster is power up, all cores are sitting in WFI leaking (Ps) with caches being kept coherent and everything. As said above, we had to account for the core WFI power in the cluster active power (OPP) so it ends up becoming quite high.
So the numbers do make sense for TC2, it is just not a very well-designed SoC from a power management point of view. It was a very early test chip not designed for power management experiments at all, but it has really good power measurement infrastructure (energy counters) and everything is upstream and has been that for years. Your previous experience has most likely been with more representative platforms, so I expect numbers for other platforms to be in line with your experience. Juno, which is also a test chip, is closer to what you describe but still not really representative for product grade SoCs, but we don't have anything better with upstream support.
From formula F.4, we can combine power with static leakage and dynamic leakage; IPA also used static/dynamic leakage to depict energy model. But EAS uses another way, which provide the power data according to every OPP and idle state. So that means on one platform, we need provide two kinds of power data.
IMHO, i think the static and dynamic leakage is more simple; because usually we will use (mW/MHz) to describe the power efficiency for specific CPU, though (mW/MHz) cannot very accurately for power consumption if the voltage has been changed (See formula F.6, usually the voltage will be increased at higher frequency). But if we use mW/MHz, maybe we can calculate with very simple way for we can just only use it to mulitplate with frequency to get dynamic power.
So we only need provide below parameters: P-state: static leakage, power efficiency (mW/MHz), capacity (DMIPS/MHz); C-state: static leakage, power efficiency (mW/MHz);
What's the thoughts for unify the energy model?
We want to unify the power models if at all possible. The IPA people are looking into it. The difficulty is that we are looking for different things, so the models have to capture enough detail to be useful for both.
Are you proposing to derive the individual P-state numbers from global numbers or do you propose to have the three parameters for each P-state in tables like we currently have them?
If you want to derive them from global numbers, you would need to compensate for voltage scaling for both Ps and Pd so you would need the voltage for each state. Otherwise you energy efficiency will _improve_ as you increasing frequency.
It might work. I think the first step is to see if the derived curves would correlate well with real measurements. We would need a way to derive static leakage and power efficiency from measurements. I don't know if that can be easily done. Do you have any suggestions for that?
Deriving the table data using F.5 and F.6 would mean that we can only model systems that follow those formulas reasonably well. The current tables are pure measurement data with a little bit of extrapolation to find the cluster power, which should be a bit more flexible. I'm not sure if that really matter though.
Thanks, Morten