Re: [Eas-dev] Thoughts and Questions For EAS Energy Model

21 Sep 2015

      Hi Morten,
Thanks for review, please see below comments and further more
questions.
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
...
Thanks for sharing this excellent write-up. I'm tempted to suggest that
we add this to the documention.
Glad it's helpful and free to use it if you want.
...
On Thu, Sep 17, 2015 at 04:02:09PM +0100, Leo Yan wrote:
...
Hi all,
Below are some thoughts and questions after reviewed EAS's energy model; my
purpose is want to get clear the energy model from user's perspective, so
below question will _ONLY_ focus on the model and not dig into the
implementation.
This email is related long, but i think if use formulas, we can easily
get the same page; So i lists the energy model's formulas, then based
on them i try to match with TC2's power data and bring up some questions.
Look forward to your suggestions and comments.

Basic Energy and Power Calculation Formulas
From the doc Documentation/scheduler/sched-energy.txt, we can get to know
the energy can be calculated with:
Energy [j] = Power [w] * Time [s]                        (F.1)
So let's assume there have one piece of code, which has fixed instruction
numbers will be executed on CPU, the execution duration is depend on CPU's
pipeline and CPU's frequency. So can convert F.1 to F.2:
                           Code [instructions]

Energy [j] = Power [w] * ------------------------------
                           (Inst Per Cycle) * Frequency
                           Code [instructions]
       = Power [w] * ------------------------------  (F.2)
                                 MIPS(f)
		  `-> 'f' is factor of frequency

Because MIPS(f) can be normalize as the CPU's capacity corresponding to
OPP, so we can simply convert from F.2 to F.3:
                           Code [instructions]

Energy [j] = Power [w] * ------------------------------  (F.3)
                                 CPU_Capacity(f)
If breakdown Power[w], we can split it into two parts: static leakage, and
dynamic leakage:
Power [w] = Ps [w] + Pd [w]                              (F.4)
Static power leakage can be calculated with below formula:
Ps [w] = i * V [v]                                       (F.5)
         `-> 'i' is coefficient for according to silicon's process
        V [v] is voltage according to OPP
Dynamic power leakage can be calculated with below formula:
Pd [w] = b * V [v] * V [v] * frequency                   (F.6)
         `-> 'b' is coefficient for according to silicon's process
        V [v] is voltage according to OPP
Here have two special cases, if the island's clock is gated, then
Pd [w] = 0, So:
Power [w] = Ps [w]                                       (F.7)
If the island is powered off, then
Ps [w] = 0, Pd [w] = 0; So:
Power [w] = 0                                            (F.8)
So energy can be calculated as (come from F.3 and F.4):
                              Code [instructions]

Energy [j] = (Ps [w] + Pd [w]) * ----------------------  (F.9)
                                    CPU_Capacity(f)

Formulas for duty cycle
We separate the logic (cluster or CPU) into two states: P-state and C-state,
for P-state and C-state they have different power data, this is because
after the logic enter C-state, it will be clock gating or powered off. So if
we expand the time axis for relative long time, we need calculate CPU's
utilization percentage (for CPU is full running, util = 100%). Let's
simplize the ratio between "Code [instructions]" and "CPU_Capactity(f)" as
the utilization, So the energy calculation can be depicted as:
         Code [instructions]

Util(f) = --------------------------                     (F.10)
               CPU_Capacity(f)
Energy [j] = Power_Pstate [w] * Util(f)
           + Power_Cstate [w] * (1 - Util(f))            (F.11)
                                                     (F.12)

Energy [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i))
           + Sum(i=0..MAX_IDL)(Power_Cstate [w](i) * Util_IDL(i))
Sum(i=0..MAX_OPP)Util_OPP(i) + Sum(i..MAX_IDLE)Util_IDL(i) = 1

Formulas for clusters
                                                         (F.13)
Energy [j] = Energy_cluster [j]
           + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
                                                     (F.14)

Energy_cluster [j]
           = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i))
           + Sum(i=WFI, ClusterOff)(Power_Cstate [w](i) * Util_IDL(i))

A minor detail here is that a cluster and/or cpu may be idle (from a
utilization point of view) but not actually in an idle state (from a
hardware point of view). For example, all the cpus may be in WFI or
cpu_power_down while the cluster is still has power and clock going. You
point that out towards the end as well. For this reason, the model has
to consider this idle, but not really idle, state too. I called it
'active idle' in the past.
OK, now 'active idle' is quite clear. i'd like discuss it further more
in below comments.
...
...
Energy_cpu [j]
             = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i))
             + Sum(i=WFI, CPUOff)(Power_Cstate [w](i) * Util_IDL(i))

Thoughts and Questions

Let's summary EAS's energy model as below:
CPU::capacity_state::power : CPU's power [w] for specific OPP
  Power(OPP)         = Ps [w] + Pd [w]
CPU::idle_state::power : CPU's power [w] for specific idle state
  Power(IDLE_WFI)    = Ps [w]
  Power(IDLE_CPUOff) = 0
CPU's IDLE_WFI means: CPU is clock gating, so has static leakage but
  don't include dynamic leakage.

Agreed, but if we imagine that we have state between WFI and CPUOff
which powers down a part of the cpu core, but not everything (like
CPUOff), it would consume
Power(IDLE_CPUalmostOff) = a * Ps [w]
   			   -> a = ratio of transistors powered down.
Totally agree that CPU may have other extra idle states, and for a
common solution, we should not expose the limitation on idle states.
...
F.5 assume that all transistors are affected, which holds as long as all
transitors in the power domains that we provide separate model data for
(cpu core and cluster) are all equally affected by each idle-state.
For one specific power state, whatever it's a P-state or C-state,
actually we need define it with three factors: voltage domain, power
domain, and clock domain. After we define well these factors for a
state, then we can easily to just apply F.5/F.6.
So just like the cases of "IDLE_CPUalmostOff" and "IDLE_CPUOFF", there
must be something difference b/t them, for example they have different
power domain but may have same clock domain and voltage domain. So
naturally we can calculate different power result for them.
...
F.6 makes a similar assumption about the toggling rate of all
transistors scaling linearly with the frequency. I think that one is
probably fine for the model precision that we after, but I haven't
verified using actual measurements.
Here need clarify, F.5 and F.6 will _NOT_ assume for all transistors,
it will totally depend on the upper three domains' definitions and then
correctly to use these two formulas.
Usually if have errors, it's very likely we cannot define these three
domains clearly and then introduce incorrect concept.
...
...
CLUSTER::capacity_state::power : Cluster's power [w] for specific OPP
  Power(OPP)         = Ps [w] + Pd [w]

CLUSTER::idle_state::power : CPU's power [w] for specific idle state
  Power(IDLE_WFI)    = Ps [w] + Pd [w]
  Power(IDLE_CLSOff) = 0

  Cluster's IDLE_WFI is quite special, means all CPUs in cluster have been
  powered off, but cluster's logic (L2$ and SCU, etc) is powered on and clock
  is enabled, so it includes cluster level's static power and dynamic power.

Right, this the 'active idle' state I mentioned earlier.
...
Are these formulas matching the original design?

Very much, yes. The only difference is that in the current design I
don't distinguish between static and dynamic power, so if you substitute
Ps [w] + Pd [w] = P [w] it is the same.
Got it, it's fine to just use summed power data.
...
...

TC2's data for cluster's sleep:
static struct idle_state idle_states_cluster_a7[] = {
{ .power = 25 }, /* WFI */
{ .power = 10 }, /* cluster-sleep-l */
};
static struct idle_state idle_states_cluster_a15[] = {
{ .power = 70 }, /* WFI */
{ .power = 25 }, /* cluster-sleep-b */
};
For cluster level's sleep, the clock is gating and domain is powered off,
so the dynamic leakage and static leakge should be zero, right?

In an ideal world, yes. These numbers come from actual measurements
using the TC2 energy counters so this is down to practical issues.
Something must still be leaking while the cluster is off which is
included in the power domain monitored by the counters, or the energy
counter circuits may not be 100% accurate. We didn't tweak the numbers
to make them fit theory ;-)
Make sense, it's acceptable if have some little inaccuracy.
...
...

TC2's data for CPU's idle state:
static struct idle_state idle_states_core_a7[] = {

{ .power = 0 }, /* WFI */
    };
static struct idle_state idle_states_core_a15[] = {

{ .power = 0 }, /* WFI */
    };
CPU has two idle state, one is 'WFI' and another is 'C2'; For 'WFI' state,
the power will not be zero, this is because 'WFI' state means internal
clock gating, so according to F.7, there should have static leakage.

BTW, for TC2, there have no corresponding idle state for 'C2', this is
weird. Could you confirm it has been delibrately removed?

I assume that by 'C2' you mean CPUOff. You seem to be assuming that all
cpus have WFI and CPUOff. This is not the case. TC2 has no CPUOff state,
so it wasn't removed, it was never there :-) It only has WFI
(clock-gating each individual core) and CLSOff (power down the entire
cluster). We need to be able to handle those systems too, as well as
systems with more per-cpu idle-states.
Now know why TC2 has suck kind of power data.
...
The WFI power is zero for practical reasons. It is not possible derive
the per-core WFI power with the energy counters. We can put all cpus
into WFI and measure the cluster energy, which would be the result of
F.13, but we have no way of figuring out how to decompose it into
cluster and cpu energy contributions. We have to account for all the
energy somewhere, so instead of assuming some arbitrary split between
cluster and cpu energy, we assume that it is all cluster energy. Hence,
the WFI power is accounted for in the cluster 'active idle' power.
IOW, it isn't missing, it is just accounted for somewhere else as we
didn't have a way to figure out the true split between cluster and core.
Yes, it's hard to extract power data independently for cluster level
and core level. The main reason is hard to get the delta value for WFI
if SoC don't support CPU's power off.
Just curious, if it's feasible with below steps to measure WFI state
in TC2?
- Firstly measure the power date when cluster is powered off;
- Then power on CPU0 only, and place CPU0 into "WFI":
  Power_Delta0 = cluster level power + one CPU's "WFI";
- Then power on CPU1, and place and can get:
  Power_Delta1 = cluster level power + two CPUs' "WFI";
- So finally can get "WFI" power = Power_Delta1 - Power_Delta2;
The key point is step 2, when power on one core, will other cores in
the same cluster be automatically be powered on as well?
...
Talking about idle-state representation. The current idle-state tables
are quite confusing. We only have per-cpu states listed in the per-cpu
tables, and per-cluster in the per-cluster tables (+ active idle). This
is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff
for the cluster tables for TC2. I'm planning on changing that so we have
the full list of states in all tables, but with zeros or repeated power
numbers for states that don't affect the associated power domain.
Here i think we should create a clear principle for enery model and
apply it. If we go back to review for state "WFI", its power
domain/voltage domain/clock domain are all in CPU's level but not in
cluster level. So the most reasonable calculation for 'active idle'
state should be despicted as below:
Energy [j] = Energy_cluster [j]
           + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i))
  Sum(i=0..MAX_OPP)Util_OPP(i) = 1
Energy_cpuE [j] = Power(IDLE_WFI)
So that means for 'active idle' state, all cpus stay in "WFI" state,
but for cluster level, actually it always stays in P-state but not C-state.
This is decided by cluster level's power domain/clock domain is always
ON for 'active idle'.
But now EAS consider cluster level as a idle state for 'active idle',
right?
So let's dig further more for this question, and find actually this
question is very related how look at the idle states. We need create a
idle voting mechanism for different schedule domain level, and there
should have mechanism can roll back to lower level's schedule domain
for idle state's selection if upper schedule domain is in P-state.
Below is a example for voting:
0: Power On state
 1: idle state 1
 2: idle state 2
 ...
Example 1:
SCHED_DOMAIN (CPU)     SCHED_DOMAIN(MC)
CPU0     0                      1
CPU1     0                      1
CPU2     0                      1
CPU3     0                      1
So all 4 CPUs vote 0 for cluster level, means to power on cluster and
all 4 CPUs run into idle state 1; finally scheduler can easily know for
SCHED_DOMAIN (CPU) (or cluster level) is not in idle state, so it can
rollback to SCHED_DOMAIN(MC) (or cpu level) to find correct idle
state.
Example 2:
SCHED_DOMAIN (CPU)     SCHED_DOMAIN(MC)
CPU0     1                      1
CPU1     1                      1
CPU2     1                      1
CPU3     0                      1
3 CPUs vote 1 to power off cluster and 1 CPU votes 0 to power on
cluster, finally scheduler can easily know for SCHED_DOMAIN (CPU)
(or cluster level) the minimum vote is 0, means cluster will be powered
on, so it will rollback to SCHED_DOMAIN(MC) (or cpu level) to find
correct idle state for CPU level.
Example 3:
SCHED_DOMAIN (CPU)     SCHED_DOMAIN(MC)
CPU0     1                      2
CPU1     1                      2
CPU2     1                      2
CPU3     0                      1
3 CPUs vote 1 to power off cluster and 1 CPU votes 0 to power on
cluster, finally scheduler can easily know for SCHED_DOMAIN (CPU)
(or cluster level) the minimum vote is 0, means cluster will be powered
on, so it will rollback to SCHED_DOMAIN(MC) (or cpu level) to find
correct idle state for CPU level, Example 3 wants to demonstrate there
have two different idle states for CPU level, so scheduler need to know
the CPU will rollback to which exactly idle state for individual CPU.
...
...

TC2's data for P-state:
static struct capacity_state cap_states_cluster_a7[] = {

/* Cluster only power */
   { .cap =  150, .power = 2967, }, /*  350 MHz */
   [...]
    };
static struct capacity_state cap_states_core_a7[] = {

/* Power per cpu */
   { .cap =  150, .power =  187, }, /*  350 MHz */
   [...]
    };
From previous experience, the CPU level's power leakage is very higher
than cluster level's leakage. For example, for CA7, if only power on cluster
(all CPUs in cluster are powered off), the power delta is ~10mA@156MHz; if
power on one CPUs, the power delta is about 30mA@156MHz. I also checked the
data for CA53, it has similar result.

So this is confilict with TC2's power data, you can see the cluster
level's power leakage is quite high (almost 15 times than CPU level). This
means almostly we cannot get much benefit from CPU level's low power
state, due cluster level will contribute most of power consumption. This
is not make sense.

As said above, TC2 doesn't have a CPUOff state which makes it really
crippled in terms of power management. As soon as the cluster is power
up, all cores are sitting in WFI leaking (Ps) with caches being kept
coherent and everything. As said above, we had to account for the core
WFI power in the cluster active power (OPP) so it ends up becoming quite
high.
So the numbers do make sense for TC2, it is just not a very
well-designed SoC from a power management point of view. It was a very
early test chip not designed for power management experiments at all,
but it has really good power measurement infrastructure (energy
counters) and everything is upstream and has been that for years. Your
previous experience has most likely been with more representative
platforms, so I expect numbers for other platforms to be in line with
your experience. Juno, which is also a test chip, is closer to what you
describe but still not really representative for product grade SoCs, but
we don't have anything better with upstream support.
So P-state's Power data for TC2 is actually below combination :)
CLUSTER::capacity_state::power
        Power_CLUSTER(OPP)   = Cluster (Ps [w] + Pd [w])
                             + CPU (Ps [w]) * 4
                                  `-> include 4 CPU's static leakage
CPU::capacity_state::power
        Power_CPU(OPP)       = CPU (Pd [w])
[...]
Thanks,
Leo Yan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] Thoughts and Questions For EAS Energy Model