On Tue, Sep 22, 2015 at 08:44:40PM +0100, Leo Yan wrote:
On Mon, Sep 21, 2015 at 05:31:37PM +0100, Morten Rasmussen wrote:
On Mon, Sep 21, 2015 at 06:58:30AM +0100, Leo Yan wrote:
On Fri, Sep 18, 2015 at 05:57:48PM +0100, Morten Rasmussen wrote:
Talking about idle-state representation. The current idle-state tables are quite confusing. We only have per-cpu states listed in the per-cpu tables, and per-cluster in the per-cluster tables (+ active idle). This is why we have WFI for the core tables and 'active idle' (WFI) + CLSOff for the cluster tables for TC2. I'm planning on changing that so we have the full list of states in all tables, but with zeros or repeated power numbers for states that don't affect the associated power domain.
Here i think we should create a clear principle for enery model and apply it. If we go back to review for state "WFI", its power domain/voltage domain/clock domain are all in CPU's level but not in cluster level. So the most reasonable calculation for 'active idle' state should be despicted as below:
Energy [j] = Energy_cluster [j] + Sum(i=0..MAX_CPU_PER_CLUSTER)Energy_cpu(i) [j]
Energy_cluster [j] = Sum(i=0..MAX_OPP)(Power_Pstate [w](i) * Util_OPP(i)) Sum(i=0..MAX_OPP)Util_OPP(i) = 1
Energy_cpuE [j] = Power(IDLE_WFI)
So that means for 'active idle' state, all cpus stay in "WFI" state, but for cluster level, actually it always stays in P-state but not C-state. This is decided by cluster level's power domain/clock domain is always ON for 'active idle'.
But now EAS consider cluster level as a idle state for 'active idle', right?
Yes, but it isn't easy to generalize based on the TC2 model due the limitations of TC2. From a model point of view we want to know which state the cpu/cluster is in: Running or idling. The C-states represents the hardware supported idle-states (controlling clock and/or power). An idle cluster or core may idle in one of these states or sit idle with everything power up and clocked. The latter is 'active idle'. A cluster may be active idle if all the cpus are idling in some per-cpu idle-state and the cpuidle governor has chosen to leave it powered up (possibly due to target residency constraints). The same could in theory be the case for a cpu core. It could be spinning in the idle loop if cpuidle didn't decide to enter a C-state. On ARM WFI is practically free to enter, so we always enter a proper hardware idle-state whenever we are idle. Even if it is only for a single clock cycle. Hence, we would never be active idling an ARM cpu, so WFI takes the role of active idle in this case. If WFI had a target_residency that would prevent cpus to enter it and leave them spinning, we would need an active idle state for the cpus as well.
In the model we treat active idle as an idle state despite the cpu/cluster being fully operational and running. The reason for this is that even though we are in some P-state, we aren't actually doing anything useful and the power consumption is likely to be very different from when we are busy. In the cluster active idle case, all the cpus are idling, which means nobody is accessing caches and memory hence the transistor toggling is very limited (though it might be affected by snooping traffic if another cluster is busy). If we used the busy P-state power, we would vastly over-estimate the active idle power for the cluster in most cases. In the cpu case (if we weren't guaranteed to enter WFI), we would be spinning in some simple loop that probably wouldn't exercise the entire cpu core and hopefully use a little less power (no cache access and expensive instructions).
Since we are technically running when active idling, one could argue that we should have an active idle power number for each P-state. For ARM that isn't an issue for per-core idling as we have WFI. For clusters we may want to consider it.
I should add that the P-state influence does not go away entirely for cores when they enter WFI. Ps (F.5) is still there since WFI is only clock gating so the voltage of the P-state still has an affect. It isn't voltage squared, so I'm not if it is really a problem.
The short answer is: In active idle the cpu/cluster is in a P-state doing nothing. We can make WFI the active idle state per-core (cpu) on ARM as we are guaranteed to enter it when the cpu is idle.
Agreed, here have two concerns:
- If take cluster's 'active idle' as an idle state, that means it will totally ignore Pd [w] for it. That means whatever frequency the cluster level is running at, the dynamic power leakage will be ignored.
I wouldn't say we totally ignore Pd, we measure the total power P = Ps + Pd, but I agree with you that Pd depends on the P-state in which we are active idling. As I just added above, the same is also true for Ps (F.5). It is just worse for Pd (F.6) as it has voltage squared.
Below are some power data on CA7 for 'active idle' data:
CPUFreq@156MHz: 11mA CPUFreq@312MHz: 28mA CPUFreq@624MHz: 36mA CPUFreq@800MHz: 45mA CPUFreq@1100Hz: 56mA
So in practice, if we use lowest frequency for cluster's 'active idle', it will have some deviation if cluster actually is running at highest frequency.
Yes, that is quite a difference, around 5x. The question is whether it actually affects the scheduling decisions if we include this in the model, or if we can get away with just picking something in the middle, like 36mA. If we pick 36mA, we would overestimate energy expense of idling the cluster in low-utilization scenarios, and under-estimate in high-utilization scenarios. I think it could give some strange results if active idling turns out to consume more energy than being busy for the lowest P-states. I can't come up with a scenario where it is a problem though. More thinking is needed I think.
If it turns out that we need to capture active idle more accurately in the model, we could extend the P-state table to have idle-power numbers for each state in addition to the busy power. We would need a special case in the idle energy calculation to use those numbers instead when we are in active idle and use the C-state data when we are in a true hardware idle state.
There may have more than one kind of 'active idle' state for cluster; for example, all cores in cluster can into 'WFI' state will have a corresponding 'active idle' state; and all cores in cluster run into 'CPUOFF' state will have another corresponding 'active idle' state. These two kind of 'active idle' state we also should handle as the same one?
Furthermore, if one CPU only run into 'WFI' and other CPUs in the cluster run into 'CPUOFF', how to select the 'active idle' state?
Wouldn't it primarily affect the core energy consumption? I would associate the energy delta between all WFI and all CPUOff with the cores and not the cluster as I would have thought it was caused by powering off the cores. The cluster logic would be on and clocked in both cases and since the cores are idling they shouldn't cause any (different) Pd for the cluster in the two cases. Why would the selected core idle-state affect the cluster? Do you have an example?
If we change to take 'active idle' state as cluster level's P-state, upper issues can easily dismiss.
Agreed, I think we should consider letting the active idle power depend on the actual P-state. Your numbers above definitely shows it is something that needs further investigation. Thanks for sharing the numbers.
Thanks, Morten