On 17-Mar 20:05, Leo Yan wrote:
Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs. From hardware design perspective, each candidate CPU cannot decide OPP by itself due it binds with other CPUs in the same clock domain, e.g. it binds clock domain in the cluster with other CPUs, at the end we need to compute energy for all CPUs in the cluster.
Let's use below CPU topology as the example:
Cluster_0 Cluster_1 CPU_0 CPU_4 CPU_1 CPU_5 CPU_2 CPU_6 CPU_3 CPU_7
Current code always calculate the energy for all CPUs in bound clock domain, if the candidates are CPU_0 and CPU_4, the formula for energy computation is as below:
E(CPU_x) stands the CPU_x energy, notation E(CPU_x)` means the CPU_x energy after take placed on it, TE(CPU_x) means the total energy for all computation for all CPUs when task is placed on CPU_x, E_DIFF(CPU_x - CPU_y) means the energy difference between CPU_x and CPU_y, it equals to TE(CPU_x) - TE(CPU_y).
TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)
TE(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`
E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)
From upper formula we easily get to know CPU_1/2/3/5/6/7 energy computations are redundant,
This means that, for example, the contribution E(CPU_2) is the same in TE(CPU_0) and TE(CPU_4), right?
If that's what you mean, I'm not entirely convinced. This is true _IFF_ the OPP for Cluster_0 does not change while having the task in CPU_0 or having the task in the other cluster.
If the OPP for Cluster_0 should change, then E(CPU_2) can have to very different values in TE(CPU_0) and TE(CPU_4).
on the other hand if we only consider the energy consumption by waken task, the energy difference is between the target CPU energy data before and after task placed on it.
Sorry, I don't completely get this last... ... and it seems it's they key to understand to next paragraph :/
This method have one benefit is it can avoid to compute all CPUs energy in the same cluster, and only can focus the energy change introduced by waken task on the target CPU,
Do you mean that we whould ignore the blocked utilization?
which this method is called 'task oriented computation' in this patch, the energy computation can be optimized as:
TE(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) TE(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)
Again, IMHO in the above formula, if the OPP could change we cannot disregard energy variations from other cpus in Cluster_1.
E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)
As the result, the computation iteration can be reduced from 20 times to 8 times; so this can significantly reduce calculation overload.
After using task oriented computation, it has one case the computation might take longer time than previous method. For instance, when candidates are CPU_0 and CPU1, and after place task on either CPU the OPP will be increased,
... but this can happen also in the previous case.
Let say that CPU_4 has a utilization which is ~80% less then the capacity of its current OPP and that the task's PREV_CPU is CPU_0.
Let's also assume that the utilization of the task is big enought to require a capacity increase on CPU_4. Then, the energy for all the other CPUs in Cluster_1 are different in the before and after case.
in this case, the old method uses below computation:
TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) TE(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)
E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)
Using task oriented computation, because the OPP increasing impacts other CPUs in the same cluster, so it needs to calculate all related CPUs energy:
TE(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
TE(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)
We can use more complex method for optimization, e.g. we can extend eenv structure to cache CPU energy data so can reuse them for two candidates. This can be used for later optimization.
As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios.
That's becasue we care about relative differences in energy consumptions...
Computation inconsistent issue might introduce trouble for PE filter.
... we should look into those issues. We know about some of them and already have patches, which was part of the 3.18 kernel used by Pixel devices but not yet merged in 4.4, since we notice that these cases, although being broken, are also not happening in Android because of the schedtune configurations used by Pixel devices.
This patch fixes issue by always calculating task based energy.
To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so the energy computation is limited to this single CPU, otherwise it rolls back to compute all CPUs
Ok, so that maens that we always check for OPP changes... then it should work...
in the same clock domain.
Change-Id: Ifb64ad77c6173388abf13e255e2ed8e8586a38bc Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 43 insertions(+), 33 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 49eee75..4811fce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5473,7 +5473,7 @@ struct energy_env { /* Estimated energy variation wrt previous CPU */ int nrg_delta;
- } cpu[NR_CPUS];
- } cpu[NR_CPUS*2];
^^^^^^^^^^^
Why can't we have a before/after set of variables directly in the definition of eenv?
Given the usage below it should also turns out to be slightly more cache friendly. i.e.
struct energy_env {
#define PREV_CAP 0 #define NEXT_CAP 1 struct { // ... } cpu[2][NR_CPUS]; }
/* The morst energy efficient CPU for the specified energy_env::p */ int next_cpu; @@ -5791,45 +5791,55 @@ static int compute_energy(struct energy_env *eenv, int candidate) */ static int compute_task_energy(struct energy_env *eenv, int cpu) {
- struct sched_domain *sd, *sd_cap;
- struct sched_group *sg;
- int first_cpu;
- struct sched_domain *sd;
- unsigned int prev_cap_idx, next_cap_idx;
- int cmp_idx, ret;
- sd = rcu_dereference(per_cpu(sd_ea, cpu));
- sd = rcu_dereference(per_cpu(sd_scs, cpu)); if (!sd) return -1; /* Error */
- sg = sd->groups;
- do {
/* Skip SGs which do not contains a candidate CPU */
if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg)))
continue;
eenv->sg_top = sg;
first_cpu = cpumask_first(sched_group_cpus(sg));
/*
* The CPU capacity sharing attribution is decided by hardhware
* design so we can decide the sg_cp value at the beginning
* for specific CPU.
*/
sd_cap = rcu_dereference(per_cpu(sd_scs, first_cpu));
if (sd_cap && sd_cap->parent)
eenv->sg_cap = sd_cap->parent->groups;
else
eenv->sg_cap = sd_cap->groups;
find_new_capacity(eenv, cpu);
- /*
* The CPU capacity sharing attribution is decided by hardhware
* design so we can decide the sg_cp value at the beginning
* for specific CPU.
*/
- if (sd && sd->parent)
eenv->sg_cap = sd->parent->groups;
- else
eenv->sg_cap = sd->groups;
- /* Estimate capacity index before task placement */
- cmp_idx = NR_CPUS + cpu;
- prev_cap_idx = find_new_capacity(eenv, cmp_idx);
- next_cap_idx = find_new_capacity(eenv, cpu);
^^^^
Based on the above suggestion, this parameter would be {PREV,NEXT}_CAP.
- /*
* Computation is iteration sched_group from bottom to up level for
* energy accumulation, 'sg_top' is top most sched_group:
* - If the CPU frequency has no change before and after task placed
* onto the target CPU, set 'sg_top' to sched_group for the target
* CPU; this means only to calculate the energy for this single CPU
* and ignore other CPUs in the same clock domain.
+1
* - If found the OPP frequency is changed after task placement then
* need to calculate all CPUs who bound in the same clock domain,
* so set 'sg_top' to shared capacity scheduling group.
*/
- if (prev_cap_idx != next_cap_idx)
eenv->sg_top = eenv->sg_cap;
- else
eenv->sg_top = sd->groups;
/* energy is unscaled to reduce rounding errors */
if (compute_energy(eenv, cpu) == -EINVAL) {
eenv->next_cpu = eenv->prev_cpu;
return -EINVAL;
}
- /* energy is unscaled to reduce rounding errors */
- ret = compute_energy(eenv, cmp_idx);
^^^^^^^
Same here... that would be a {PREV,NEXT}_CAP.
- if (ret < 0)
return ret;
- } while (sg = sg->next, sg != sd->groups);
ret = compute_energy(eenv, cpu);
if (ret < 0)
return ret;
eenv->cpu[cpu].energy -= eenv->cpu[cmp_idx].energy;
eenv->cpu[NEXT_CAP][cpu].energy -= eenv->cpu[PREV_CPU][cpu].energy;
return 0; }
-- 1.9.1
-- #include <best/regards.h>
Patrick Bellasi