Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs. From hardware design perspective, each candidate CPU cannot decide OPP by itself due it binds with other CPUs in the same clock domain, e.g. it binds clock domain in the cluster with other CPUs, at the end we need to compute energy for all CPUs in the cluster.
Let's use below CPU topology as the example:
Cluster_0 Cluster_1 CPU_0 CPU_4 CPU_1 CPU_5 CPU_2 CPU_6 CPU_3 CPU_7
Current code always calculate the energy for all CPUs in bound clock domain, if the candidates are CPU_0 and CPU_4, the formula for energy computation is as below:
E(CPU_x) stands the CPU_x energy, notation E(CPU_x)` means the CPU_x energy after take placed on it, TE(CPU_x) means the total energy for all computation for all CPUs when task is placed on CPU_x, E_DIFF(CPU_x - CPU_y) means the energy difference between CPU_x and CPU_y, it equals to TE(CPU_x) - TE(CPU_y).
TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)
TE(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`
E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)
From upper formula we easily get to know CPU_1/2/3/5/6/7 energy
computations are redundant, on the other hand if we only consider the energy consumption by waken task, the energy difference is between the target CPU energy data before and after task placed on it. This method have one benefit is it can avoid to compute all CPUs energy in the same cluster, and only can focus the energy change introduced by waken task on the target CPU, which this method is called 'task oriented computation' in this patch, the energy computation can be optimized as:
TE(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) TE(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)
E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)
As the result, the computation iteration can be reduced from 20 times to 8 times; so this can significantly reduce calculation overload.
After using task oriented computation, it has one case the computation might take longer time than previous method. For instance, when candidates are CPU_0 and CPU1, and after place task on either CPU the OPP will be increased, in this case, the old method uses below computation:
TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) TE(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)
E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)
Using task oriented computation, because the OPP increasing impacts other CPUs in the same cluster, so it needs to calculate all related CPUs energy:
TE(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
TE(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)
We can use more complex method for optimization, e.g. we can extend eenv structure to cache CPU energy data so can reuse them for two candidates. This can be used for later optimization.
As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios. Computation inconsistent issue might introduce trouble for PE filter. This patch fixes issue by always calculating task based energy.
To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so the energy computation is limited to this single CPU, otherwise it rolls back to compute all CPUs in the same clock domain.
Change-Id: Ifb64ad77c6173388abf13e255e2ed8e8586a38bc Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 43 insertions(+), 33 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 49eee75..4811fce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5473,7 +5473,7 @@ struct energy_env { /* Estimated energy variation wrt previous CPU */ int nrg_delta;
- } cpu[NR_CPUS]; + } cpu[NR_CPUS*2];
/* The morst energy efficient CPU for the specified energy_env::p */ int next_cpu; @@ -5791,45 +5791,55 @@ static int compute_energy(struct energy_env *eenv, int candidate) */ static int compute_task_energy(struct energy_env *eenv, int cpu) { - struct sched_domain *sd, *sd_cap; - struct sched_group *sg; - int first_cpu; + struct sched_domain *sd; + unsigned int prev_cap_idx, next_cap_idx; + int cmp_idx, ret;
- sd = rcu_dereference(per_cpu(sd_ea, cpu)); + sd = rcu_dereference(per_cpu(sd_scs, cpu)); if (!sd) return -1; /* Error */
- sg = sd->groups; - do { - /* Skip SGs which do not contains a candidate CPU */ - if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg))) - continue; - - eenv->sg_top = sg; - - first_cpu = cpumask_first(sched_group_cpus(sg)); - - /* - * The CPU capacity sharing attribution is decided by hardhware - * design so we can decide the sg_cp value at the beginning - * for specific CPU. - */ - sd_cap = rcu_dereference(per_cpu(sd_scs, first_cpu)); - if (sd_cap && sd_cap->parent) - eenv->sg_cap = sd_cap->parent->groups; - else - eenv->sg_cap = sd_cap->groups; - - find_new_capacity(eenv, cpu); + /* + * The CPU capacity sharing attribution is decided by hardhware + * design so we can decide the sg_cp value at the beginning + * for specific CPU. + */ + if (sd && sd->parent) + eenv->sg_cap = sd->parent->groups; + else + eenv->sg_cap = sd->groups; + + /* Estimate capacity index before task placement */ + cmp_idx = NR_CPUS + cpu; + prev_cap_idx = find_new_capacity(eenv, cmp_idx); + next_cap_idx = find_new_capacity(eenv, cpu); + + /* + * Computation is iteration sched_group from bottom to up level for + * energy accumulation, 'sg_top' is top most sched_group: + * - If the CPU frequency has no change before and after task placed + * onto the target CPU, set 'sg_top' to sched_group for the target + * CPU; this means only to calculate the energy for this single CPU + * and ignore other CPUs in the same clock domain. + * - If found the OPP frequency is changed after task placement then + * need to calculate all CPUs who bound in the same clock domain, + * so set 'sg_top' to shared capacity scheduling group. + */ + if (prev_cap_idx != next_cap_idx) + eenv->sg_top = eenv->sg_cap; + else + eenv->sg_top = sd->groups;
- /* energy is unscaled to reduce rounding errors */ - if (compute_energy(eenv, cpu) == -EINVAL) { - eenv->next_cpu = eenv->prev_cpu; - return -EINVAL; - } + /* energy is unscaled to reduce rounding errors */ + ret = compute_energy(eenv, cmp_idx); + if (ret < 0) + return ret;
- } while (sg = sg->next, sg != sd->groups); + ret = compute_energy(eenv, cpu); + if (ret < 0) + return ret;
+ eenv->cpu[cpu].energy -= eenv->cpu[cmp_idx].energy; return 0; }
-- 1.9.1