Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs; each candidate CPU cannot decide the final OPP by itself due it binds with other CPUs in the same clock domain, at the end we need calculate all CPUs energy.
Let's use below CPU topology as the example:
Cluster_0 Cluster_1 CPU_0 CPU_4 CPU_1 CPU_5 CPU_2 CPU_6 CPU_3 CPU_7
Current code always calculate the energy for all CPUs in bound clock domain, if the candidate CPUs are CPU_0 and CPU_4, then the formula for energy calculation is as below:
E(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)
E(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`
E_Diff(CPU_0 - CPU_4) = E(CPU_0) - E(CPU_4)
But from upper formula we can easily get to know CPU_1/2/3/5/6/7 energy calculation are redundant, so if we only take account the energy for the task consumed (but not compute all CPUs energy) after place it onto one specific CPU, then the energy calculation can be optimized as:
E(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) E(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)
E_Diff(CPU_0 - CPU_4) = E(CPU_0) - E(CPU_4)
So the energy calculation iteration can be reduced from 20 times to 8 times; this can significant reduce the energy calculation overload.
After using task oriented calculation, there has one case the energy calculation might take longer time than previous method. For instance, if candidate CPUs are CPU_0 and CPU1, and after place task on either CPU the CPU OPP will be increased. In this case, the old code uses below method for energy calculation:
E(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) E(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)
E_Diff(CPU_1 - CPU_0) = E(CPU_1) - E(CPU_0)
Because the OPP increasing impacts other CPUs in the same clock domain, so it needs to calculate all related CPUs energy:
E(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E_Diff(CPU_1 - CPU_0) = E(CPU_1) - E(CPU_0)
We can use more complex method for optimization, e.g. firstly calculate the CPU_0 OPP and CPU_1 OPP and directly select CPU with most power efficiency OPP. Or we can reuse the energy data before task placement for two candidates. These methods can be used for later optimization.
As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios. This patch fixes issue by always calculating task based energy.
To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so finally the energy calculation can be limited to this single CPU.
On Hikey960, after fixing LITTLT CPU freq to 1402000Hz and big CPU to 1421000Hz, with the home screen scenario for 10s ftrace log the energy calculation duration can be optimized as below:
Energy calculation between LITTLE CPU and big CPU, the duration can be decreased from 34660ns to 16565ns (52% decreasing); when the energy calculation between the two CPUs in the same cluster, the duration can be decreased from 24342ns to 21093ns (13% decreasing).
Change-Id: I7980fae37195547a663da0bbd7ff8d4a17313e9f Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 273 +++++++++++++++++++++++++++++----------------------- 1 file changed, 154 insertions(+), 119 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fa3c350..4d5d900 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5620,112 +5620,172 @@ static int group_idle_state(struct energy_env *eenv, struct sched_group *sg) }
/* - * sched_group_energy(): Computes the absolute energy consumption of cpus - * belonging to the sched_group including shared resources shared only by - * members of the group. Iterates over all cpus in the hierarchy below the - * sched_group starting from the bottom working it's way up before going to - * the next cpu until all cpus are covered at all levels. The current - * implementation is likely to gather the same util statistics multiple times. - * This can probably be done in a faster but more complex way. - * Note: sched_group_energy() may fail when racing with sched_domain updates. - */ -static int sched_group_energy(struct energy_env *eenv) -{ - struct cpumask visit_cpus; + * sched_group_energy(): Computes the absolute energy consumption of specific + * sched_group; In this function 'sched_group *sg' corresponds to one specific + * hardware level logic, for lowest level it presents CPU logic and calculates + * CPU energy, for second level it presents cluster logic and calculates cluster + * energy based on the energy parameters. + */ +static int sched_group_energy(struct energy_env *eenv, struct sched_group *sg) +{ u64 total_energy = 0; + unsigned long group_util; + int sg_busy_energy, sg_idle_energy; + int cap_idx, idle_idx;
- WARN_ON(!eenv->sg_top->sge); + cap_idx = eenv->cap_idx; + idle_idx = group_idle_state(eenv, sg); + group_util = group_norm_util(eenv, sg);
- cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top)); + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power); + sg_idle_energy = ((SCHED_CAPACITY_SCALE-group_util) + * sg->sge->idle_states[idle_idx].power);
- while (!cpumask_empty(&visit_cpus)) { - struct sched_group *sg_shared_cap = NULL; - int cpu = cpumask_first(&visit_cpus); - struct sched_domain *sd; + total_energy = sg_busy_energy + sg_idle_energy;
- /* - * Is the group utilization affected by cpus outside this - * sched_group? - */ - sd = rcu_dereference(per_cpu(sd_scs, cpu)); + eenv->energy += total_energy >> SCHED_CAPACITY_SHIFT; + return 0; +}
- if (sd && sd->parent) - sg_shared_cap = sd->parent->groups; +/* + * sched_group_hierarchy_energy(): Computes the absolute energy consumption + * based on sched_group hierarchy. We use 'eenv->sg_top' to indicate how many + * CPUs energy should be calculated, e.g. if the target CPU OPP is same before + * and after task placement, then we can merely calculate this target CPU + * energy and it's higher hierarchy sched_group energy (e.g. for cluster level + * or more higher level hardware logical energy consumption); In this case + * we don't need calculate other CPUs energy in the same clock domain due other + * CPUs energy are no matter with waken task placement. + * + * If the target CPU OPP is different between before and after task placement, + * then this target CPU OPP increasing also impacts other CPUs energy in the + * same clock domain, so we need set 'eenv->sg_top' to one higher level so this + * can let energy calculation includes other interaction CPUs. + * + * For this purpose, sched_group_hierarchy_energy() iterates over all + * sched_groups in the hierarchy and check if the sched_group has intersection + * with 'eenv->sg_top'; if the sched_group is intersect with 'eenv->sg_top' + * that means the sched_group has been impacted by task placement and we calls + * sched_group_energy() to calculate the energy for the specific sched_group, + * finally we can accumulate the energy data related with waken task. + */ +static int sched_group_hierarchy_energy(struct energy_env *eenv, int cpu) +{ + struct sched_domain *sd; + struct sched_group *sg;
- for_each_domain(cpu, sd) { - struct sched_group *sg = sd->groups; + eenv->energy = 0; + for_each_domain(cpu, sd) { + sg = sd->groups;
- /* Has this sched_domain already been visited? */ - if (sd->child && group_first_cpu(sg) != cpu) - break; + do { + if (!cpumask_intersects(sched_group_cpus(eenv->sg_top), + sched_group_cpus(sg))) + continue;
- do { - unsigned long group_util; - int sg_busy_energy, sg_idle_energy; - int cap_idx, idle_idx; + sched_group_energy(eenv, sg);
- if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) - eenv->sg_cap = sg_shared_cap; - else - eenv->sg_cap = sg; + } while (sg = sg->next, sg != sd->groups); + }
- cap_idx = find_new_capacity(eenv, sg->sge); + return 0; +}
- if (sg->group_weight == 1) { - /* Remove capacity of src CPU (before task move) */ - if (eenv->trg_cpu == eenv->src_cpu && - cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) { - eenv->cap.before = sg->sge->cap_states[cap_idx].cap; - eenv->cap.delta -= eenv->cap.before; - } - /* Add capacity of dst CPU (after task move) */ - if (eenv->trg_cpu == eenv->dst_cpu && - cpumask_test_cpu(eenv->dst_cpu, sched_group_cpus(sg))) { - eenv->cap.after = sg->sge->cap_states[cap_idx].cap; - eenv->cap.delta += eenv->cap.after; - } - } +static inline bool cpu_in_sg(struct sched_group *sg, int cpu) +{ + return cpu != -1 && cpumask_test_cpu(cpu, sched_group_cpus(sg)); +}
- idle_idx = group_idle_state(eenv, sg); - group_util = group_norm_util(eenv, sg); +static inline unsigned long task_util(struct task_struct *p);
- sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power); - sg_idle_energy = ((SCHED_CAPACITY_SCALE-group_util) - * sg->sge->idle_states[idle_idx].power); +/* + * compute_task_energy(): Computes the absolute energy consumption for + * waken task on specific CPU. It calculates the energy data twice, + * the first energy data is before task placement, the second energy data + * is after task placed on specific CPU; the difference value between + * these two energy if the energy introduced by this task. + * + * In the energy calculation, we use 'eenv' as a descriptor to track + * energy calculation parameters and assign output result in it. Here + * have two parameters are important for energy calculation: + * + * 'eenv->sg_cap' presents the relationship for the sched_group which + * shares the same capacity (or the CPUs in this sched_group share the + * same clock domain). + * + * 'eenv->sg_top' presents which CPUs involved for the task energy + * calculation. If 'eenv->sg_top' points the lowest level + * sched_group, this means the energy calculation is only related with + * single CPU, otherwise it calculates all CPUs energy in the same + * clock domain. + */ +static int compute_task_energy(struct energy_env *eenv, int cpu) +{ + struct sched_domain *sd; + struct sched_group *sg; + unsigned int prev_cap_idx, next_cap_idx; + unsigned long energy;
- total_energy += sg_busy_energy + sg_idle_energy; + sd = rcu_dereference(per_cpu(sd_scs, cpu)); + if (!sd) + return 0;
- if (!sd->child) - cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg)); + sg = sd->groups;
- if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top))) - goto next_cpu; + /* + * The CPU capacity sharing attribution is decided by hardhware + * design so we can decide the sg_cp value at the beginning + * for specific CPU. + */ + if (sd && sd->parent) + eenv->sg_cap = sd->parent->groups; + else + eenv->sg_cap = sd->groups;
- } while (sg = sg->next, sg != sd->groups); - } + /* Estimate capacity index before task placement */ + eenv->trg_cpu = -1; + prev_cap_idx = find_new_capacity(eenv, sg->sge);
- /* - * If we raced with hotplug and got an sd NULL-pointer; - * returning a wrong energy estimation is better than - * entering an infinite loop. - */ - if (cpumask_test_cpu(cpu, &visit_cpus)) - return -EINVAL; -next_cpu: - cpumask_clear_cpu(cpu, &visit_cpus); - continue; + /* Estimate capacity index after task placement */ + eenv->trg_cpu = cpu; + next_cap_idx = find_new_capacity(eenv, sg->sge); + + /* + * Before and after task placement, if the CPU frequency has no change + * then we can set sg_top to the single CPU; this means we can only + * calculate the energy related with this single CPU and ignore other + * CPUs in the same clock domain. If we found the OPP frequency is + * changed then we need to calculate all impacted CPUs in the same + * clock domain, so we need to set to shared capacity scheduling group. + */ + if (prev_cap_idx != next_cap_idx) + eenv->sg_top = sd->parent->groups; + else + eenv->sg_top = sd->groups; + + /* Remove capacity of src CPU (before task move) */ + if (cpu == eenv->src_cpu) { + eenv->cap.before = sg->sge->cap_states[next_cap_idx].cap; + eenv->cap.delta -= eenv->cap.before; + /* Add capacity of dst CPU (after task move) */ + } else if (cpu == eenv->dst_cpu) { + eenv->cap.after = sg->sge->cap_states[next_cap_idx].cap; + eenv->cap.delta += eenv->cap.after; }
- eenv->energy = total_energy >> SCHED_CAPACITY_SHIFT; - return 0; -} + /* Calculate the energy before task placement */ + eenv->trg_cpu = -1; + eenv->cap_idx = prev_cap_idx; + sched_group_hierarchy_energy(eenv, cpu); + energy = eenv->energy;
-static inline bool cpu_in_sg(struct sched_group *sg, int cpu) -{ - return cpu != -1 && cpumask_test_cpu(cpu, sched_group_cpus(sg)); -} + /* Calculate the energy after task placement */ + eenv->trg_cpu = cpu; + eenv->cap_idx = next_cap_idx; + sched_group_hierarchy_energy(eenv, cpu);
-static inline unsigned long task_util(struct task_struct *p); + return eenv->energy - energy; +}
/* * energy_diff(): Estimate the energy impact of changing the utilization @@ -5737,53 +5797,28 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu) static inline int __energy_diff(struct energy_env *eenv) { struct sched_domain *sd; - struct sched_group *sg; - int sd_cpu = -1, energy_before = 0, energy_after = 0; + int sd_cpu = -1; int diff, margin;
- struct energy_env eenv_before = { - .util_delta = task_util(eenv->task), - .src_cpu = eenv->src_cpu, - .dst_cpu = eenv->dst_cpu, - .trg_cpu = eenv->src_cpu, - .nrg = { 0, 0, 0, 0}, - .cap = { 0, 0, 0 }, - .task = eenv->task, - }; - if (eenv->src_cpu == eenv->dst_cpu) return 0;
sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu; - sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
+ /* + * The CPU capacity sharing attribution is fixed by hardhware + * design so we can decide the sg_cp value at the beginning + * for specific CPU. + */ + sd = rcu_dereference(per_cpu(sd_scs, sd_cpu)); if (!sd) return 0; /* Error */
- sg = sd->groups; - - do { - if (cpu_in_sg(sg, eenv->src_cpu) || cpu_in_sg(sg, eenv->dst_cpu)) { - eenv_before.sg_top = eenv->sg_top = sg; - - if (sched_group_energy(&eenv_before)) - return 0; /* Invalid result abort */ - energy_before += eenv_before.energy; - - /* Keep track of SRC cpu (before) capacity */ - eenv->cap.before = eenv_before.cap.before; - eenv->cap.delta = eenv_before.cap.delta; - - if (sched_group_energy(eenv)) - return 0; /* Invalid result abort */ - energy_after += eenv->energy; - } - } while (sg = sg->next, sg != sd->groups); + eenv->nrg.before = compute_task_energy(eenv, eenv->src_cpu); + eenv->nrg.after = compute_task_energy(eenv, eenv->dst_cpu); + eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; + eenv->payoff = 0;
- eenv->nrg.before = energy_before; - eenv->nrg.after = energy_after; - eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; - eenv->payoff = 0; #ifndef CONFIG_SCHED_TUNE trace_sched_energy_diff(eenv->task, eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, -- 1.9.1