Hi all,
First of all, this patch set is for energy comparison optimization. The performance of energy comparison is important if we want to add more candidate CPUs to pick best CPU.
Another meaningful point for this patch set is to evaluate for energy calculation with task oriented. Current energy calculation algorithm is calculate CPU energy, this patch set is to change the concept so we get know what's the energy introduced by the waken task.
With this patch set, below are measured energy calculation duration, the duration measurement relies on patch [1]; the statistics uses duration mean value (unit: ns) and we can see the performance improvement with this patchset:
wl: workload runtime percentage% with period = 5ms
Without Patches With Patches Opt % wl: 1% 11858 8457 28.7% wl: 5% 13028 9534 26.8% wl: 10% 9361 7831 16.3% wl: 20% 10736 7999 25.5% wl: 30% 8216 7210 12.2% wl: 40% 15222 9538 37.3%
You could check the detailed testing results with LISA scripts [2][3].
This is following up some discussion we have at SFO17 connect, so could you reivew this patch set and let me know how if it's good to commit on gerrit for Android common kernel?
[1] https://git.linaro.org/people/leo.yan/linux-eas-opt.git/commit/?h=android-hi... [2] https://github.com/Leo-Yan/lisa/blob/lisa_20180115_add_metrics/ipynb/example... [2] https://github.com/Leo-Yan/lisa/blob/lisa_20180115_add_metrics/ipynb/example...
Leo Yan (3): sched/fair: Optimize energy calculation with task oriented sched/fair: Use per cpu data to maintain energy environment sched/fair: Record energy and capacity data for every CPU
kernel/sched/fair.c | 364 +++++++++++++++++++++++++++++----------------------- 1 file changed, 204 insertions(+), 160 deletions(-)
-- 1.9.1
Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs; each candidate CPU cannot decide the final OPP by itself due it binds with other CPUs in the same clock domain, at the end we need calculate all CPUs energy.
Let's use below CPU topology as the example:
Cluster_0 Cluster_1 CPU_0 CPU_4 CPU_1 CPU_5 CPU_2 CPU_6 CPU_3 CPU_7
Current code always calculate the energy for all CPUs in bound clock domain, if the candidate CPUs are CPU_0 and CPU_4, then the formula for energy calculation is as below:
E(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)
E(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`
E_Diff(CPU_0 - CPU_4) = E(CPU_0) - E(CPU_4)
But from upper formula we can easily get to know CPU_1/2/3/5/6/7 energy calculation are redundant, so if we only take account the energy for the task consumed (but not compute all CPUs energy) after place it onto one specific CPU, then the energy calculation can be optimized as:
E(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) E(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)
E_Diff(CPU_0 - CPU_4) = E(CPU_0) - E(CPU_4)
So the energy calculation iteration can be reduced from 20 times to 8 times; this can significant reduce the energy calculation overload.
After using task oriented calculation, there has one case the energy calculation might take longer time than previous method. For instance, if candidate CPUs are CPU_0 and CPU1, and after place task on either CPU the CPU OPP will be increased. In this case, the old code uses below method for energy calculation:
E(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) E(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)
E_Diff(CPU_1 - CPU_0) = E(CPU_1) - E(CPU_0)
Because the OPP increasing impacts other CPUs in the same clock domain, so it needs to calculate all related CPUs energy:
E(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E_Diff(CPU_1 - CPU_0) = E(CPU_1) - E(CPU_0)
We can use more complex method for optimization, e.g. firstly calculate the CPU_0 OPP and CPU_1 OPP and directly select CPU with most power efficiency OPP. Or we can reuse the energy data before task placement for two candidates. These methods can be used for later optimization.
As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios. This patch fixes issue by always calculating task based energy.
To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so finally the energy calculation can be limited to this single CPU.
On Hikey960, after fixing LITTLT CPU freq to 1402000Hz and big CPU to 1421000Hz, with the home screen scenario for 10s ftrace log the energy calculation duration can be optimized as below:
Energy calculation between LITTLE CPU and big CPU, the duration can be decreased from 34660ns to 16565ns (52% decreasing); when the energy calculation between the two CPUs in the same cluster, the duration can be decreased from 24342ns to 21093ns (13% decreasing).
Change-Id: I7980fae37195547a663da0bbd7ff8d4a17313e9f Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 273 +++++++++++++++++++++++++++++----------------------- 1 file changed, 154 insertions(+), 119 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fa3c350..4d5d900 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5620,112 +5620,172 @@ static int group_idle_state(struct energy_env *eenv, struct sched_group *sg) }
/* - * sched_group_energy(): Computes the absolute energy consumption of cpus - * belonging to the sched_group including shared resources shared only by - * members of the group. Iterates over all cpus in the hierarchy below the - * sched_group starting from the bottom working it's way up before going to - * the next cpu until all cpus are covered at all levels. The current - * implementation is likely to gather the same util statistics multiple times. - * This can probably be done in a faster but more complex way. - * Note: sched_group_energy() may fail when racing with sched_domain updates. - */ -static int sched_group_energy(struct energy_env *eenv) -{ - struct cpumask visit_cpus; + * sched_group_energy(): Computes the absolute energy consumption of specific + * sched_group; In this function 'sched_group *sg' corresponds to one specific + * hardware level logic, for lowest level it presents CPU logic and calculates + * CPU energy, for second level it presents cluster logic and calculates cluster + * energy based on the energy parameters. + */ +static int sched_group_energy(struct energy_env *eenv, struct sched_group *sg) +{ u64 total_energy = 0; + unsigned long group_util; + int sg_busy_energy, sg_idle_energy; + int cap_idx, idle_idx;
- WARN_ON(!eenv->sg_top->sge); + cap_idx = eenv->cap_idx; + idle_idx = group_idle_state(eenv, sg); + group_util = group_norm_util(eenv, sg);
- cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top)); + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power); + sg_idle_energy = ((SCHED_CAPACITY_SCALE-group_util) + * sg->sge->idle_states[idle_idx].power);
- while (!cpumask_empty(&visit_cpus)) { - struct sched_group *sg_shared_cap = NULL; - int cpu = cpumask_first(&visit_cpus); - struct sched_domain *sd; + total_energy = sg_busy_energy + sg_idle_energy;
- /* - * Is the group utilization affected by cpus outside this - * sched_group? - */ - sd = rcu_dereference(per_cpu(sd_scs, cpu)); + eenv->energy += total_energy >> SCHED_CAPACITY_SHIFT; + return 0; +}
- if (sd && sd->parent) - sg_shared_cap = sd->parent->groups; +/* + * sched_group_hierarchy_energy(): Computes the absolute energy consumption + * based on sched_group hierarchy. We use 'eenv->sg_top' to indicate how many + * CPUs energy should be calculated, e.g. if the target CPU OPP is same before + * and after task placement, then we can merely calculate this target CPU + * energy and it's higher hierarchy sched_group energy (e.g. for cluster level + * or more higher level hardware logical energy consumption); In this case + * we don't need calculate other CPUs energy in the same clock domain due other + * CPUs energy are no matter with waken task placement. + * + * If the target CPU OPP is different between before and after task placement, + * then this target CPU OPP increasing also impacts other CPUs energy in the + * same clock domain, so we need set 'eenv->sg_top' to one higher level so this + * can let energy calculation includes other interaction CPUs. + * + * For this purpose, sched_group_hierarchy_energy() iterates over all + * sched_groups in the hierarchy and check if the sched_group has intersection + * with 'eenv->sg_top'; if the sched_group is intersect with 'eenv->sg_top' + * that means the sched_group has been impacted by task placement and we calls + * sched_group_energy() to calculate the energy for the specific sched_group, + * finally we can accumulate the energy data related with waken task. + */ +static int sched_group_hierarchy_energy(struct energy_env *eenv, int cpu) +{ + struct sched_domain *sd; + struct sched_group *sg;
- for_each_domain(cpu, sd) { - struct sched_group *sg = sd->groups; + eenv->energy = 0; + for_each_domain(cpu, sd) { + sg = sd->groups;
- /* Has this sched_domain already been visited? */ - if (sd->child && group_first_cpu(sg) != cpu) - break; + do { + if (!cpumask_intersects(sched_group_cpus(eenv->sg_top), + sched_group_cpus(sg))) + continue;
- do { - unsigned long group_util; - int sg_busy_energy, sg_idle_energy; - int cap_idx, idle_idx; + sched_group_energy(eenv, sg);
- if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) - eenv->sg_cap = sg_shared_cap; - else - eenv->sg_cap = sg; + } while (sg = sg->next, sg != sd->groups); + }
- cap_idx = find_new_capacity(eenv, sg->sge); + return 0; +}
- if (sg->group_weight == 1) { - /* Remove capacity of src CPU (before task move) */ - if (eenv->trg_cpu == eenv->src_cpu && - cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) { - eenv->cap.before = sg->sge->cap_states[cap_idx].cap; - eenv->cap.delta -= eenv->cap.before; - } - /* Add capacity of dst CPU (after task move) */ - if (eenv->trg_cpu == eenv->dst_cpu && - cpumask_test_cpu(eenv->dst_cpu, sched_group_cpus(sg))) { - eenv->cap.after = sg->sge->cap_states[cap_idx].cap; - eenv->cap.delta += eenv->cap.after; - } - } +static inline bool cpu_in_sg(struct sched_group *sg, int cpu) +{ + return cpu != -1 && cpumask_test_cpu(cpu, sched_group_cpus(sg)); +}
- idle_idx = group_idle_state(eenv, sg); - group_util = group_norm_util(eenv, sg); +static inline unsigned long task_util(struct task_struct *p);
- sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power); - sg_idle_energy = ((SCHED_CAPACITY_SCALE-group_util) - * sg->sge->idle_states[idle_idx].power); +/* + * compute_task_energy(): Computes the absolute energy consumption for + * waken task on specific CPU. It calculates the energy data twice, + * the first energy data is before task placement, the second energy data + * is after task placed on specific CPU; the difference value between + * these two energy if the energy introduced by this task. + * + * In the energy calculation, we use 'eenv' as a descriptor to track + * energy calculation parameters and assign output result in it. Here + * have two parameters are important for energy calculation: + * + * 'eenv->sg_cap' presents the relationship for the sched_group which + * shares the same capacity (or the CPUs in this sched_group share the + * same clock domain). + * + * 'eenv->sg_top' presents which CPUs involved for the task energy + * calculation. If 'eenv->sg_top' points the lowest level + * sched_group, this means the energy calculation is only related with + * single CPU, otherwise it calculates all CPUs energy in the same + * clock domain. + */ +static int compute_task_energy(struct energy_env *eenv, int cpu) +{ + struct sched_domain *sd; + struct sched_group *sg; + unsigned int prev_cap_idx, next_cap_idx; + unsigned long energy;
- total_energy += sg_busy_energy + sg_idle_energy; + sd = rcu_dereference(per_cpu(sd_scs, cpu)); + if (!sd) + return 0;
- if (!sd->child) - cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg)); + sg = sd->groups;
- if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top))) - goto next_cpu; + /* + * The CPU capacity sharing attribution is decided by hardhware + * design so we can decide the sg_cp value at the beginning + * for specific CPU. + */ + if (sd && sd->parent) + eenv->sg_cap = sd->parent->groups; + else + eenv->sg_cap = sd->groups;
- } while (sg = sg->next, sg != sd->groups); - } + /* Estimate capacity index before task placement */ + eenv->trg_cpu = -1; + prev_cap_idx = find_new_capacity(eenv, sg->sge);
- /* - * If we raced with hotplug and got an sd NULL-pointer; - * returning a wrong energy estimation is better than - * entering an infinite loop. - */ - if (cpumask_test_cpu(cpu, &visit_cpus)) - return -EINVAL; -next_cpu: - cpumask_clear_cpu(cpu, &visit_cpus); - continue; + /* Estimate capacity index after task placement */ + eenv->trg_cpu = cpu; + next_cap_idx = find_new_capacity(eenv, sg->sge); + + /* + * Before and after task placement, if the CPU frequency has no change + * then we can set sg_top to the single CPU; this means we can only + * calculate the energy related with this single CPU and ignore other + * CPUs in the same clock domain. If we found the OPP frequency is + * changed then we need to calculate all impacted CPUs in the same + * clock domain, so we need to set to shared capacity scheduling group. + */ + if (prev_cap_idx != next_cap_idx) + eenv->sg_top = sd->parent->groups; + else + eenv->sg_top = sd->groups; + + /* Remove capacity of src CPU (before task move) */ + if (cpu == eenv->src_cpu) { + eenv->cap.before = sg->sge->cap_states[next_cap_idx].cap; + eenv->cap.delta -= eenv->cap.before; + /* Add capacity of dst CPU (after task move) */ + } else if (cpu == eenv->dst_cpu) { + eenv->cap.after = sg->sge->cap_states[next_cap_idx].cap; + eenv->cap.delta += eenv->cap.after; }
- eenv->energy = total_energy >> SCHED_CAPACITY_SHIFT; - return 0; -} + /* Calculate the energy before task placement */ + eenv->trg_cpu = -1; + eenv->cap_idx = prev_cap_idx; + sched_group_hierarchy_energy(eenv, cpu); + energy = eenv->energy;
-static inline bool cpu_in_sg(struct sched_group *sg, int cpu) -{ - return cpu != -1 && cpumask_test_cpu(cpu, sched_group_cpus(sg)); -} + /* Calculate the energy after task placement */ + eenv->trg_cpu = cpu; + eenv->cap_idx = next_cap_idx; + sched_group_hierarchy_energy(eenv, cpu);
-static inline unsigned long task_util(struct task_struct *p); + return eenv->energy - energy; +}
/* * energy_diff(): Estimate the energy impact of changing the utilization @@ -5737,53 +5797,28 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu) static inline int __energy_diff(struct energy_env *eenv) { struct sched_domain *sd; - struct sched_group *sg; - int sd_cpu = -1, energy_before = 0, energy_after = 0; + int sd_cpu = -1; int diff, margin;
- struct energy_env eenv_before = { - .util_delta = task_util(eenv->task), - .src_cpu = eenv->src_cpu, - .dst_cpu = eenv->dst_cpu, - .trg_cpu = eenv->src_cpu, - .nrg = { 0, 0, 0, 0}, - .cap = { 0, 0, 0 }, - .task = eenv->task, - }; - if (eenv->src_cpu == eenv->dst_cpu) return 0;
sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu; - sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
+ /* + * The CPU capacity sharing attribution is fixed by hardhware + * design so we can decide the sg_cp value at the beginning + * for specific CPU. + */ + sd = rcu_dereference(per_cpu(sd_scs, sd_cpu)); if (!sd) return 0; /* Error */
- sg = sd->groups; - - do { - if (cpu_in_sg(sg, eenv->src_cpu) || cpu_in_sg(sg, eenv->dst_cpu)) { - eenv_before.sg_top = eenv->sg_top = sg; - - if (sched_group_energy(&eenv_before)) - return 0; /* Invalid result abort */ - energy_before += eenv_before.energy; - - /* Keep track of SRC cpu (before) capacity */ - eenv->cap.before = eenv_before.cap.before; - eenv->cap.delta = eenv_before.cap.delta; - - if (sched_group_energy(eenv)) - return 0; /* Invalid result abort */ - energy_after += eenv->energy; - } - } while (sg = sg->next, sg != sd->groups); + eenv->nrg.before = compute_task_energy(eenv, eenv->src_cpu); + eenv->nrg.after = compute_task_energy(eenv, eenv->dst_cpu); + eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; + eenv->payoff = 0;
- eenv->nrg.before = energy_before; - eenv->nrg.after = energy_after; - eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; - eenv->payoff = 0; #ifndef CONFIG_SCHED_TUNE trace_sched_energy_diff(eenv->task, eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, -- 1.9.1
Hi all,
On Fri, Jan 19, 2018 at 12:34:20PM +0800, Leo Yan wrote:
Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs; each candidate CPU cannot decide the final OPP by itself due it binds with other CPUs in the same clock domain, at the end we need calculate all CPUs energy.
Let's use below CPU topology as the example:
Cluster_0 Cluster_1 CPU_0 CPU_4 CPU_1 CPU_5 CPU_2 CPU_6 CPU_3 CPU_7
Current code always calculate the energy for all CPUs in bound clock domain, if the candidate CPUs are CPU_0 and CPU_4, then the formula for energy calculation is as below:
E(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)
E(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`
E_Diff(CPU_0 - CPU_4) = E(CPU_0) - E(CPU_4)
But from upper formula we can easily get to know CPU_1/2/3/5/6/7 energy calculation are redundant, so if we only take account the energy for the task consumed (but not compute all CPUs energy) after place it onto one specific CPU, then the energy calculation can be optimized as:
E(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) E(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)
E_Diff(CPU_0 - CPU_4) = E(CPU_0) - E(CPU_4)
So the energy calculation iteration can be reduced from 20 times to 8 times; this can significant reduce the energy calculation overload.
After using task oriented calculation, there has one case the energy calculation might take longer time than previous method. For instance, if candidate CPUs are CPU_0 and CPU1, and after place task on either CPU the CPU OPP will be increased. In this case, the old code uses below method for energy calculation:
E(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) E(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)
E_Diff(CPU_1 - CPU_0) = E(CPU_1) - E(CPU_0)
Because the OPP increasing impacts other CPUs in the same clock domain, so it needs to calculate all related CPUs energy:
E(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)
E_Diff(CPU_1 - CPU_0) = E(CPU_1) - E(CPU_0)
We can use more complex method for optimization, e.g. firstly calculate the CPU_0 OPP and CPU_1 OPP and directly select CPU with most power efficiency OPP. Or we can reuse the energy data before task placement for two candidates. These methods can be used for later optimization.
As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios. This patch fixes issue by always calculating task based energy.
To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so finally the energy calculation can be limited to this single CPU.
On Hikey960, after fixing LITTLT CPU freq to 1402000Hz and big CPU to 1421000Hz, with the home screen scenario for 10s ftrace log the energy calculation duration can be optimized as below:
Energy calculation between LITTLE CPU and big CPU, the duration can be decreased from 34660ns to 16565ns (52% decreasing); when the energy calculation between the two CPUs in the same cluster, the duration can be decreased from 24342ns to 21093ns (13% decreasing).
Thanks a lot for Daniel reviewing and suggestion, this patch is big and hard to digest so I will split this patch into two smaller patches (or even more smaller patches if I can) for easier reviewing:
- The first patch is to add cpu frequency predication around and cancel redundant CPUs for energy calculation; - The second patch is to introduce task energy calculation;
I will prepare for new patch set for this. FYI.
[...]
Thanks, Leo Yan
This commit changes to use per cpu data to maintain energy environment structure, so can move it out from kernel stack and avoid the stack overflow issue if we want to add more items into energy environment structure for later optimization.
Change-Id: I7aed4c972c464ca683828d85b9b8f9311622da55 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4d5d900..6dee639 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5453,6 +5453,8 @@ struct energy_env { } cap; };
+static DEFINE_PER_CPU(struct energy_env, energy_env); + static int cpu_util_wake(int cpu, struct task_struct *p);
/* @@ -7079,14 +7081,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync
if (target_cpu != prev_cpu) { int delta = 0; - struct energy_env eenv = { - .util_delta = task_util(p), - .src_cpu = prev_cpu, - .dst_cpu = target_cpu, - .task = p, - .trg_cpu = target_cpu, - }; + struct energy_env *eenv = this_cpu_ptr(&energy_env);
+ memset(eenv, 0x0, sizeof(*eenv)); + eenv->util_delta = task_util(p); + eenv->task = p; + eenv->src_cpu = prev_cpu; + eenv->dst_cpu = target_cpu; + eenv->trg_cpu = target_cpu;
#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && @@ -7100,14 +7102,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }
- if (energy_diff(&eenv) >= 0) { + if (energy_diff(eenv) >= 0) { /* No energy saving for target_cpu, try backup */ target_cpu = tmp_backup; - eenv.dst_cpu = target_cpu; - eenv.trg_cpu = target_cpu; + eenv->dst_cpu = target_cpu; + eenv->trg_cpu = target_cpu; if (tmp_backup < 0 || tmp_backup == prev_cpu || - energy_diff(&eenv) >= 0) { + energy_diff(eenv) >= 0) { schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); target_cpu = prev_cpu; -- 1.9.1
On 19/01/2018 05:34, Leo Yan wrote:
This commit changes to use per cpu data to maintain energy environment structure, so can move it out from kernel stack and avoid the stack overflow issue if we want to add more items into energy environment structure for later optimization.
Change-Id: I7aed4c972c464ca683828d85b9b8f9311622da55 Signed-off-by: Leo Yan leo.yan@linaro.org
[ ... ]
@@ -7079,14 +7081,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync
if (target_cpu != prev_cpu) { int delta = 0;
struct energy_env eenv = {
.util_delta = task_util(p),
.src_cpu = prev_cpu,
.dst_cpu = target_cpu,
.task = p,
.trg_cpu = target_cpu,
};
struct energy_env *eenv = this_cpu_ptr(&energy_env);
memset(eenv, 0x0, sizeof(*eenv));
eenv->util_delta = task_util(p);
eenv->task = p;
eenv->src_cpu = prev_cpu;
eenv->dst_cpu = target_cpu;
eenv->trg_cpu = target_cpu;
Allocating a structure on the stack is just a register address increment. Here, the result is the memset is much more expensive.
Regarding the stack overflow, I'm not sure it is worth to create a percpu variable. The issue must be caught and fixed in a proper way if it happens.
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
Hi Daniel,
Thanks for reveiwing.
On Wed, Jan 31, 2018 at 11:50:23AM +0100, Daniel Lezcano wrote:
On 19/01/2018 05:34, Leo Yan wrote:
This commit changes to use per cpu data to maintain energy environment structure, so can move it out from kernel stack and avoid the stack overflow issue if we want to add more items into energy environment structure for later optimization.
Change-Id: I7aed4c972c464ca683828d85b9b8f9311622da55 Signed-off-by: Leo Yan leo.yan@linaro.org
[ ... ]
@@ -7079,14 +7081,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync
if (target_cpu != prev_cpu) { int delta = 0;
struct energy_env eenv = {
.util_delta = task_util(p),
.src_cpu = prev_cpu,
.dst_cpu = target_cpu,
.task = p,
.trg_cpu = target_cpu,
};
struct energy_env *eenv = this_cpu_ptr(&energy_env);
memset(eenv, 0x0, sizeof(*eenv));
eenv->util_delta = task_util(p);
eenv->task = p;
eenv->src_cpu = prev_cpu;
eenv->dst_cpu = target_cpu;
eenv->trg_cpu = target_cpu;
Allocating a structure on the stack is just a register address increment. Here, the result is the memset is much more expensive.
Regarding the stack overflow, I'm not sure it is worth to create a percpu variable. The issue must be caught and fixed in a proper way if it happens.
This patch is preparation for patch 3/3, patch 3/3 adds more elements to recording the CPU energy as cached value so cann't repeat calculations. After we add the per CPU data for caching energy and capacity, this is not scalable for system with many CPUs
If we only consider the system have 8 cpus for mobile platform, then the energy_env structure size will be increased 96 bytes (per cpu data for three items: int nrg, int cap, int used; so 4Bs x 3 x 8 = 96Bs); If system have 16 cpus, then the structure increases 192Bs, this is not big pressure for 4KB/8KB kernel stack size.
So we can drop this patch if we only consider mobile platform.
Thanks, Leo Yan
This commit adds energy and capacity recording items for every CPU, so this can help to compare CPU energy for multiple times. For example, if we compare the previous CPU with other candidate CPUs, we can only calculate energy for previous CPU for single time and then compare previous CPU with other candidate CPUs for multiple times, the first time we need calculate the energy data for previous CPU, but later we can directly use the cached energy value for previous CPU.
Change-Id: I80ef8bfcb54408e376f4c6bb0181776cdf954367 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 97 ++++++++++++++++++++++++++++------------------------- 1 file changed, 52 insertions(+), 45 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6dee639..0f34d81 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5438,19 +5438,18 @@ struct energy_env { int dst_cpu; int trg_cpu; int energy; - int payoff; struct task_struct *task; + + int nrg_delta; + int nrg_diff; + int cap_delta; + int payoff; + struct { - int before; - int after; - int delta; - int diff; - } nrg; - struct { - int before; - int after; - int delta; - } cap; + int nrg; + int cap; + int used; + } cpu[NR_CPUS]; };
static DEFINE_PER_CPU(struct energy_env, energy_env); @@ -5728,6 +5727,9 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) unsigned int prev_cap_idx, next_cap_idx; unsigned long energy;
+ if (eenv->cpu[cpu].used) + return 0; + sd = rcu_dereference(per_cpu(sd_scs, cpu)); if (!sd) return 0; @@ -5765,16 +5767,6 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) else eenv->sg_top = sd->groups;
- /* Remove capacity of src CPU (before task move) */ - if (cpu == eenv->src_cpu) { - eenv->cap.before = sg->sge->cap_states[next_cap_idx].cap; - eenv->cap.delta -= eenv->cap.before; - /* Add capacity of dst CPU (after task move) */ - } else if (cpu == eenv->dst_cpu) { - eenv->cap.after = sg->sge->cap_states[next_cap_idx].cap; - eenv->cap.delta += eenv->cap.after; - } - /* Calculate the energy before task placement */ eenv->trg_cpu = -1; eenv->cap_idx = prev_cap_idx; @@ -5786,6 +5778,10 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) eenv->cap_idx = next_cap_idx; sched_group_hierarchy_energy(eenv, cpu);
+ eenv->cpu[cpu].cap = sg->sge->cap_states[next_cap_idx].cap; + eenv->cpu[cpu].nrg = eenv->energy - energy; + eenv->cpu[cpu].used = 1; + return eenv->energy - energy; }
@@ -5816,29 +5812,34 @@ static inline int __energy_diff(struct energy_env *eenv) if (!sd) return 0; /* Error */
- eenv->nrg.before = compute_task_energy(eenv, eenv->src_cpu); - eenv->nrg.after = compute_task_energy(eenv, eenv->dst_cpu); - eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; + compute_task_energy(eenv, eenv->src_cpu); + compute_task_energy(eenv, eenv->dst_cpu); + + eenv->cap_delta = eenv->cpu[eenv->dst_cpu].cap - + eenv->cpu[eenv->src_cpu].cap; + eenv->nrg_diff = eenv->cpu[eenv->dst_cpu].nrg - + eenv->cpu[eenv->src_cpu].nrg; eenv->payoff = 0;
#ifndef CONFIG_SCHED_TUNE trace_sched_energy_diff(eenv->task, - eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, - eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, - eenv->cap.before, eenv->cap.after, eenv->cap.delta, - eenv->nrg.delta, eenv->payoff); + eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, + eenv->cpu[eenv->src_cpu].nrg, eenv->cpu[eenv->dst_cpu].nrg, + eenv->nrg_diff, + eenv->cpu[eenv->src_cpu].cap, eenv->cpu[eenv->dst_cpu].cap, + eenv->cap_delta, eenv->nrg.delta, eenv->payoff); #endif /* * Dead-zone margin preventing too many migrations. */
- margin = eenv->nrg.before >> 6; /* ~1.56% */ + margin = eenv->cpu[eenv->src_cpu].nrg >> 6; /* ~1.56% */
- diff = eenv->nrg.after - eenv->nrg.before; + diff = eenv->nrg_diff;
- eenv->nrg.diff = (abs(diff) < margin) ? 0 : eenv->nrg.diff; + eenv->nrg_diff = (abs(diff) < margin) ? 0 : diff;
- return eenv->nrg.diff; + return eenv->nrg_diff; }
#ifdef CONFIG_SCHED_TUNE @@ -5900,27 +5901,33 @@ static inline int __energy_diff(struct energy_env *eenv) /* Return energy diff when boost margin is 0 */ if (boost == 0) { trace_sched_energy_diff(eenv->task, - eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, - eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, - eenv->cap.before, eenv->cap.after, eenv->cap.delta, - 0, -eenv->nrg.diff); - return eenv->nrg.diff; + eenv->src_cpu, eenv->dst_cpu, + eenv->util_delta, + eenv->cpu[eenv->src_cpu].nrg, + eenv->cpu[eenv->dst_cpu].nrg, + eenv->nrg_diff, + eenv->cpu[eenv->src_cpu].cap, + eenv->cpu[eenv->dst_cpu].cap, + eenv->cap_delta, 0, -eenv->nrg_diff); + return eenv->nrg_diff; }
/* Compute normalized energy diff */ - nrg_delta = normalize_energy(eenv->nrg.diff); - eenv->nrg.delta = nrg_delta; + nrg_delta = normalize_energy(eenv->nrg_diff); + eenv->nrg_delta = nrg_delta;
eenv->payoff = schedtune_accept_deltas( - eenv->nrg.delta, - eenv->cap.delta, + eenv->nrg_delta, + eenv->cap_delta, eenv->task);
trace_sched_energy_diff(eenv->task, - eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, - eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, - eenv->cap.before, eenv->cap.after, eenv->cap.delta, - eenv->nrg.delta, eenv->payoff); + eenv->src_cpu, eenv->dst_cpu, + eenv->util_delta, + eenv->cpu[eenv->src_cpu].nrg, eenv->cpu[eenv->dst_cpu].nrg, + eenv->nrg_diff, + eenv->cpu[eenv->src_cpu].cap, eenv->cpu[eenv->dst_cpu].cap, + eenv->cap_delta, eenv->nrg_delta, eenv->payoff);
/* * When SchedTune is enabled, the energy_diff() function will return -- 1.9.1
On Fri, Jan 19, 2018 at 12:34:19PM +0800, Leo Yan wrote:
Hi all,
First of all, this patch set is for energy comparison optimization. The performance of energy comparison is important if we want to add more candidate CPUs to pick best CPU.
Another meaningful point for this patch set is to evaluate for energy calculation with task oriented. Current energy calculation algorithm is calculate CPU energy, this patch set is to change the concept so we get know what's the energy introduced by the waken task.
With this patch set, below are measured energy calculation duration, the duration measurement relies on patch [1]; the statistics uses duration mean value (unit: ns) and we can see the performance improvement with this patchset:
wl: workload runtime percentage% with period = 5ms
Without Patches With Patches Opt %
wl: 1% 11858 8457 28.7% wl: 5% 13028 9534 26.8% wl: 10% 9361 7831 16.3% wl: 20% 10736 7999 25.5% wl: 30% 8216 7210 12.2% wl: 40% 15222 9538 37.3%
You could check the detailed testing results with LISA scripts [2][3].
FWIW, I also paste the duration result for video/audio cases (with LISA script [4]):
Without Patches With Patches Opt % Video 6884 5194 24.5% Audio 11242 7812 30.5%
This is following up some discussion we have at SFO17 connect, so could you reivew this patch set and let me know how if it's good to commit on gerrit for Android common kernel?
Joel, Chris, Patrick & all,
I'd really like to heard your suggestion and comments :) If it's possible to upstream these patch set for Android kernel?
[1] https://git.linaro.org/people/leo.yan/linux-eas-opt.git/commit/?h=android-hi... [2] https://github.com/Leo-Yan/lisa/blob/lisa_20180115_add_metrics/ipynb/example... [3] https://github.com/Leo-Yan/lisa/blob/lisa_20180115_add_metrics/ipynb/example... [4] https://github.com/Leo-Yan/lisa/blob/lisa_20180115_add_metrics/ipynb/sched_d...
Leo Yan (3): sched/fair: Optimize energy calculation with task oriented sched/fair: Use per cpu data to maintain energy environment sched/fair: Record energy and capacity data for every CPU
kernel/sched/fair.c | 364 +++++++++++++++++++++++++++++----------------------- 1 file changed, 204 insertions(+), 160 deletions(-)
-- 1.9.1