Re: [Eas-dev] [PATCH 13/15] sched/fair: Optimize energy computation with task oriented

19 Mar 2018

On 17-Mar 20:05, Leo Yan wrote:
...
Let's firstly see one example for a small utilization task is waken up
and need calculate energy for two candidate CPUs.  From hardware design
perspective, each candidate CPU cannot decide OPP by itself due it binds
with other CPUs in the same clock domain, e.g. it binds clock domain
in the cluster with other CPUs, at the end we need to compute energy for
all CPUs in the cluster.
Let's use below CPU topology as the example:
Cluster_0               Cluster_1
CPU_0                   CPU_4
CPU_1                   CPU_5
CPU_2                   CPU_6
CPU_3                   CPU_7


Current code always calculate the energy for all CPUs in bound clock
domain, if the candidates are CPU_0 and CPU_4, the formula for energy
computation is as below:
E(CPU_x) stands the CPU_x energy, notation E(CPU_x)` means the CPU_x
energy after take placed on it, TE(CPU_x) means the total energy for
all computation for all CPUs when task is placed on CPU_x,
E_DIFF(CPU_x - CPU_y) means the energy difference between CPU_x and
CPU_y, it equals to TE(CPU_x) - TE(CPU_y).
TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)`
          + E(CPU_4)  + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)
TE(CPU_4) = E(CPU_0)  + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)
          + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`
E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)
From upper formula we easily get to know CPU_1/2/3/5/6/7 energy
computations are redundant,
This means that, for example, the contribution E(CPU_2) is the same in
TE(CPU_0) and TE(CPU_4), right?
If that's what you mean, I'm not entirely convinced. This is true
_IFF_ the OPP for Cluster_0 does not change while having the task in
CPU_0 or having the task in the other cluster.
If the OPP for Cluster_0 should change, then E(CPU_2) can have to very
different values in TE(CPU_0) and TE(CPU_4).
...
on the other hand if we only consider the energy consumption by
waken task, the energy difference is between the target CPU energy
data before and after task placed on it.
Sorry, I don't completely get this last...
... and it seems it's they key to understand to next paragraph :/
...
This method have one benefit is it can avoid to compute all CPUs
energy in the same cluster, and only can focus the energy change
introduced by waken task on the target CPU,
Do you mean that we whould ignore the blocked utilization?
...
which this method is called 'task
oriented computation' in this patch, the energy computation can be
optimized as:
TE(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0)
TE(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)
Again, IMHO in the above formula, if the OPP could change we cannot
disregard energy variations from other cpus in Cluster_1.
...
E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)
As the result, the computation iteration can be reduced from 20 times
to 8 times; so this can significantly reduce calculation overload.
After using task oriented computation, it has one case the computation
might take longer time than previous method.  For instance, when
candidates are CPU_0 and CPU1, and after place task on either CPU the
OPP will be increased,
... but this can happen also in the previous case.
Let say that CPU_4 has a utilization which is ~80% less then the
capacity of its current OPP and that the task's PREV_CPU is CPU_0.
Let's also assume that the utilization of the task is big enought to
require a capacity increase on CPU_4. Then, the energy for all the
other CPUs in Cluster_1 are different in the before and after case.
...
in this case, the old method uses below computation:
TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)
TE(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)
E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)
Using task oriented computation, because the OPP increasing impacts
other CPUs in the same cluster, so it needs to calculate all related
CPUs energy:
TE(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)`
          - E(CPU_0)  - E(CPU_1)  - E(CPU_2)  - E(CPU_3)  - E(CLS_0)
TE(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)`
          - E(CPU_0)  - E(CPU_1)  - E(CPU_2)  - E(CPU_3)  - E(CLS_0)
E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)
We can use more complex method for optimization, e.g. we can extend
eenv structure to cache CPU energy data so can reuse them for two
candidates.  This can be used for later optimization.
As side effect, this patch also resolves energy calculation consistent
issue, e.g. for some cases the energy calculation is for one cluster,
some cases the energy calculation is for multiple clusters; so the
energy data semantics are not consistent for different scenarios.
That's becasue we care about relative differences in energy
consumptions...
...
Computation inconsistent issue might introduce trouble for PE filter.
... we should look into those issues. We know about some of them and
already have patches, which was part of the 3.18 kernel used by Pixel
devices but not yet merged in 4.4, since we notice that these cases,
although being broken, are also not happening in Android because of
the schedtune configurations used by Pixel devices.
...
This patch fixes issue by always calculating task based energy.
To achieve the optimization, this patch utilizes 'eenv->sg_cap' and
'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about
the CPU capacity shared attribution, so eventually it's to describe the
clock domain shared within CPUs, from this parameter we can get to know
the final OPP selection; we need utilize parameter 'eenv->sg_top' to
define which CPU we take care about, if the frequency is not changed
after placing waken task then it will set the first level scheduling
group to it (means one the single CPU) so the energy computation is
limited to this single CPU, otherwise it rolls back to compute all CPUs
Ok, so that maens that we always check for OPP changes... then it
should work...
...
in the same clock domain.
Change-Id: Ifb64ad77c6173388abf13e255e2ed8e8586a38bc
Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49eee75..4811fce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5473,7 +5473,7 @@ struct energy_env {
   	/* Estimated energy variation wrt previous CPU */
   	int	nrg_delta;

} cpu[NR_CPUS];


} cpu[NR_CPUS*2];

^^^^^^^^^^^
Why can't we have a before/after set of variables directly in the
definition of eenv?
Given the usage below it should also turns out to be slightly more
cache friendly. i.e.
struct energy_env {
#define PREV_CAP 0
   #define NEXT_CAP 1
        struct {
                // ...
        } cpu[2][NR_CPUS];
   }
...
/* The morst energy efficient CPU for the specified energy_env::p */
   int			next_cpu;
@@ -5791,45 +5791,55 @@ static int compute_energy(struct energy_env *eenv, int candidate)
  */
 static int compute_task_energy(struct energy_env *eenv, int cpu)
 {

struct sched_domain *sd, *sd_cap;
struct sched_group *sg;
int first_cpu;


struct sched_domain *sd;
unsigned int prev_cap_idx, next_cap_idx;
int cmp_idx, ret;


sd = rcu_dereference(per_cpu(sd_ea, cpu));


sd = rcu_dereference(per_cpu(sd_scs, cpu));
if (!sd)
return -1; /* Error */


sg = sd->groups;
do {
/* Skip SGs which do not contains a candidate CPU */


if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg)))


	continue;



eenv->sg_top = sg;



first_cpu = cpumask_first(sched_group_cpus(sg));



/*


 * The CPU capacity sharing attribution is decided by hardhware


 * design so we can decide the sg_cp value at the beginning


 * for specific CPU.


 */


sd_cap = rcu_dereference(per_cpu(sd_scs, first_cpu));


if (sd_cap && sd_cap->parent)


	eenv->sg_cap = sd_cap->parent->groups;


else


	eenv->sg_cap = sd_cap->groups;



find_new_capacity(eenv, cpu);




/*
* The CPU capacity sharing attribution is decided by hardhware


* design so we can decide the sg_cp value at the beginning


* for specific CPU.


*/


if (sd && sd->parent)
eenv->sg_cap = sd->parent->groups;


else
eenv->sg_cap = sd->groups;



/* Estimate capacity index before task placement */
cmp_idx = NR_CPUS + cpu;
prev_cap_idx = find_new_capacity(eenv, cmp_idx);
next_cap_idx = find_new_capacity(eenv, cpu);

^^^^
Based on the above suggestion, this parameter would be
{PREV,NEXT}_CAP.
...


/*
* Computation is iteration sched_group from bottom to up level for


* energy accumulation, 'sg_top' is top most sched_group:


* - If the CPU frequency has no change before and after task placed


*   onto the target CPU, set 'sg_top' to sched_group for the target


*   CPU; this means only to calculate the energy for this single CPU


*   and ignore other CPUs in the same clock domain.



+1
...

* - If found the OPP frequency is changed after task placement then


*   need to calculate all CPUs who bound in the same clock domain,


*   so set 'sg_top' to shared capacity scheduling group.


*/


if (prev_cap_idx != next_cap_idx)
eenv->sg_top = eenv->sg_cap;


else
eenv->sg_top = sd->groups;





/* energy is unscaled to reduce rounding errors */


if (compute_energy(eenv, cpu) == -EINVAL) {


	eenv->next_cpu = eenv->prev_cpu;


	return -EINVAL;


}




/* energy is unscaled to reduce rounding errors */
ret = compute_energy(eenv, cmp_idx);

^^^^^^^
Same here... that would be a {PREV,NEXT}_CAP.
...

if (ret < 0)
return ret;





} while (sg = sg->next, sg != sd->groups);


ret = compute_energy(eenv, cpu);

if (ret < 0)

return ret;



eenv->cpu[cpu].energy -= eenv->cpu[cmp_idx].energy;


eenv->cpu[NEXT_CAP][cpu].energy -= eenv->cpu[PREV_CPU][cpu].energy;
...
return 0;
 }
--
1.9.1
--
#include <best/regards.h>
Patrick Bellasi

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [PATCH 13/15] sched/fair: Optimize energy computation with task oriented