[PATCH 00/15] Energy computation optimization on android-4.9-eas-dev

List overview All Threads
Download

newer

older

eas question

[Integration Branch] Update...

Leo Yan

17 Mar 2018 17 Mar '18

12:05 p.m.

This patch set is to optimize the energy computation on Android kernel android-4.9-eas-dev branch [1];

Patches 0001-0012 are used to refactor the code and some minor optimization, otherwise the task energy computation is hard to landed into current code;

Patch 0013 "sched/fair: Optimize energy computation with task oriented" is the core patch in whole patch set, which is mainly used to implement energy calculation for task. Patch 0014 is a sequential patch to use cached value so we can get more benefit for performance by exchanging more memory.

Patch 0015 is a trival experiment patch to remove 'idle state estimation'.

The testing uses rt-app to generate synthetic workload, the workload duty cycles are 1%/5%/10%/20%/30%/40%; the duration is measured interval for func select_energy_cpu_idx(), which now is used to calculation three candidates in single run. The result shows this patch set improve for energy computation duration:

+----------------+-------+-------+-------+-------+-------+--------+ | workload | 1% | 5% | 10% | 20% | 30% | 40% | +----------------+-------+-------+-------+-------+-------+--------+ | w/o patch set | 17267 | 21227 | 17019 | 13914 | 15002 | 23412 | | w/t patch set | 10823 | 11924 | 10931 | 10785 | 11139 | 11223 | +----------------+-------+-------+-------+-------+-------+--------+ | Opt percentage | 37% | 43% | 36% | 22% | 26% | 52% | +----------------+-------+-------+-------+-------+-------+--------+

The detailed testing ipython notebook you could check [2][3].

[1] https://android.googlesource.com/kernel/common/+/android-4.9-eas-dev [2] https://github.com/Leo-Yan/lisa/blob/2018_03_17_android_4.9_eas_dev_nrg_comp... [3] https://github.com/Leo-Yan/lisa/blob/2018_03_17_android_4.9_eas_dev_nrg_comp...

Leo Yan (15): sched/fair: Prepare energy env cpumask before energy calculation sched/fair: Re-define return values for select_energy_cpu_idx() sched/fair: Reduce indent in select_energy_cpu_brute() sched/fair: Fix one minor typo sched/fair: Use per cpu data to maintain energy environment sched/fair: Use cpumask to track candidates for energy calculation sched/fair: Lift CPU iteration out of calc_sg_energy() sched/fair: Introduce new function compute_task_energy() sched/fair: Decide 'eenv->sg_cap' ahead energy computation sched/fair: Use eenv::sg_cap to select capacity index sched/fair: Estimate capacity index ahead energy computation sched/fair: Refactor compute_energy() sched/fair: Optimize energy computation with task oriented sched/fair: Optimize energy calculation with cached energy data sched/fair: Remove idle state estimation

kernel/sched/fair.c | 542 +++++++++++++++++++++++++--------------------------- 1 file changed, 256 insertions(+), 286 deletions(-)

-- 1.9.1

Show replies by thread

Leo Yan

17 Mar 17 Mar

12:05 p.m.

New subject: [PATCH 01/15] sched/fair: Prepare energy env cpumask before energy calculation

Move out energy env cpumask setting from select_energy_cpu_idx() and prepare it ahead. This changes make code more clear, which can prepare energy environment variable before energy calculation.

Change-Id: I9061492177ad8d4830cf8bbdaffdf7860b27e319 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 315c386..37231b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5874,15 +5874,6 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) if (!sd) return EAS_CPU_PRV;

- cpumask_clear(&eenv->cpus_mask); - for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx) { - int cpu = eenv->cpu[cpu_idx].cpu_id; - - if (cpu < 0) - continue; - cpumask_set_cpu(cpu, &eenv->cpus_mask); - } - sg = sd->groups; do { /* Skip SGs which do not contains a candidate CPU */ @@ -7133,7 +7124,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync }, };

- #ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && p->state == TASK_WAKING) @@ -7147,6 +7137,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

+ cpumask_clear(&eenv.cpus_mask); + if (prev_cpu >= 0) + cpumask_set_cpu(prev_cpu, &eenv.cpus_mask); + if (next_cpu >= 0) + cpumask_set_cpu(next_cpu, &eenv.cpus_mask); + if (backup_cpu >= 0) + cpumask_set_cpu(backup_cpu, &eenv.cpus_mask); + /* Check if EAS_CPU_NXT is a more energy efficient CPU */ if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Change-Id: I57bc9c96566eaf6085c85ea9386d11422cafddbf Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 37231b8..0f0b307 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5850,16 +5850,15 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu) * select_energy_cpu_idx(): estimate the energy impact of changing the * utilization distribution. * - * The eenv parameter specifies the changes: utilisation amount and a pair of - * possible CPU candidates (the previous CPU and a different target CPU). - * - * This function returns the index of a CPU candidate specified by the + * eenv::next_idx returns the index of a CPU candidate specified by the * energy_env which corresponds to the first CPU saving energy. * Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy - * efficient than running on prev_cpu. This is also the value returned in case - * of abort due to error conditions during the computations. - * A value greater than zero means that the first energy-efficient CPU is the - * one represented by eenv->cpu[eenv->next_idx].cpu_id. + * efficient than running on prev_cpu. A value greater than zero means that + * the first energy-efficient CPU is the one represented by + * eenv->cpu[eenv->next_idx].cpu_id. + * + * The function returns negative value in case of abort due to error + * conditions during the computations, success returns 0. */ static inline int select_energy_cpu_idx(struct energy_env *eenv) { @@ -5871,8 +5870,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)

sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; sd = rcu_dereference(per_cpu(sd_ea, sd_cpu)); - if (!sd) - return EAS_CPU_PRV; + if (!sd) { + eenv->next_idx = EAS_CPU_PRV; + return -1; + }

sg = sd->groups; do { @@ -5882,8 +5883,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)

eenv->sg_top = sg; /* energy is unscaled to reduce rounding errors */ - if (compute_energy(eenv) == -EINVAL) - return EAS_CPU_PRV; + if (compute_energy(eenv) == -EINVAL) { + eenv->next_idx = EAS_CPU_PRV; + return -EINVAL; + }

} while (sg = sg->next, sg != sd->groups);

@@ -5929,7 +5932,7 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) } }

- return eenv->next_idx; + return 0; }

/* @@ -7145,17 +7148,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);

+ select_energy_cpu_idx(&eenv); + /* Check if EAS_CPU_NXT is a more energy efficient CPU */ - if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) { + if (eenv.next_idx != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); - target_cpu = eenv.cpu[eenv.next_idx].cpu_id; - goto unlock; + } else { + schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav); + schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

- schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav); - schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); - target_cpu = prev_cpu; + target_cpu = eenv.cpu[eenv.next_idx].cpu_id; goto unlock; }

-- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 03/15] sched/fair: Reduce indent in select_energy_cpu_brute()

Function select_energy_cpu_brute() can firstly handle for case 'next_cpu == prev_cpu', so can avoid big amount code indentation under the case 'next_cpu != prev_cpu'.

Change-Id: Iac08b5f1e6072805c0bee22680bd0ecab411cdba Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 91 +++++++++++++++++++++++++---------------------------- 1 file changed, 42 insertions(+), 49 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0f0b307..b67999a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7060,6 +7060,8 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync int target_cpu; int backup_cpu; int next_cpu; + int delta = 0; + struct energy_env eenv;

schedstat_inc(p->se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq()->eas_stats.secb_attempts); @@ -7108,63 +7110,54 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync }

target_cpu = prev_cpu; - if (next_cpu != prev_cpu) { - int delta = 0; - struct energy_env eenv = { - .p = p, - .util_delta = task_util(p), - /* Task's previous CPU candidate */ - .cpu[EAS_CPU_PRV] = { - .cpu_id = prev_cpu, - }, - /* Main alternative CPU candidate */ - .cpu[EAS_CPU_NXT] = { - .cpu_id = next_cpu, - }, - /* Backup alternative CPU candidate */ - .cpu[EAS_CPU_BKP] = { - .cpu_id = backup_cpu, - }, - }; + if (next_cpu == prev_cpu) { + schedstat_inc(p->se.statistics.nr_wakeups_secb_count); + schedstat_inc(this_rq()->eas_stats.secb_count); + goto unlock; + } + + eenv.p = p; + eenv.util_delta = task_util(p); + /* Task's previous CPU candidate */ + eenv.cpu[EAS_CPU_PRV].cpu_id = prev_cpu; + /* Main alternative CPU candidate */ + eenv.cpu[EAS_CPU_NXT].cpu_id = next_cpu; + /* Backup alternative CPU candidate */ + eenv.cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

#ifdef CONFIG_SCHED_WALT - if (!walt_disabled && sysctl_sched_use_walt_cpu_util && - p->state == TASK_WAKING) - delta = task_util(p); + if (!walt_disabled && sysctl_sched_use_walt_cpu_util && + p->state == TASK_WAKING) + delta = task_util(p); #endif - /* Not enough spare capacity on previous cpu */ - if (__cpu_overutilized(prev_cpu, delta)) { - schedstat_inc(p->se.statistics.nr_wakeups_secb_insuff_cap); - schedstat_inc(this_rq()->eas_stats.secb_insuff_cap); - target_cpu = next_cpu; - goto unlock; - } - - cpumask_clear(&eenv.cpus_mask); - if (prev_cpu >= 0) - cpumask_set_cpu(prev_cpu, &eenv.cpus_mask); - if (next_cpu >= 0) - cpumask_set_cpu(next_cpu, &eenv.cpus_mask); - if (backup_cpu >= 0) - cpumask_set_cpu(backup_cpu, &eenv.cpus_mask); + /* Not enough spare capacity on previous cpu */ + if (__cpu_overutilized(prev_cpu, delta)) { + schedstat_inc(p->se.statistics.nr_wakeups_secb_insuff_cap); + schedstat_inc(this_rq()->eas_stats.secb_insuff_cap); + target_cpu = next_cpu; + goto unlock; + }

- select_energy_cpu_idx(&eenv); + cpumask_clear(&eenv.cpus_mask); + if (prev_cpu >= 0) + cpumask_set_cpu(prev_cpu, &eenv.cpus_mask); + if (next_cpu >= 0) + cpumask_set_cpu(next_cpu, &eenv.cpus_mask); + if (backup_cpu >= 0) + cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);

- /* Check if EAS_CPU_NXT is a more energy efficient CPU */ - if (eenv.next_idx != EAS_CPU_PRV) { - schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); - schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); - } else { - schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav); - schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); - } + select_energy_cpu_idx(&eenv);

- target_cpu = eenv.cpu[eenv.next_idx].cpu_id; - goto unlock; + /* Check if EAS_CPU_NXT is a more energy efficient CPU */ + if (eenv.next_idx != EAS_CPU_PRV) { + schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); + schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); + } else { + schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav); + schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

- schedstat_inc(p->se.statistics.nr_wakeups_secb_count); - schedstat_inc(this_rq()->eas_stats.secb_count); + target_cpu = eenv.cpu[eenv.next_idx].cpu_id;

unlock: rcu_read_unlock(); -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 04/15] sched/fair: Fix one minor typo

Fix comment to substitute 'energy_env::p' for 'energy_env::task', since field 'task' has been renamed to 'p'.

Change-Id: I6edded9eec9e7ee7710310e2aefd3c88d94606a9 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b67999a..6c0b918 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5495,7 +5495,7 @@ struct energy_env {

/* * Index (into energy_env::cpu) of the morst energy efficient CPU for - * the specified energy_env::task + * the specified energy_env::p */ int next_idx;

-- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 05/15] sched/fair: Use per cpu data to maintain energy environment

This commit changes to use per cpu data to maintain energy environment structure, so can move it out from kernel stack and avoid the stack overflow issue if we want to add more items into energy environment structure for later optimization.

Change-Id: Ifdcb1bbae36cb8e074b31ec35f00c79f6a14b973 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6c0b918..9997a6a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6690,7 +6690,36 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)

return target; } - + +static struct kmem_cache *energy_env_cache __read_mostly; +static DEFINE_PER_CPU(struct energy_env *, energy_env); + +static void init_energy_env(void) +{ + struct energy_env *eenv; + int i; + + energy_env_cache = KMEM_CACHE(energy_env, 0); + + for_each_possible_cpu(i) { + eenv = kmem_cache_alloc(energy_env_cache, GFP_KERNEL | __GFP_ZERO); + if (!eenv) + goto err; + + rcu_assign_pointer(per_cpu(energy_env, i), eenv); + } + + return; +err: + for_each_possible_cpu(i) { + eenv = rcu_dereference(per_cpu(energy_env, i)); + if (!eenv) + continue; + + kmem_cache_free(energy_env_cache, eenv); + } +} + /* * cpu_util_wake: Compute cpu utilization with any contributions from * the waking task p removed. check_for_migration() looks for a better CPU of @@ -7061,14 +7090,13 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync int backup_cpu; int next_cpu; int delta = 0; - struct energy_env eenv; + int cpu = smp_processor_id(); + struct energy_env *eenv;

schedstat_inc(p->se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq()->eas_stats.secb_attempts);

if (sysctl_sched_sync_hint_enable && sync) { - int cpu = smp_processor_id(); - if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) { schedstat_inc(p->se.statistics.nr_wakeups_secb_sync); schedstat_inc(this_rq()->eas_stats.secb_sync); @@ -7116,14 +7144,15 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

- eenv.p = p; - eenv.util_delta = task_util(p); + eenv = rcu_dereference(per_cpu(energy_env, cpu)); + eenv->p = p; + eenv->util_delta = task_util(p); /* Task's previous CPU candidate */ - eenv.cpu[EAS_CPU_PRV].cpu_id = prev_cpu; + eenv->cpu[EAS_CPU_PRV].cpu_id = prev_cpu; /* Main alternative CPU candidate */ - eenv.cpu[EAS_CPU_NXT].cpu_id = next_cpu; + eenv->cpu[EAS_CPU_NXT].cpu_id = next_cpu; /* Backup alternative CPU candidate */ - eenv.cpu[EAS_CPU_BKP].cpu_id = backup_cpu; + eenv->cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && @@ -7138,18 +7167,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

- cpumask_clear(&eenv.cpus_mask); + cpumask_clear(&eenv->cpus_mask); if (prev_cpu >= 0) - cpumask_set_cpu(prev_cpu, &eenv.cpus_mask); + cpumask_set_cpu(prev_cpu, &eenv->cpus_mask); if (next_cpu >= 0) - cpumask_set_cpu(next_cpu, &eenv.cpus_mask); + cpumask_set_cpu(next_cpu, &eenv->cpus_mask); if (backup_cpu >= 0) - cpumask_set_cpu(backup_cpu, &eenv.cpus_mask); + cpumask_set_cpu(backup_cpu, &eenv->cpus_mask);

- select_energy_cpu_idx(&eenv); + select_energy_cpu_idx(eenv);

/* Check if EAS_CPU_NXT is a more energy efficient CPU */ - if (eenv.next_idx != EAS_CPU_PRV) { + if (eenv->next_idx != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); } else { @@ -7157,7 +7186,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

- target_cpu = eenv.cpu[eenv.next_idx].cpu_id; + target_cpu = eenv->cpu[eenv->next_idx].cpu_id;

unlock: rcu_read_unlock(); @@ -10975,6 +11004,7 @@ __init void init_sched_fair_class(void) nohz.next_balance = jiffies; zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); #endif + init_energy_env(); #endif /* SMP */

} -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 06/15] sched/fair: Use cpumask to track candidates for energy calculation

The energy calculation has two steps, the first step is to select CPU candidates, the second step is to compare energy for these candidates. Now in first step we use three candidates (EAS_CPU_PRV, EAS_CPU_NXT and EAS_CPU_BKP), and energy calculation also uses these three CPU index for computation iteration.

If we bind code to these three candidates, it's hard for us to extend for more CPU candidates; on the other hand cpumask structure is more suitable to track candidates and for extension for more candidates.

This patch is to use cpumask structure to track the candidates, and remove the three predefined candidates EAS_CPU_PRV, EAS_CPU_NXT and EAS_CPU_BKP.

Change-Id: I0fb1a904b6024ebfb3c26daa1a00dc8132eebb3a Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 151 ++++++++++++++++++++-------------------------------- 1 file changed, 57 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9997a6a..f4206c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5444,20 +5444,6 @@ static inline bool energy_aware(void) }

/* - * CPU candidates. - * - * These are labels to reference CPU candidates for an energy_diff. - * Currently we support only two possible candidates: the task's previous CPU - * and another candiate CPU. - * More advanced/aggressive EAS selection policies can consider more - * candidates. - */ -#define EAS_CPU_PRV 0 -#define EAS_CPU_NXT 1 -#define EAS_CPU_BKP 2 -#define EAS_CPU_CNT 3 - -/* * energy_diff - supports the computation of the estimated energy impact in * moving a "task"'s "util_delta" between different CPU candidates. */ @@ -5471,10 +5457,6 @@ struct energy_env {

/* CPU candidates to evaluate */ struct { - - /* CPU ID, must be in cpus_mask */ - int cpu_id; - /* * Index (into sched_group_energy::cap_states) of the OPP the * CPU needs to run at if the task is placed on it. @@ -5488,16 +5470,16 @@ struct energy_env { /* Estimated system energy */ unsigned int energy;

- /* Estimated energy variation wrt EAS_CPU_PRV */ + /* Estimated energy variation wrt previous CPU */ int nrg_delta;

- } cpu[EAS_CPU_CNT]; + } cpu[NR_CPUS];

- /* - * Index (into energy_env::cpu) of the morst energy efficient CPU for - * the specified energy_env::p - */ - int next_idx; + /* The morst energy efficient CPU for the specified energy_env::p */ + int next_cpu; + + /* Previous CPU */ + int prev_cpu;

/* Support data */ struct sched_group *sg_top; @@ -5530,7 +5512,7 @@ static unsigned long __cpu_norm_util(unsigned long util, unsigned long capacity) return (util << SCHED_CAPACITY_SHIFT)/capacity; }

-static unsigned long group_max_util(struct energy_env *eenv, int cpu_idx) +static unsigned long group_max_util(struct energy_env *eenv, int cpu_id) { unsigned long max_util = 0; unsigned long util; @@ -5545,7 +5527,7 @@ static unsigned long group_max_util(struct energy_env *eenv, int cpu_idx) * then we should add the (estimated) utilization of the task * assuming we will wake it up on that CPU. */ - if (unlikely(cpu == eenv->cpu[cpu_idx].cpu_id)) + if (unlikely(cpu == cpu_id)) util += eenv->util_delta;

max_util = max(max_util, util); @@ -5575,9 +5557,9 @@ static unsigned long group_max_util(struct energy_env *eenv, int cpu_idx) * estimate (more busy). */ static unsigned -long group_norm_util(struct energy_env *eenv, int cpu_idx) +long group_norm_util(struct energy_env *eenv, int cpu_id) { - unsigned long capacity = eenv->cpu[cpu_idx].cap; + unsigned long capacity = eenv->cpu[cpu_id].cap; unsigned long util, util_sum = 0; int cpu;

@@ -5590,7 +5572,7 @@ long group_norm_util(struct energy_env *eenv, int cpu_idx) * then we should add the (estimated) utilization of the task * assuming we will wake it up on that CPU. */ - if (unlikely(cpu == eenv->cpu[cpu_idx].cpu_id)) + if (unlikely(cpu == cpu_id)) util += eenv->util_delta;

util_sum += __cpu_norm_util(util, capacity); @@ -5599,29 +5581,29 @@ long group_norm_util(struct energy_env *eenv, int cpu_idx) return min_t(unsigned long, util_sum, SCHED_CAPACITY_SCALE); }

-static int find_new_capacity(struct energy_env *eenv, int cpu_idx) +static int find_new_capacity(struct energy_env *eenv, int cpu_id) { const struct sched_group_energy *sge = eenv->sg->sge; int idx, max_idx = sge->nr_cap_states - 1; - unsigned long util = group_max_util(eenv, cpu_idx); + unsigned long util = group_max_util(eenv, cpu_id);

/* default is max_cap if we don't find a match */ - eenv->cpu[cpu_idx].cap_idx = max_idx; - eenv->cpu[cpu_idx].cap = sge->cap_states[max_idx].cap; + eenv->cpu[cpu_id].cap_idx = max_idx; + eenv->cpu[cpu_id].cap = sge->cap_states[max_idx].cap;

for (idx = 0; idx < sge->nr_cap_states; idx++) { if (sge->cap_states[idx].cap >= util) { /* Keep track of SG's capacity */ - eenv->cpu[cpu_idx].cap_idx = idx; - eenv->cpu[cpu_idx].cap = sge->cap_states[idx].cap; + eenv->cpu[cpu_id].cap_idx = idx; + eenv->cpu[cpu_id].cap = sge->cap_states[idx].cap; break; } }

- return eenv->cpu[cpu_idx].cap_idx; + return eenv->cpu[cpu_id].cap_idx; }

-static int group_idle_state(struct energy_env *eenv, int cpu_idx) +static int group_idle_state(struct energy_env *eenv, int cpu_id) { struct sched_group *sg = eenv->sg; int i, state = INT_MAX; @@ -5635,10 +5617,8 @@ static int group_idle_state(struct energy_env *eenv, int cpu_idx) /* Take non-cpuidle idling into account (active idle/arch_cpu_idle()) */ state++;

- src_in_grp = cpumask_test_cpu(eenv->cpu[EAS_CPU_PRV].cpu_id, - sched_group_cpus(sg)); - dst_in_grp = cpumask_test_cpu(eenv->cpu[cpu_idx].cpu_id, - sched_group_cpus(sg)); + src_in_grp = cpumask_test_cpu(eenv->prev_cpu, sched_group_cpus(sg)); + dst_in_grp = cpumask_test_cpu(cpu_id, sched_group_cpus(sg)); if (src_in_grp == dst_in_grp) { /* both CPUs under consideration are in the same group or not in * either group, migration should leave idle state the same. @@ -5652,7 +5632,7 @@ static int group_idle_state(struct energy_env *eenv, int cpu_idx) */ for_each_cpu(i, sched_group_cpus(sg)) { grp_util += cpu_util_wake(i, eenv->p); - if (unlikely(i == eenv->cpu[cpu_idx].cpu_id)) + if (unlikely(i == cpu_id)) grp_util += eenv->util_delta; }

@@ -5707,39 +5687,36 @@ static void calc_sg_energy(struct energy_env *eenv) unsigned long sg_util; int cap_idx, idle_idx; int total_energy = 0; - int cpu_idx; - - for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx) { + int cpu;

+ for_each_cpu(cpu, &eenv->cpus_mask) {

- if (eenv->cpu[cpu_idx].cpu_id == -1) - continue; /* Compute ACTIVE energy */ - cap_idx = find_new_capacity(eenv, cpu_idx); + cap_idx = find_new_capacity(eenv, cpu); busy_power = sg->sge->cap_states[cap_idx].power; /* * in order to calculate cpu_norm_util, we need to know which * capacity level the group will be at, so calculate that first */ - sg_util = group_norm_util(eenv, cpu_idx); + sg_util = group_norm_util(eenv, cpu);

busy_energy = sg_util * busy_power;

/* Compute IDLE energy */ - idle_idx = group_idle_state(eenv, cpu_idx); + idle_idx = group_idle_state(eenv, cpu); idle_power = sg->sge->idle_states[idle_idx].power;

idle_energy = SCHED_CAPACITY_SCALE - sg_util; idle_energy *= idle_power;

total_energy = busy_energy + idle_energy; - eenv->cpu[cpu_idx].energy += total_energy; + eenv->cpu[cpu].energy += total_energy; } }

/* * compute_energy() computes the absolute variation in energy consumption by - * moving eenv.util_delta from EAS_CPU_PRV to EAS_CPU_NXT. + * moving eenv::util_delta from eenv::prev_cpu to eenv::next_cpu. * * NOTE: compute_energy() may fail when racing with sched_domain updates, in * which case we abort by returning -EINVAL. @@ -5850,12 +5827,8 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu) * select_energy_cpu_idx(): estimate the energy impact of changing the * utilization distribution. * - * eenv::next_idx returns the index of a CPU candidate specified by the - * energy_env which corresponds to the first CPU saving energy. - * Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy - * efficient than running on prev_cpu. A value greater than zero means that - * the first energy-efficient CPU is the one represented by - * eenv->cpu[eenv->next_idx].cpu_id. + * eenv::next_cpu returns a CPU candidate specified by the energy_env which + * corresponds to the first energy-efficient CPU. * * The function returns negative value in case of abort due to error * conditions during the computations, success returns 0. @@ -5864,14 +5837,13 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) { struct sched_domain *sd; struct sched_group *sg; - int sd_cpu = -1; - int cpu_idx; + int cpu; int margin;

- sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; - sd = rcu_dereference(per_cpu(sd_ea, sd_cpu)); + cpu = cpumask_first(&eenv->cpus_mask); + sd = rcu_dereference(per_cpu(sd_ea, cpu)); if (!sd) { - eenv->next_idx = EAS_CPU_PRV; + eenv->next_cpu = eenv->prev_cpu; return -1; }

@@ -5884,49 +5856,46 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) eenv->sg_top = sg; /* energy is unscaled to reduce rounding errors */ if (compute_energy(eenv) == -EINVAL) { - eenv->next_idx = EAS_CPU_PRV; + eenv->next_cpu = eenv->prev_cpu; return -EINVAL; }

} while (sg = sg->next, sg != sd->groups);

/* Scale energy before comparisons */ - for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx) - eenv->cpu[cpu_idx].energy >>= SCHED_CAPACITY_SHIFT; + for_each_cpu(cpu, &eenv->cpus_mask) + eenv->cpu[cpu].energy >>= SCHED_CAPACITY_SHIFT;

/* * Compute the dead-zone margin used to prevent too many task * migrations with negligible energy savings. * An energy saving is considered meaningful if it reduces the energy - * consumption of EAS_CPU_PRV CPU candidate by at least ~1.56% + * consumption of previous CPU candidate by at least ~1.56% */ - margin = eenv->cpu[EAS_CPU_PRV].energy >> 6; + margin = eenv->cpu[eenv->prev_cpu].energy >> 6;

/* - * By default the EAS_CPU_PRV CPU is considered the most energy + * By default the previous CPU is considered the most energy * efficient, with a 0 energy variation. */ - eenv->next_idx = EAS_CPU_PRV; + eenv->next_cpu = eenv->prev_cpu;

/* * Compare the other CPU candidates to find a CPU which can be - * more energy efficient then EAS_CPU_PRV + * more energy efficient than previous CPU */ - for (cpu_idx = EAS_CPU_NXT; cpu_idx < EAS_CPU_CNT; ++cpu_idx) { - /* Skip not valid scheduled candidates */ - if (eenv->cpu[cpu_idx].cpu_id < 0) - continue; - /* Compute energy delta wrt EAS_CPU_PRV */ - eenv->cpu[cpu_idx].nrg_delta = - eenv->cpu[cpu_idx].energy - - eenv->cpu[EAS_CPU_PRV].energy; + for_each_cpu(cpu, &eenv->cpus_mask) { + + /* Compute energy delta wrt previous CPU */ + eenv->cpu[cpu].nrg_delta = + eenv->cpu[cpu].energy - eenv->cpu[eenv->prev_cpu].energy; /* filter energy variations within the dead-zone margin */ - if (abs(eenv->cpu[cpu_idx].nrg_delta) < margin) - eenv->cpu[cpu_idx].nrg_delta = 0; + if (abs(eenv->cpu[cpu].nrg_delta) < margin) + eenv->cpu[cpu].nrg_delta = 0; /* update the schedule candidate with min(nrg_delta) */ - if (eenv->cpu[cpu_idx].nrg_delta < - eenv->cpu[eenv->next_idx].nrg_delta) { - eenv->next_idx = cpu_idx; + if (eenv->cpu[cpu].nrg_delta < + eenv->cpu[eenv->next_cpu].nrg_delta) { + eenv->next_cpu = cpu; if (sched_feat(FBT_STRICT_ORDER)) break; } @@ -7147,12 +7116,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync eenv = rcu_dereference(per_cpu(energy_env, cpu)); eenv->p = p; eenv->util_delta = task_util(p); - /* Task's previous CPU candidate */ - eenv->cpu[EAS_CPU_PRV].cpu_id = prev_cpu; - /* Main alternative CPU candidate */ - eenv->cpu[EAS_CPU_NXT].cpu_id = next_cpu; - /* Backup alternative CPU candidate */ - eenv->cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && @@ -7177,8 +7140,8 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync

select_energy_cpu_idx(eenv);

- /* Check if EAS_CPU_NXT is a more energy efficient CPU */ - if (eenv->next_idx != EAS_CPU_PRV) { + /* Check if eenv::next_cpu is a more energy efficient CPU */ + if (eenv->next_cpu != eenv->prev_cpu) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); } else { @@ -7186,7 +7149,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

- target_cpu = eenv->cpu[eenv->next_idx].cpu_id; + target_cpu = eenv->next_cpu;

unlock: rcu_read_unlock(); -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 07/15] sched/fair: Lift CPU iteration out of calc_sg_energy()

compute_energy() traverses scheduling groups and calc_sg_energy() iterates every candidate for scheduling group energy computation. This implementation is good from the point of view of performance, due all candidates energy calculation can be finished by calling these two functions only one time. Because every function has program looping, one is to iterate scheduling group and another is to iterate candidate CPUs, the code is not readable with complex logic, furthermore, it's difficult to do any optimization due this two functions are very coupled with each other.

This patch lifts CPU iteration out of calc_sg_energy(), and move candidates iteration into select_energy_cpu_idx(). With this change, functions can have simple semantics, compute_energy()/calc_sg_energy() is used to calculate energy for only single one CPU, though compute_energy() will be called 3 times compared against 1 time calling before this patch, but calc_sg_energy() should have same overload between with and without this patch, and this function is main contribution for computation.

Change-Id: Ic6d1843a3502baf6939eec1d36067cbfc11af817 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 68 ++++++++++++++++++++++++++--------------------------- 1 file changed, 33 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f4206c1..618c80d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5678,7 +5678,7 @@ static int group_idle_state(struct energy_env *eenv, int cpu_id) * The required scaling will be performed just one time, by the calling * functions, once we accumulated the contributons for all the SGs. */ -static void calc_sg_energy(struct energy_env *eenv) +static void calc_sg_energy(struct energy_env *eenv, int cpu) { struct sched_group *sg = eenv->sg; int busy_energy, idle_energy; @@ -5687,31 +5687,27 @@ static void calc_sg_energy(struct energy_env *eenv) unsigned long sg_util; int cap_idx, idle_idx; int total_energy = 0; - int cpu; - - for_each_cpu(cpu, &eenv->cpus_mask) {

- /* Compute ACTIVE energy */ - cap_idx = find_new_capacity(eenv, cpu); - busy_power = sg->sge->cap_states[cap_idx].power; - /* - * in order to calculate cpu_norm_util, we need to know which - * capacity level the group will be at, so calculate that first - */ - sg_util = group_norm_util(eenv, cpu); + /* Compute ACTIVE energy */ + cap_idx = find_new_capacity(eenv, cpu); + busy_power = sg->sge->cap_states[cap_idx].power; + /* + * in order to calculate cpu_norm_util, we need to know which + * capacity level the group will be at, so calculate that first + */ + sg_util = group_norm_util(eenv, cpu);

- busy_energy = sg_util * busy_power; + busy_energy = sg_util * busy_power;

- /* Compute IDLE energy */ - idle_idx = group_idle_state(eenv, cpu); - idle_power = sg->sge->idle_states[idle_idx].power; + /* Compute IDLE energy */ + idle_idx = group_idle_state(eenv, cpu); + idle_power = sg->sge->idle_states[idle_idx].power;

- idle_energy = SCHED_CAPACITY_SCALE - sg_util; - idle_energy *= idle_power; + idle_energy = SCHED_CAPACITY_SCALE - sg_util; + idle_energy *= idle_power;

- total_energy = busy_energy + idle_energy; - eenv->cpu[cpu].energy += total_energy; - } + total_energy = busy_energy + idle_energy; + eenv->cpu[cpu].energy += total_energy; }

/* @@ -5721,7 +5717,7 @@ static void calc_sg_energy(struct energy_env *eenv) * NOTE: compute_energy() may fail when racing with sched_domain updates, in * which case we abort by returning -EINVAL. */ -static int compute_energy(struct energy_env *eenv) +static int compute_energy(struct energy_env *eenv, int candidate) { struct cpumask visit_cpus; int cpu_count; @@ -5773,7 +5769,7 @@ static int compute_energy(struct energy_env *eenv) * CPUs in the current visited SG. */ eenv->sg = sg; - calc_sg_energy(eenv); + calc_sg_energy(eenv, candidate);

if (!sd->child) { /* @@ -5847,20 +5843,22 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) return -1; }

- sg = sd->groups; - do { - /* Skip SGs which do not contains a candidate CPU */ - if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg))) - continue; + for_each_cpu(cpu, &eenv->cpus_mask) { + sg = sd->groups; + do { + /* Skip SGs which do not contains a candidate CPU */ + if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg))) + continue;

- eenv->sg_top = sg; - /* energy is unscaled to reduce rounding errors */ - if (compute_energy(eenv) == -EINVAL) { - eenv->next_cpu = eenv->prev_cpu; - return -EINVAL; - } + eenv->sg_top = sg; + /* energy is unscaled to reduce rounding errors */ + if (compute_energy(eenv, cpu) == -EINVAL) { + eenv->next_cpu = eenv->prev_cpu; + return -EINVAL; + }

- } while (sg = sg->next, sg != sd->groups); + } while (sg = sg->next, sg != sd->groups); + }

/* Scale energy before comparisons */ for_each_cpu(cpu, &eenv->cpus_mask) -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 08/15] sched/fair: Introduce new function compute_task_energy()

This patch is minor refactoring to introduce new function compute_task_energy(), this function is used to calculate energy for waken task on specific CPU.

Change-Id: Iebcec22338fdea1f480d171a60056349d9a01259 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 58 ++++++++++++++++++++++++++++++++--------------------- 1 file changed, 35 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 618c80d..213efaf 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5814,6 +5814,37 @@ static int compute_energy(struct energy_env *eenv, int candidate) return 0; }

+/* + * compute_task_energy(): Computes the energy consumption for + * waken task on one specified candidate CPU. + */ +static int compute_task_energy(struct energy_env *eenv, int cpu) +{ + struct sched_domain *sd; + struct sched_group *sg; + + sd = rcu_dereference(per_cpu(sd_ea, cpu)); + if (!sd) + return -1; /* Error */ + + sg = sd->groups; + do { + /* Skip SGs which do not contains a candidate CPU */ + if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg))) + continue; + + eenv->sg_top = sg; + /* energy is unscaled to reduce rounding errors */ + if (compute_energy(eenv, cpu) == -EINVAL) { + eenv->next_cpu = eenv->prev_cpu; + return -EINVAL; + } + + } while (sg = sg->next, sg != sd->groups); + + return 0; +} + static inline bool cpu_in_sg(struct sched_group *sg, int cpu) { return cpu != -1 && cpumask_test_cpu(cpu, sched_group_cpus(sg)); @@ -5831,33 +5862,14 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu) */ static inline int select_energy_cpu_idx(struct energy_env *eenv) { - struct sched_domain *sd; - struct sched_group *sg; int cpu; int margin;

- cpu = cpumask_first(&eenv->cpus_mask); - sd = rcu_dereference(per_cpu(sd_ea, cpu)); - if (!sd) { - eenv->next_cpu = eenv->prev_cpu; - return -1; - } - for_each_cpu(cpu, &eenv->cpus_mask) { - sg = sd->groups; - do { - /* Skip SGs which do not contains a candidate CPU */ - if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg))) - continue; - - eenv->sg_top = sg; - /* energy is unscaled to reduce rounding errors */ - if (compute_energy(eenv, cpu) == -EINVAL) { - eenv->next_cpu = eenv->prev_cpu; - return -EINVAL; - } - - } while (sg = sg->next, sg != sd->groups); + if (compute_task_energy(eenv, cpu) < 0) { + eenv->next_cpu = eenv->prev_cpu; + return -EINVAL; + } }

/* Scale energy before comparisons */ -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 09/15] sched/fair: Decide 'eenv->sg_cap' ahead energy computation

eenv::sg_cap is used to indicate which sched_group CPUs are bound to the same clock domain, as result the final capacity needs to consider all CPUs in eenv::sg_cap. Now eenv::sg_cap is set in the middle of flow for every sched_group energy calculation, but this is not necessary and we can prepare its value ahead energy computation.

This patch moves eenv::sg_cap setting from sched_group_energy() to function compute_task_energy(), so can prepare it ahead energy computation and this can avoid to set it repeatedly.

Change-Id: I9e231fdc26bbf61aa09894c9ea0532b3aa2f3f6b Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 213efaf..d7bb3a8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5738,20 +5738,9 @@ static int compute_energy(struct energy_env *eenv, int candidate) cpu_count = cpumask_weight(&visit_cpus);

while (!cpumask_empty(&visit_cpus)) { - struct sched_group *sg_shared_cap = NULL; int cpu = cpumask_first(&visit_cpus); struct sched_domain *sd;

- /* - * Is the group utilization affected by cpus outside this - * sched_group? - * This sd may have groups with cpus which were not present - * when we took visit_cpus. - */ - sd = rcu_dereference(per_cpu(sd_scs, cpu)); - if (sd && sd->parent) - sg_shared_cap = sd->parent->groups; - for_each_domain(cpu, sd) { struct sched_group *sg = sd->groups;

@@ -5760,10 +5749,6 @@ static int compute_energy(struct energy_env *eenv, int candidate) break;

do { - eenv->sg_cap = sg; - if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) - eenv->sg_cap = sg_shared_cap; - /* * Compute the energy for all the candidate * CPUs in the current visited SG. @@ -5820,8 +5805,9 @@ static int compute_energy(struct energy_env *eenv, int candidate) */ static int compute_task_energy(struct energy_env *eenv, int cpu) { - struct sched_domain *sd; + struct sched_domain *sd, *sd_cap; struct sched_group *sg; + int first_cpu;

sd = rcu_dereference(per_cpu(sd_ea, cpu)); if (!sd) @@ -5834,6 +5820,20 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) continue;

eenv->sg_top = sg; + + first_cpu = cpumask_first(sched_group_cpus(sg)); + + /* + * The CPU capacity sharing attribution is decided by hardhware + * design so we can decide the sg_cp value at the beginning + * for specific CPU. + */ + sd_cap = rcu_dereference(per_cpu(sd_scs, first_cpu)); + if (sd_cap && sd_cap->parent) + eenv->sg_cap = sd_cap->parent->groups; + else + eenv->sg_cap = sd_cap->groups; + /* energy is unscaled to reduce rounding errors */ if (compute_energy(eenv, cpu) == -EINVAL) { eenv->next_cpu = eenv->prev_cpu; -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 10/15] sched/fair: Use eenv::sg_cap to select capacity index

eenv::sg_cap points to scheduling group which we are caring about its capacity for energy computation. So change to use it for capacity index selection.

Change-Id: I26cadd7724ea4902910150129404321ac191a30e Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d7bb3a8..2596e51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5583,7 +5583,7 @@ long group_norm_util(struct energy_env *eenv, int cpu_id)

static int find_new_capacity(struct energy_env *eenv, int cpu_id) { - const struct sched_group_energy *sge = eenv->sg->sge; + const struct sched_group_energy *sge = eenv->sg_cap->sge; int idx, max_idx = sge->nr_cap_states - 1; unsigned long util = group_max_util(eenv, cpu_id);

-- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 11/15] sched/fair: Estimate capacity index ahead energy computation

Based on the sched_group and waken task utilization, we can estimate the capacity index ahead energy computation, and the capacity index is consistent for all CPUs in the group.

This patch moves capacity index estimation ahead energy computation, and optimize to only estimate capacity index once and remove redundant estimations for every CPU in the code.

Change-Id: I634f2fa47d2e762a1b5e570c7f792b67e202a1d2 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2596e51..e5d5ed5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5689,7 +5689,7 @@ static void calc_sg_energy(struct energy_env *eenv, int cpu) int total_energy = 0;

/* Compute ACTIVE energy */ - cap_idx = find_new_capacity(eenv, cpu); + cap_idx = eenv->cpu[cpu].cap_idx; busy_power = sg->sge->cap_states[cap_idx].power; /* * in order to calculate cpu_norm_util, we need to know which @@ -5834,6 +5834,8 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) else eenv->sg_cap = sd_cap->groups;

+ find_new_capacity(eenv, cpu); + /* energy is unscaled to reduce rounding errors */ if (compute_energy(eenv, cpu) == -EINVAL) { eenv->next_cpu = eenv->prev_cpu; -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 12/15] sched/fair: Refactor compute_energy()

During energy calculation, it iterates cpumask bits according to eenv:sg_top; there have two race conditions between energy calculation flow with CPU hotplug flow, so in the middle of calculation any CPU might be hotplugged in or out.

This patch checks two conditions to confirm if hit any hotplug race condition, and if hit then directly bail out with returning error value. During calculation but haven't finished all CPUs iteration, if cpu_count is zero this means there have some CPU is concurrently hotplugged in system, so this is one race condition; after calculation has accomplished, we should expect cpu_count is zero so means all CPU has been traversed one by one, otherwise it means some CPUs have been hotplugged out and as result it has skipped for calculation, this is for detecting the second race condition. This patch refactors compute_energy() by using these two race conditions checking, and removed redundant variables.

Change-Id: I8dadea2229dabb8ddc60597978f724ccaae61311 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 102 +++++++++++++++++++++++----------------------------- 1 file changed, 44 insertions(+), 58 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e5d5ed5..49eee75 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5719,83 +5719,69 @@ static void calc_sg_energy(struct energy_env *eenv, int cpu) */ static int compute_energy(struct energy_env *eenv, int candidate) { - struct cpumask visit_cpus; int cpu_count; + int cpu; + struct sched_domain *sd;

WARN_ON(!eenv->sg_top->sge);

- cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top)); /* If a cpu is hotplugged in while we are in this function, - * it does not appear in the existing visit_cpus mask + * it does not appear in the existing eenv::sg_top mask * which came from the sched_group pointer of the * sched_domain pointed at by sd_ea for either the prev * or next cpu and was dereferenced in __energy_diff. - * Since we will dereference sd_scs later as we iterate + * Since we will dereference sd later as we iterate * through the CPUs we expect to visit, new CPUs can - * be present which are not in the visit_cpus mask. + * be present which are not in the eenv::sg_top mask. * Guard this with cpu_count. */ - cpu_count = cpumask_weight(&visit_cpus); + cpu_count = cpumask_weight(sched_group_cpus(eenv->sg_top)); + cpu = cpumask_first(sched_group_cpus(eenv->sg_top));

- while (!cpumask_empty(&visit_cpus)) { - int cpu = cpumask_first(&visit_cpus); - struct sched_domain *sd; + for_each_domain(cpu, sd) { + struct sched_group *sg = sd->groups;

- for_each_domain(cpu, sd) { - struct sched_group *sg = sd->groups; + do { + if (!cpumask_intersects(sched_group_cpus(eenv->sg_top), + sched_group_cpus(sg))) + continue;

- /* Has this sched_domain already been visited? */ - if (sd->child && group_first_cpu(sg) != cpu) - break; + /* + * Compute the energy for all the candidate + * CPUs in the current visited SG. + */ + eenv->sg = sg; + calc_sg_energy(eenv, candidate);

- do { + if (!sd->child) { /* - * Compute the energy for all the candidate - * CPUs in the current visited SG. + * cpu_count here is the number of cpus we + * expect to visit in this calculation. If + * we race against hotplug, we can have extra + * cpus added to the groups we are iterating + * which do not appear in the eenv::sg_top mask. + * In that case we are not able to calculate + * energy without restarting so we will bail + * out and use prev_cpu this time. */ - eenv->sg = sg; - calc_sg_energy(eenv, candidate); - - if (!sd->child) { - /* - * cpu_count here is the number of - * cpus we expect to visit in this - * calculation. If we race against - * hotplug, we can have extra cpus - * added to the groups we are - * iterating which do not appear in - * the visit_cpus mask. In that case - * we are not able to calculate energy - * without restarting so we will bail - * out and use prev_cpu this time. - */ - if (!cpu_count) - return -EINVAL; - cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg)); - cpu_count--; - } - - if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top))) - goto next_cpu; - - } while (sg = sg->next, sg != sd->groups); - } - - /* - * If we raced with hotplug and got an sd NULL-pointer; - * returning a wrong energy estimation is better than - * entering an infinite loop. - * Specifically: If a cpu is unplugged after we took - * the visit_cpus mask, it no longer has an sd_scs - * pointer, so when we dereference it, we get NULL. - */ - if (cpumask_test_cpu(cpu, &visit_cpus)) - return -EINVAL; -next_cpu: - cpumask_clear_cpu(cpu, &visit_cpus); - continue; + if (!cpu_count) + return -EINVAL; + cpu_count--; + } + } while (sg = sg->next, sg != sd->groups); }

+ /* + * If we raced with hotplug and got an sd NULL-pointer; + * returning a wrong energy estimation is better than + * entering an infinite loop. + * Specifically: If a cpu is unplugged after we took + * the eenv::sg_top mask, it no longer has an sd pointer, + * so when we dereference it, we get NULL. + */ + if (cpu_count) + return -EINVAL; + return 0; }

-- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 13/15] sched/fair: Optimize energy computation with task oriented

Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs. From hardware design perspective, each candidate CPU cannot decide OPP by itself due it binds with other CPUs in the same clock domain, e.g. it binds clock domain in the cluster with other CPUs, at the end we need to compute energy for all CPUs in the cluster.

Let's use below CPU topology as the example:

Cluster_0 Cluster_1 CPU_0 CPU_4 CPU_1 CPU_5 CPU_2 CPU_6 CPU_3 CPU_7

Current code always calculate the energy for all CPUs in bound clock domain, if the candidates are CPU_0 and CPU_4, the formula for energy computation is as below:

E(CPU_x) stands the CPU_x energy, notation E(CPU_x)` means the CPU_x energy after take placed on it, TE(CPU_x) means the total energy for all computation for all CPUs when task is placed on CPU_x, E_DIFF(CPU_x - CPU_y) means the energy difference between CPU_x and CPU_y, it equals to TE(CPU_x) - TE(CPU_y).

TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)

TE(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`

E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)

...

From upper formula we easily get to know CPU_1/2/3/5/6/7 energy

computations are redundant, on the other hand if we only consider the energy consumption by waken task, the energy difference is between the target CPU energy data before and after task placed on it. This method have one benefit is it can avoid to compute all CPUs energy in the same cluster, and only can focus the energy change introduced by waken task on the target CPU, which this method is called 'task oriented computation' in this patch, the energy computation can be optimized as:

TE(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) TE(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)

E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)

As the result, the computation iteration can be reduced from 20 times to 8 times; so this can significantly reduce calculation overload.

After using task oriented computation, it has one case the computation might take longer time than previous method. For instance, when candidates are CPU_0 and CPU1, and after place task on either CPU the OPP will be increased, in this case, the old method uses below computation:

TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) TE(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)

E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)

Using task oriented computation, because the OPP increasing impacts other CPUs in the same cluster, so it needs to calculate all related CPUs energy:

TE(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)

TE(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)

E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)

We can use more complex method for optimization, e.g. we can extend eenv structure to cache CPU energy data so can reuse them for two candidates. This can be used for later optimization.

As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios. Computation inconsistent issue might introduce trouble for PE filter. This patch fixes issue by always calculating task based energy.

To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so the energy computation is limited to this single CPU, otherwise it rolls back to compute all CPUs in the same clock domain.

Change-Id: Ifb64ad77c6173388abf13e255e2ed8e8586a38bc Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 49eee75..4811fce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5473,7 +5473,7 @@ struct energy_env { /* Estimated energy variation wrt previous CPU */ int nrg_delta;

- } cpu[NR_CPUS]; + } cpu[NR_CPUS*2];

/* The morst energy efficient CPU for the specified energy_env::p */ int next_cpu; @@ -5791,45 +5791,55 @@ static int compute_energy(struct energy_env *eenv, int candidate) */ static int compute_task_energy(struct energy_env *eenv, int cpu) { - struct sched_domain *sd, *sd_cap; - struct sched_group *sg; - int first_cpu; + struct sched_domain *sd; + unsigned int prev_cap_idx, next_cap_idx; + int cmp_idx, ret;

- sd = rcu_dereference(per_cpu(sd_ea, cpu)); + sd = rcu_dereference(per_cpu(sd_scs, cpu)); if (!sd) return -1; /* Error */

- sg = sd->groups; - do { - /* Skip SGs which do not contains a candidate CPU */ - if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg))) - continue; - - eenv->sg_top = sg; - - first_cpu = cpumask_first(sched_group_cpus(sg)); - - /* - * The CPU capacity sharing attribution is decided by hardhware - * design so we can decide the sg_cp value at the beginning - * for specific CPU. - */ - sd_cap = rcu_dereference(per_cpu(sd_scs, first_cpu)); - if (sd_cap && sd_cap->parent) - eenv->sg_cap = sd_cap->parent->groups; - else - eenv->sg_cap = sd_cap->groups; - - find_new_capacity(eenv, cpu); + /* + * The CPU capacity sharing attribution is decided by hardhware + * design so we can decide the sg_cp value at the beginning + * for specific CPU. + */ + if (sd && sd->parent) + eenv->sg_cap = sd->parent->groups; + else + eenv->sg_cap = sd->groups; + + /* Estimate capacity index before task placement */ + cmp_idx = NR_CPUS + cpu; + prev_cap_idx = find_new_capacity(eenv, cmp_idx); + next_cap_idx = find_new_capacity(eenv, cpu); + + /* + * Computation is iteration sched_group from bottom to up level for + * energy accumulation, 'sg_top' is top most sched_group: + * - If the CPU frequency has no change before and after task placed + * onto the target CPU, set 'sg_top' to sched_group for the target + * CPU; this means only to calculate the energy for this single CPU + * and ignore other CPUs in the same clock domain. + * - If found the OPP frequency is changed after task placement then + * need to calculate all CPUs who bound in the same clock domain, + * so set 'sg_top' to shared capacity scheduling group. + */ + if (prev_cap_idx != next_cap_idx) + eenv->sg_top = eenv->sg_cap; + else + eenv->sg_top = sd->groups;

- /* energy is unscaled to reduce rounding errors */ - if (compute_energy(eenv, cpu) == -EINVAL) { - eenv->next_cpu = eenv->prev_cpu; - return -EINVAL; - } + /* energy is unscaled to reduce rounding errors */ + ret = compute_energy(eenv, cmp_idx); + if (ret < 0) + return ret;

- } while (sg = sg->next, sg != sd->groups); + ret = compute_energy(eenv, cpu); + if (ret < 0) + return ret;

+ eenv->cpu[cpu].energy -= eenv->cpu[cmp_idx].energy; return 0; }

-- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 14/15] sched/fair: Optimize energy calculation with cached energy data

This patch is typical optimization by using more memory space to get better performance. eenv::cpu is extended as below usage:

If task placement is to trigger CPU frequency change, then we save the calculated energy value for without task placement into cpu[NR_CPUS*2..NR_CPUS*3-1], later if there has another CPU also need this same value for calculation, then it can directly fetch this value. So can avoid duplicate calculation for CPUs in the same cluster.

Change-Id: Ic1fbdc3513a8f90148e3dfabe385f40fdbe609bd Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4811fce..b834fb3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5473,7 +5473,9 @@ struct energy_env { /* Estimated energy variation wrt previous CPU */ int nrg_delta;

- } cpu[NR_CPUS*2]; + /* Flag for if has calculated already */ + int used; + } cpu[NR_CPUS*3];

/* The morst energy efficient CPU for the specified energy_env::p */ int next_cpu; @@ -5725,6 +5727,9 @@ static int compute_energy(struct energy_env *eenv, int candidate)

WARN_ON(!eenv->sg_top->sge);

+ if (eenv->cpu[candidate].used) + return 0; + /* If a cpu is hotplugged in while we are in this function, * it does not appear in the existing eenv::sg_top mask * which came from the sched_group pointer of the @@ -5782,6 +5787,7 @@ static int compute_energy(struct energy_env *eenv, int candidate) if (cpu_count) return -EINVAL;

+ eenv->cpu[candidate].used = 1; return 0; }

@@ -5810,8 +5816,7 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) eenv->sg_cap = sd->groups;

/* Estimate capacity index before task placement */ - cmp_idx = NR_CPUS + cpu; - prev_cap_idx = find_new_capacity(eenv, cmp_idx); + prev_cap_idx = find_new_capacity(eenv, cpu + NR_CPUS); next_cap_idx = find_new_capacity(eenv, cpu);

/* @@ -5825,10 +5830,25 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) * need to calculate all CPUs who bound in the same clock domain, * so set 'sg_top' to shared capacity scheduling group. */ - if (prev_cap_idx != next_cap_idx) + if (prev_cap_idx != next_cap_idx) { eenv->sg_top = eenv->sg_cap; - else + /* + * eenv::cpus[(NR_CPUS*2)..(NR_CPUS*3-1)] is used to calculate + * whole cluster level energy calculation; so this value also + * can be used by other CPUs in the same cluster to calculate + * difference. + */ + cmp_idx = cpumask_first(sched_group_cpus(eenv->sg_top)); + cmp_idx += NR_CPUS * 2; + find_new_capacity(eenv, cmp_idx); + } else { eenv->sg_top = sd->groups; + /* + * eenv::cpus[(NR_CPUS)..(NR_CPUS*2-1)] is used to calculate + * single CPU energy calculation. + */ + cmp_idx = cpu + NR_CPUS; + }

/* energy is unscaled to reduce rounding errors */ ret = compute_energy(eenv, cmp_idx); @@ -7146,6 +7166,9 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv->cpus_mask);

+ /* Clear energy calculation data */ + memset(eenv->cpu, 0, sizeof(eenv->cpu)); + select_energy_cpu_idx(eenv);

/* Check if eenv::next_cpu is a more energy efficient CPU */ -- 1.9.1

Leo Yan

12:05 p.m.

New subject: [PATCH 15/15] sched/fair: Remove idle state estimation

When calculation energy difference before and after task placement on specific CPU, the idle state selection sometimes is based on the idle state index and sometimes based on the utilization to estimation. So the different idle state selections even happens on same one CPU, and the idle state estimation code is totally different with current CPU idle governor.

Avoid to introduce big deviation for energy calculation, let's remove the idle state estimation.

Change-Id: I4de0df9ce8f4e425206bd14de7c8069b488c4483 Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 49 ------------------------------------------------- 1 file changed, 49 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b834fb3..4e43c9a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5609,8 +5609,6 @@ static int group_idle_state(struct energy_env *eenv, int cpu_id) { struct sched_group *sg = eenv->sg; int i, state = INT_MAX; - int src_in_grp, dst_in_grp; - long grp_util = 0;

/* Find the shallowest idle state in the sched group. */ for_each_cpu(i, sched_group_cpus(sg)) @@ -5619,53 +5617,6 @@ static int group_idle_state(struct energy_env *eenv, int cpu_id) /* Take non-cpuidle idling into account (active idle/arch_cpu_idle()) */ state++;

- src_in_grp = cpumask_test_cpu(eenv->prev_cpu, sched_group_cpus(sg)); - dst_in_grp = cpumask_test_cpu(cpu_id, sched_group_cpus(sg)); - if (src_in_grp == dst_in_grp) { - /* both CPUs under consideration are in the same group or not in - * either group, migration should leave idle state the same. - */ - goto end; - } - - /* - * Try to estimate if a deeper idle state is - * achievable when we move the task. - */ - for_each_cpu(i, sched_group_cpus(sg)) { - grp_util += cpu_util_wake(i, eenv->p); - if (unlikely(i == cpu_id)) - grp_util += eenv->util_delta; - } - - if (grp_util <= - ((long)sg->sgc->max_capacity * (int)sg->group_weight)) { - /* after moving, this group is at most partly - * occupied, so it should have some idle time. - */ - int max_idle_state_idx = sg->sge->nr_idle_states - 2; - int new_state = grp_util * max_idle_state_idx; - if (grp_util <= 0) - /* group will have no util, use lowest state */ - new_state = max_idle_state_idx + 1; - else { - /* for partially idle, linearly map util to idle - * states, excluding the lowest one. This does not - * correspond to the state we expect to enter in - * reality, but an indication of what might happen. - */ - new_state = min(max_idle_state_idx, (int) - (new_state / sg->sgc->max_capacity)); - new_state = max_idle_state_idx - new_state; - } - state = new_state; - } else { - /* After moving, the group will be fully occupied - * so assume it will not be idle at all. - */ - state = 0; - } -end: return state; }

-- 1.9.1

Daniel Lezcano

19 Mar 19 Mar

6:07 a.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

On Sat, Mar 17, 2018 at 08:05:15PM +0800, Leo Yan wrote:

...

select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Yes there is a duplication of the information. Actually, the change could be the contrary: make select_energy_cpu_idx return the cpu_idx and kill the next_idx field in the structure.

next_idx field is used only in the calling function, it could be stored in a local variable.

...

Change-Id: I57bc9c96566eaf6085c85ea9386d11422cafddbf Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 37231b8..0f0b307 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5850,16 +5850,15 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu)

select_energy_cpu_idx(): estimate the energy impact of changing the

utilization distribution.

The eenv parameter specifies the changes: utilisation amount and a pair of

possible CPU candidates (the previous CPU and a different target CPU).

This function returns the index of a CPU candidate specified by the

eenv::next_idx returns the index of a CPU candidate specified by the

energy_env which corresponds to the first CPU saving energy.

Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy

efficient than running on prev_cpu. This is also the value returned in case

of abort due to error conditions during the computations.

A value greater than zero means that the first energy-efficient CPU is the

one represented by eenv->cpu[eenv->next_idx].cpu_id.

efficient than running on prev_cpu. A value greater than zero means that

the first energy-efficient CPU is the one represented by

eenv->cpu[eenv->next_idx].cpu_id.

The function returns negative value in case of abort due to error

conditions during the computations, success returns 0.

*/

static inline int select_energy_cpu_idx(struct energy_env *eenv) { @@ -5871,8 +5870,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)

sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
if (!sd)
return EAS_CPU_PRV;
if (!sd) {
eenv->next_idx = EAS_CPU_PRV;
return -1;
}

sg = sd->groups; do {
@@ -5882,8 +5883,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)
eenv->sg_top = sg;
/* energy is unscaled to reduce rounding errors */
if (compute_energy(eenv) == -EINVAL)
	return EAS_CPU_PRV;
if (compute_energy(eenv) == -EINVAL) {
	eenv->next_idx = EAS_CPU_PRV;
	return -EINVAL;
}
} while (sg = sg->next, sg != sd->groups);
@@ -5929,7 +5932,7 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) } }

return eenv->next_idx;

return 0;

}

/* @@ -7145,17 +7148,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
select_energy_cpu_idx(&eenv);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */
if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) {
if (eenv.next_idx != EAS_CPU_PRV) {
schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_nrg_sav);
	target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
	goto unlock;
} else {
	schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
	schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
}
schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
target_cpu = prev_cpu;
target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
goto unlock; }
-- 1.9.1

http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Daniel Lezcano

6:14 a.m.

New subject: [PATCH 03/15] sched/fair: Reduce indent in select_energy_cpu_brute()

On Sat, Mar 17, 2018 at 08:05:16PM +0800, Leo Yan wrote:

...

Function select_energy_cpu_brute() can firstly handle for case 'next_cpu == prev_cpu', so can avoid big amount code indentation under the case 'next_cpu != prev_cpu'.

Change-Id: Iac08b5f1e6072805c0bee22680bd0ecab411cdba Signed-off-by: Leo Yan leo.yan@linaro.org

Acked-by: Daniel Lezcano daniel.lezcano@linaro.org

Leo Yan

8:40 a.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

On Mon, Mar 19, 2018 at 07:07:33AM +0100, Daniel Lezcano wrote:

...

On Sat, Mar 17, 2018 at 08:05:15PM +0800, Leo Yan wrote:

...
select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Yes there is a duplication of the information. Actually, the change could be the contrary: make select_energy_cpu_idx return the cpu_idx and kill the next_idx field in the structure.

next_idx field is used only in the calling function, it could be stored in a local variable.

Yeah, good suggestion. Will fix in next version.

Thanks, Leo Yan

...

...
Change-Id: I57bc9c96566eaf6085c85ea9386d11422cafddbf Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 37231b8..0f0b307 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5850,16 +5850,15 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu)

select_energy_cpu_idx(): estimate the energy impact of changing the

utilization distribution.

The eenv parameter specifies the changes: utilisation amount and a pair of

possible CPU candidates (the previous CPU and a different target CPU).

This function returns the index of a CPU candidate specified by the

eenv::next_idx returns the index of a CPU candidate specified by the

energy_env which corresponds to the first CPU saving energy.

Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy

efficient than running on prev_cpu. This is also the value returned in case

of abort due to error conditions during the computations.

A value greater than zero means that the first energy-efficient CPU is the

one represented by eenv->cpu[eenv->next_idx].cpu_id.

efficient than running on prev_cpu. A value greater than zero means that

the first energy-efficient CPU is the one represented by

eenv->cpu[eenv->next_idx].cpu_id.

The function returns negative value in case of abort due to error

conditions during the computations, success returns 0.

*/

static inline int select_energy_cpu_idx(struct energy_env *eenv) { @@ -5871,8 +5870,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)

sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
if (!sd)
return EAS_CPU_PRV;
if (!sd) {
eenv->next_idx = EAS_CPU_PRV;
return -1;
}

sg = sd->groups; do {
@@ -5882,8 +5883,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)
eenv->sg_top = sg;
/* energy is unscaled to reduce rounding errors */
if (compute_energy(eenv) == -EINVAL)
	return EAS_CPU_PRV;
if (compute_energy(eenv) == -EINVAL) {
	eenv->next_idx = EAS_CPU_PRV;
	return -EINVAL;
}
} while (sg = sg->next, sg != sd->groups);
@@ -5929,7 +5932,7 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) } }

return eenv->next_idx;

return 0;

}

/* @@ -7145,17 +7148,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
select_energy_cpu_idx(&eenv);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */
if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) {
if (eenv.next_idx != EAS_CPU_PRV) {
schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_nrg_sav);
	target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
	goto unlock;
} else {
	schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
	schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
}
schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
target_cpu = prev_cpu;
target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
goto unlock; }
-- 1.9.1
--

http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Patrick Bellasi

12:22 p.m.

New subject: [PATCH 00/15] Energy computation optimization on android-4.9-eas-dev

Hi Leo,

thanks for sharing those patches which, since a long time, we know you are working on. I think they go back to our meeting last year in Mountain View...

I'm always a bit confused about the proper review channel. Usually we review Android related bits on Google's gerrit... I'm wondering why you posted this series here while some of the patches are already uploaded on gerrit, e.g.

https://android-review.googlesource.com/c/kernel/common/+/555724

Anyway, I'll have a fast look at this series and I'm sure you'll have a better change to have a chat about these at Connect too.

Cheers Patrick

On 17-Mar 20:05, Leo Yan wrote:

...

This patch set is to optimize the energy computation on Android kernel android-4.9-eas-dev branch [1];

Patches 0001-0012 are used to refactor the code and some minor optimization, otherwise the task energy computation is hard to landed into current code;

Patch 0013 "sched/fair: Optimize energy computation with task oriented" is the core patch in whole patch set, which is mainly used to implement energy calculation for task. Patch 0014 is a sequential patch to use cached value so we can get more benefit for performance by exchanging more memory.

Patch 0015 is a trival experiment patch to remove 'idle state estimation'.

The testing uses rt-app to generate synthetic workload, the workload duty cycles are 1%/5%/10%/20%/30%/40%; the duration is measured interval for func select_energy_cpu_idx(), which now is used to calculation three candidates in single run. The result shows this patch set improve for energy computation duration:

+----------------+-------+-------+-------+-------+-------+--------+ | workload | 1% | 5% | 10% | 20% | 30% | 40% | +----------------+-------+-------+-------+-------+-------+--------+ | w/o patch set | 17267 | 21227 | 17019 | 13914 | 15002 | 23412 | | w/t patch set | 10823 | 11924 | 10931 | 10785 | 11139 | 11223 | +----------------+-------+-------+-------+-------+-------+--------+ | Opt percentage | 37% | 43% | 36% | 22% | 26% | 52% | +----------------+-------+-------+-------+-------+-------+--------+

The detailed testing ipython notebook you could check [2][3].

[1] https://android.googlesource.com/kernel/common/+/android-4.9-eas-dev [2] https://github.com/Leo-Yan/lisa/blob/2018_03_17_android_4.9_eas_dev_nrg_comp... [3] https://github.com/Leo-Yan/lisa/blob/2018_03_17_android_4.9_eas_dev_nrg_comp...

Leo Yan (15): sched/fair: Prepare energy env cpumask before energy calculation sched/fair: Re-define return values for select_energy_cpu_idx() sched/fair: Reduce indent in select_energy_cpu_brute() sched/fair: Fix one minor typo sched/fair: Use per cpu data to maintain energy environment sched/fair: Use cpumask to track candidates for energy calculation sched/fair: Lift CPU iteration out of calc_sg_energy() sched/fair: Introduce new function compute_task_energy() sched/fair: Decide 'eenv->sg_cap' ahead energy computation sched/fair: Use eenv::sg_cap to select capacity index sched/fair: Estimate capacity index ahead energy computation sched/fair: Refactor compute_energy() sched/fair: Optimize energy computation with task oriented sched/fair: Optimize energy calculation with cached energy data sched/fair: Remove idle state estimation

kernel/sched/fair.c | 542 +++++++++++++++++++++++++--------------------------- 1 file changed, 256 insertions(+), 286 deletions(-)

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

12:24 p.m.

New subject: [PATCH 01/15] sched/fair: Prepare energy env cpumask before energy calculation

Hi Leo,

On 17-Mar 20:05, Leo Yan wrote:

...

Move out energy env cpumask setting from select_energy_cpu_idx() and prepare it ahead. This changes make code more clear, which can prepare energy environment variable before energy calculation.

Change-Id: I9061492177ad8d4830cf8bbdaffdf7860b27e319 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 315c386..37231b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5874,15 +5874,6 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) if (!sd) return EAS_CPU_PRV;
cpumask_clear(&eenv->cpus_mask);

for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx) {
int cpu = eenv->cpu[cpu_idx].cpu_id;

The intent to do it here and do it with a loop is because of that EAS_CPU_CNT definition, which should make it clear that potentially we can think about having more then 3 candidate we have not.

That's why also I don't agree that moving this block outside makes the code more clear...

...

if (cpu < 0)
	continue;
cpumask_set_cpu(cpu, &eenv->cpus_mask);
}

sg = sd->groups; do { /* Skip SGs which do not contains a candidate CPU */
@@ -7133,7 +7124,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync }, };

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && p->state == TASK_WAKING) @@ -7147,6 +7137,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }
cpumask_clear(&eenv.cpus_mask);
if (prev_cpu >= 0)
	cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);
if (next_cpu >= 0)
	cpumask_set_cpu(next_cpu, &eenv.cpus_mask);
if (backup_cpu >= 0)
	cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */ if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

12:33 p.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

On 19-Mar 07:07, Daniel Lezcano wrote:

...

On Sat, Mar 17, 2018 at 08:05:15PM +0800, Leo Yan wrote:

...
select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Yes there is a duplication of the information. Actually, the change could be the contrary: make select_energy_cpu_idx return the cpu_idx and kill the next_idx field in the structure.

next_idx field is used only in the calling function, it could be stored in a local variable.

Unfortunately it's not merged (yet) in eas-dev, but there is an energy_diff tracepoint which gets eenv only data.

That's the reason why we store next_idx in the eenv, moreover it makes the code more self-contained from a functional standpoint since everywhere you know that the eenv contains all the information which defined a certain energy aware wakeup decision.

Using next_idx as a return value is also another design decision, since it makes the core more streamline to read.

Thus, overall I don't see a big point on this patch if you consider the above two "design decisions".

...

...
Change-Id: I57bc9c96566eaf6085c85ea9386d11422cafddbf Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 37231b8..0f0b307 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5850,16 +5850,15 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu)

select_energy_cpu_idx(): estimate the energy impact of changing the

utilization distribution.

The eenv parameter specifies the changes: utilisation amount and a pair of

possible CPU candidates (the previous CPU and a different target CPU).

This function returns the index of a CPU candidate specified by the

eenv::next_idx returns the index of a CPU candidate specified by the

energy_env which corresponds to the first CPU saving energy.

Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy

efficient than running on prev_cpu. This is also the value returned in case

of abort due to error conditions during the computations.

A value greater than zero means that the first energy-efficient CPU is the

one represented by eenv->cpu[eenv->next_idx].cpu_id.

efficient than running on prev_cpu. A value greater than zero means that

the first energy-efficient CPU is the one represented by

eenv->cpu[eenv->next_idx].cpu_id.

The function returns negative value in case of abort due to error

conditions during the computations, success returns 0.

*/

static inline int select_energy_cpu_idx(struct energy_env *eenv) { @@ -5871,8 +5870,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)

sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
if (!sd)
return EAS_CPU_PRV;
if (!sd) {
eenv->next_idx = EAS_CPU_PRV;
return -1;
}

sg = sd->groups; do {
@@ -5882,8 +5883,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)
eenv->sg_top = sg;
/* energy is unscaled to reduce rounding errors */
if (compute_energy(eenv) == -EINVAL)
	return EAS_CPU_PRV;
if (compute_energy(eenv) == -EINVAL) {
	eenv->next_idx = EAS_CPU_PRV;
	return -EINVAL;
}
} while (sg = sg->next, sg != sd->groups);
@@ -5929,7 +5932,7 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) } }

return eenv->next_idx;

return 0;

}

/* @@ -7145,17 +7148,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
select_energy_cpu_idx(&eenv);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */
if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) {
if (eenv.next_idx != EAS_CPU_PRV) {
schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_nrg_sav);
	target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
	goto unlock;
} else {
	schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
	schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
}
schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
target_cpu = prev_cpu;
target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
goto unlock; }
-- 1.9.1
--

http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

-- #include <best/regards.h>

Patrick Bellasi

12:43 p.m.

New subject: [PATCH 03/15] sched/fair: Reduce indent in select_energy_cpu_brute()

On 17-Mar 20:05, Leo Yan wrote:

...

Function select_energy_cpu_brute() can firstly handle for case 'next_cpu == prev_cpu', so can avoid big amount code indentation under the case 'next_cpu != prev_cpu'.

The existing code is still withing the 80 cols margin and also not so bad to read.

Thus I'm not in general a big fan of this kind of function-0-change patches... and I would really like to avoid them at least when they are not part of a functional refactoring.

Moreover...

...

Change-Id: Iac08b5f1e6072805c0bee22680bd0ecab411cdba Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 91 +++++++++++++++++++++++++---------------------------- 1 file changed, 42 insertions(+), 49 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0f0b307..b67999a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7060,6 +7060,8 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync int target_cpu; int backup_cpu; int next_cpu;

int delta = 0;

struct energy_env eenv;

schedstat_inc(p->se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq()->eas_stats.secb_attempts);

@@ -7108,63 +7110,54 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync }

target_cpu = prev_cpu;
if (next_cpu != prev_cpu) {
int delta = 0;
struct energy_env eenv = {
	.p              = p,
	.util_delta     = task_util(p),
	/* Task's previous CPU candidate */
	.cpu[EAS_CPU_PRV] = {
		.cpu_id = prev_cpu,
	},
	/* Main alternative CPU candidate */
	.cpu[EAS_CPU_NXT] = {
		.cpu_id = next_cpu,
	},
	/* Backup alternative CPU candidate */
	.cpu[EAS_CPU_BKP] = {
		.cpu_id = backup_cpu,
	},
};

The block above exploits the designated initializers to ensure a zero initialization to all omitted members... for example eenv.energy which is used by the following code to sum track the estimated energy of a candidate to ensure a zero initialization to all omitted members... for example eenv.energy which is used by the following code to sum track the estimated energy of a candidate.

...

if (next_cpu == prev_cpu) {
schedstat_inc(p->se.statistics.nr_wakeups_secb_count);
schedstat_inc(this_rq()->eas_stats.secb_count);
goto unlock;
}

eenv.p = p;

eenv.util_delta = task_util(p);

/* Task's previous CPU candidate */

eenv.cpu[EAS_CPU_PRV].cpu_id = prev_cpu;

/* Main alternative CPU candidate */

eenv.cpu[EAS_CPU_NXT].cpu_id = next_cpu;

/* Backup alternative CPU candidate */

eenv.cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

... AFAIKS here we don't sure eenv.cpu[XXX].energy = 0

Thus don't we get whatever is on stack as initial energy value?

...

#ifdef CONFIG_SCHED_WALT

if (!walt_disabled && sysctl_sched_use_walt_cpu_util &&

```
	p->state == TASK_WAKING)
```
```
	delta = task_util(p);
```

if (!walt_disabled && sysctl_sched_use_walt_cpu_util &&
```
p->state == TASK_WAKING)
```
```
delta = task_util(p);
```

#endif

/* Not enough spare capacity on previous cpu */

if (__cpu_overutilized(prev_cpu, delta)) {

	schedstat_inc(p->se.statistics.nr_wakeups_secb_insuff_cap);

	schedstat_inc(this_rq()->eas_stats.secb_insuff_cap);

```
	target_cpu = next_cpu;
```
```
	goto unlock;
```
```
}
```
```
cpumask_clear(&eenv.cpus_mask);
```
```
if (prev_cpu >= 0)
```

	cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);

```
if (next_cpu >= 0)
```

	cpumask_set_cpu(next_cpu, &eenv.cpus_mask);

```
if (backup_cpu >= 0)
```

	cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);

/* Not enough spare capacity on previous cpu */
if (__cpu_overutilized(prev_cpu, delta)) {

schedstat_inc(p->se.statistics.nr_wakeups_secb_insuff_cap);

schedstat_inc(this_rq()->eas_stats.secb_insuff_cap);

```
target_cpu = next_cpu;
```
```
goto unlock;
```
}

```
select_energy_cpu_idx(&eenv);
```

cpumask_clear(&eenv.cpus_mask);
if (prev_cpu >= 0)

cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);

if (next_cpu >= 0)

cpumask_set_cpu(next_cpu, &eenv.cpus_mask);

if (backup_cpu >= 0)

cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);

/* Check if EAS_CPU_NXT is a more energy efficient CPU */

```
if (eenv.next_idx != EAS_CPU_PRV) {
```

	schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);

	schedstat_inc(this_rq()->eas_stats.secb_nrg_sav);

```
} else {
```

	schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);

	schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);

```
}
```

select_energy_cpu_idx(&eenv);

target_cpu = eenv.cpu[eenv.next_idx].cpu_id;

```
goto unlock;
```

/* Check if EAS_CPU_NXT is a more energy efficient CPU */
if (eenv.next_idx != EAS_CPU_PRV) {

schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);

schedstat_inc(this_rq()->eas_stats.secb_nrg_sav);

} else {

schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);

schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);

}

schedstat_inc(p->se.statistics.nr_wakeups_secb_count);
schedstat_inc(this_rq()->eas_stats.secb_count);

target_cpu = eenv.cpu[eenv.next_idx].cpu_id;

unlock: rcu_read_unlock(); -- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

12:56 p.m.

New subject: [PATCH 05/15] sched/fair: Use per cpu data to maintain energy environment

On 17-Mar 20:05, Leo Yan wrote:

...

This commit changes to use per cpu data to maintain energy environment structure, so can move it out from kernel stack and avoid the stack overflow issue if we want to add more items into energy environment structure for later optimization.

This relocation makes sense to us, to the point that:

- Chris already used this approach in the EAS re-expression for v4.14 kernels: https://android.googlesource.com/kernel/common/+/android-4.14/ which has been available since some mothns now.

- Quentin/Dietmar are using a similar approach for the "mainline" posting (soon to be posted on LKML)

Thus, this seems to be yet another implementation.

IMHO we should focus on reviewing/refactoring the v4.14 version, maybe by improving what's in that kernel to add some of the interesting features you are proposing in this series.

At the end it would be nice to reduce the amount of code variations we have around... thus, before merging such a big change in v4.9, I would like to see that we all agree we can use the same solution both in v4.14 and v4.9.

Another thing to consider is that the mainline solution is more likely to be aligned with what Dietmar is going to push on LKML (perhaps by today)... thus keep an eye on that too.

...

Change-Id: Ifdcb1bbae36cb8e074b31ec35f00c79f6a14b973 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6c0b918..9997a6a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6690,7 +6690,36 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)

return target; }

+static struct kmem_cache *energy_env_cache __read_mostly; +static DEFINE_PER_CPU(struct energy_env *, energy_env);

+static void init_energy_env(void) +{
struct energy_env *eenv;

int i;

energy_env_cache = KMEM_CACHE(energy_env, 0);

for_each_possible_cpu(i) {
eenv = kmem_cache_alloc(energy_env_cache, GFP_KERNEL | __GFP_ZERO);
if (!eenv)
	goto err;
rcu_assign_pointer(per_cpu(energy_env, i), eenv);
}

return;
+err:
for_each_possible_cpu(i) {
eenv = rcu_dereference(per_cpu(energy_env, i));
if (!eenv)
	continue;
kmem_cache_free(energy_env_cache, eenv);
}
+}

/*

cpu_util_wake: Compute cpu utilization with any contributions from

the waking task p removed. check_for_migration() looks for a better CPU of

@@ -7061,14 +7090,13 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync int backup_cpu; int next_cpu; int delta = 0;

struct energy_env eenv;

int cpu = smp_processor_id();

struct energy_env *eenv;

schedstat_inc(p->se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq()->eas_stats.secb_attempts);

if (sysctl_sched_sync_hint_enable && sync) {
int cpu = smp_processor_id();
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) { schedstat_inc(p->se.statistics.nr_wakeups_secb_sync); schedstat_inc(this_rq()->eas_stats.secb_sync);
@@ -7116,14 +7144,15 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

eenv.p = p;

eenv.util_delta = task_util(p);

eenv = rcu_dereference(per_cpu(energy_env, cpu));

eenv->p = p;

eenv->util_delta = task_util(p); /* Task's previous CPU candidate */

eenv.cpu[EAS_CPU_PRV].cpu_id = prev_cpu;

eenv->cpu[EAS_CPU_PRV].cpu_id = prev_cpu; /* Main alternative CPU candidate */

eenv.cpu[EAS_CPU_NXT].cpu_id = next_cpu;

eenv->cpu[EAS_CPU_NXT].cpu_id = next_cpu; /* Backup alternative CPU candidate */

eenv.cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

eenv->cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && @@ -7138,18 +7167,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

cpumask_clear(&eenv.cpus_mask);

cpumask_clear(&eenv->cpus_mask); if (prev_cpu >= 0)
cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);
cpumask_set_cpu(prev_cpu, &eenv->cpus_mask);
if (next_cpu >= 0)
cpumask_set_cpu(next_cpu, &eenv.cpus_mask);
cpumask_set_cpu(next_cpu, &eenv->cpus_mask);
if (backup_cpu >= 0)
cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
cpumask_set_cpu(backup_cpu, &eenv->cpus_mask);
select_energy_cpu_idx(&eenv);

select_energy_cpu_idx(eenv);

/* Check if EAS_CPU_NXT is a more energy efficient CPU */

if (eenv.next_idx != EAS_CPU_PRV) {

if (eenv->next_idx != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); } else {

@@ -7157,7 +7186,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

target_cpu = eenv.cpu[eenv.next_idx].cpu_id;

target_cpu = eenv->cpu[eenv->next_idx].cpu_id;

unlock: rcu_read_unlock(); @@ -10975,6 +11004,7 @@ __init void init_sched_fair_class(void) nohz.next_balance = jiffies; zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); #endif

init_energy_env();

#endif /* SMP */

}

1.9.1

-- #include <best/regards.h>

Patrick Bellasi

2:27 p.m.

New subject: [PATCH 06/15] sched/fair: Use cpumask to track candidates for energy calculation

On 17-Mar 20:05, Leo Yan wrote:

...

The energy calculation has two steps, the first step is to select CPU candidates, the second step is to compare energy for these candidates. Now in first step we use three candidates (EAS_CPU_PRV, EAS_CPU_NXT and EAS_CPU_BKP), and energy calculation also uses these three CPU index for computation iteration.

If we bind code to these three candidates, it's hard for us to extend for more CPU candidates; on the other hand cpumask structure is more suitable to track candidates and for extension for more candidates.

This patch is to use cpumask structure to track the candidates, and remove the three predefined candidates EAS_CPU_PRV, EAS_CPU_NXT and EAS_CPU_BKP.

Again, I'm in favour to this approach, but still think it should be aligned with the mainline proposal... which, as I told, Dietmar is going to post soon on LKML...

...

Change-Id: I0fb1a904b6024ebfb3c26daa1a00dc8132eebb3a Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 151 ++++++++++++++++++++-------------------------------- 1 file changed, 57 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9997a6a..f4206c1 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5444,20 +5444,6 @@ static inline bool energy_aware(void) }

/*

CPU candidates.

These are labels to reference CPU candidates for an energy_diff.

Currently we support only two possible candidates: the task's previous CPU

and another candiate CPU.

More advanced/aggressive EAS selection policies can consider more

candidates.

*/

-#define EAS_CPU_PRV 0 -#define EAS_CPU_NXT 1 -#define EAS_CPU_BKP 2 -#define EAS_CPU_CNT 3

-/*

energy_diff - supports the computation of the estimated energy impact in

moving a "task"'s "util_delta" between different CPU candidates.

*/ @@ -5471,10 +5457,6 @@ struct energy_env {

/* CPU candidates to evaluate */ struct {
/* CPU ID, must be in cpus_mask */
int	cpu_id;
/*

Index (into sched_group_energy::cap_states) of the OPP the

CPU needs to run at if the task is placed on it.
@@ -5488,16 +5470,16 @@ struct energy_env { /* Estimated system energy */ unsigned int energy;
/* Estimated energy variation wrt EAS_CPU_PRV */
/* Estimated energy variation wrt previous CPU */
int nrg_delta;
} cpu[EAS_CPU_CNT];

} cpu[NR_CPUS];
/*
* Index (into energy_env::cpu) of the morst energy efficient CPU for
* the specified energy_env::p
*/
int next_idx;
/* The morst energy efficient CPU for the specified energy_env::p */

int next_cpu;

/* Previous CPU */

int prev_cpu;

/* Support data */ struct sched_group *sg_top;

@@ -5530,7 +5512,7 @@ static unsigned long __cpu_norm_util(unsigned long util, unsigned long capacity) return (util << SCHED_CAPACITY_SHIFT)/capacity; }

-static unsigned long group_max_util(struct energy_env *eenv, int cpu_idx) +static unsigned long group_max_util(struct energy_env *eenv, int cpu_id) { unsigned long max_util = 0; unsigned long util; @@ -5545,7 +5527,7 @@ static unsigned long group_max_util(struct energy_env *eenv, int cpu_idx) * then we should add the (estimated) utilization of the task * assuming we will wake it up on that CPU. */
if (unlikely(cpu == eenv->cpu[cpu_idx].cpu_id))
if (unlikely(cpu == cpu_id))
util += eenv->util_delta;
max_util = max(max_util, util);
@@ -5575,9 +5557,9 @@ static unsigned long group_max_util(struct energy_env *eenv, int cpu_idx)

estimate (more busy).

*/ static unsigned -long group_norm_util(struct energy_env *eenv, int cpu_idx) +long group_norm_util(struct energy_env *eenv, int cpu_id) {

unsigned long capacity = eenv->cpu[cpu_idx].cap;

unsigned long capacity = eenv->cpu[cpu_id].cap; unsigned long util, util_sum = 0; int cpu;

@@ -5590,7 +5572,7 @@ long group_norm_util(struct energy_env *eenv, int cpu_idx) * then we should add the (estimated) utilization of the task * assuming we will wake it up on that CPU. */
if (unlikely(cpu == eenv->cpu[cpu_idx].cpu_id))
if (unlikely(cpu == cpu_id))
util += eenv->util_delta;
util_sum += __cpu_norm_util(util, capacity);
@@ -5599,29 +5581,29 @@ long group_norm_util(struct energy_env *eenv, int cpu_idx) return min_t(unsigned long, util_sum, SCHED_CAPACITY_SCALE); }

-static int find_new_capacity(struct energy_env *eenv, int cpu_idx) +static int find_new_capacity(struct energy_env *eenv, int cpu_id) { const struct sched_group_energy *sge = eenv->sg->sge; int idx, max_idx = sge->nr_cap_states - 1;

unsigned long util = group_max_util(eenv, cpu_idx);

unsigned long util = group_max_util(eenv, cpu_id);

/* default is max_cap if we don't find a match */

eenv->cpu[cpu_idx].cap_idx = max_idx;

eenv->cpu[cpu_idx].cap = sge->cap_states[max_idx].cap;

eenv->cpu[cpu_id].cap_idx = max_idx;

eenv->cpu[cpu_id].cap = sge->cap_states[max_idx].cap;

for (idx = 0; idx < sge->nr_cap_states; idx++) { if (sge->cap_states[idx].cap >= util) { /* Keep track of SG's capacity */
	eenv->cpu[cpu_idx].cap_idx = idx;
	eenv->cpu[cpu_idx].cap = sge->cap_states[idx].cap;
	eenv->cpu[cpu_id].cap_idx = idx;
	eenv->cpu[cpu_id].cap = sge->cap_states[idx].cap;
break;
} }
return eenv->cpu[cpu_idx].cap_idx;

return eenv->cpu[cpu_id].cap_idx;

}

-static int group_idle_state(struct energy_env *eenv, int cpu_idx) +static int group_idle_state(struct energy_env *eenv, int cpu_id) { struct sched_group *sg = eenv->sg; int i, state = INT_MAX; @@ -5635,10 +5617,8 @@ static int group_idle_state(struct energy_env *eenv, int cpu_idx) /* Take non-cpuidle idling into account (active idle/arch_cpu_idle()) */ state++;
src_in_grp = cpumask_test_cpu(eenv->cpu[EAS_CPU_PRV].cpu_id,
		      sched_group_cpus(sg));
dst_in_grp = cpumask_test_cpu(eenv->cpu[cpu_idx].cpu_id,
		      sched_group_cpus(sg));
src_in_grp = cpumask_test_cpu(eenv->prev_cpu, sched_group_cpus(sg));

dst_in_grp = cpumask_test_cpu(cpu_id, sched_group_cpus(sg)); if (src_in_grp == dst_in_grp) { /* both CPUs under consideration are in the same group or not in

either group, migration should leave idle state the same.

@@ -5652,7 +5632,7 @@ static int group_idle_state(struct energy_env *eenv, int cpu_idx) */ for_each_cpu(i, sched_group_cpus(sg)) { grp_util += cpu_util_wake(i, eenv->p);
if (unlikely(i == eenv->cpu[cpu_idx].cpu_id))
if (unlikely(i == cpu_id))
grp_util += eenv->util_delta;
}
@@ -5707,39 +5687,36 @@ static void calc_sg_energy(struct energy_env *eenv) unsigned long sg_util; int cap_idx, idle_idx; int total_energy = 0;

int cpu_idx;

for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx) {

int cpu;

for_each_cpu(cpu, &eenv->cpus_mask) {
if (eenv->cpu[cpu_idx].cpu_id == -1)
	continue;
/* Compute ACTIVE energy */
cap_idx = find_new_capacity(eenv, cpu_idx);
cap_idx = find_new_capacity(eenv, cpu);
busy_power = sg->sge->cap_states[cap_idx].power; /*

in order to calculate cpu_norm_util, we need to know which

capacity level the group will be at, so calculate that first

*/
sg_util = group_norm_util(eenv, cpu_idx);
sg_util = group_norm_util(eenv, cpu);
busy_energy = sg_util * busy_power;

/* Compute IDLE energy */
idle_idx = group_idle_state(eenv, cpu_idx);
idle_idx = group_idle_state(eenv, cpu);
idle_power = sg->sge->idle_states[idle_idx].power;

idle_energy = SCHED_CAPACITY_SCALE - sg_util; idle_energy *= idle_power;

total_energy = busy_energy + idle_energy;
eenv->cpu[cpu_idx].energy += total_energy;
eenv->cpu[cpu].energy += total_energy;
}
}

/*

compute_energy() computes the absolute variation in energy consumption by

moving eenv.util_delta from EAS_CPU_PRV to EAS_CPU_NXT.
moving eenv::util_delta from eenv::prev_cpu to eenv::next_cpu.

NOTE: compute_energy() may fail when racing with sched_domain updates, in
  which case we abort by returning -EINVAL.
@@ -5850,12 +5827,8 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu)

select_energy_cpu_idx(): estimate the energy impact of changing the

utilization distribution.

eenv::next_idx returns the index of a CPU candidate specified by the

energy_env which corresponds to the first CPU saving energy.

Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy

efficient than running on prev_cpu. A value greater than zero means that

the first energy-efficient CPU is the one represented by

eenv->cpu[eenv->next_idx].cpu_id.

eenv::next_cpu returns a CPU candidate specified by the energy_env which

corresponds to the first energy-efficient CPU.

The function returns negative value in case of abort due to error

conditions during the computations, success returns 0.

@@ -5864,14 +5837,13 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) { struct sched_domain *sd; struct sched_group *sg;

int sd_cpu = -1;

int cpu_idx;

int cpu; int margin;

sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id;

sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));

cpu = cpumask_first(&eenv->cpus_mask);

sd = rcu_dereference(per_cpu(sd_ea, cpu)); if (!sd) {
eenv->next_idx = EAS_CPU_PRV;
eenv->next_cpu = eenv->prev_cpu;
return -1; }
@@ -5884,49 +5856,46 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) eenv->sg_top = sg; /* energy is unscaled to reduce rounding errors */ if (compute_energy(eenv) == -EINVAL) {
	eenv->next_idx = EAS_CPU_PRV;
	eenv->next_cpu = eenv->prev_cpu;
return -EINVAL;
}

} while (sg = sg->next, sg != sd->groups);

/* Scale energy before comparisons */
for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx)
eenv->cpu[cpu_idx].energy >>= SCHED_CAPACITY_SHIFT;
for_each_cpu(cpu, &eenv->cpus_mask)
eenv->cpu[cpu].energy >>= SCHED_CAPACITY_SHIFT;
/*

Compute the dead-zone margin used to prevent too many task

migrations with negligible energy savings.

An energy saving is considered meaningful if it reduces the energy
* consumption of EAS_CPU_PRV CPU candidate by at least ~1.56%
* consumption of previous CPU candidate by at least ~1.56%
*/
margin = eenv->cpu[EAS_CPU_PRV].energy >> 6;

margin = eenv->cpu[eenv->prev_cpu].energy >> 6;

/*
* By default the EAS_CPU_PRV CPU is considered the most energy
* By default the previous CPU is considered the most energy
efficient, with a 0 energy variation.

*/
eenv->next_idx = EAS_CPU_PRV;

eenv->next_cpu = eenv->prev_cpu;

/*

Compare the other CPU candidates to find a CPU which can be
* more energy efficient then EAS_CPU_PRV
* more energy efficient than previous CPU
*/
for (cpu_idx = EAS_CPU_NXT; cpu_idx < EAS_CPU_CNT; ++cpu_idx) {
/* Skip not valid scheduled candidates */
if (eenv->cpu[cpu_idx].cpu_id < 0)
	continue;
/* Compute energy delta wrt EAS_CPU_PRV */
eenv->cpu[cpu_idx].nrg_delta =
	eenv->cpu[cpu_idx].energy -
	eenv->cpu[EAS_CPU_PRV].energy;
for_each_cpu(cpu, &eenv->cpus_mask) {
/* Compute energy delta wrt previous CPU */
eenv->cpu[cpu].nrg_delta =
	eenv->cpu[cpu].energy - eenv->cpu[eenv->prev_cpu].energy;
/* filter energy variations within the dead-zone margin */
if (abs(eenv->cpu[cpu_idx].nrg_delta) < margin)
	eenv->cpu[cpu_idx].nrg_delta = 0;
if (abs(eenv->cpu[cpu].nrg_delta) < margin)
	eenv->cpu[cpu].nrg_delta = 0;
/* update the schedule candidate with min(nrg_delta) */
if (eenv->cpu[cpu_idx].nrg_delta <
    eenv->cpu[eenv->next_idx].nrg_delta) {
	eenv->next_idx = cpu_idx;
if (eenv->cpu[cpu].nrg_delta <
    eenv->cpu[eenv->next_cpu].nrg_delta) {
	eenv->next_cpu = cpu;
if (sched_feat(FBT_STRICT_ORDER))
	break;
}
@@ -7147,12 +7116,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync eenv = rcu_dereference(per_cpu(energy_env, cpu)); eenv->p = p; eenv->util_delta = task_util(p);

/* Task's previous CPU candidate */

eenv->cpu[EAS_CPU_PRV].cpu_id = prev_cpu;

/* Main alternative CPU candidate */

eenv->cpu[EAS_CPU_NXT].cpu_id = next_cpu;

/* Backup alternative CPU candidate */

eenv->cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && @@ -7177,8 +7140,8 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync

select_energy_cpu_idx(eenv);

/* Check if EAS_CPU_NXT is a more energy efficient CPU */

if (eenv->next_idx != EAS_CPU_PRV) {

/* Check if eenv::next_cpu is a more energy efficient CPU */

if (eenv->next_cpu != eenv->prev_cpu) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); } else {

@@ -7186,7 +7149,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

target_cpu = eenv->cpu[eenv->next_idx].cpu_id;

target_cpu = eenv->next_cpu;

unlock: rcu_read_unlock(); -- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

2:47 p.m.

New subject: [PATCH 12/15] sched/fair: Refactor compute_energy()

On 17-Mar 20:05, Leo Yan wrote:

...

During energy calculation, it iterates cpumask bits according to eenv:sg_top; there have two race conditions between energy calculation flow with CPU hotplug flow, so in the middle of calculation any CPU might be hotplugged in or out.

This patch checks two conditions to confirm if hit any hotplug race condition, and if hit then directly bail out with returning error value. During calculation but haven't finished all CPUs iteration, if cpu_count is zero this means there have some CPU is concurrently hotplugged in system, so this is one race condition; after calculation has accomplished, we should expect cpu_count is zero so means all CPU has been traversed one by one, otherwise it means some CPUs have been hotplugged out and as result it has skipped for calculation, this is for detecting the second race condition. This patch refactors compute_energy() by using these two race conditions checking, and removed redundant variables.

In principle I agree with most of the changes you propose up to this patch included. This is a massive refactoring of the way we currently compute energy diff without changing the external effects.

Still I would suggest to have these modifications properly mixed/merged in v4.14. Chris got the chance to almost completely rewrite EAS in v4.14 to:

1) get rid of the history which we don't really need thus consolidating the proposal in a minimum set of clean patches

2) introduce most of the optimizations we mainly discussed so far

These first set of patches from you, from 01 to 12, are a good example of optimizations. Some of these concepts are already there in v4.14:

https://android.googlesource.com/kernel/common/+/android-4.14/

others can be easily ported/adapted to that code line. For example, I like your proposals on a per_cpu(eenv) using a bitmaks to select the cpus as well as the optimization on sched group capacities computation and caching. That's not everything new if you look at the v4.14 codebase... but some things I think they are still missing and could/should be added.

Regarding instead merging these patches in v4.9 I think it's not completely worth, mainly because we have already two different proposals/variations. The v4.14 Android flavour and an upcoming mainline simplified model. I would really like to minimize code variations and try to consolidate what we have in the v4.14 code based... this version is and can still be different than the mainline proposal... but at least we will have only one.

For v4.9, once we agree on a unified and optimized expression on v4.14, we can always try to backport it to the first kernel too.

...

Change-Id: I8dadea2229dabb8ddc60597978f724ccaae61311 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 102 +++++++++++++++++++++++----------------------------- 1 file changed, 44 insertions(+), 58 deletions(-)

struct cpumask visit_cpus; int cpu_count;

int cpu;
struct sched_domain *sd;

WARN_ON(!eenv->sg_top->sge);

cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top)); /* If a cpu is hotplugged in while we are in this function,

* it does not appear in the existing visit_cpus mask

```
* it does not appear in the existing eenv::sg_top mask
```
- which came from the sched_group pointer of the
- sched_domain pointed at by sd_ea for either the prev
- or next cpu and was dereferenced in __energy_diff.

* Since we will dereference sd_scs later as we iterate

* Since we will dereference sd later as we iterate

through the CPUs we expect to visit, new CPUs can

* be present which are not in the visit_cpus mask.

* be present which are not in the eenv::sg_top mask.

Guard this with cpu_count.

cpu_count = cpumask_weight(&visit_cpus);

cpu_count = cpumask_weight(sched_group_cpus(eenv->sg_top));
cpu = cpumask_first(sched_group_cpus(eenv->sg_top));

while (!cpumask_empty(&visit_cpus)) {
```
int cpu = cpumask_first(&visit_cpus);
```
```
struct sched_domain *sd;
```

for_each_domain(cpu, sd) {
```
struct sched_group *sg = sd->groups;
```

```
for_each_domain(cpu, sd) {
```
```
	struct sched_group *sg = sd->groups;
```

```
do {
```

	if (!cpumask_intersects(sched_group_cpus(eenv->sg_top),

```
				sched_group_cpus(sg)))
```
```
		continue;
```

	/* Has this sched_domain already been visited? */

	if (sd->child && group_first_cpu(sg) != cpu)

```
		break;
```

```
	/*
```

	 * Compute the energy for all the candidate

```
	 * CPUs in the current visited SG.
```
```
	 */
```
```
	eenv->sg = sg;
```
```
	calc_sg_energy(eenv, candidate);
```

```
	do {
```

```
	if (!sd->child) {
	/*
```

		 * Compute the energy for all the candidate

```
		 * CPUs in the current visited SG.
```

		 * cpu_count here is the number of cpus we

		 * expect to visit in this calculation. If

		 * we race against hotplug, we can have extra

		 * cpus added to the groups we are iterating

		 * which do not appear in the eenv::sg_top mask.

		 * In that case we are not able to calculate

		 * energy without restarting so we will bail

		 * out and use prev_cpu this time.
	 */

```
		eenv->sg = sg;
```
```
		calc_sg_energy(eenv, candidate);
```
```
		if (!sd->child) {
```
```
			/*
```
```
			 * cpu_count here is the number of
```
```
			 * cpus we expect to visit in this
```
```
			 * calculation. If we race against
```
```
			 * hotplug, we can have extra cpus
```
```
			 * added to the groups we are
```
```
			 * iterating which do not appear in
```

			 * the visit_cpus mask. In that case

			 * we are not able to calculate energy

			 * without restarting so we will bail

```
			 * out and use prev_cpu this time.
```
```
			 */
```
```
			if (!cpu_count)
```
```
				return -EINVAL;
```

			cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));

```
			cpu_count--;
```
```
		}
```

		if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top)))

```
			goto next_cpu;
```

	} while (sg = sg->next, sg != sd->groups);

```
}
```
```
/*
```

 * If we raced with hotplug and got an sd NULL-pointer;

 * returning a wrong energy estimation is better than

```
 * entering an infinite loop.
```

 * Specifically: If a cpu is unplugged after we took

 * the visit_cpus mask, it no longer has an sd_scs

 * pointer, so when we dereference it, we get NULL.

```
 */
```

if (cpumask_test_cpu(cpu, &visit_cpus))

```
	return -EINVAL;
```

-next_cpu:

```
cpumask_clear_cpu(cpu, &visit_cpus);
```
```
continue;
```

```
		if (!cpu_count)
```
```
			return -EINVAL;
```
```
		cpu_count--;
```
```
	}
```

} while (sg = sg->next, sg != sd->groups);

}

* If we raced with hotplug and got an sd NULL-pointer;

* returning a wrong energy estimation is better than

```
* entering an infinite loop.
```

* Specifically: If a cpu is unplugged after we took

* the eenv::sg_top mask, it no longer has an sd pointer,

* so when we dereference it, we get NULL.

```
*/
```
if (cpu_count)
```
return -EINVAL;
```
return 0;

}

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

3:52 p.m.

New subject: [PATCH 13/15] sched/fair: Optimize energy computation with task oriented

On 17-Mar 20:05, Leo Yan wrote:

...

Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs. From hardware design perspective, each candidate CPU cannot decide OPP by itself due it binds with other CPUs in the same clock domain, e.g. it binds clock domain in the cluster with other CPUs, at the end we need to compute energy for all CPUs in the cluster.

Let's use below CPU topology as the example:
Cluster_0               Cluster_1
CPU_0                   CPU_4
CPU_1                   CPU_5
CPU_2                   CPU_6
CPU_3                   CPU_7
Current code always calculate the energy for all CPUs in bound clock domain, if the candidates are CPU_0 and CPU_4, the formula for energy computation is as below:

E(CPU_x) stands the CPU_x energy, notation E(CPU_x)` means the CPU_x energy after take placed on it, TE(CPU_x) means the total energy for all computation for all CPUs when task is placed on CPU_x, E_DIFF(CPU_x - CPU_y) means the energy difference between CPU_x and CPU_y, it equals to TE(CPU_x) - TE(CPU_y).

TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)

TE(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`

E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)

From upper formula we easily get to know CPU_1/2/3/5/6/7 energy computations are redundant,

This means that, for example, the contribution E(CPU_2) is the same in TE(CPU_0) and TE(CPU_4), right?

If that's what you mean, I'm not entirely convinced. This is true _IFF_ the OPP for Cluster_0 does not change while having the task in CPU_0 or having the task in the other cluster.

If the OPP for Cluster_0 should change, then E(CPU_2) can have to very different values in TE(CPU_0) and TE(CPU_4).

...

on the other hand if we only consider the energy consumption by waken task, the energy difference is between the target CPU energy data before and after task placed on it.

Sorry, I don't completely get this last... ... and it seems it's they key to understand to next paragraph :/

...

This method have one benefit is it can avoid to compute all CPUs energy in the same cluster, and only can focus the energy change introduced by waken task on the target CPU,

Do you mean that we whould ignore the blocked utilization?

...

which this method is called 'task oriented computation' in this patch, the energy computation can be optimized as:

TE(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) TE(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)

Again, IMHO in the above formula, if the OPP could change we cannot disregard energy variations from other cpus in Cluster_1.

...

E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)

As the result, the computation iteration can be reduced from 20 times to 8 times; so this can significantly reduce calculation overload.

After using task oriented computation, it has one case the computation might take longer time than previous method. For instance, when candidates are CPU_0 and CPU1, and after place task on either CPU the OPP will be increased,

... but this can happen also in the previous case.

Let say that CPU_4 has a utilization which is ~80% less then the capacity of its current OPP and that the task's PREV_CPU is CPU_0.

Let's also assume that the utilization of the task is big enought to require a capacity increase on CPU_4. Then, the energy for all the other CPUs in Cluster_1 are different in the before and after case.

...

in this case, the old method uses below computation:

TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) TE(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)

E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)

Using task oriented computation, because the OPP increasing impacts other CPUs in the same cluster, so it needs to calculate all related CPUs energy:

TE(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)

TE(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)

E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)

We can use more complex method for optimization, e.g. we can extend eenv structure to cache CPU energy data so can reuse them for two candidates. This can be used for later optimization.

As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios.

That's becasue we care about relative differences in energy consumptions...

...

Computation inconsistent issue might introduce trouble for PE filter.

... we should look into those issues. We know about some of them and already have patches, which was part of the 3.18 kernel used by Pixel devices but not yet merged in 4.4, since we notice that these cases, although being broken, are also not happening in Android because of the schedtune configurations used by Pixel devices.

...

This patch fixes issue by always calculating task based energy.

To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so the energy computation is limited to this single CPU, otherwise it rolls back to compute all CPUs

Ok, so that maens that we always check for OPP changes... then it should work...

...

in the same clock domain.

Change-Id: Ifb64ad77c6173388abf13e255e2ed8e8586a38bc Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 49eee75..4811fce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5473,7 +5473,7 @@ struct energy_env { /* Estimated energy variation wrt previous CPU */ int nrg_delta;

} cpu[NR_CPUS];

} cpu[NR_CPUS*2];

^^^^^^^^^^^

Why can't we have a before/after set of variables directly in the definition of eenv?

Given the usage below it should also turns out to be slightly more cache friendly. i.e.

struct energy_env {

#define PREV_CAP 0 #define NEXT_CAP 1 struct { // ... } cpu[2][NR_CPUS]; }

...

/* The morst energy efficient CPU for the specified energy_env::p */ int next_cpu; @@ -5791,45 +5791,55 @@ static int compute_energy(struct energy_env *eenv, int candidate) */ static int compute_task_energy(struct energy_env *eenv, int cpu) {

struct sched_domain *sd, *sd_cap;

struct sched_group *sg;

int first_cpu;

struct sched_domain *sd;

unsigned int prev_cap_idx, next_cap_idx;

int cmp_idx, ret;

sd = rcu_dereference(per_cpu(sd_ea, cpu));

sd = rcu_dereference(per_cpu(sd_scs, cpu)); if (!sd) return -1; /* Error */
sg = sd->groups;

do {
/* Skip SGs which do not contains a candidate CPU */
if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg)))
	continue;
eenv->sg_top = sg;
first_cpu = cpumask_first(sched_group_cpus(sg));
/*
 * The CPU capacity sharing attribution is decided by hardhware
 * design so we can decide the sg_cp value at the beginning
 * for specific CPU.
 */
sd_cap = rcu_dereference(per_cpu(sd_scs, first_cpu));
if (sd_cap && sd_cap->parent)
	eenv->sg_cap = sd_cap->parent->groups;
else
	eenv->sg_cap = sd_cap->groups;
find_new_capacity(eenv, cpu);
/*
* The CPU capacity sharing attribution is decided by hardhware
* design so we can decide the sg_cp value at the beginning
* for specific CPU.
*/
if (sd && sd->parent)
eenv->sg_cap = sd->parent->groups;
else
eenv->sg_cap = sd->groups;
/* Estimate capacity index before task placement */

cmp_idx = NR_CPUS + cpu;

prev_cap_idx = find_new_capacity(eenv, cmp_idx);

next_cap_idx = find_new_capacity(eenv, cpu);

^^^^

Based on the above suggestion, this parameter would be {PREV,NEXT}_CAP.

...

* Computation is iteration sched_group from bottom to up level for

* energy accumulation, 'sg_top' is top most sched_group:

* - If the CPU frequency has no change before and after task placed

*   onto the target CPU, set 'sg_top' to sched_group for the target

*   CPU; this means only to calculate the energy for this single CPU

*   and ignore other CPUs in the same clock domain.

...

* - If found the OPP frequency is changed after task placement then

*   need to calculate all CPUs who bound in the same clock domain,

*   so set 'sg_top' to shared capacity scheduling group.

```
*/
```
if (prev_cap_idx != next_cap_idx)
```
eenv->sg_top = eenv->sg_cap;
```
else
```
eenv->sg_top = sd->groups;
```

/* energy is unscaled to reduce rounding errors */

if (compute_energy(eenv, cpu) == -EINVAL) {

```
	eenv->next_cpu = eenv->prev_cpu;
```
```
	return -EINVAL;
```
```
}
```

/* energy is unscaled to reduce rounding errors */
ret = compute_energy(eenv, cmp_idx);

^^^^^^^

Same here... that would be a {PREV,NEXT}_CAP.

...

if (ret < 0)
return ret;
} while (sg = sg->next, sg != sd->groups);
ret = compute_energy(eenv, cpu);

if (ret < 0)
return ret;
eenv->cpu[cpu].energy -= eenv->cpu[cmp_idx].energy;

eenv->cpu[NEXT_CAP][cpu].energy -= eenv->cpu[PREV_CPU][cpu].energy;

...

return 0; }

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

Leo Yan

3:55 p.m.

New subject: [PATCH 00/15] Energy computation optimization on android-4.9-eas-dev

Hi Patrick,

On Mon, Mar 19, 2018 at 12:22:44PM +0000, Patrick Bellasi wrote:

...

Hi Leo,

thanks for sharing those patches which, since a long time, we know you are working on. I think they go back to our meeting last year in Mountain View...

This is my 4th version for trying to upstream this patch set to Android common kernel since last year's Mountain view meeting; and I will only focus on this patch set so far.

...

I'm always a bit confused about the proper review channel. Usually we review Android related bits on Google's gerrit...

I want to share out the patch set and get you guy's opinion on it, so we can get to know how to proceed for next step; this is also helpful for Linaro internally consolidation for EAS related works. For this purpose, I think eas-dev is a good place for this.

And sometime I hope this can benefit for some SoC members if they want to try these patches on their own production branch.

If you think this patch set is suitable for merging into AOSP, I am glad to proceed to send them with gerrit.

...

I'm wondering why you posted this series here while some of the patches are already uploaded on gerrit, e.g.

https://android-review.googlesource.com/c/kernel/common/+/555724

This is a single big patch for all refactoring, as result I found it's hard for reviewing, e.g. Daniel L. took much time to review it and seems it's not easily to understand its underlying idea. So I tried to split the big patch into small patches for easier reviewing.

Another reason is: after the refactoring patch "sched/fair: re-factor energy_diff to use a single (extensible) energy_env" has been merged into android-4.9-eas-dev branch, the energy calculation has been changed significantly, this results in the old patch is impossible to rebase on the latest code base. So I need to totally rework the patch set rather than a simple rebasing.

...

Anyway, I'll have a fast look at this series and I'm sure you'll have a better change to have a chat about these at Connect too.

Thanks for reviewing.

Thanks, Leo Yan

...

On 17-Mar 20:05, Leo Yan wrote:

...
This patch set is to optimize the energy computation on Android kernel android-4.9-eas-dev branch [1];

Patches 0001-0012 are used to refactor the code and some minor optimization, otherwise the task energy computation is hard to landed into current code;

Patch 0013 "sched/fair: Optimize energy computation with task oriented" is the core patch in whole patch set, which is mainly used to implement energy calculation for task. Patch 0014 is a sequential patch to use cached value so we can get more benefit for performance by exchanging more memory.

Patch 0015 is a trival experiment patch to remove 'idle state estimation'.

The testing uses rt-app to generate synthetic workload, the workload duty cycles are 1%/5%/10%/20%/30%/40%; the duration is measured interval for func select_energy_cpu_idx(), which now is used to calculation three candidates in single run. The result shows this patch set improve for energy computation duration:

+----------------+-------+-------+-------+-------+-------+--------+ | workload | 1% | 5% | 10% | 20% | 30% | 40% | +----------------+-------+-------+-------+-------+-------+--------+ | w/o patch set | 17267 | 21227 | 17019 | 13914 | 15002 | 23412 | | w/t patch set | 10823 | 11924 | 10931 | 10785 | 11139 | 11223 | +----------------+-------+-------+-------+-------+-------+--------+ | Opt percentage | 37% | 43% | 36% | 22% | 26% | 52% | +----------------+-------+-------+-------+-------+-------+--------+

The detailed testing ipython notebook you could check [2][3].

[1] https://android.googlesource.com/kernel/common/+/android-4.9-eas-dev [2] https://github.com/Leo-Yan/lisa/blob/2018_03_17_android_4.9_eas_dev_nrg_comp... [3] https://github.com/Leo-Yan/lisa/blob/2018_03_17_android_4.9_eas_dev_nrg_comp...

Leo Yan (15): sched/fair: Prepare energy env cpumask before energy calculation sched/fair: Re-define return values for select_energy_cpu_idx() sched/fair: Reduce indent in select_energy_cpu_brute() sched/fair: Fix one minor typo sched/fair: Use per cpu data to maintain energy environment sched/fair: Use cpumask to track candidates for energy calculation sched/fair: Lift CPU iteration out of calc_sg_energy() sched/fair: Introduce new function compute_task_energy() sched/fair: Decide 'eenv->sg_cap' ahead energy computation sched/fair: Use eenv::sg_cap to select capacity index sched/fair: Estimate capacity index ahead energy computation sched/fair: Refactor compute_energy() sched/fair: Optimize energy computation with task oriented sched/fair: Optimize energy calculation with cached energy data sched/fair: Remove idle state estimation

kernel/sched/fair.c | 542 +++++++++++++++++++++++++--------------------------- 1 file changed, 256 insertions(+), 286 deletions(-)

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

Patrick Bellasi

4:01 p.m.

New subject: [PATCH 14/15] sched/fair: Optimize energy calculation with cached energy data

On 17-Mar 20:05, Leo Yan wrote:

...

This patch is typical optimization by using more memory space to get better performance. eenv::cpu is extended as below usage:

+----------------+ ---> eenv::cpu[0] | new nrg | +----------------+ ---> eenv::cpu[NR_CPUS] | nrg(CPU) | +----------------+ ---> eenv::cpu[NR_CPUS*2] | nrg(Cluster) | +----------------+

If task placement is to trigger CPU frequency change, then we save the calculated energy value for without task placement into cpu[NR_CPUS*2..NR_CPUS*3-1], later if there has another CPU also need this same value for calculation, then it can directly fetch this value. So can avoid duplicate calculation for CPUs in the same cluster.

Change-Id: Ic1fbdc3513a8f90148e3dfabe385f40fdbe609bd Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 33 ++++++++++++++++++++++++++++----- 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4811fce..b834fb3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5473,7 +5473,9 @@ struct energy_env { /* Estimated energy variation wrt previous CPU */ int nrg_delta;

} cpu[NR_CPUS*2];
/* Flag for if has calculated already */
int	used;
} cpu[NR_CPUS*3];

/* The morst energy efficient CPU for the specified energy_env::p */ int next_cpu;
@@ -5725,6 +5727,9 @@ static int compute_energy(struct energy_env *eenv, int candidate)

WARN_ON(!eenv->sg_top->sge);
if (eenv->cpu[candidate].used)
return 0;
/* If a cpu is hotplugged in while we are in this function,

it does not appear in the existing eenv::sg_top mask

which came from the sched_group pointer of the
@@ -5782,6 +5787,7 @@ static int compute_energy(struct energy_env *eenv, int candidate) if (cpu_count) return -EINVAL;

eenv->cpu[candidate].used = 1;

That's interesting... Chris should have something similar in v4.14... although not exactly the same.

Can't we recycle on of the already existing variable (e.g. nrg_delta) has a sentinel to flag when the slot is already in use?

Another alternative could be to recycle the MSBs of cpu_id to encode some flags like this one.

...

return 0; }

@@ -5810,8 +5816,7 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) eenv->sg_cap = sd->groups;

/* Estimate capacity index before task placement */

cmp_idx = NR_CPUS + cpu;

prev_cap_idx = find_new_capacity(eenv, cmp_idx);

prev_cap_idx = find_new_capacity(eenv, cpu + NR_CPUS); next_cap_idx = find_new_capacity(eenv, cpu);

/*

@@ -5825,10 +5830,25 @@ static int compute_task_energy(struct energy_env *eenv, int cpu) * need to calculate all CPUs who bound in the same clock domain, * so set 'sg_top' to shared capacity scheduling group. */

if (prev_cap_idx != next_cap_idx)

if (prev_cap_idx != next_cap_idx) { eenv->sg_top = eenv->sg_cap;

else
/*
 * eenv::cpus[(NR_CPUS*2)..(NR_CPUS*3-1)] is used to calculate
 * whole cluster level energy calculation; so this value also
 * can be used by other CPUs in the same cluster to calculate
 * difference.
 */
cmp_idx  = cpumask_first(sched_group_cpus(eenv->sg_top));
cmp_idx += NR_CPUS * 2;
find_new_capacity(eenv, cmp_idx);
} else { eenv->sg_top = sd->groups;
/*
 * eenv::cpus[(NR_CPUS)..(NR_CPUS*2-1)] is used to calculate
 * single CPU energy calculation.
 */
cmp_idx = cpu + NR_CPUS;
}

I like memorization techniques... but we should really try to get it working by defining a proper struct layout/nesting instead of playing math on indexes like here above... it's so error prone :/

...

/* energy is unscaled to reduce rounding errors */ ret = compute_energy(eenv, cmp_idx); @@ -7146,6 +7166,9 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv->cpus_mask);

/* Clear energy calculation data */

memset(eenv->cpu, 0, sizeof(eenv->cpu));

select_energy_cpu_idx(eenv);

/* Check if eenv::next_cpu is a more energy efficient CPU */

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

4:05 p.m.

New subject: [PATCH 15/15] sched/fair: Remove idle state estimation

On 17-Mar 20:05, Leo Yan wrote:

...

When calculation energy difference before and after task placement on specific CPU, the idle state selection sometimes is based on the idle state index and sometimes based on the utilization to estimation. So the different idle state selections even happens on same one CPU, and the idle state estimation code is totally different with current CPU idle governor.

Moreover, the estimated IDLE stat is always a bit "random" and it seems to not give big advantages in terms of energy aware decisions...

...

Avoid to introduce big deviation for energy calculation, let's remove the idle state estimation.

.. that's why the simplifies energy model Dietmar is going to post on LKML is not considering idle (and cluster) energy.

If you reached the same conclusions, we should really review/test/push for that simplifies mode because it will make our life so much easier... especially considering all the complexity we have here, also after applying all these patches you propose :)

...

Change-Id: I4de0df9ce8f4e425206bd14de7c8069b488c4483 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 49 ------------------------------------------------- 1 file changed, 49 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b834fb3..4e43c9a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5609,8 +5609,6 @@ static int group_idle_state(struct energy_env *eenv, int cpu_id) { struct sched_group *sg = eenv->sg; int i, state = INT_MAX;

int src_in_grp, dst_in_grp;

long grp_util = 0;

/* Find the shallowest idle state in the sched group. */ for_each_cpu(i, sched_group_cpus(sg))

@@ -5619,53 +5617,6 @@ static int group_idle_state(struct energy_env *eenv, int cpu_id) /* Take non-cpuidle idling into account (active idle/arch_cpu_idle()) */ state++;
src_in_grp = cpumask_test_cpu(eenv->prev_cpu, sched_group_cpus(sg));

dst_in_grp = cpumask_test_cpu(cpu_id, sched_group_cpus(sg));

if (src_in_grp == dst_in_grp) {
/* both CPUs under consideration are in the same group or not in
 * either group, migration should leave idle state the same.
 */
goto end;
}

/*
* Try to estimate if a deeper idle state is
* achievable when we move the task.
*/
for_each_cpu(i, sched_group_cpus(sg)) {
grp_util += cpu_util_wake(i, eenv->p);
if (unlikely(i == cpu_id))
	grp_util += eenv->util_delta;
}

if (grp_util <=
((long)sg->sgc->max_capacity * (int)sg->group_weight)) {
/* after moving, this group is at most partly
 * occupied, so it should have some idle time.
 */
int max_idle_state_idx = sg->sge->nr_idle_states - 2;
int new_state = grp_util * max_idle_state_idx;
if (grp_util <= 0)
	/* group will have no util, use lowest state */
	new_state = max_idle_state_idx + 1;
else {
	/* for partially idle, linearly map util to idle
	 * states, excluding the lowest one. This does not
	 * correspond to the state we expect to enter in
	 * reality, but an indication of what might happen.
	 */
	new_state = min(max_idle_state_idx, (int)
			(new_state / sg->sgc->max_capacity));
	new_state = max_idle_state_idx - new_state;
}
state = new_state;
} else {
/* After moving, the group will be fully occupied
 * so assume it will not be idle at all.
 */
state = 0;
}
-end: return state; }

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

Leo Yan

4:15 p.m.

New subject: [PATCH 01/15] sched/fair: Prepare energy env cpumask before energy calculation

On Mon, Mar 19, 2018 at 12:24:09PM +0000, Patrick Bellasi wrote:

...

Hi Leo,

On 17-Mar 20:05, Leo Yan wrote:

...
Move out energy env cpumask setting from select_energy_cpu_idx() and prepare it ahead. This changes make code more clear, which can prepare energy environment variable before energy calculation.

Change-Id: I9061492177ad8d4830cf8bbdaffdf7860b27e319 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 315c386..37231b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5874,15 +5874,6 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) if (!sd) return EAS_CPU_PRV;
cpumask_clear(&eenv->cpus_mask);

for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx) {
int cpu = eenv->cpu[cpu_idx].cpu_id;
The intent to do it here and do it with a loop is because of that EAS_CPU_CNT definition, which should make it clear that potentially we can think about having more then 3 candidate we have not.

That's why also I don't agree that moving this block outside makes the code more clear...

I think we could connect this patch with patch 0006 'sched/fair: Use cpumask to track candidates for energy calculation'; so all cpumask should be initialized in function select_energy_cpu_brute().

We can take this patch as refactoring for 'preparing energy_env context' so we can have two distinguished steps: the first step is preparation eenv context (in func select_energy_cpu_brute()) and second step is energy calculation & comparison (in func select_energy_cpu_idx()).

Thanks, Leo Yan

...

...
if (cpu < 0)
	continue;
cpumask_set_cpu(cpu, &eenv->cpus_mask);
}

sg = sd->groups; do { /* Skip SGs which do not contains a candidate CPU */
@@ -7133,7 +7124,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync }, };

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && p->state == TASK_WAKING) @@ -7147,6 +7137,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }
cpumask_clear(&eenv.cpus_mask);
if (prev_cpu >= 0)
	cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);
if (next_cpu >= 0)
	cpumask_set_cpu(next_cpu, &eenv.cpus_mask);
if (backup_cpu >= 0)
	cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */ if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
-- 1.9.1
-- #include <best/regards.h>

Patrick Bellasi

Leo Yan

4:37 p.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

On Mon, Mar 19, 2018 at 12:33:01PM +0000, Patrick Bellasi wrote:

...

On 19-Mar 07:07, Daniel Lezcano wrote:

...
On Sat, Mar 17, 2018 at 08:05:15PM +0800, Leo Yan wrote:

...
select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Yes there is a duplication of the information. Actually, the change could be the contrary: make select_energy_cpu_idx return the cpu_idx and kill the next_idx field in the structure.

next_idx field is used only in the calling function, it could be stored in a local variable.

Unfortunately it's not merged (yet) in eas-dev, but there is an energy_diff tracepoint which gets eenv only data.

That's the reason why we store next_idx in the eenv, moreover it makes the code more self-contained from a functional standpoint since everywhere you know that the eenv contains all the information which defined a certain energy aware wakeup decision.

Using next_idx as a return value is also another design decision, since it makes the core more streamline to read.

Thus, overall I don't see a big point on this patch if you consider the above two "design decisions".

I personlly prefer to remove redundant variable from data structure as we can, this finally can help us to keep code as simple as possible. So I am bias to Daniel suggestion.

And after quick thinking, after removing 'next_idx' field it should not impact ftrace event recording, right?

Thanks, Leo Yan

...

...
...
Change-Id: I57bc9c96566eaf6085c85ea9386d11422cafddbf Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 37231b8..0f0b307 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5850,16 +5850,15 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu)

select_energy_cpu_idx(): estimate the energy impact of changing the

utilization distribution.

The eenv parameter specifies the changes: utilisation amount and a pair of

possible CPU candidates (the previous CPU and a different target CPU).

This function returns the index of a CPU candidate specified by the

eenv::next_idx returns the index of a CPU candidate specified by the

energy_env which corresponds to the first CPU saving energy.

Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy

efficient than running on prev_cpu. This is also the value returned in case

of abort due to error conditions during the computations.

A value greater than zero means that the first energy-efficient CPU is the

one represented by eenv->cpu[eenv->next_idx].cpu_id.

efficient than running on prev_cpu. A value greater than zero means that

the first energy-efficient CPU is the one represented by

eenv->cpu[eenv->next_idx].cpu_id.

The function returns negative value in case of abort due to error

conditions during the computations, success returns 0.

*/

static inline int select_energy_cpu_idx(struct energy_env *eenv) { @@ -5871,8 +5870,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)

sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
if (!sd)
return EAS_CPU_PRV;
if (!sd) {
eenv->next_idx = EAS_CPU_PRV;
return -1;
}

sg = sd->groups; do {
@@ -5882,8 +5883,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)
eenv->sg_top = sg;
/* energy is unscaled to reduce rounding errors */
if (compute_energy(eenv) == -EINVAL)
	return EAS_CPU_PRV;
if (compute_energy(eenv) == -EINVAL) {
	eenv->next_idx = EAS_CPU_PRV;
	return -EINVAL;
}
} while (sg = sg->next, sg != sd->groups);
@@ -5929,7 +5932,7 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) } }

return eenv->next_idx;

return 0;

}

/* @@ -7145,17 +7148,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
select_energy_cpu_idx(&eenv);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */
if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) {
if (eenv.next_idx != EAS_CPU_PRV) {
schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_nrg_sav);
	target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
	goto unlock;
} else {
	schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
	schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
}
schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
target_cpu = prev_cpu;
target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
goto unlock; }
-- 1.9.1
--

http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
-- #include <best/regards.h>

Patrick Bellasi

Patrick Bellasi

4:53 p.m.

New subject: [PATCH 01/15] sched/fair: Prepare energy env cpumask before energy calculation

On 20-Mar 00:15, Leo Yan wrote:

...

On Mon, Mar 19, 2018 at 12:24:09PM +0000, Patrick Bellasi wrote:

...
Hi Leo,

On 17-Mar 20:05, Leo Yan wrote:

...
Move out energy env cpumask setting from select_energy_cpu_idx() and prepare it ahead. This changes make code more clear, which can prepare energy environment variable before energy calculation.

Change-Id: I9061492177ad8d4830cf8bbdaffdf7860b27e319 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 315c386..37231b8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5874,15 +5874,6 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) if (!sd) return EAS_CPU_PRV;
cpumask_clear(&eenv->cpus_mask);

for (cpu_idx = EAS_CPU_PRV; cpu_idx < EAS_CPU_CNT; ++cpu_idx) {
int cpu = eenv->cpu[cpu_idx].cpu_id;
The intent to do it here and do it with a loop is because of that EAS_CPU_CNT definition, which should make it clear that potentially we can think about having more then 3 candidate we have not.

That's why also I don't agree that moving this block outside makes the code more clear...
I think we could connect this patch with patch 0006 'sched/fair: Use cpumask to track candidates for energy calculation'; so all cpumask should be initialized in function select_energy_cpu_brute().

We can take this patch as refactoring for 'preparing energy_env context' so we can have two distinguished steps: the first step is preparation eenv context (in func select_energy_cpu_brute()) and second step is energy calculation & comparison (in func select_energy_cpu_idx()).

Right, I appreciate the fine grained splitting in patches... but I agree that some could be better merged to make more clear why a refactoring is eventually needed for.

...

Thanks, Leo Yan

...
...
if (cpu < 0)
	continue;
cpumask_set_cpu(cpu, &eenv->cpus_mask);
}

sg = sd->groups; do { /* Skip SGs which do not contains a candidate CPU */
@@ -7133,7 +7124,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync }, };

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && p->state == TASK_WAKING) @@ -7147,6 +7137,14 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }
cpumask_clear(&eenv.cpus_mask);
if (prev_cpu >= 0)
	cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);
if (next_cpu >= 0)
	cpumask_set_cpu(next_cpu, &eenv.cpus_mask);
if (backup_cpu >= 0)
	cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */ if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
-- 1.9.1
-- #include <best/regards.h>

Patrick Bellasi

-- #include <best/regards.h>

Patrick Bellasi

4:55 p.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

On 20-Mar 00:37, Leo Yan wrote:

...

On Mon, Mar 19, 2018 at 12:33:01PM +0000, Patrick Bellasi wrote:

...
On 19-Mar 07:07, Daniel Lezcano wrote:

...
On Sat, Mar 17, 2018 at 08:05:15PM +0800, Leo Yan wrote:

...
select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Yes there is a duplication of the information. Actually, the change could be the contrary: make select_energy_cpu_idx return the cpu_idx and kill the next_idx field in the structure.

next_idx field is used only in the calling function, it could be stored in a local variable.

Unfortunately it's not merged (yet) in eas-dev, but there is an energy_diff tracepoint which gets eenv only data.

That's the reason why we store next_idx in the eenv, moreover it makes the code more self-contained from a functional standpoint since everywhere you know that the eenv contains all the information which defined a certain energy aware wakeup decision.

Using next_idx as a return value is also another design decision, since it makes the core more streamline to read.

Thus, overall I don't see a big point on this patch if you consider the above two "design decisions".

I personlly prefer to remove redundant variable from data structure as we can, this finally can help us to keep code as simple as possible. So I am bias to Daniel suggestion.

And after quick thinking, after removing 'next_idx' field it should not impact ftrace event recording, right?

Depends on what you trace with the tracepoint. The idea was to have all the energy diff related info tracked by the eenv struct, so that we can know exactly how we got to a certain results.

Here, next_idx represents actually which CPU has been computed to be the most energy efficient candidate, which can be a valuable info to track in a tracepoint.

...

Thanks, Leo Yan

...
...
...
Change-Id: I57bc9c96566eaf6085c85ea9386d11422cafddbf Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 42 +++++++++++++++++++++++------------------- 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 37231b8..0f0b307 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5850,16 +5850,15 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu)

select_energy_cpu_idx(): estimate the energy impact of changing the

utilization distribution.

The eenv parameter specifies the changes: utilisation amount and a pair of

possible CPU candidates (the previous CPU and a different target CPU).

This function returns the index of a CPU candidate specified by the

eenv::next_idx returns the index of a CPU candidate specified by the

energy_env which corresponds to the first CPU saving energy.

Thus, 0 (EAS_CPU_PRV) means that non of the CPU candidate is more energy

efficient than running on prev_cpu. This is also the value returned in case

of abort due to error conditions during the computations.

A value greater than zero means that the first energy-efficient CPU is the

one represented by eenv->cpu[eenv->next_idx].cpu_id.

efficient than running on prev_cpu. A value greater than zero means that

the first energy-efficient CPU is the one represented by

eenv->cpu[eenv->next_idx].cpu_id.

The function returns negative value in case of abort due to error

conditions during the computations, success returns 0.

*/

static inline int select_energy_cpu_idx(struct energy_env *eenv) { @@ -5871,8 +5870,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)

sd_cpu = eenv->cpu[EAS_CPU_PRV].cpu_id; sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
if (!sd)
return EAS_CPU_PRV;
if (!sd) {
eenv->next_idx = EAS_CPU_PRV;
return -1;
}

sg = sd->groups; do {
@@ -5882,8 +5883,10 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv)
eenv->sg_top = sg;
/* energy is unscaled to reduce rounding errors */
if (compute_energy(eenv) == -EINVAL)
	return EAS_CPU_PRV;
if (compute_energy(eenv) == -EINVAL) {
	eenv->next_idx = EAS_CPU_PRV;
	return -EINVAL;
}
} while (sg = sg->next, sg != sd->groups);
@@ -5929,7 +5932,7 @@ static inline int select_energy_cpu_idx(struct energy_env *eenv) } }

return eenv->next_idx;

return 0;

}

/* @@ -7145,17 +7148,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (backup_cpu >= 0) cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
select_energy_cpu_idx(&eenv);
/* Check if EAS_CPU_NXT is a more energy efficient CPU */
if (select_energy_cpu_idx(&eenv) != EAS_CPU_PRV) {
if (eenv.next_idx != EAS_CPU_PRV) {
schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_nrg_sav);
	target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
	goto unlock;
} else {
	schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
	schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
}
schedstat_inc(p->se.statistics.nr_wakeups_secb_no_nrg_sav);
schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav);
target_cpu = prev_cpu;
target_cpu = eenv.cpu[eenv.next_idx].cpu_id;
goto unlock; }
-- 1.9.1
--

http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
-- #include <best/regards.h>

Patrick Bellasi

-- #include <best/regards.h>

Patrick Bellasi

Daniel Lezcano

4:57 p.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

Hi Patrick,

On Mon, Mar 19, 2018 at 12:33:01PM +0000, Patrick Bellasi wrote:

...

On 19-Mar 07:07, Daniel Lezcano wrote:

...
On Sat, Mar 17, 2018 at 08:05:15PM +0800, Leo Yan wrote:

...
select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Yes there is a duplication of the information. Actually, the change could be the contrary: make select_energy_cpu_idx return the cpu_idx and kill the next_idx field in the structure.

next_idx field is used only in the calling function, it could be stored in a local variable.

Unfortunately it's not merged (yet) in eas-dev, but there is an energy_diff tracepoint which gets eenv only data.

There is no real usage of this transient value for a field in a structure.

...

That's the reason why we store next_idx in the eenv, moreover it makes the code more self-contained from a functional standpoint since everywhere you know that the eenv contains all the information which defined a certain energy aware wakeup decision.

I disagree with the "self-contained approach is a good thing": that keeps the entire code out of the upstreaming path.

Moreover, the stack size in the kernel is limited to 16KB. This bloated structure will one day kill the stack.

Do you really think storing a value in a structure passed in parameter and returning it from the same function and then using it in one place does really make sense ?

...

Using next_idx as a return value is also another design decision, since it makes the core more streamline to read.

Really ?

...

Thus, overall I don't see a big point on this patch if you consider the above two "design decisions".

So the eenv is passed to the function, this one fills eenv->next_idx and returns next_idx and the eenv->next_idx is used by the caller only on this place only. That is not a design decision, it is an implementation and it makes sense to do this cleanup.

Daniel Lezcano

5:03 p.m.

New subject: [PATCH 03/15] sched/fair: Reduce indent in select_energy_cpu_brute()

On Mon, Mar 19, 2018 at 12:43:50PM +0000, Patrick Bellasi wrote:

...

On 17-Mar 20:05, Leo Yan wrote:

...
Function select_energy_cpu_brute() can firstly handle for case 'next_cpu == prev_cpu', so can avoid big amount code indentation under the case 'next_cpu != prev_cpu'.

The existing code is still withing the 80 cols margin and also not so bad to read.

Thus I'm not in general a big fan of this kind of function-0-change patches... and I would really like to avoid them at least when they are not part of a functional refactoring.

I'm not very used of the EAS code, still ramping up. I don't see anything shocking by setting the same indentation than the other blocks. While trying to read the code, it is more clear IMO.

...

Moreover...

...
Change-Id: Iac08b5f1e6072805c0bee22680bd0ecab411cdba Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 91 +++++++++++++++++++++++++---------------------------- 1 file changed, 42 insertions(+), 49 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0f0b307..b67999a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7060,6 +7060,8 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync int target_cpu; int backup_cpu; int next_cpu;

int delta = 0;

struct energy_env eenv;

schedstat_inc(p->se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq()->eas_stats.secb_attempts);

@@ -7108,63 +7110,54 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync }

target_cpu = prev_cpu;
if (next_cpu != prev_cpu) {
int delta = 0;
struct energy_env eenv = {
	.p              = p,
	.util_delta     = task_util(p),
	/* Task's previous CPU candidate */
	.cpu[EAS_CPU_PRV] = {
		.cpu_id = prev_cpu,
	},
	/* Main alternative CPU candidate */
	.cpu[EAS_CPU_NXT] = {
		.cpu_id = next_cpu,
	},
	/* Backup alternative CPU candidate */
	.cpu[EAS_CPU_BKP] = {
		.cpu_id = backup_cpu,
	},
};
The block above exploits the designated initializers to ensure a zero initialization to all omitted members... for example eenv.energy which is used by the following code to sum track the estimated energy of a candidate to ensure a zero initialization to all omitted members... for example eenv.energy which is used by the following code to sum track the estimated energy of a candidate.

struct energy_env eenv = {};

...

...
if (next_cpu == prev_cpu) {
schedstat_inc(p->se.statistics.nr_wakeups_secb_count);
schedstat_inc(this_rq()->eas_stats.secb_count);
goto unlock;
}

eenv.p = p;

eenv.util_delta = task_util(p);

/* Task's previous CPU candidate */

eenv.cpu[EAS_CPU_PRV].cpu_id = prev_cpu;

/* Main alternative CPU candidate */

eenv.cpu[EAS_CPU_NXT].cpu_id = next_cpu;

/* Backup alternative CPU candidate */

eenv.cpu[EAS_CPU_BKP].cpu_id = backup_cpu;
... AFAIKS here we don't sure eenv.cpu[XXX].energy = 0

Thus don't we get whatever is on stack as initial energy value?

Daniel Lezcano

5:28 p.m.

New subject: [PATCH 05/15] sched/fair: Use per cpu data to maintain energy environment

On Sat, Mar 17, 2018 at 08:05:18PM +0800, Leo Yan wrote:

...

This commit changes to use per cpu data to maintain energy environment structure, so can move it out from kernel stack and avoid the stack overflow issue if we want to add more items into energy environment structure for later optimization.

Change-Id: Ifdcb1bbae36cb8e074b31ec35f00c79f6a14b973 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6c0b918..9997a6a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6690,7 +6690,36 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)

return target; }

+static struct kmem_cache *energy_env_cache __read_mostly; +static DEFINE_PER_CPU(struct energy_env *, energy_env);

+static void init_energy_env(void) +{
struct energy_env *eenv;

int i;

energy_env_cache = KMEM_CACHE(energy_env, 0);

for_each_possible_cpu(i) {
eenv = kmem_cache_alloc(energy_env_cache, GFP_KERNEL | __GFP_ZERO);
if (!eenv)
	goto err;
rcu_assign_pointer(per_cpu(energy_env, i), eenv);
}

return;
+err:
for_each_possible_cpu(i) {
eenv = rcu_dereference(per_cpu(energy_env, i));
if (!eenv)
	continue;
kmem_cache_free(energy_env_cache, eenv);
}
+}

This portion of code is not right.

First you create a pool of objects with kmem_cache_create() and then you use this poll of objects to allocate cached objects.

It should be like:

At init time: struct pool *mypool = kmem_cache_create("my_poll", sizeof(struct poll), 0, SLAB_PANIC, NULL);

During runtime: replace kzalloc by:

myobject = kmem_cache_zalloc(mypoll, GFP_KERNEL);

and kfree by:

kmem_cache_free(myobject);

And at exit time: kmem_cache_destroy(mypool);

Usually the pool of objects stays there for the entire life-cycle of the subsystem.

The poll of objects is visible at /proc/slabinfo.

However, this is only usefull when you are allocating and freeing objects at each services (eg. a new process at fork time, a new timer, etc ...)

For this change, a statically per cpu define (not a pointer) is ok and then access with per_cpu_ptr.

static DEFINE_PER_CPU(struct energy_env, energy_env);

[ ... ]

struct energy_env *eenv = per_cpu_ptr(&energy_env, cpu);

...

/*

cpu_util_wake: Compute cpu utilization with any contributions from

the waking task p removed. check_for_migration() looks for a better CPU of

@@ -7061,14 +7090,13 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync int backup_cpu; int next_cpu; int delta = 0;

struct energy_env eenv;

int cpu = smp_processor_id();

struct energy_env *eenv;

schedstat_inc(p->se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq()->eas_stats.secb_attempts);

if (sysctl_sched_sync_hint_enable && sync) {
int cpu = smp_processor_id();
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) { schedstat_inc(p->se.statistics.nr_wakeups_secb_sync); schedstat_inc(this_rq()->eas_stats.secb_sync);
@@ -7116,14 +7144,15 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

eenv.p = p;

eenv.util_delta = task_util(p);

eenv = rcu_dereference(per_cpu(energy_env, cpu));

eenv->p = p;

eenv->util_delta = task_util(p); /* Task's previous CPU candidate */

eenv.cpu[EAS_CPU_PRV].cpu_id = prev_cpu;

eenv->cpu[EAS_CPU_PRV].cpu_id = prev_cpu; /* Main alternative CPU candidate */

eenv.cpu[EAS_CPU_NXT].cpu_id = next_cpu;

eenv->cpu[EAS_CPU_NXT].cpu_id = next_cpu; /* Backup alternative CPU candidate */

eenv.cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

eenv->cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && @@ -7138,18 +7167,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

cpumask_clear(&eenv.cpus_mask);

cpumask_clear(&eenv->cpus_mask); if (prev_cpu >= 0)
cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);
cpumask_set_cpu(prev_cpu, &eenv->cpus_mask);
if (next_cpu >= 0)
cpumask_set_cpu(next_cpu, &eenv.cpus_mask);
cpumask_set_cpu(next_cpu, &eenv->cpus_mask);
if (backup_cpu >= 0)
cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
cpumask_set_cpu(backup_cpu, &eenv->cpus_mask);
select_energy_cpu_idx(&eenv);

select_energy_cpu_idx(eenv);

/* Check if EAS_CPU_NXT is a more energy efficient CPU */

if (eenv.next_idx != EAS_CPU_PRV) {

if (eenv->next_idx != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); } else {

@@ -7157,7 +7186,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

target_cpu = eenv.cpu[eenv.next_idx].cpu_id;

target_cpu = eenv->cpu[eenv->next_idx].cpu_id;

unlock: rcu_read_unlock(); @@ -10975,6 +11004,7 @@ __init void init_sched_fair_class(void) nohz.next_balance = jiffies; zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); #endif

init_energy_env();

#endif /* SMP */

}

1.9.1

http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Patrick Bellasi

6:25 p.m.

New subject: [PATCH 02/15] sched/fair: Re-define return values for select_energy_cpu_idx()

On 19-Mar 17:57, Daniel Lezcano wrote:

...

Hi Patrick,

On Mon, Mar 19, 2018 at 12:33:01PM +0000, Patrick Bellasi wrote:

...
On 19-Mar 07:07, Daniel Lezcano wrote:

...
On Sat, Mar 17, 2018 at 08:05:15PM +0800, Leo Yan wrote:

...
select_energy_cpu_idx() returns the target CPU index, and eenv->next_idx is used to define target CPU index as well; it's pointless to use two methods to return exactly same value.

This patch is to change the definition for return value of select_energy_cpu_idx(), errors return negative value and success returns 0. And later use eenv->next_idx for target CPU index.

Yes there is a duplication of the information. Actually, the change could be the contrary: make select_energy_cpu_idx return the cpu_idx and kill the next_idx field in the structure.

next_idx field is used only in the calling function, it could be stored in a local variable.

Unfortunately it's not merged (yet) in eas-dev, but there is an energy_diff tracepoint which gets eenv only data.

There is no real usage of this transient value for a field in a structure.

...
That's the reason why we store next_idx in the eenv, moreover it makes the code more self-contained from a functional standpoint since everywhere you know that the eenv contains all the information which defined a certain energy aware wakeup decision.

I disagree with the "self-contained approach is a good thing": that keeps the entire code out of the upstreaming path.

The mainlining path is already quite different then this code. Hopefully you will like it better once is posted.

...

Moreover, the stack size in the kernel is limited to 16KB. This bloated structure will one day kill the stack.

And that's why I'm totally ok with one of the following patches where Leo move this struct into a per-cpu variable...

...

Do you really think storing a value in a structure passed in parameter and returning it from the same function and then using it in one place does really make sense ?

It could, in case you need to read the same information somewhere else... which however you right is not the case of the current code.

...

...
Using next_idx as a return value is also another design decision, since it makes the core more streamline to read.

Really ?

Yes, considering other patches which are not currently merged, exactly like this one. However, I agree that this cached value should/could be added once the other usages are added.

It's likely like that now because when we forward ported patches from v3.18 we skipped the patches with the usages, with the idea to reduce the overall code footprint. Nevertheless, when you are required to do these forward/backward porting, sometimes it makes things simpler to keep patches almost similar.

...

...
Thus, overall I don't see a big point on this patch if you consider the above two "design decisions".

So the eenv is passed to the function, this one fills eenv->next_idx and returns next_idx and the eenv->next_idx is used by the caller only on this place only. That is not a design decision, it is an implementation and it makes sense to do this cleanup.

At the current status of the code I would say yes, however, again I'm not against refactoring/cleanups... there are many to do... we should just avoid creating many different code favours.

I can also agree that this patch makes sense... but I would like better to see this patch aligned with the most updated cleanup we already have.

-- #include <best/regards.h>

Patrick Bellasi

Leo Yan

20 Mar 20 Mar

1:27 a.m.

New subject: [PATCH 05/15] sched/fair: Use per cpu data to maintain energy environment

On Mon, Mar 19, 2018 at 06:28:08PM +0100, Daniel Lezcano wrote:

...

On Sat, Mar 17, 2018 at 08:05:18PM +0800, Leo Yan wrote:

...
This commit changes to use per cpu data to maintain energy environment structure, so can move it out from kernel stack and avoid the stack overflow issue if we want to add more items into energy environment structure for later optimization.

Change-Id: Ifdcb1bbae36cb8e074b31ec35f00c79f6a14b973 Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 62 +++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6c0b918..9997a6a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6690,7 +6690,36 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)

return target; }

+static struct kmem_cache *energy_env_cache __read_mostly; +static DEFINE_PER_CPU(struct energy_env *, energy_env);

+static void init_energy_env(void) +{
struct energy_env *eenv;

int i;

energy_env_cache = KMEM_CACHE(energy_env, 0);

for_each_possible_cpu(i) {
eenv = kmem_cache_alloc(energy_env_cache, GFP_KERNEL | __GFP_ZERO);
if (!eenv)
	goto err;
rcu_assign_pointer(per_cpu(energy_env, i), eenv);
}

return;
+err:
for_each_possible_cpu(i) {
eenv = rcu_dereference(per_cpu(energy_env, i));
if (!eenv)
	continue;
kmem_cache_free(energy_env_cache, eenv);
}
+}
This portion of code is not right.

First you create a pool of objects with kmem_cache_create() and then you use this poll of objects to allocate cached objects.

It should be like:

At init time: struct pool *mypool = kmem_cache_create("my_poll", sizeof(struct poll), 0, SLAB_PANIC, NULL);

During runtime: replace kzalloc by:

myobject = kmem_cache_zalloc(mypoll, GFP_KERNEL);

and kfree by:

kmem_cache_free(myobject);

And at exit time: kmem_cache_destroy(mypool);

Usually the pool of objects stays there for the entire life-cycle of the subsystem.

The poll of objects is visible at /proc/slabinfo.

However, this is only usefull when you are allocating and freeing objects at each services (eg. a new process at fork time, a new timer, etc ...)

For this change, a statically per cpu define (not a pointer) is ok and then access with per_cpu_ptr.

static DEFINE_PER_CPU(struct energy_env, energy_env);

[ ... ]

struct energy_env *eenv = per_cpu_ptr(&energy_env, cpu);

Will fix for this.

At beginning I tried avoid too big size for kernel Image linkage. After your reminding, the static data should be placed into .bss section, so it's good to use static definition.

Thanks, Leo Yan

...

...
/*

cpu_util_wake: Compute cpu utilization with any contributions from

the waking task p removed. check_for_migration() looks for a better CPU of

@@ -7061,14 +7090,13 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync int backup_cpu; int next_cpu; int delta = 0;

struct energy_env eenv;

int cpu = smp_processor_id();

struct energy_env *eenv;

schedstat_inc(p->se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq()->eas_stats.secb_attempts);

if (sysctl_sched_sync_hint_enable && sync) {
int cpu = smp_processor_id();
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) { schedstat_inc(p->se.statistics.nr_wakeups_secb_sync); schedstat_inc(this_rq()->eas_stats.secb_sync);
@@ -7116,14 +7144,15 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

eenv.p = p;

eenv.util_delta = task_util(p);

eenv = rcu_dereference(per_cpu(energy_env, cpu));

eenv->p = p;

eenv->util_delta = task_util(p); /* Task's previous CPU candidate */

eenv.cpu[EAS_CPU_PRV].cpu_id = prev_cpu;

eenv->cpu[EAS_CPU_PRV].cpu_id = prev_cpu; /* Main alternative CPU candidate */

eenv.cpu[EAS_CPU_NXT].cpu_id = next_cpu;

eenv->cpu[EAS_CPU_NXT].cpu_id = next_cpu; /* Backup alternative CPU candidate */

eenv.cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

eenv->cpu[EAS_CPU_BKP].cpu_id = backup_cpu;

#ifdef CONFIG_SCHED_WALT if (!walt_disabled && sysctl_sched_use_walt_cpu_util && @@ -7138,18 +7167,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock; }

cpumask_clear(&eenv.cpus_mask);

cpumask_clear(&eenv->cpus_mask); if (prev_cpu >= 0)
cpumask_set_cpu(prev_cpu, &eenv.cpus_mask);
cpumask_set_cpu(prev_cpu, &eenv->cpus_mask);
if (next_cpu >= 0)
cpumask_set_cpu(next_cpu, &eenv.cpus_mask);
cpumask_set_cpu(next_cpu, &eenv->cpus_mask);
if (backup_cpu >= 0)
cpumask_set_cpu(backup_cpu, &eenv.cpus_mask);
cpumask_set_cpu(backup_cpu, &eenv->cpus_mask);
select_energy_cpu_idx(&eenv);

select_energy_cpu_idx(eenv);

/* Check if EAS_CPU_NXT is a more energy efficient CPU */

if (eenv.next_idx != EAS_CPU_PRV) {

if (eenv->next_idx != EAS_CPU_PRV) { schedstat_inc(p->se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq()->eas_stats.secb_nrg_sav); } else {

@@ -7157,7 +7186,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync schedstat_inc(this_rq()->eas_stats.secb_no_nrg_sav); }

target_cpu = eenv.cpu[eenv.next_idx].cpu_id;

target_cpu = eenv->cpu[eenv->next_idx].cpu_id;

unlock: rcu_read_unlock(); @@ -10975,6 +11004,7 @@ __init void init_sched_fair_class(void) nohz.next_balance = jiffies; zalloc_cpumask_var(&nohz.idle_cpus_mask, GFP_NOWAIT); #endif

init_energy_env();

#endif /* SMP */

}

1.9.1
--

http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs

Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog

Leo Yan

23 Mar 23 Mar

3:49 a.m.

New subject: [PATCH 13/15] sched/fair: Optimize energy computation with task oriented

On Mon, Mar 19, 2018 at 03:52:48PM +0000, Patrick Bellasi wrote:

...

On 17-Mar 20:05, Leo Yan wrote:

...
Let's firstly see one example for a small utilization task is waken up and need calculate energy for two candidate CPUs. From hardware design perspective, each candidate CPU cannot decide OPP by itself due it binds with other CPUs in the same clock domain, e.g. it binds clock domain in the cluster with other CPUs, at the end we need to compute energy for all CPUs in the cluster.

Let's use below CPU topology as the example:
Cluster_0               Cluster_1
CPU_0                   CPU_4
CPU_1                   CPU_5
CPU_2                   CPU_6
CPU_3                   CPU_7
Current code always calculate the energy for all CPUs in bound clock domain, if the candidates are CPU_0 and CPU_4, the formula for energy computation is as below:

E(CPU_x) stands the CPU_x energy, notation E(CPU_x)` means the CPU_x energy after take placed on it, TE(CPU_x) means the total energy for all computation for all CPUs when task is placed on CPU_x, E_DIFF(CPU_x - CPU_y) means the energy difference between CPU_x and CPU_y, it equals to TE(CPU_x) - TE(CPU_y).

TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0)` + E(CPU_4) + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)

TE(CPU_4) = E(CPU_0) + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) + E(CPU_4)` + E(CPU_5) + E(CPU_6) + E(CPU_7) + E(CLS_1)`

E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)

From upper formula we easily get to know CPU_1/2/3/5/6/7 energy computations are redundant,
This means that, for example, the contribution E(CPU_2) is the same in TE(CPU_0) and TE(CPU_4), right?

Yes.

...

If that's what you mean, I'm not entirely convinced. This is true _IFF_ the OPP for Cluster_0 does not change while having the task in CPU_0 or having the task in the other cluster.

If the OPP for Cluster_0 should change, then E(CPU_2) can have to very different values in TE(CPU_0) and TE(CPU_4).

You are totally right. If cluster_0 OPP is to be changed, we must compute E(CPU_2) for this case. I should clarify this prerequisite in the commit log.

...

...
on the other hand if we only consider the energy consumption by waken task, the energy difference is between the target CPU energy data before and after task placed on it.

Sorry, I don't completely get this last... ... and it seems it's they key to understand to next paragraph :/

Let me rephrase it: Let's assume the task is placed onto on candidate CPU, if this candidate CPU has same OPP with without this waken task on it, then we can simply calculate the energy introduced by the task:

E_DIFF = E(CPU_target)` - E(CPU_target)

...

...
This method have one benefit is it can avoid to compute all CPUs energy in the same cluster, and only can focus the energy change introduced by waken task on the target CPU,

Do you mean that we whould ignore the blocked utilization?

No, if the candidate CPU has no OPP change for waken task placement, we can skip to compuate all other irrelevant CPUs energy in the same cluster.

...

...
which this method is called 'task oriented computation' in this patch, the energy computation can be optimized as:

TE(CPU_0) = E(CPU_0)` + E(CLS_0)` - E(CPU_0) - E(CLS_0) TE(CPU_4) = E(CPU_4)` + E(CLS_1)` - E(CPU_4) - E(CLS_1)

Again, IMHO in the above formula, if the OPP could change we cannot disregard energy variations from other cpus in Cluster_1.

Yes. I will refine the commit log.

...

...
E_DIFF(CPU_0 - CPU_4) = TE(CPU_0) - TE(CPU_4)

As the result, the computation iteration can be reduced from 20 times to 8 times; so this can significantly reduce calculation overload.

After using task oriented computation, it has one case the computation might take longer time than previous method. For instance, when candidates are CPU_0 and CPU1, and after place task on either CPU the OPP will be increased,

... but this can happen also in the previous case.

Let say that CPU_4 has a utilization which is ~80% less then the capacity of its current OPP and that the task's PREV_CPU is CPU_0.

Let's also assume that the utilization of the task is big enought to require a capacity increase on CPU_4. Then, the energy for all the other CPUs in Cluster_1 are different in the before and after case.

Agree. I will list all possible cases in commit log in next version.

...

...
in this case, the old method uses below computation:

TE(CPU_0) = E(CPU_0)` + E(CPU_1) + E(CPU_2) + E(CPU_3) + E(CLS_0) TE(CPU_1) = E(CPU_0) + E(CPU_1)` + E(CPU_2) + E(CPU_3) + E(CLS_0)

E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)

Using task oriented computation, because the OPP increasing impacts other CPUs in the same cluster, so it needs to calculate all related CPUs energy:

TE(CPU_0) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)

TE(CPU_1) = E(CPU_0)` + E(CPU_1)' + E(CPU_2)' + E(CPU_3)' + E(CLS_0)` - E(CPU_0) - E(CPU_1) - E(CPU_2) - E(CPU_3) - E(CLS_0)

E_DIFF(CPU_0 - CPU_1) = TE(CPU_0) - TE(CPU_1)

We can use more complex method for optimization, e.g. we can extend eenv structure to cache CPU energy data so can reuse them for two candidates. This can be used for later optimization.

As side effect, this patch also resolves energy calculation consistent issue, e.g. for some cases the energy calculation is for one cluster, some cases the energy calculation is for multiple clusters; so the energy data semantics are not consistent for different scenarios.

That's becasue we care about relative differences in energy consumptions...

...
Computation inconsistent issue might introduce trouble for PE filter.

... we should look into those issues. We know about some of them and already have patches, which was part of the 3.18 kernel used by Pixel devices but not yet merged in 4.4, since we notice that these cases, although being broken, are also not happening in Android because of the schedtune configurations used by Pixel devices.

To more clearly describe the issue I observed, please see below code related with dead-zone margin [1]:

/* * Compute the dead-zone margin used to prevent too many task * migrations with negligible energy savings. * An energy saving is considered meaningful if it reduces the energy * consumption of EAS_CPU_PRV CPU candidate by at least ~1.56% */ margin = eenv->cpu[EAS_CPU_PRV].energy >> 6;

So the dead-zone margin is: eenv->cpu[EAS_CPU_PRV].energy / 64; the value of eenv->cpu[EAS_CPU_PRV].energy is not consistent, it might be one cluster energy (for all candidates in same cluster) or two clusters energy (for candidates are in two clusters).

[1] https://android.googlesource.com/kernel/common/+/android-4.9-eas-dev/kernel/...

...

...
This patch fixes issue by always calculating task based energy.

To achieve the optimization, this patch utilizes 'eenv->sg_cap' and 'eenv->sg_top' parameters; the parameter 'eenv->sg_cap' is only about the CPU capacity shared attribution, so eventually it's to describe the clock domain shared within CPUs, from this parameter we can get to know the final OPP selection; we need utilize parameter 'eenv->sg_top' to define which CPU we take care about, if the frequency is not changed after placing waken task then it will set the first level scheduling group to it (means one the single CPU) so the energy computation is limited to this single CPU, otherwise it rolls back to compute all CPUs

Ok, so that maens that we always check for OPP changes... then it should work...

...
in the same clock domain.

Change-Id: Ifb64ad77c6173388abf13e255e2ed8e8586a38bc Signed-off-by: Leo Yan leo.yan@linaro.org

kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 43 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 49eee75..4811fce 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5473,7 +5473,7 @@ struct energy_env { /* Estimated energy variation wrt previous CPU */ int nrg_delta;

} cpu[NR_CPUS];

} cpu[NR_CPUS*2];
         ^^^^^^^^^^^
Why can't we have a before/after set of variables directly in the definition of eenv?

Given the usage below it should also turns out to be slightly more cache friendly. i.e.

struct energy_env {

#define PREV_CAP 0 #define NEXT_CAP 1 struct { // ... } cpu[2][NR_CPUS]; }

Yeah, thanks for good suggestions. Will refine patch with it.

...

...
/* The morst energy efficient CPU for the specified energy_env::p */ int next_cpu; @@ -5791,45 +5791,55 @@ static int compute_energy(struct energy_env *eenv, int candidate) */ static int compute_task_energy(struct energy_env *eenv, int cpu) {

struct sched_domain *sd, *sd_cap;

struct sched_group *sg;

int first_cpu;

struct sched_domain *sd;

unsigned int prev_cap_idx, next_cap_idx;

int cmp_idx, ret;

sd = rcu_dereference(per_cpu(sd_ea, cpu));

sd = rcu_dereference(per_cpu(sd_scs, cpu)); if (!sd) return -1; /* Error */
sg = sd->groups;

do {
/* Skip SGs which do not contains a candidate CPU */
if (!cpumask_intersects(&eenv->cpus_mask, sched_group_cpus(sg)))
	continue;
eenv->sg_top = sg;
first_cpu = cpumask_first(sched_group_cpus(sg));
/*
 * The CPU capacity sharing attribution is decided by hardhware
 * design so we can decide the sg_cp value at the beginning
 * for specific CPU.
 */
sd_cap = rcu_dereference(per_cpu(sd_scs, first_cpu));
if (sd_cap && sd_cap->parent)
	eenv->sg_cap = sd_cap->parent->groups;
else
	eenv->sg_cap = sd_cap->groups;
find_new_capacity(eenv, cpu);
/*
* The CPU capacity sharing attribution is decided by hardhware
* design so we can decide the sg_cp value at the beginning
* for specific CPU.
*/
if (sd && sd->parent)
eenv->sg_cap = sd->parent->groups;
else
eenv->sg_cap = sd->groups;
/* Estimate capacity index before task placement */

cmp_idx = NR_CPUS + cpu;

prev_cap_idx = find_new_capacity(eenv, cmp_idx);

next_cap_idx = find_new_capacity(eenv, cpu);
                                           ^^^^
Based on the above suggestion, this parameter would be {PREV,NEXT}_CAP.

...
/*
* Computation is iteration sched_group from bottom to up level for
* energy accumulation, 'sg_top' is top most sched_group:
* - If the CPU frequency has no change before and after task placed
*   onto the target CPU, set 'sg_top' to sched_group for the target
*   CPU; this means only to calculate the energy for this single CPU
*   and ignore other CPUs in the same clock domain.
+1

...
* - If found the OPP frequency is changed after task placement then
*   need to calculate all CPUs who bound in the same clock domain,
*   so set 'sg_top' to shared capacity scheduling group.
*/
if (prev_cap_idx != next_cap_idx)
eenv->sg_top = eenv->sg_cap;
else
eenv->sg_top = sd->groups;
/* energy is unscaled to reduce rounding errors */
if (compute_energy(eenv, cpu) == -EINVAL) {
	eenv->next_cpu = eenv->prev_cpu;
	return -EINVAL;
}
/* energy is unscaled to reduce rounding errors */

ret = compute_energy(eenv, cmp_idx);
                               ^^^^^^^
Same here... that would be a {PREV,NEXT}_CAP.

...
if (ret < 0)
return ret;
} while (sg = sg->next, sg != sd->groups);
ret = compute_energy(eenv, cpu);

if (ret < 0)
return ret;
eenv->cpu[cpu].energy -= eenv->cpu[cmp_idx].energy;
    eenv->cpu[NEXT_CAP][cpu].energy -= eenv->cpu[PREV_CPU][cpu].energy;
...
return 0; }

-- 1.9.1

-- #include <best/regards.h>

Patrick Bellasi

2251

days inactive

2257

days old

eas-dev@lists.linaro.org

39 comments

participants

tags (0)

participants (3)

Daniel Lezcano
Leo Yan
Patrick Bellasi