This patch series is a following up for EASv5 power profiling on Hikey.
From profiling result, rt-app-31/38/44 are inconsistent; Finally
found this issue can be fixed by these 4 patches. After applied these patch, we can get good improvement for these cases (mW):
Energy BestComb Mainline(ndm) noEAS(ndm) EAS(ndm) EAS(sched) EAS(Applied Patches) mp3 412 604.41 551.79 528.99 530.20 491.10 rt-app-6 676 864.18 846.72 792.88 840.33 759.96 rt-app-13 968 1222.47 1210.35 1673.04 1332.13 1253.99 rt-app-19 1348 1412.08 1474.86 1612.12 1421.28 1355.49 rt-app-25 1619 1718.67 1710.73 2104.41 2028.25 1584.25 rt-app-31 1878 1968.08 1965.87 2318.11 2976.59 1903.69 rt-app-38 2283 2580.23 2540.45 2576.46 2724.32 2241.29 rt-app-44 2578 3092.66 3056.92 2913.91 2669.91 2406.45 rt-app-50 2848 3492.36 3423.26 3489.14 3429.41 3290.25
This patch series is ONLY for EXPERIMENTAL purpose.
Leo Yan (4): sched/fair: EASv5: Fix CPU shared capacity issue sched/fair: EASv5: snapshot CPU's utilization sched/fair: EASv5: Add CPU's total utilization sched/fair: EASv5: update new capacity index
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 1 + 2 files changed, 74 insertions(+), 15 deletions(-)
-- 1.9.1
When clusters share clocking domain, then they will bind to the same OPP; So if calculate energy data, should consider cross all related CPUs. But scheduler have no a top level sched group for these all CPUs, finally will not iterate all CPUs and get wrong calculation result.
This patch will change eenv->sg_cap as cpumask structure and set all related CPUs into it.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9d0e11..63aef51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4892,7 +4892,7 @@ static inline bool energy_aware(void)
struct energy_env { struct sched_group *sg_top; - struct sched_group *sg_cap; + struct cpumask sg_cap; int cap_idx; int usage_delta; int src_cpu; @@ -4952,7 +4952,7 @@ unsigned long group_max_usage(struct energy_env *eenv) int i, delta; unsigned long max_usage = 0;
- for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) { + for_each_cpu(i, &eenv->sg_cap) { delta = calc_usage_delta(eenv, i); max_usage = max(max_usage, __get_cpu_usage(i, delta)); } @@ -5054,8 +5054,19 @@ static unsigned int sched_group_energy(struct energy_env *eenv) * sched_group? */ sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES); - if (sd && sd->parent) + + cpumask_clear(&eenv->sg_cap); + if (sd && sd->parent) { sg_shared_cap = sd->parent->groups; + cpumask_or(&eenv->sg_cap, &eenv->sg_cap, + sched_group_cpus(sg_shared_cap)); + } else if (sd) { + sg_shared_cap = sd->groups; + do { + cpumask_or(&eenv->sg_cap, &eenv->sg_cap, + sched_group_cpus(sg_shared_cap)); + } while (sg_shared_cap = sg_shared_cap->next, sg_shared_cap != sd->groups); + }
for_each_domain(cpu, sd) { sg = sd->groups; @@ -5069,11 +5080,6 @@ static unsigned int sched_group_energy(struct energy_env *eenv) int sg_busy_energy, sg_idle_energy; int cap_idx, idle_idx;
- if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) - eenv->sg_cap = sg_shared_cap; - else - eenv->sg_cap = sg; - cap_idx = find_new_capacity(eenv, sg->sge);
if (sg->group_weight == 1) { @@ -5170,6 +5176,7 @@ static int energy_diff(struct energy_env *eenv) energy_before += sched_group_energy(&eenv_before); energy_after += sched_group_energy(eenv); } + } while (sg = sg->next, sg != sd->groups);
eenv->nrg.before = energy_before; -- 1.9.1
When calculate energy difference, EAS need calculate twice. The first time is to calculate energy before task's migration and second time is to calculate energy if task is migrated to the selected CPU. But there have high chance for these two times have no exactly same context, such like the CPU frequency has been changed by CPUFreq governor. Then the calculation for energy difference will introduce unexpected result and cannot be trusted anymore.
So this patch try to create a snapshot for CPU's utilization, so later's two times' calculation will base on this snapshot; This will be helpful for the calculations with same context.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 63aef51..01bb3f0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4890,6 +4890,11 @@ static inline bool energy_aware(void) return sched_feat(ENERGY_AWARE); }
+struct energy_env_snapshot { + int util[8]; + int block[8]; +}; + struct energy_env { struct sched_group *sg_top; struct cpumask sg_cap; @@ -4900,6 +4905,8 @@ struct energy_env { int energy; int energy_payoff; struct task_struct *task; + struct energy_env_snapshot snapshot; + struct { int before; int after; @@ -4913,6 +4920,24 @@ struct energy_env { } cap; };
+static unsigned long __eas_get_cpu_usage(struct energy_env *eenv, int cpu, int delta) +{ + int sum; + unsigned long usage = eenv->snapshot.util[cpu]; + unsigned long blocked = eenv->snapshot.block[cpu]; + unsigned long capacity_orig = capacity_orig_of(cpu); + + sum = usage + blocked + delta; + + if (sum < 0) + return 0; + + if (sum >= capacity_orig) + return capacity_orig; + + return sum; +} + /* * __cpu_norm_usage() returns the cpu usage relative to a specific capacity, * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for @@ -4927,9 +4952,10 @@ struct energy_env { * * norm_usage = running_time/time ~ usage/capacity */ -static unsigned long __cpu_norm_usage(int cpu, unsigned long capacity, int delta) +static unsigned long __cpu_norm_usage(struct energy_env *eenv, int cpu, + unsigned long capacity, int delta) { - int usage = __get_cpu_usage(cpu, delta); + int usage = __eas_get_cpu_usage(eenv, cpu, delta);
if (usage >= capacity) return SCHED_CAPACITY_SCALE; @@ -4954,7 +4980,7 @@ unsigned long group_max_usage(struct energy_env *eenv)
for_each_cpu(i, &eenv->sg_cap) { delta = calc_usage_delta(eenv, i); - max_usage = max(max_usage, __get_cpu_usage(i, delta)); + max_usage = max(max_usage, __eas_get_cpu_usage(eenv, i, delta)); }
return max_usage; @@ -4978,7 +5004,7 @@ long group_norm_usage(struct energy_env *eenv, struct sched_group *sg)
for_each_cpu(i, sched_group_cpus(sg)) { delta = calc_usage_delta(eenv, i); - usage_sum += __cpu_norm_usage(i, capacity, delta); + usage_sum += __cpu_norm_usage(eenv, i, capacity, delta); }
if (usage_sum > SCHED_CAPACITY_SCALE) @@ -5135,6 +5161,7 @@ static int energy_diff(struct energy_env *eenv) struct sched_domain *sd; struct sched_group *sg; int sd_cpu = -1, energy_before = 0, energy_after = 0; + int i;
struct energy_env eenv_before = { .usage_delta = 0, @@ -5154,6 +5181,17 @@ static int energy_diff(struct energy_env *eenv) return 0; /* Error */
sg = sd->groups; + + for_each_cpu(i, sched_domain_span(sd)) { + eenv_before.snapshot.util[i] = + cpu_rq(i)->cfs.utilization_load_avg; + eenv_before.snapshot.block[i] = + cpu_rq(i)->cfs.utilization_blocked_avg; + + eenv->snapshot.util[i] = eenv_before.snapshot.util[i]; + eenv->snapshot.block[i] = eenv_before.snapshot.block[i]; + } + do { if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) { -- 1.9.1
The previous code calculates CPU's total utilization with runnable tasks and block tasks. These two variables will be updated separately, so there have chance to get intermediate values for them; for example, in the function enqueue_entity_load_avg() it will wake up one task, it firstly will decrease cfs_rq::utilization_blocked_avg and then increase cfs_rq::utilization_load_avg, but between these two operations there has interval caused by function update_entity_load_avg(). So finally it's easily to get these two value equal zero, even though there have one big workload task on this CPU.
So add one more variable util_total to trace the sum value for CPU's utilization and dismiss intermediate value issue.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 38 +++++++++++++++++++++++++------------- kernel/sched/sched.h | 1 + 2 files changed, 26 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 01bb3f0..6203a20 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2872,7 +2872,10 @@ static inline void update_entity_load_avg(struct sched_entity *se, -utilization_delta); }
- trace_sched_load_avg_cpu(cpu, cfs_rq); + cfs_rq->util_total = cfs_rq->utilization_load_avg + + cfs_rq->utilization_blocked_avg; + + trace_sched_load_avg_cpu(cpu, &cpu_rq(cpu)->cfs); }
/* @@ -2883,6 +2886,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) { u64 now = cfs_rq_clock_task(cfs_rq) >> 20; u64 decays; + int cpu = cpu_of(rq_of(cfs_rq));
decays = now - cfs_rq->last_decay; if (!decays && !force_update) @@ -2908,6 +2912,9 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) }
__update_cfs_rq_tg_load_contrib(cfs_rq, force_update); + + cfs_rq->util_total = cfs_rq->utilization_load_avg + + cfs_rq->utilization_blocked_avg; }
/* Add the load generated by se into cfs_rq's child load-average */ @@ -2958,6 +2965,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup); + + cfs_rq->util_total = cfs_rq->utilization_load_avg + + cfs_rq->utilization_blocked_avg; }
/* @@ -2969,6 +2979,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep) { + int cpu = cpu_of(rq_of(cfs_rq)); + update_entity_load_avg(se, 1); /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !sleep); @@ -2981,6 +2993,9 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, se->avg.utilization_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); } /* migrations, e.g. sleep=0 leave decay_count == 0 */ + + cfs_rq->util_total = cfs_rq->utilization_load_avg + + cfs_rq->utilization_blocked_avg; }
/* @@ -4891,8 +4906,7 @@ static inline bool energy_aware(void) }
struct energy_env_snapshot { - int util[8]; - int block[8]; + int util_total[8]; };
struct energy_env { @@ -4923,11 +4937,10 @@ struct energy_env { static unsigned long __eas_get_cpu_usage(struct energy_env *eenv, int cpu, int delta) { int sum; - unsigned long usage = eenv->snapshot.util[cpu]; - unsigned long blocked = eenv->snapshot.block[cpu]; + unsigned long util_total = eenv->snapshot.util_total[cpu]; unsigned long capacity_orig = capacity_orig_of(cpu);
- sum = usage + blocked + delta; + sum = util_total + delta;
if (sum < 0) return 0; @@ -5183,13 +5196,9 @@ static int energy_diff(struct energy_env *eenv) sg = sd->groups;
for_each_cpu(i, sched_domain_span(sd)) { - eenv_before.snapshot.util[i] = - cpu_rq(i)->cfs.utilization_load_avg; - eenv_before.snapshot.block[i] = - cpu_rq(i)->cfs.utilization_blocked_avg; - - eenv->snapshot.util[i] = eenv_before.snapshot.util[i]; - eenv->snapshot.block[i] = eenv_before.snapshot.block[i]; + eenv_before.snapshot.util_total[i] = cpu_rq(i)->cfs.util_total; + eenv->snapshot.util_total[i] = + eenv_before.snapshot.util_total[i]; }
do { @@ -8972,6 +8981,9 @@ static void task_move_group_fair(struct task_struct *p, int queued) cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; cfs_rq->utilization_blocked_avg += se->avg.utilization_avg_contrib; + + cfs_rq->util_total = cfs_rq->utilization_load_avg + + cfs_rq->utilization_blocked_avg; #endif } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 849b6b5..deb1fae 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -382,6 +382,7 @@ struct cfs_rq { */ unsigned long runnable_load_avg, blocked_load_avg; unsigned long utilization_load_avg, utilization_blocked_avg; + unsigned long util_total; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load, removed_utilization; -- 1.9.1
When calculate new CPU capacity index, need update the eenv->cap_idx, which will be used for group_norm_usage(). Otherwise it will calculate with zero.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6203a20..ce293ff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5032,12 +5032,13 @@ static int find_new_capacity(struct energy_env *eenv, unsigned long util = group_max_usage(eenv);
for (idx = 0; idx < sge->nr_cap_states; idx++) { - if (sge->cap_states[idx].cap >= util) + if (sge->cap_states[idx].cap >= util) { + eenv->cap_idx = idx; return idx; + } }
eenv->cap_idx = idx; - return idx; }
-- 1.9.1
Hi Leo,
sorry for the huge delay but I was on a 4 weeks sabbatical.
On 10/11/15 09:58, Leo Yan wrote:
This patch series is a following up for EASv5 power profiling on Hikey. From profiling result, rt-app-31/38/44 are inconsistent; Finally found this issue can be fixed by these 4 patches. After applied these patch, we can get good improvement for these cases (mW):
Energy BestComb Mainline(ndm) noEAS(ndm) EAS(ndm) EAS(sched) EAS(Applied Patches) mp3 412 604.41 551.79 528.99 530.20 491.10 rt-app-6 676 864.18 846.72 792.88 840.33 759.96 rt-app-13 968 1222.47 1210.35 1673.04 1332.13 1253.99 rt-app-19 1348 1412.08 1474.86 1612.12 1421.28 1355.49 rt-app-25 1619 1718.67 1710.73 2104.41 2028.25 1584.25 rt-app-31 1878 1968.08 1965.87 2318.11 2976.59 1903.69 rt-app-38 2283 2580.23 2540.45 2576.46 2724.32 2241.29 rt-app-44 2578 3092.66 3056.92 2913.91 2669.91 2406.45 rt-app-50 2848 3492.36 3423.26 3489.14 3429.41 3290.25
Would you be able to switch to Patrick's schedtest to run the EAS test there? This would enable us to compare the results much more easy.
I know that there is a little bit of work remaining to integrate the Hikey board into schedtest (the actual board integration and the integration of the ARM Energy Probe).
If you don't have time to do this right now, would you be able to share an instrumented Hikey board with us so we could do the integration?
The main point is that to be able to evaluate patches for possible EAS integration which might add SMP (and/or exotic PM configurations, like the Frequency Domain spawning two clusters) we need a common understanding what these test results actually mean in detail. I don't see another way than using the same tool for the test.
Could you please switch to EAS RFC 5.2 ? It will have the PELT rewrite so some of your fixes should be no longer necessary or already addressed.
linux-arm.org/linux-power.git energy_model_rfc_v5.2
This patch series is ONLY for EXPERIMENTAL purpose.
Leo Yan (4): sched/fair: EASv5: Fix CPU shared capacity issue sched/fair: EASv5: snapshot CPU's utilization sched/fair: EASv5: Add CPU's total utilization sched/fair: EASv5: update new capacity index
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 1 + 2 files changed, 74 insertions(+), 15 deletions(-)
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 10/11/15 09:58, Leo Yan wrote:
When clusters share clocking domain, then they will bind to the same OPP; So if calculate energy data, should consider cross all related CPUs. But scheduler have no a top level sched group for these all CPUs, finally will not iterate all CPUs and get wrong calculation result.
This patch will change eenv->sg_cap as cpumask structure and set all related CPUs into it.
The 'put a struct cpumask sg_cap' on the stack idea should work but we so far favoured a different solution. Mainly also to cover one cluster systems for EAS we thought it would be better to introduce an extra sched domain 'SYS' (in your case on top of 'DIE') to:
(1) Hold cluster energy model (EM) data on a single cluster system (2) Offer this 'sg_shared_cap = sd->parent->groups' thing on a platform like Hikey (3) Let cpu and cluster level EM data survive if cpus get off-lined
Obviously, this sd level can't be part of the actual scheduling decisions so CFS code has to change slightly for this to work. I attached the appropriate patch. It's only tested lightly on TC2 and JUNO and might not apply on the current code line.
Would be nice if you can test it and give feedback if you consider it as something you're fine with. Thanks!
We should consider integrating one or the other solution into EAS RFC 6 to cover your topology for EAS as well.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9d0e11..63aef51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4892,7 +4892,7 @@ static inline bool energy_aware(void)
struct energy_env { struct sched_group *sg_top;
struct sched_group *sg_cap;
struct cpumask sg_cap; int cap_idx; int usage_delta; int src_cpu;
@@ -4952,7 +4952,7 @@ unsigned long group_max_usage(struct energy_env *eenv) int i, delta; unsigned long max_usage = 0;
for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) {
for_each_cpu(i, &eenv->sg_cap) { delta = calc_usage_delta(eenv, i); max_usage = max(max_usage, __get_cpu_usage(i, delta)); }
@@ -5054,8 +5054,19 @@ static unsigned int sched_group_energy(struct energy_env *eenv) * sched_group? */ sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
if (sd && sd->parent)
cpumask_clear(&eenv->sg_cap);
if (sd && sd->parent) { sg_shared_cap = sd->parent->groups;
cpumask_or(&eenv->sg_cap, &eenv->sg_cap,
sched_group_cpus(sg_shared_cap));
This path would cover platforms like TC2 and JUNO.
} else if (sd) {
sg_shared_cap = sd->groups;
do {
cpumask_or(&eenv->sg_cap, &eenv->sg_cap,
sched_group_cpus(sg_shared_cap));
} while (sg_shared_cap = sg_shared_cap->next, sg_shared_cap != sd->groups);
}
This path would cover Hikey.
for_each_domain(cpu, sd) { sg = sd->groups;
@@ -5069,11 +5080,6 @@ static unsigned int sched_group_energy(struct energy_env *eenv) int sg_busy_energy, sg_idle_energy; int cap_idx, idle_idx;
if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
eenv->sg_cap = sg_shared_cap;
else
eenv->sg_cap = sg;
cap_idx = find_new_capacity(eenv, sg->sge); if (sg->group_weight == 1) {
@@ -5170,6 +5176,7 @@ static int energy_diff(struct energy_env *eenv) energy_before += sched_group_energy(&eenv_before); energy_after += sched_group_energy(eenv); }
} while (sg = sg->next, sg != sd->groups); eenv->nrg.before = energy_before;
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 10/11/15 09:58, Leo Yan wrote:
When calculate energy difference, EAS need calculate twice. The first time is to calculate energy before task's migration and second time is to calculate energy if task is migrated to the selected CPU. But there have high chance for these two times have no exactly same context, such like the CPU frequency has been changed by CPUFreq governor. Then the calculation for energy difference will introduce unexpected result and cannot be trusted anymore.
So this patch try to create a snapshot for CPU's utilization, so later's two times' calculation will base on this snapshot; This will be helpful for the calculations with same context.
If you do this, IMHO you're just changing from one scenario in which you base your calculation on erroneous data to another. In your case, if the system underneath EAS has changed, you're operating on stale data for before _and_ after now.
The result which we base task placement decision on, is a difference of energy and it is impossible to calculate this atomically. So what difference does it make to have the error in the resulting energy number comparing to the case where you calculate the energy number purely on stale data but the system you put the task on has changed in the meantime? Worst case scenario is a wrong task placement. The whole cfs code is designed in such a way that you have to be able to work with stale data for the sched domain/sched group statistic data. Don't see an advantage right now.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 63aef51..01bb3f0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4890,6 +4890,11 @@ static inline bool energy_aware(void) return sched_feat(ENERGY_AWARE); }
+struct energy_env_snapshot {
int util[8];
int block[8];
+};
struct energy_env { struct sched_group *sg_top; struct cpumask sg_cap; @@ -4900,6 +4905,8 @@ struct energy_env { int energy; int energy_payoff; struct task_struct *task;
struct energy_env_snapshot snapshot;
struct { int before; int after;
@@ -4913,6 +4920,24 @@ struct energy_env { } cap; };
+static unsigned long __eas_get_cpu_usage(struct energy_env *eenv, int cpu, int delta) +{
int sum;
unsigned long usage = eenv->snapshot.util[cpu];
unsigned long blocked = eenv->snapshot.block[cpu];
unsigned long capacity_orig = capacity_orig_of(cpu);
sum = usage + blocked + delta;
if (sum < 0)
return 0;
if (sum >= capacity_orig)
return capacity_orig;
return sum;
+}
/*
- __cpu_norm_usage() returns the cpu usage relative to a specific capacity,
- i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
@@ -4927,9 +4952,10 @@ struct energy_env {
- norm_usage = running_time/time ~ usage/capacity
*/ -static unsigned long __cpu_norm_usage(int cpu, unsigned long capacity, int delta) +static unsigned long __cpu_norm_usage(struct energy_env *eenv, int cpu,
unsigned long capacity, int delta)
{
int usage = __get_cpu_usage(cpu, delta);
int usage = __eas_get_cpu_usage(eenv, cpu, delta); if (usage >= capacity) return SCHED_CAPACITY_SCALE;
@@ -4954,7 +4980,7 @@ unsigned long group_max_usage(struct energy_env *eenv)
for_each_cpu(i, &eenv->sg_cap) { delta = calc_usage_delta(eenv, i);
max_usage = max(max_usage, __get_cpu_usage(i, delta));
max_usage = max(max_usage, __eas_get_cpu_usage(eenv, i, delta)); } return max_usage;
@@ -4978,7 +5004,7 @@ long group_norm_usage(struct energy_env *eenv, struct sched_group *sg)
for_each_cpu(i, sched_group_cpus(sg)) { delta = calc_usage_delta(eenv, i);
usage_sum += __cpu_norm_usage(i, capacity, delta);
usage_sum += __cpu_norm_usage(eenv, i, capacity, delta); } if (usage_sum > SCHED_CAPACITY_SCALE)
@@ -5135,6 +5161,7 @@ static int energy_diff(struct energy_env *eenv) struct sched_domain *sd; struct sched_group *sg; int sd_cpu = -1, energy_before = 0, energy_after = 0;
int i; struct energy_env eenv_before = { .usage_delta = 0,
@@ -5154,6 +5181,17 @@ static int energy_diff(struct energy_env *eenv) return 0; /* Error */
sg = sd->groups;
for_each_cpu(i, sched_domain_span(sd)) {
eenv_before.snapshot.util[i] =
cpu_rq(i)->cfs.utilization_load_avg;
eenv_before.snapshot.block[i] =
cpu_rq(i)->cfs.utilization_blocked_avg;
eenv->snapshot.util[i] = eenv_before.snapshot.util[i];
eenv->snapshot.block[i] = eenv_before.snapshot.block[i];
}
do { if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) {
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 10/11/15 09:58, Leo Yan wrote:
The previous code calculates CPU's total utilization with runnable tasks and block tasks. These two variables will be updated separately, so there have chance to get intermediate values for them; for example, in the function enqueue_entity_load_avg() it will wake up one task, it firstly will decrease cfs_rq::utilization_blocked_avg and then increase cfs_rq::utilization_load_avg, but between these two operations there has interval caused by function update_entity_load_avg(). So finally it's easily to get these two value equal zero, even though there have one big workload task on this CPU.
This shouldn't be necessary any more in EAS RFC 5.1/5.2 with the PELT rewrite. We still have cfs_rq->avg.util_avg and cfs_rq->removed_util_avg though so the former one isn't always up to date (e.g. in case of task migrations and dying tasks).
So add one more variable util_total to trace the sum value for CPU's utilization and dismiss intermediate value issue.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 38 +++++++++++++++++++++++++------------- kernel/sched/sched.h | 1 + 2 files changed, 26 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 01bb3f0..6203a20 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2872,7 +2872,10 @@ static inline void update_entity_load_avg(struct sched_entity *se, -utilization_delta); }
trace_sched_load_avg_cpu(cpu, cfs_rq);
cfs_rq->util_total = cfs_rq->utilization_load_avg +
cfs_rq->utilization_blocked_avg;
trace_sched_load_avg_cpu(cpu, &cpu_rq(cpu)->cfs);
}
/* @@ -2883,6 +2886,7 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) { u64 now = cfs_rq_clock_task(cfs_rq) >> 20; u64 decays;
int cpu = cpu_of(rq_of(cfs_rq)); decays = now - cfs_rq->last_decay; if (!decays && !force_update)
@@ -2908,6 +2912,9 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update) }
__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
cfs_rq->util_total = cfs_rq->utilization_load_avg +
cfs_rq->utilization_blocked_avg;
}
/* Add the load generated by se into cfs_rq's child load-average */ @@ -2958,6 +2965,9 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq, cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib; /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !wakeup);
cfs_rq->util_total = cfs_rq->utilization_load_avg +
cfs_rq->utilization_blocked_avg;
}
/* @@ -2969,6 +2979,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep) {
int cpu = cpu_of(rq_of(cfs_rq));
update_entity_load_avg(se, 1); /* we force update consideration on load-balancer moves */ update_cfs_rq_blocked_load(cfs_rq, !sleep);
@@ -2981,6 +2993,9 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq, se->avg.utilization_avg_contrib; se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter); } /* migrations, e.g. sleep=0 leave decay_count == 0 */
cfs_rq->util_total = cfs_rq->utilization_load_avg +
cfs_rq->utilization_blocked_avg;
}
/* @@ -4891,8 +4906,7 @@ static inline bool energy_aware(void) }
struct energy_env_snapshot {
int util[8];
int block[8];
int util_total[8];
};
struct energy_env { @@ -4923,11 +4937,10 @@ struct energy_env { static unsigned long __eas_get_cpu_usage(struct energy_env *eenv, int cpu, int delta) { int sum;
unsigned long usage = eenv->snapshot.util[cpu];
unsigned long blocked = eenv->snapshot.block[cpu];
unsigned long util_total = eenv->snapshot.util_total[cpu]; unsigned long capacity_orig = capacity_orig_of(cpu);
sum = usage + blocked + delta;
sum = util_total + delta; if (sum < 0) return 0;
@@ -5183,13 +5196,9 @@ static int energy_diff(struct energy_env *eenv) sg = sd->groups;
for_each_cpu(i, sched_domain_span(sd)) {
eenv_before.snapshot.util[i] =
cpu_rq(i)->cfs.utilization_load_avg;
eenv_before.snapshot.block[i] =
cpu_rq(i)->cfs.utilization_blocked_avg;
eenv->snapshot.util[i] = eenv_before.snapshot.util[i];
eenv->snapshot.block[i] = eenv_before.snapshot.block[i];
eenv_before.snapshot.util_total[i] = cpu_rq(i)->cfs.util_total;
eenv->snapshot.util_total[i] =
eenv_before.snapshot.util_total[i]; } do {
@@ -8972,6 +8981,9 @@ static void task_move_group_fair(struct task_struct *p, int queued) cfs_rq->blocked_load_avg += se->avg.load_avg_contrib; cfs_rq->utilization_blocked_avg += se->avg.utilization_avg_contrib;
cfs_rq->util_total = cfs_rq->utilization_load_avg +
cfs_rq->utilization_blocked_avg;
#endif } } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 849b6b5..deb1fae 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -382,6 +382,7 @@ struct cfs_rq { */ unsigned long runnable_load_avg, blocked_load_avg; unsigned long utilization_load_avg, utilization_blocked_avg;
unsigned long util_total; atomic64_t decay_counter; u64 last_decay; atomic_long_t removed_load, removed_utilization;
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 10/11/15 09:58, Leo Yan wrote:
When calculate new CPU capacity index, need update the eenv->cap_idx, which will be used for group_norm_usage(). Otherwise it will calculate with zero.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6203a20..ce293ff 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5032,12 +5032,13 @@ static int find_new_capacity(struct energy_env *eenv, unsigned long util = group_max_usage(eenv);
for (idx = 0; idx < sge->nr_cap_states; idx++) {
if (sge->cap_states[idx].cap >= util)
if (sge->cap_states[idx].cap >= util) {
eenv->cap_idx = idx; return idx;
}
Nice catch! We also figured this out in the meantime so it's part of EAS RFC 5.2 .
} eenv->cap_idx = idx;
return idx;
}
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Dietmar,
On Thu, Dec 17, 2015 at 07:18:52PM +0000, Dietmar Eggemann wrote:
sorry for the huge delay but I was on a 4 weeks sabbatical.
On 10/11/15 09:58, Leo Yan wrote:
This patch series is a following up for EASv5 power profiling on Hikey. From profiling result, rt-app-31/38/44 are inconsistent; Finally found this issue can be fixed by these 4 patches. After applied these patch, we can get good improvement for these cases (mW):
Energy BestComb Mainline(ndm) noEAS(ndm) EAS(ndm) EAS(sched) EAS(Applied Patches) mp3 412 604.41 551.79 528.99 530.20 491.10 rt-app-6 676 864.18 846.72 792.88 840.33 759.96 rt-app-13 968 1222.47 1210.35 1673.04 1332.13 1253.99 rt-app-19 1348 1412.08 1474.86 1612.12 1421.28 1355.49 rt-app-25 1619 1718.67 1710.73 2104.41 2028.25 1584.25 rt-app-31 1878 1968.08 1965.87 2318.11 2976.59 1903.69 rt-app-38 2283 2580.23 2540.45 2576.46 2724.32 2241.29 rt-app-44 2578 3092.66 3056.92 2913.91 2669.91 2406.45 rt-app-50 2848 3492.36 3423.26 3489.14 3429.41 3290.25
Would you be able to switch to Patrick's schedtest to run the EAS test there? This would enable us to compare the results much more easy.
Sure, will do this.
I know that there is a little bit of work remaining to integrate the Hikey board into schedtest (the actual board integration and the integration of the ARM Energy Probe).
board integration has finished, but AEP integration need time to do. I will finish this in next week and share with you.
If you don't have time to do this right now, would you be able to share an instrumented Hikey board with us so we could do the integration?
I usually need finish other more prioritized tasks, and use rest time for EAS profiling on Hikey; So strongly suggest you build up Hikey environment at your side, and I will fully support for this.
For sharing board, let us use one offline email to track this.
The main point is that to be able to evaluate patches for possible EAS integration which might add SMP (and/or exotic PM configurations, like the Frequency Domain spawning two clusters) we need a common understanding what these test results actually mean in detail. I don't see another way than using the same tool for the test.
Could you please switch to EAS RFC 5.2 ? It will have the PELT rewrite so some of your fixes should be no longer necessary or already addressed.
linux-arm.org/linux-power.git energy_model_rfc_v5.2
Totally agree, before I continue more profiling EAS I will firstly switch to EAS RFC 5.2 and new profiling tool (schedtest).
Thanks, Leo Yan
This patch series is ONLY for EXPERIMENTAL purpose.
Leo Yan (4): sched/fair: EASv5: Fix CPU shared capacity issue sched/fair: EASv5: snapshot CPU's utilization sched/fair: EASv5: Add CPU's total utilization sched/fair: EASv5: update new capacity index
kernel/sched/fair.c | 88 +++++++++++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 1 + 2 files changed, 74 insertions(+), 15 deletions(-)
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Thu, Dec 17, 2015 at 07:18:56PM +0000, Dietmar Eggemann wrote:
On 10/11/15 09:58, Leo Yan wrote:
When clusters share clocking domain, then they will bind to the same OPP; So if calculate energy data, should consider cross all related CPUs. But scheduler have no a top level sched group for these all CPUs, finally will not iterate all CPUs and get wrong calculation result.
This patch will change eenv->sg_cap as cpumask structure and set all related CPUs into it.
The 'put a struct cpumask sg_cap' on the stack idea should work but we so far favoured a different solution. Mainly also to cover one cluster systems for EAS we thought it would be better to introduce an extra sched domain 'SYS' (in your case on top of 'DIE') to:
Your idea is better than my patch, because it has more clear topology corresponding to hardware design. Also I have below questions so can understand well your patch :)
(1) Hold cluster energy model (EM) data on a single cluster system
For single cluster system, there only has 'MC' sched domain (for cpu level) so there have no 'CPU' sched domain to hold cluster's EM data. So should add sched domain 'SYS' to help save cluster's EM data, right?
(2) Offer this 'sg_shared_cap = sd->parent->groups' thing on a platform like Hikey (3) Let cpu and cluster level EM data survive if cpus get off-lined
Just curious, what benefit we can get for this? When CPU or whole cluster has been off-lined, then we should not take care their EM data anymore.
Obviously, this sd level can't be part of the actual scheduling decisions so CFS code has to change slightly for this to work. I attached the appropriate patch. It's only tested lightly on TC2 and JUNO and might not apply on the current code line.
Would be nice if you can test it and give feedback if you consider it as something you're fine with. Thanks!
Do you suggest I test this based on RFCv5 or RFCv6?
We should consider integrating one or the other solution into EAS RFC 6 to cover your topology for EAS as well.
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9d0e11..63aef51 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4892,7 +4892,7 @@ static inline bool energy_aware(void)
struct energy_env { struct sched_group *sg_top;
struct sched_group *sg_cap;
struct cpumask sg_cap; int cap_idx; int usage_delta; int src_cpu;
@@ -4952,7 +4952,7 @@ unsigned long group_max_usage(struct energy_env *eenv) int i, delta; unsigned long max_usage = 0;
for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) {
for_each_cpu(i, &eenv->sg_cap) { delta = calc_usage_delta(eenv, i); max_usage = max(max_usage, __get_cpu_usage(i, delta)); }
@@ -5054,8 +5054,19 @@ static unsigned int sched_group_energy(struct energy_env *eenv) * sched_group? */ sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
if (sd && sd->parent)
cpumask_clear(&eenv->sg_cap);
if (sd && sd->parent) { sg_shared_cap = sd->parent->groups;
cpumask_or(&eenv->sg_cap, &eenv->sg_cap,
sched_group_cpus(sg_shared_cap));
This path would cover platforms like TC2 and JUNO.
Correct.
} else if (sd) {
sg_shared_cap = sd->groups;
do {
cpumask_or(&eenv->sg_cap, &eenv->sg_cap,
sched_group_cpus(sg_shared_cap));
} while (sg_shared_cap = sg_shared_cap->next, sg_shared_cap != sd->groups);
}
This path would cover Hikey.
Correct.
Thanks, Leo Yan
for_each_domain(cpu, sd) { sg = sd->groups;
@@ -5069,11 +5080,6 @@ static unsigned int sched_group_energy(struct energy_env *eenv) int sg_busy_energy, sg_idle_energy; int cap_idx, idle_idx;
if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
eenv->sg_cap = sg_shared_cap;
else
eenv->sg_cap = sg;
cap_idx = find_new_capacity(eenv, sg->sge); if (sg->group_weight == 1) {
@@ -5170,6 +5176,7 @@ static int energy_diff(struct energy_env *eenv) energy_before += sched_group_energy(&eenv_before); energy_after += sched_group_energy(eenv); }
} while (sg = sg->next, sg != sd->groups); eenv->nrg.before = energy_before;
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Thu, Dec 17, 2015 at 07:19:00PM +0000, Dietmar Eggemann wrote:
On 10/11/15 09:58, Leo Yan wrote:
When calculate energy difference, EAS need calculate twice. The first time is to calculate energy before task's migration and second time is to calculate energy if task is migrated to the selected CPU. But there have high chance for these two times have no exactly same context, such like the CPU frequency has been changed by CPUFreq governor. Then the calculation for energy difference will introduce unexpected result and cannot be trusted anymore.
So this patch try to create a snapshot for CPU's utilization, so later's two times' calculation will base on this snapshot; This will be helpful for the calculations with same context.
If you do this, IMHO you're just changing from one scenario in which you base your calculation on erroneous data to another. In your case, if the system underneath EAS has changed, you're operating on stale data for before _and_ after now.
The result which we base task placement decision on, is a difference of energy and it is impossible to calculate this atomically. So what difference does it make to have the error in the resulting energy number comparing to the case where you calculate the energy number purely on stale data but the system you put the task on has changed in the meantime? Worst case scenario is a wrong task placement. The whole cfs code is designed in such a way that you have to be able to work with stale data for the sched domain/sched group statistic data. Don't see an advantage right now.
The key thing is: we need calculate energy diff with calling function sched_group_energy() twice, but during these two times' calculation there have change for CPU's frequency.
For example, at the first time when call sched_group_energy() the CPU is running at 200MHz, and at the second time when call sched_group_energy() the CPU is running at 1.2GHz, in this case will get big difference between these two energy results. Finally I can see there have many wrong decisions for CPU selection.
Suppose this error will happen with much higher percentage on SMP platform (like hikey) rather than big.LITTLE system. This is because big.LITTLE system have no power efficiency overlap between two clusters, so will not impact the final calculation for energy difference (even LITTLE CPU changes to its highest OPP suddenly).
So adding snapshot, just want to make sure sched_group_energy() will be called twice with exactly same context, and finally we can make sure calculation for energy difference atomically.
Thanks, Leo Yan
Signed-off-by: Leo Yan leo.yan@linaro.org
kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 63aef51..01bb3f0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4890,6 +4890,11 @@ static inline bool energy_aware(void) return sched_feat(ENERGY_AWARE); }
+struct energy_env_snapshot {
int util[8];
int block[8];
+};
struct energy_env { struct sched_group *sg_top; struct cpumask sg_cap; @@ -4900,6 +4905,8 @@ struct energy_env { int energy; int energy_payoff; struct task_struct *task;
struct energy_env_snapshot snapshot;
struct { int before; int after;
@@ -4913,6 +4920,24 @@ struct energy_env { } cap; };
+static unsigned long __eas_get_cpu_usage(struct energy_env *eenv, int cpu, int delta) +{
int sum;
unsigned long usage = eenv->snapshot.util[cpu];
unsigned long blocked = eenv->snapshot.block[cpu];
unsigned long capacity_orig = capacity_orig_of(cpu);
sum = usage + blocked + delta;
if (sum < 0)
return 0;
if (sum >= capacity_orig)
return capacity_orig;
return sum;
+}
/*
- __cpu_norm_usage() returns the cpu usage relative to a specific capacity,
- i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
@@ -4927,9 +4952,10 @@ struct energy_env {
- norm_usage = running_time/time ~ usage/capacity
*/ -static unsigned long __cpu_norm_usage(int cpu, unsigned long capacity, int delta) +static unsigned long __cpu_norm_usage(struct energy_env *eenv, int cpu,
unsigned long capacity, int delta)
{
int usage = __get_cpu_usage(cpu, delta);
int usage = __eas_get_cpu_usage(eenv, cpu, delta); if (usage >= capacity) return SCHED_CAPACITY_SCALE;
@@ -4954,7 +4980,7 @@ unsigned long group_max_usage(struct energy_env *eenv)
for_each_cpu(i, &eenv->sg_cap) { delta = calc_usage_delta(eenv, i);
max_usage = max(max_usage, __get_cpu_usage(i, delta));
max_usage = max(max_usage, __eas_get_cpu_usage(eenv, i, delta)); } return max_usage;
@@ -4978,7 +5004,7 @@ long group_norm_usage(struct energy_env *eenv, struct sched_group *sg)
for_each_cpu(i, sched_group_cpus(sg)) { delta = calc_usage_delta(eenv, i);
usage_sum += __cpu_norm_usage(i, capacity, delta);
usage_sum += __cpu_norm_usage(eenv, i, capacity, delta); } if (usage_sum > SCHED_CAPACITY_SCALE)
@@ -5135,6 +5161,7 @@ static int energy_diff(struct energy_env *eenv) struct sched_domain *sd; struct sched_group *sg; int sd_cpu = -1, energy_before = 0, energy_after = 0;
int i; struct energy_env eenv_before = { .usage_delta = 0,
@@ -5154,6 +5181,17 @@ static int energy_diff(struct energy_env *eenv) return 0; /* Error */
sg = sd->groups;
for_each_cpu(i, sched_domain_span(sd)) {
eenv_before.snapshot.util[i] =
cpu_rq(i)->cfs.utilization_load_avg;
eenv_before.snapshot.block[i] =
cpu_rq(i)->cfs.utilization_blocked_avg;
eenv->snapshot.util[i] = eenv_before.snapshot.util[i];
eenv->snapshot.block[i] = eenv_before.snapshot.block[i];
}
do { if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) {
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Leo,
On 12/18/2015 03:29 AM, Leo Yan wrote:
On Thu, Dec 17, 2015 at 07:18:56PM +0000, Dietmar Eggemann wrote:
On 10/11/15 09:58, Leo Yan wrote:
[...]
Your idea is better than my patch, because it has more clear topology corresponding to hardware design. Also I have below questions so can understand well your patch :)
(1) Hold cluster energy model (EM) data on a single cluster system
For single cluster system, there only has 'MC' sched domain (for cpu level) so there have no 'CPU' sched domain to hold cluster's EM data. So should
Yes, this is the plan for a single cluster EAS system. (s/'CPU'/'DIE')
add sched domain 'SYS' to help save cluster's EM data, right?
Yeah, something like this. You only need the SYS sd with one sg spawning the whole system, no EM info is needed on this sd for hikey.
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 5883f9e262ec..805627f61fc5 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -354,6 +354,7 @@ static struct sched_domain_topology_level arm64_topology[] = { { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) }, #endif { cpu_cpu_mask, NULL, cpu_cluster_energy, SD_INIT_NAME(DIE) }, + { cpu_cpu_mask, NULL, NULL, SD_INIT_NAME(SYS) }, { NULL, }, };
In the meantime I played a little bit with this patch and I think you have to add some code to cover your functionality of having a top sd with one sg spawning all cpus. The original patch covers only the functionality when individual cpus are offline on a big.Little system and a single cluster system where we want to attach EM data on SYS sd.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 487bb3c4627b..8b6cb48b8985 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5929,7 +5929,9 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent) SD_PREFER_SIBLING | SD_SHARE_POWERDOMAIN | SD_SHARE_CAP_STATES); - if (parent->groups->sge) { + if (parent->groups->sge || + (sd->groups->sge && parent->groups == parent->groups->next && + parent->span_weight == num_online_cpus())) { parent->flags &= ~SD_LOAD_BALANCE; return 0; }
The additional if condition is necessary since you don't want to attach EM data (parent->groups->sge) onto your SYS sd.
JUNO looks like this with these two changes on top of patch 'sched: EAS & cpu hotplug interoperability':
root@genericarmv8:~# cat /proc/schedstat (only for cpu0)
cpu0 0 0 4436 1854 3368 1751 1034319260 310590900 2555 domain0 39 1136 1136 0 0 0 0 0 1136 1 1 0 0 0 0 0 1 1436 1335 94 ... domain1 3f 900 852 46 25311 2 0 1 770 0 0 0 0 0 0 0 0 1429 1235 ... domain2 3f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/{name,flags}
MC DIE SYS 33583 4143 4142 <-- !SD_LOAD_BALANCE
(2) Offer this 'sg_shared_cap = sd->parent->groups' thing on a platform like Hikey (3) Let cpu and cluster level EM data survive if cpus get off-lined
Just curious, what benefit we can get for this? When CPU or whole cluster has been off-lined, then we should not take care their EM data anymore.
If the entire cluster is offline, then yes, we shouldn't take these cpus or the cluster contribution into consideration but I was talking about situations where all but one cpu of a cluster is offline or about the cpus in the other cluster of a two cluster system.
From the patch header:
For Energy-Aware Scheduling (EAS) to work properly, even in the case that cpus are hot-plugged out, the energy model (EM) data on all energy-aware sched domains has to be present for all online cpus.
Mainline sd hierarchy setup code will remove sd's which are not useful for task scheduling e.g. in the following situations:
1. Only one cpu remains in one cluster of a two cluster system.
This remaining cpu only has DIE and no MC sd.
2. A complete cluster in a two-cluster system is hot-plugged out.
The cpus of the remaining cluster only have MC and no DIE sd.
Obviously, this sd level can't be part of the actual scheduling decisions so CFS code has to change slightly for this to work. I attached the appropriate patch. It's only tested lightly on TC2 and JUNO and might not apply on the current code line.
Would be nice if you can test it and give feedback if you consider it as something you're fine with. Thanks!
Do you suggest I test this based on RFCv5 or RFCv6?
Definitely RFCv5.2. EAS RFCv6 does not exist yet.
-- Dietmar
[...] IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.