### Basic Ideas ###
This patch set is rebased on EASv1.2 for power optimization on Hikey960.
The ARM big.LITTLE systems have many variants, some platforms use the same CPU architecture for multi-clusters, every cluster has different manufacture process (or clock design) so the clusters can have different OPP settings; this kind system the 'LITTLE' core and 'big' core have the same architecture but we can get power benefit from the 'LITTLE' core due it has better power efficiency compared to 'big' core at the same OPP. On the other hand, for this kind system, usually the 'LITTLE' core power efficiency doesn't has huge difference compared to 'big' core's; and furthermore the final CPU power saving percentage will discount twice, so when optimize power for some scenarios, the optimization may not significant as expected; or this means power optimization is not priority issue on these platforms.
Regarding the CPU power discouting for whole system, the first discount is related with CPU duty cycle, the second discount is related with SoC/Board baseline power data. We can estimate the CPU power saving percentage for system level with below formula:
CPU power saving percentage: CPU_PS% CPU duty cycle: CPU_DC% The percentage between CPU power and whole system: CPU_SYS%
So finally the estimated power saving percentage as below: CPU_PS% * CPU_DC% * CPU_SYS%
Let's see one example, we have two CA53 clusters, the 'LITTLE' cluster can improve 30% power efficiency than 'big' cluster, so CPU_PS% = 30%; the video playback (1080p) has CPU duty cycle CPU_DC% = 30% (1 core); the ratio between CPU power and system power is CPU_SYS% = 15%, so finally we can save power by using 'LITTLE' compared to 'big' core:
CPU_PS% * CPU_DC% * CPU_SYS% = 30% * 30% * 15% = 1.35%
Naturally we consider 1.35% percentage is not a significant improvement; but for some cases there have the concept for delta power; and if we compare it with delta power can see the importantance for power saving. Let's use video playback as example, the delta power percentage (DP%) is:
DP% = (video_playback_power - home_screen_power) / video_playback_power
So DP% is one important criteria for phone models to check some scenarios compared to Android idle syste. If we think DP% = 15% and power saving percentage 1.35%, then power saving with 1.35% is meaningful when we compare 1.35% vs 15%.
Another kind of big.LITTLE system has big different power efficiency, if we review the power efficiency on Hikey960, we can see the coefficient (mw/MHz) the worst case is the CA73 is 6.2 times than CA53, if we select the median OPPs as reference we can see CA73 is 2.42 times than CA52. So the highest CPU_PS%(max) = 86%, the median CPU_PS%(median) = 70%. Let's check upper case we can save how much on Hikey960 in theory:
CPU_PS%(max) * CPU_DC% * CPU_SYS% = 86% * 30% * 15% = 3.87% CPU_PS%(median) * CPU_DC% * CPU_SYS% = 70% * 30% * 15% = 3.15%
We can see power saving percentage 3.87%/3.15% is significant to DP% (15%).
So on Hikey960, if there have some scenarios with high CPU duty cycle and sustainable power consuming, the power optimization is important for them.
### Implementations ###
Below are detailed implemenation for the optimizations:
a) Add back CPU selection based on power efficiency
EASv1.2 has function find_best_target(), this function mainly focus on the idlest CPU so reduce the scheduling latency; but in some cases it will miss to select the best power efficiency CPU.
So patch 0001/0002 are mainly to add back CPU selection based on power efficiency; we still keep the function find_best_target() but it's only used for "boost" and "prefer_idle" cases, and use power efficiency path for normal cases.
b) EAS core algorithm optimization
For EAS core algorithm, it should resolve problems for below items:
1. Support more than two clusters; 2. Keep CPU to stay lowest OPP as possible, and pack small tasks when system is idle; 3. Directly migrate waken task to best CPU to meet performance requirement, this means the task could be migrated to higher capacity CPU and vice versa; 4. Consistent result for energy calculation and simple implemenation;
Patch 0003/0004 change to select CPU with cluster basis; this means scheduler firstly select candidate CPUs within every cluster, so every cluster can has one candidate CPU or the cluster hasn't any one CPU can meet the requirement. Finally all energy difference calculation happens within these candidate CPUs. This gives us several benefit, the first one is scheduler doesn't couple with previous CPU anymore; in the old code it always compare energy between previous CPU and a new possible CPU, but for some case the previous CPU is completely wrong CPU for the task so the comparison is pointless actually. After applied patch 0003/ 0004, it introduces one side effect: the task can be directly migrate from lower capacity CPU to higher capacity CPU (LITTLE -> big), usually this doesn't happen in old code, due the energy comparison the lower capacity CPU can beat higher capacity CPU so the task is missed the chance to migrate to higher capacity CPU.
Patch 0005/0006 are to select best CPU within cluster. In task waken path, EAS core algorithm is responsible for task selection; it should achieve two targets: keep CPU to stay lowest OPP as possible and spread tasks if we can predict the OPP is possible to increased after place waken task on one specific CPU. So patch 0003/0004 are to find the CPU with lowest OPP and has highest utilization compared with other CPUs with the same OPP, so we can rely on EAS core algorithm to spread tasks if CPU OPP is increasing and pack tasks after CPU is decreased to lowest OPP.
After applied patch 0003/0004, there introduced much more times energy comparison between one big CPU and one little CPU. As result, it's observed the EAS core algorithm is fragile for some corner cases. So patch 0007/0008/0009 are for more robust energy calculation, especailly 0008/0009 patches introduce an extra signal for "util_waken_avg", by using this signal we can remove waken task value for CPU's utilization, so finally all CPU signals are cleaned by removing waken task stale utilization value.
Patch 0010 is a significant change for EAS core algorithm, the main idea is to change energy calculation from CPU oriented to task oriented. Based on energy modeling we can easily anwser the question is: if place the task onto one specific CPU, how much power is consumed by this task? So essentially we can calculate the task consumed energy for specific CPU, so can get to know the power consumption for every possible CPU and finally filter out which CPU is best power saving one. After changed to task oriented energy calculation, it's also more smooth to generate perf idx and energy idx based on task oriented but not CPU oriented so hope this also can benefit for schedTune PE filter as well.
c) Tipping point optimization
Power saving optimization mainly focus on how to defer the system tipping point so energy aware path can be enabled for most case, but deferring tipping point also means it hurts performance case if system cannot over tipping point for overloaded scenarios (like benchmarks).
So the target is: optimize power without performance regression.
Patch 0011 is Thara's patch v1 "Per Sched domain over utilization", the patch gives good method for how to store the per sched domain flag. I tweaked it with below criterias for overutilization:
1. If single CPU is more than 80% util, then set lowest level sched domain as 'overutilized'; so this is the tipping point for 'inner overutilized' flag. 2. If any CPU has 'misfit' task or the cluster's overall util > 80% of the cluster overall capacity, then set parent level sched doamin as 'overutilized', this is the tipping point for 'outer overutilized' flag. 3. If overall util > 50% of the all CPU overall capacity, then set root domain's 'overutilized' flag. The 50% actually is a quite high bar, e.g. if there have two clusters that means the overall util > the middle capacity for two clusters, also means the overall util has totally beyond one cluster capacity so kick 'global' tipping point and spread tasks cross two clusters.
So with 'per sched domain flag', we can defer the 'global' tipping point and rely on it as a switch for energy aware path. Patch 0011 is to move energy aware function to the beginning of waken path, so this give the function energy_aware_wake_cpu() more chance to execute if system is under tipping point; only when system is over tipping point then it will go back to execute traditional waken balance to select idlest CPU.
### Testing result ###
On Hikey960, below is testing enviornment: - Android AOSP kernel 4.4 https://android.googlesource.com/kernel/hikey-linaro branch: android-hikey-linaro-4.4 - CPUFreq governor: sched-freq - Fxied DDR: 400MHz - Fixed GPU: 533MHz - HDMI: unplugged - WIFI: disabled
Please note, the video playback (1080p) is using software codec with VLC player on Android, camera recording is use synthesized workload camera-long.json to simulate the camera scenario.
Test_Case Referenced_Phone PELT_Optimized PELT_Optimized WALT_Optimized WALT_Optimized (mW) [*] (mW) (Percentage) (mW) (Percentage) homescreen 800 -5.05 [**] -0.63% -10.46 -1.31% Audio(MP3) 200 (LCD OFF) 5.33 2.66% 60.62 30.31% Video(1080p) 1000 133.09 13.31% 26.10 2.61% Camera Recording 2000 163.94 8.20% -79.57 [***] -3.98%
[*] The reference phone is not any specific phone model, here I give out some very roughly power data for well optimized commercial phones. So this are only some data based on old experience and they are not not very precise. These power data is based on power data from the battery measurement point with 4.2v.
[**] Positive value: power reducing by this patch set Negative value: power increasing by this patch set
[***] Camera Recording + WALT power data is much worse with this patch set; Will explian in "conclusion" section.
Testing raw data: http://people.linaro.org/~leo.yan/eas_upstream/hikey960_result/
### Conclusion ###
Firstly Hikey960 is a good candidate platform to verify power saving optimization :)
This patch set with PELT signal has good result on Hikey960, especially for cases video playback (saving 133.09mW) and camera recording (saving 164.94mW). For audio playback, it can save 5.33mW; for homescreen it has a bit regression (increased 5.05mW), suspect this related with task packing on LITTLE core but need investigate for this.
This patch set with WALT signal has good result for audio playback and video playback, but it's broken for camera recording case. After reviewed the trace log, the main issue is many tasks' WALT signal can reach into the range 100~200, so there have many comparision between LITTLE CPU 1844MHz and big CPU 903MHz. From the power modeling parameters, the big CPU 903MHz has lower power efficiency than LITTLE CPU 1844MHz, so tasks are migrated onto big core frequently. Compared to WALT signal, PELT signal can co-work with power modeling parameters well, so we can see the energy awaring algorithm can avoid task easily migration to big cores. (seems to me, this is a question as: what's the signal can match for eas core algorithm?)
Some known issues:
- CPUFreq governor impacts power consumption much, sched-freq is easily to reach 1844MHz. so need check if have mechanism to optimize the policy to reduce the chance to set 1844MHz;
Another testing is to use other governors: schedutil, interactive.
- RT threads now are not energy awared, so they are migrated to big cores;
- Load balancing flow has no energy awared optimization;
- Now fixed DDR frequency, if enable DDR frequency change then power modeling will be changed significant: Need devfreq driver for DDR, and tune power modeling for this.
- Though these patches have been verified on Juno there have no harm for performance, need do performance comprision on Hikey960.
Leo Yan (12): sched/fair: add function find_nrg_efficient_target() sched/fair: enable energy efficiency selection sched/fair: use new function to select CPU from sched group sched/fair: select candidate CPUs by cluster basis sched/fair: refine find_new_capacity() sched/fair: optimize CPU selection with lowest OPP sched/fair: increase resolution for energy calculation sched/fair: introduce signal util_waken_avg for CPU sched/fair: select idle CPU as backup for waken up path sched/fair: task oriented energy calculation sched/fair: update idle CPU blocked load in update_sg_lb_stats() sched/fair: add trace event for sched group energy
Thara Gopinath (1): Per Sched domain over utilization
include/linux/sched.h | 2 +- include/trace/events/sched.h | 45 ++++ kernel/sched/fair.c | 593 +++++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 1 + 4 files changed, 484 insertions(+), 157 deletions(-) mode change 100644 => 100755 include/trace/events/sched.h mode change 100644 => 100755 kernel/sched/fair.c
-- 1.9.1
EASv1.2 unified the CPU selection for waken task with function find_best_target(); this function tries to select idle CPU or CPU with lowest utilization as candidate CPU, at the end this can reduce scheduling latency and boost performance for interactive scenarios.
On the other hand, this function is not comprehensive for power saving and it's fragile for big.LITTLE clusters system.
E.g, if has set "prefer_idle" flag, one small task was running on big core and the task is waken up after a sleeping, so the function find_best_target() iterates the scheduling group to find the best target for this task. The first iteration from previous CPU's scheduling group, so it firstly iterates the big core's scheduling if previous CPU is big core. As result it has much higher chance to select one idling big core and directly bails out, rather than to select idle CPUs from LITTLE cluster.
Another case is: if "prefer_idle" flag is cleared, the first time iterates the scheduling group for LITTLE cluster and second time iterates the group corresponding to big cluster; the function find_best_target() has much high chance to select big CPU for variable 'target_cpu' and 'best_idle_cpu'. This is not optimal selection from power saving perspective, this means we may miss the chance to select the LITTLE CPU with lower OPP.
So this patch is to add back function find_nrg_efficient_target(), which is to select CPU from energy efficiency perspective.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e9afae4..45b4080 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6310,6 +6310,73 @@ static inline int find_best_target(struct task_struct *p, bool boosted, bool pre return target_cpu; }
+static inline int find_nrg_efficient_target(struct task_struct *p, + struct sched_domain *sd) +{ + struct sched_group *sg, *sg_target; + int target_max_cap = INT_MAX; + int target_cpu = task_cpu(p); + unsigned long task_util_boosted, new_util; + int i; + + sg = sd->groups; + sg_target = sg; + + /* + * Find group with sufficient capacity. We only get here if no cpu is + * overutilized. We may end up overutilizing a cpu by adding the task, + * but that should not be any worse than select_idle_sibling(). + * load_balance() should sort it out later as we get above the tipping + * point. + */ + do { + /* Assuming all cpus are the same in group */ + int max_cap_cpu = group_first_cpu(sg); + + /* + * Assume smaller max capacity means more energy-efficient. + * Ideally we should query the energy model for the right + * answer but it easily ends up in an exhaustive search. + */ + if (capacity_of(max_cap_cpu) < target_max_cap && + task_fits_max(p, max_cap_cpu)) { + sg_target = sg; + target_max_cap = capacity_of(max_cap_cpu); + } + } while (sg = sg->next, sg != sd->groups); + + task_util_boosted = boosted_task_util(p); + /* Find cpu with sufficient capacity */ + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { + /* + * p's blocked utilization is still accounted for on prev_cpu + * so prev_cpu will receive a negative bias due to the double + * accounting. However, the blocked utilization may be zero. + */ + new_util = cpu_util(i) + task_util_boosted; + + /* + * Ensure minimum capacity to grant the required boost. + * The target CPU can be already at a capacity level higher + * than the one required to boost the task. + */ + if (new_util > capacity_orig_of(i)) + continue; + + if (new_util < capacity_curr_of(i)) { + target_cpu = i; + if (cpu_rq(i)->nr_running) + break; + } + + /* cpu has capacity at higher OPP, keep it as fallback */ + if (target_cpu == task_cpu(p)) + target_cpu = i; + } + + return target_cpu; +} + /* * Disable WAKE_AFFINE in the case where task @p doesn't fit in the * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu. -- 1.9.1
This patch is to enable energy efficiency selection path, and rename function find_best_target() to find_idlest_target() so it's used for "prefer_idle" flag or when task has set boost margin > 0.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 45b4080..ecc156c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6186,7 +6186,8 @@ static int start_cpu(bool boosted) return boosted ? rd->max_cap_orig_cpu : rd->min_cap_orig_cpu; }
-static inline int find_best_target(struct task_struct *p, bool boosted, bool prefer_idle) +static inline int find_idlest_target(struct task_struct *p, bool boosted, + bool prefer_idle) { int target_cpu = -1; unsigned long target_util = prefer_idle ? ULONG_MAX : 0; @@ -6434,15 +6435,18 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync goto unlock;
/* Find a cpu with sufficient capacity */ - tmp_target = find_best_target(p, boosted, prefer_idle); + if (boosted || prefer_idle) { + tmp_target = find_idlest_target(p, boosted, prefer_idle); + if (tmp_target >= 0) + target_cpu = tmp_target;
- if (tmp_target >= 0) { - target_cpu = tmp_target; - if ((boosted || prefer_idle) && idle_cpu(target_cpu)) { + if (prefer_idle && idle_cpu(target_cpu)) { schedstat_inc(p, se.statistics.nr_wakeups_secb_idle_bt); schedstat_inc(this_rq(), eas_stats.secb_idle_bt); goto unlock; } + } else { + target_cpu = find_nrg_efficient_target(p, sd); }
if (target_cpu != prev_cpu) { -- 1.9.1
This patch is to add new function energy_aware_select_candidate_cpu(), this function is to select candidate CPU from schedule group. This function uses the same logic with before, but we create this dedicated function so can help sequential optimization.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 73 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 43 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ecc156c..47f6365 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6311,14 +6311,51 @@ static inline int find_idlest_target(struct task_struct *p, bool boosted, return target_cpu; }
+static int energy_aware_select_candidate_cpu(struct task_struct *p, + struct sched_group *sg) +{ + int i, cpu = -1; + unsigned long task_util_boosted, new_util; + + task_util_boosted = boosted_task_util(p); + + /* Find cpu with sufficient capacity */ + for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) { + /* + * p's blocked utilization is still accounted for on prev_cpu + * so prev_cpu will receive a negative bias due to the double + * accounting. However, the blocked utilization may be zero. + */ + new_util = cpu_util(i) + task_util_boosted; + + /* + * Ensure minimum capacity to grant the required boost. + * The target CPU can be already at a capacity level higher + * than the one required to boost the task. + */ + if (new_util > capacity_orig_of(i)) + continue; + + if (new_util < capacity_curr_of(i)) { + cpu = i; + if (cpu_rq(i)->nr_running) + break; + } + + /* cpu has capacity at higher OPP, keep it as fallback */ + if (cpu == task_cpu(p)) + cpu = i; + } + + return cpu; +} + static inline int find_nrg_efficient_target(struct task_struct *p, struct sched_domain *sd) { struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX; - int target_cpu = task_cpu(p); - unsigned long task_util_boosted, new_util; - int i; + int target_cpu = task_cpu(p), cpu;
sg = sd->groups; sg_target = sg; @@ -6346,34 +6383,10 @@ static inline int find_nrg_efficient_target(struct task_struct *p, } } while (sg = sg->next, sg != sd->groups);
- task_util_boosted = boosted_task_util(p); - /* Find cpu with sufficient capacity */ - for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { - /* - * p's blocked utilization is still accounted for on prev_cpu - * so prev_cpu will receive a negative bias due to the double - * accounting. However, the blocked utilization may be zero. - */ - new_util = cpu_util(i) + task_util_boosted; - - /* - * Ensure minimum capacity to grant the required boost. - * The target CPU can be already at a capacity level higher - * than the one required to boost the task. - */ - if (new_util > capacity_orig_of(i)) - continue;
- if (new_util < capacity_curr_of(i)) { - target_cpu = i; - if (cpu_rq(i)->nr_running) - break; - } - - /* cpu has capacity at higher OPP, keep it as fallback */ - if (target_cpu == task_cpu(p)) - target_cpu = i; - } + cpu = energy_aware_select_candidate_cpu(p, sg_target); + if (cpu != -1) + target_cpu = cpu;
return target_cpu; } -- 1.9.1
In the old code the energy aware flow compares energy between two different CPUs and select the most power efficiency CPU for waken up task. But there have several cases the old code is hardly to handle:
- In old code it compares the energy between two candidate CPUs, and one of them must be the previous CPU. For some big tasks the previous CPU cannot provide required capacity, but the previous CPU is still selected by having less energy. So from performance perspective this case is wrongly to take previous CPU as candidate;
- It's possible the previous CPU is not the most power efficiency CPU in the cluster (e.g. in the same cluster another CPU can have lower OPP for the waken task); if always to compare with previous CPU energy, then the most power efficiency CPU will be missed. This case can happen when scheduler tries to figure out the task should stay on big cluster's low OPP or migrate the task to LITTLE cluster's high OPP, so if previous CPU is big core with high OPP then we should consider other CPUs in big cluster.
- If system has more then two clusters, the old code cannot handle this complex case.
This patch is to select candidate CPUs by cluster basis, it tries to select candidate CPUs from every schedule group (corresponding to every cluster on ARM big.LITTLE platform). Finally scheduler can select most power efficiency CPU among these candidate CPUs.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 82 ++++++++++++++++++++++++++++++----------------------- 1 file changed, 47 insertions(+), 35 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 47f6365..487fbe5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5285,6 +5285,9 @@ static inline bool energy_aware(void) }
struct energy_env { + cpumask_t search_cpus; + int target_cpu; + struct sched_group *sg_top; struct sched_group *sg_cap; int cap_idx; @@ -6350,15 +6353,14 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p, return cpu; }
-static inline int find_nrg_efficient_target(struct task_struct *p, - struct sched_domain *sd) +static inline void find_nrg_efficient_target(struct task_struct *p, + struct sched_domain *sd, + struct energy_env *eenv) { - struct sched_group *sg, *sg_target; - int target_max_cap = INT_MAX; - int target_cpu = task_cpu(p), cpu; + struct sched_group *sg; + int cpu;
sg = sd->groups; - sg_target = sg;
/* * Find group with sufficient capacity. We only get here if no cpu is @@ -6376,19 +6378,14 @@ static inline int find_nrg_efficient_target(struct task_struct *p, * Ideally we should query the energy model for the right * answer but it easily ends up in an exhaustive search. */ - if (capacity_of(max_cap_cpu) < target_max_cap && - task_fits_max(p, max_cap_cpu)) { - sg_target = sg; - target_max_cap = capacity_of(max_cap_cpu); + if (task_fits_max(p, max_cap_cpu)) { + cpu = energy_aware_select_candidate_cpu(p, sg); + if (cpu != -1) + cpumask_set_cpu(cpu, &eenv->search_cpus); } } while (sg = sg->next, sg != sd->groups);
- - cpu = energy_aware_select_candidate_cpu(p, sg_target); - if (cpu != -1) - target_cpu = cpu; - - return target_cpu; + return; }
/* @@ -6420,6 +6417,8 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync struct sched_domain *sd; int target_cpu = prev_cpu, tmp_target; bool boosted, prefer_idle; + struct energy_env eenv; + int cpu;
schedstat_inc(p, se.statistics.nr_wakeups_secb_attempts); schedstat_inc(this_rq(), eas_stats.secb_attempts); @@ -6447,6 +6446,8 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync if (!sd) goto unlock;
+ cpumask_clear(&eenv.search_cpus); + /* Find a cpu with sufficient capacity */ if (boosted || prefer_idle) { tmp_target = find_idlest_target(p, boosted, prefer_idle); @@ -6458,35 +6459,46 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync schedstat_inc(this_rq(), eas_stats.secb_idle_bt); goto unlock; } + + cpumask_set_cpu(target_cpu, &eenv.search_cpus); + cpumask_set_cpu(prev_cpu, &eenv.search_cpus); } else { - target_cpu = find_nrg_efficient_target(p, sd); + find_nrg_efficient_target(p, sd, &eenv); }
- if (target_cpu != prev_cpu) { - struct energy_env eenv = { - .util_delta = task_util(p), - .src_cpu = prev_cpu, - .dst_cpu = target_cpu, - .task = p, - }; + eenv.target_cpu = -1;
- /* Not enough spare capacity on previous cpu */ - if (cpu_overutilized(prev_cpu)) { - schedstat_inc(p, se.statistics.nr_wakeups_secb_insuff_cap); - schedstat_inc(this_rq(), eas_stats.secb_insuff_cap); - goto unlock; + for_each_cpu(cpu, &eenv.search_cpus) { + + if (eenv.target_cpu == -1) { + eenv.target_cpu = cpu; + continue; }
- if (energy_diff(&eenv) >= 0) { - schedstat_inc(p, se.statistics.nr_wakeups_secb_no_nrg_sav); - schedstat_inc(this_rq(), eas_stats.secb_no_nrg_sav); - target_cpu = prev_cpu; - goto unlock; + if (unlikely(!task_util(p))) { + if (capacity_orig_of(cpu) < capacity_orig_of(eenv.target_cpu)) + eenv.target_cpu = cpu; + + continue; }
+ eenv.util_delta = task_util(p); + eenv.src_cpu = eenv.target_cpu; + eenv.dst_cpu = cpu; + eenv.task = p; + + if (energy_diff(&eenv) < 0) + eenv.target_cpu = cpu; + } + + if (eenv.target_cpu == -1) { + schedstat_inc(p, se.statistics.nr_wakeups_secb_no_nrg_sav); + schedstat_inc(this_rq(), eas_stats.secb_no_nrg_sav); + target_cpu = prev_cpu; + } else { schedstat_inc(p, se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq(), eas_stats.secb_nrg_sav); - goto unlock; + target_cpu = eenv.target_cpu; }
schedstat_inc(p, se.statistics.nr_wakeups_secb_count); -- 1.9.1
This patch refines function find_new_capacity() with more general arguments, so this function can be used as a general function to predict OPP index. And using 80% as the margin to predict OPP idx.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) mode change 100644 => 100755 kernel/sched/fair.c
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c old mode 100644 new mode 100755 index 487fbe5..42a40bf --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5383,18 +5383,20 @@ long group_norm_util(struct energy_env *eenv, struct sched_group *sg) return util_sum; }
-static int find_new_capacity(struct energy_env *eenv, - const struct sched_group_energy * const sge) +static int find_new_capacity(const struct sched_group_energy const *sge, + unsigned long util) { int idx; - unsigned long util = group_max_util(eenv);
for (idx = 0; idx < sge->nr_cap_states; idx++) { - if (sge->cap_states[idx].cap >= util) + if (sge->cap_states[idx].cap * 1024 >= + util * capacity_margin) break; }
- eenv->cap_idx = idx; + /* roll back to max index */ + if (idx == sge->nr_cap_states) + idx = idx - 1;
return idx; } @@ -5465,7 +5467,8 @@ static int sched_group_energy(struct energy_env *eenv) else eenv->sg_cap = sg;
- cap_idx = find_new_capacity(eenv, sg->sge); + cap_idx = find_new_capacity(sg->sge, group_max_util(eenv)); + eenv->cap_idx = cap_idx;
if (sg->group_weight == 1) { /* Remove capacity of src CPU (before task move) */ -- 1.9.1
In previous code, energy aware scheduling selects CPU based on CPU current capacity can meet wakeup task requirement and prefer the running CPU rather than idle CPU so can reduce the wakeup latency. This has a side effect which usually introduces vicious circle: after place the task onto CPU, CPUFreq governor is possible to increase frequency after it detects CPU has higher utilization; if next time have another waken up task, this task also will be placed onto the same CPU due this CPU has been improved frequency yet and it's the preferred CPU due it have running CPU. As result, CPUFreq governor increases frequency in turn to packing more tasks on one CPU.
Also observed another issue is: if system has a big task and several small tasks, the LITTLE core can meet the task capacity requirement; with previous code it's more likely to pack tasks onto one CPU so it has more chance to let CPU to be overutilized, at the end the big task is migrated to big core but actually one standalone LITTLE core can meet the task capacity requirement.
So this patch is based on below two rules for power saving:
- If the CPUs share voltage and clock, then CPUs with higher OPP will consume more power than lower OPP; so prefer to spread tasks to stay on lowest OPP as possible; - After achieve lowest OPP, it's good to pack tasks so can reduce the times to wake up CPUs.
This patch follows up upper two rules, the most important criteria is to select CPUs which has lowest OPP to meet task requirement; Furthermore it pack tasks as possible and keep at at lowest OPP.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 45 ++++++++++++++++++++++++++++++++++++--------- 1 file changed, 36 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 42a40bf..0486edb 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6321,18 +6321,25 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p, struct sched_group *sg) { int i, cpu = -1; - unsigned long task_util_boosted, new_util; + int cap_idx = INT_MAX, idx; + unsigned long task_util_boosted, new_util, wake_util; + struct sched_domain *sd; + const struct sched_group_energy *sge; + int prev_cpu = task_cpu(p);
task_util_boosted = boosted_task_util(p);
/* Find cpu with sufficient capacity */ for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) { + + wake_util = cpu_util(i); + /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due to the double * accounting. However, the blocked utilization may be zero. */ - new_util = cpu_util(i) + task_util_boosted; + new_util = wake_util + task_util_boosted;
/* * Ensure minimum capacity to grant the required boost. @@ -6342,15 +6349,35 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p, if (new_util > capacity_orig_of(i)) continue;
- if (new_util < capacity_curr_of(i)) { - cpu = i; - if (cpu_rq(i)->nr_running) - break; - } + /* + * According to waken up task and CPU utilization, predict + * the CPU OPP. So select CPU with two criterias from power + * saving perspective: + * + * - CPU can stay at lowest OPP as possible; + * - For same OPP, CPU has highest utilization so bias to + * pack tasks as possible. + */ + sd = rcu_dereference(per_cpu(sd_ea, i)); + sge = sd->groups->sge; + idx = find_new_capacity(sge, new_util); + + /* Select cpu with possible lower OPP */ + if (cap_idx > idx) {
- /* cpu has capacity at higher OPP, keep it as fallback */ - if (cpu == task_cpu(p)) + cap_idx = idx; cpu = i; + + /* Optimization for CPUs with same OPP */ + } else if (cap_idx == idx) { + + if (cpu == prev_cpu) + continue; + + /* Keep previous CPU and pack tasks if possible */ + if (i == prev_cpu || wake_util > cpu_util(cpu)) + cpu = i; + } }
return cpu; -- 1.9.1
In old code the energy calculation results are right shifted by SCHED_CAPACITY_SHIFT, so this essentially decreases resolution for the calculation. After applied patch "sched/fair: select candidate CPUs by cluster basis" it introduces many times energy comparison between one little core and one big core. But there have many times wrongly task placement on big core, and this is caused by the resolution issue.
Let's use Juno-r2 modeling parameters as an example:
Capacity Power CA53 267 37 CA72 501 174
Below is one case for utilization of task and CPUs:
task_util(p)=2 cpu_util(src_cpu::ca53)=805; cpu_util(src_cpu::ca53)`=797; cpu_util(dst_cpu::ca72)=12; cpu_util(dst_cpu::ca72)`=16; ^ ^ - before task migration - after task migration
So if compare energy difference for task 'p' from 'src_cpu::ca53' to 'dst_cpu::ca72' then we can see if task migrate to 'dst_cpu:ca72' it can save energy '1'. Finally task 'p' was wrongly placed on big core.
energy_delta(src_cpu::ca53) = (797 * 37) / 1024 - (805 * 37) / 1024 = -1
energy_delta(dst_cpu::ca72) = (16 * 174) / 1024 - (12 * 174) / 1024 = 0
This patch removes right shifting SCHED_CAPACITY_SHIFT to increase calculation resolution. After applied this patch, it can clearly fix upper case, the energy calculation is as below:
energy_delta(src_cpu::ca53) = (797 * 37) - (805 * 37) = -296
energy_delta(dst_cpu::ca72) = (16 * 174) - (12 * 174) = 696
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0486edb..d9a2969 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5487,11 +5487,9 @@ static int sched_group_energy(struct energy_env *eenv)
idle_idx = group_idle_state(sg); group_util = group_norm_util(eenv, sg); - sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power) - >> SCHED_CAPACITY_SHIFT; + sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power); sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) - * sg->sge->idle_states[idle_idx].power) - >> SCHED_CAPACITY_SHIFT; + * sg->sge->idle_states[idle_idx].power);
total_energy += sg_busy_energy + sg_idle_energy;
@@ -5626,9 +5624,6 @@ normalize_energy(int energy_diff) /* Do scaling using positive numbers to increase the range */ normalized_nrg = (energy_diff < 0) ? -energy_diff : energy_diff;
- /* Scale by energy magnitude */ - normalized_nrg <<= SCHED_CAPACITY_SHIFT; - /* Normalize on max energy for target platform */ normalized_nrg = reciprocal_divide( normalized_nrg, schedtune_target_nrg.rdiv); -- 1.9.1
In the old code the scheduler uses util_avg signal to calculate the energy difference, so it has below attributions:
The previous CPU util_avg retains the decayed value for waken task, so naturally we decrease the waken task util_avg from CPU util_avg for energy difference calculation, but in some cases the CPU util_avg has been decayed to 0 but waken task util_avg keeps a big value before sleeping, so it finally cannot reflect the energy decreasing if the waken task is migrated away from this CPU.
This patch is to introduce signal util_waken_avg for CPU, this is based on Morten patch ('sched/fair: Compute task/cpu utilization at wake-up more correctly') so we can achieve all CPUs pure utilization value which have been totally removed the task retained util value, this pure utilization value we use util_waken_avg to present it. This gives good basis for CPU utilization, so scheduler can estimate more accurate CPU utilization value: util_waken_avg + task_util(p). This is good for energy calculation correctness.
Signed-off-by: Leo Yan leo.yan@linaro.org --- include/linux/sched.h | 2 +- kernel/sched/fair.c | 27 ++++++++++++++++++--------- 2 files changed, 19 insertions(+), 10 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h index ad2c304..5b1c7d0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1280,7 +1280,7 @@ struct load_weight { struct sched_avg { u64 last_update_time, load_sum; u32 util_sum, period_contrib; - unsigned long load_avg, util_avg; + unsigned long load_avg, util_avg, util_waken_avg; };
#ifdef CONFIG_SCHEDSTATS diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d9a2969..6e7279c 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5326,7 +5326,7 @@ struct energy_env { */ static unsigned long __cpu_norm_util(int cpu, unsigned long capacity, int delta) { - int util = __cpu_util(cpu, delta); + int util = cpu_rq(cpu)->cfs.avg.util_waken_avg + delta;
if (util >= capacity) return SCHED_CAPACITY_SCALE; @@ -5334,12 +5334,14 @@ static unsigned long __cpu_norm_util(int cpu, unsigned long capacity, int delta) return (util << SCHED_CAPACITY_SHIFT)/capacity; }
+static inline unsigned long task_util(struct task_struct *p); + static int calc_util_delta(struct energy_env *eenv, int cpu) { - if (cpu == eenv->src_cpu) - return -eenv->util_delta; - if (cpu == eenv->dst_cpu) - return eenv->util_delta; + if (cpu == eenv->src_cpu && !eenv->util_delta) + return task_util(eenv->task); + if (cpu == eenv->dst_cpu && eenv->util_delta) + return task_util(eenv->task); return 0; }
@@ -5351,7 +5353,7 @@ unsigned long group_max_util(struct energy_env *eenv)
for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) { delta = calc_util_delta(eenv, i); - max_util = max(max_util, __cpu_util(i, delta)); + max_util = max(max_util, cpu_rq(i)->cfs.avg.util_waken_avg + delta); }
return max_util; @@ -6325,9 +6327,15 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p, task_util_boosted = boosted_task_util(p);
/* Find cpu with sufficient capacity */ - for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg)) { + for_each_cpu(i, sched_group_cpus(sg)) { + + wake_util = cpu_util_wake(i, p);
- wake_util = cpu_util(i); + /* update waken avg */ + cpu_rq(i)->cfs.avg.util_waken_avg = wake_util; + + if (unlikely(!cpumask_test_cpu(i, tsk_cpus_allowed(p)))) + continue;
/* * p's blocked utilization is still accounted for on prev_cpu @@ -6370,7 +6378,8 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p, continue;
/* Keep previous CPU and pack tasks if possible */ - if (i == prev_cpu || wake_util > cpu_util(cpu)) + if (i == prev_cpu || + wake_util > cpu_rq(cpu)->cfs.avg.util_waken_avg) cpu = i; } } -- 1.9.1
In task waken up path, select idle CPU as backup for below two cases:
- If the cluster has idle CPU but all CPUs cannot meet task capacity requirement, then it obviously should fallback to use idle CPU;
- If the CPU opp is not stay at lowest OPP, so should spread task as possible so give more chance to decrease OPP and avoid too long scheduling latency if pack tasks onto same CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++-- 1 file changed, 29 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6e7279c..9370b5b 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6317,12 +6317,13 @@ static inline int find_idlest_target(struct task_struct *p, bool boosted, static int energy_aware_select_candidate_cpu(struct task_struct *p, struct sched_group *sg) { - int i, cpu = -1; + int i, cpu = -1, best_idle_cpu = -1; int cap_idx = INT_MAX, idx; unsigned long task_util_boosted, new_util, wake_util; struct sched_domain *sd; const struct sched_group_energy *sge; int prev_cpu = task_cpu(p); + unsigned long best_idle_cpu_util = ULONG_MAX;
task_util_boosted = boosted_task_util(p);
@@ -6338,6 +6339,14 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p, continue;
/* + * Find idle CPU as backup and bias to most recent sleep one + */ + if (idle_cpu(i) && wake_util < best_idle_cpu_util) { + best_idle_cpu = i; + best_idle_cpu_util = wake_util; + } + + /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due to the double * accounting. However, the blocked utilization may be zero. @@ -6352,6 +6361,7 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p, if (new_util > capacity_orig_of(i)) continue;
+ /* * According to waken up task and CPU utilization, predict * the CPU OPP. So select CPU with two criterias from power @@ -6379,11 +6389,28 @@ static int energy_aware_select_candidate_cpu(struct task_struct *p,
/* Keep previous CPU and pack tasks if possible */ if (i == prev_cpu || - wake_util > cpu_rq(cpu)->cfs.avg.util_waken_avg) + wake_util > cpu_rq(cpu)->cfs.avg.util_waken_avg) { cpu = i; + } } }
+ /* directly return if has not found any idle CPU */ + if (best_idle_cpu == -1) + return cpu; + + /* + * Fallback to idle CPU for two cases: + * + * - Have not found proper target CPU but Have one idle CPU; + * - The target CPU is possible to increase OPP after migrate task on it + * but have on backup idle CPU; + */ + if (cpu == -1) + cpu = best_idle_cpu; + else if (!idle_cpu(cpu) && cap_idx > 0) + cpu = best_idle_cpu; + return cpu; }
-- 1.9.1
In previous code the energy calculation is CPU energy focused, so the main idea is to calculate how much power is consumed by CPUs before and after task migration. This idea inherently is to bind the energy comparison between a new selected CPU and previous CPU, but for some cases the previous is not a ideal CPU for the task; another shortcoming is we never get to know how much the task consumes power when it's placed on one specific CPU.
The more intuitive method is to calculate the energy for task oriented, so the idea is to calculate the power consumption for the waken task for every possible CPU and select best CPU with lowest power consumption.
This patch reworks energy calculation to follow idea for task oriented. To achieve this purpose, it introduces a new struct task_energy to calculate task energy for placing it on specific CPU; and still use struct energy_env to maintain the energy comparison context between different CPUs and still can be used for PE filter.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 239 ++++++++++++++++++++++++++++------------------------ 1 file changed, 128 insertions(+), 111 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9370b5b..6833524 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5285,31 +5285,41 @@ static inline bool energy_aware(void) }
struct energy_env { - cpumask_t search_cpus; - int target_cpu; + cpumask_t search_cpus; /* possible CPUs */ + int cpu_best; /* best CPU */ + int cpu_comp; /* compared CPU */ + + struct task_struct *task; /* waken task */ + int task_util; /* waken task util */
- struct sched_group *sg_top; - struct sched_group *sg_cap; - int cap_idx; - int util_delta; - int src_cpu; - int dst_cpu; - int energy; int payoff; - struct task_struct *task; + struct { - int before; - int after; + int best; + int comp; int delta; int diff; } nrg; + struct { - int before; - int after; + int best; + int comp; int delta; } cap; };
+struct task_energy { + int cpu; /* CPU */ + struct task_struct *task; /* waken task */ + int task_util; /* waken task util */ + + struct sched_group *sg_top; + struct sched_group *sg_cap; + int cap_idx; + int cap; + int nrg; +}; + /* * __cpu_norm_util() returns the cpu util relative to a specific capacity, * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for @@ -5336,23 +5346,22 @@ static unsigned long __cpu_norm_util(int cpu, unsigned long capacity, int delta)
static inline unsigned long task_util(struct task_struct *p);
-static int calc_util_delta(struct energy_env *eenv, int cpu) +static int calc_util_delta(struct task_energy *tsk_nrg, int cpu) { - if (cpu == eenv->src_cpu && !eenv->util_delta) - return task_util(eenv->task); - if (cpu == eenv->dst_cpu && eenv->util_delta) - return task_util(eenv->task); + if (cpu == tsk_nrg->cpu) + return tsk_nrg->task_util; + return 0; }
static -unsigned long group_max_util(struct energy_env *eenv) +unsigned long group_max_util(struct task_energy *tsk_nrg) { int i, delta; unsigned long max_util = 0;
- for_each_cpu(i, sched_group_cpus(eenv->sg_cap)) { - delta = calc_util_delta(eenv, i); + for_each_cpu(i, sched_group_cpus(tsk_nrg->sg_cap)) { + delta = calc_util_delta(tsk_nrg, i); max_util = max(max_util, cpu_rq(i)->cfs.avg.util_waken_avg + delta); }
@@ -5369,14 +5378,14 @@ unsigned long group_max_util(struct energy_env *eenv) * estimate (more busy). */ static unsigned -long group_norm_util(struct energy_env *eenv, struct sched_group *sg) +long group_norm_util(struct task_energy *tsk_nrg, struct sched_group *sg) { int i, delta; unsigned long util_sum = 0; - unsigned long capacity = sg->sge->cap_states[eenv->cap_idx].cap; + unsigned long capacity = sg->sge->cap_states[tsk_nrg->cap_idx].cap;
for_each_cpu(i, sched_group_cpus(sg)) { - delta = calc_util_delta(eenv, i); + delta = calc_util_delta(tsk_nrg, i); util_sum += __cpu_norm_util(i, capacity, delta); }
@@ -5427,16 +5436,16 @@ static int group_idle_state(struct sched_group *sg) * This can probably be done in a faster but more complex way. * Note: sched_group_energy() may fail when racing with sched_domain updates. */ -static int sched_group_energy(struct energy_env *eenv) +static int sched_group_energy(struct task_energy *tsk_nrg) { struct sched_domain *sd; int cpu, total_energy = 0; struct cpumask visit_cpus; struct sched_group *sg;
- WARN_ON(!eenv->sg_top->sge); + WARN_ON(!tsk_nrg->sg_top->sge);
- cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top)); + cpumask_copy(&visit_cpus, sched_group_cpus(tsk_nrg->sg_top));
while (!cpumask_empty(&visit_cpus)) { struct sched_group *sg_shared_cap = NULL; @@ -5465,30 +5474,21 @@ static int sched_group_energy(struct energy_env *eenv) int cap_idx, idle_idx;
if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight) - eenv->sg_cap = sg_shared_cap; + tsk_nrg->sg_cap = sg_shared_cap; else - eenv->sg_cap = sg; + tsk_nrg->sg_cap = sg;
- cap_idx = find_new_capacity(sg->sge, group_max_util(eenv)); - eenv->cap_idx = cap_idx; + cap_idx = find_new_capacity(sg->sge, group_max_util(tsk_nrg)); + tsk_nrg->cap_idx = cap_idx;
if (sg->group_weight == 1) { - /* Remove capacity of src CPU (before task move) */ - if (eenv->util_delta == 0 && - cpumask_test_cpu(eenv->src_cpu, sched_group_cpus(sg))) { - eenv->cap.before = sg->sge->cap_states[cap_idx].cap; - eenv->cap.delta -= eenv->cap.before; - } - /* Add capacity of dst CPU (after task move) */ - if (eenv->util_delta != 0 && - cpumask_test_cpu(eenv->dst_cpu, sched_group_cpus(sg))) { - eenv->cap.after = sg->sge->cap_states[cap_idx].cap; - eenv->cap.delta += eenv->cap.after; + if (cpumask_test_cpu(tsk_nrg->cpu, sched_group_cpus(sg))) { + tsk_nrg->cap = sg->sge->cap_states[cap_idx].cap; } }
idle_idx = group_idle_state(sg); - group_util = group_norm_util(eenv, sg); + group_util = group_norm_util(tsk_nrg, sg); sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power); sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[idle_idx].power); @@ -5498,7 +5498,7 @@ static int sched_group_energy(struct energy_env *eenv) if (!sd->child) cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
- if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top))) + if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(tsk_nrg->sg_top))) goto next_cpu;
} while (sg = sg->next, sg != sd->groups); @@ -5516,7 +5516,7 @@ next_cpu: continue; }
- eenv->energy = total_energy; + tsk_nrg->nrg = total_energy; return 0; }
@@ -5532,25 +5532,25 @@ static inline bool cpu_in_sg(struct sched_group *sg, int cpu) * utilization is removed from or added to the system (e.g. task wake-up). If * both are specified, the utilization is migrated. */ -static inline int __energy_diff(struct energy_env *eenv) +static inline int task_energy(struct energy_env *eenv) { struct sched_domain *sd; struct sched_group *sg; - int sd_cpu = -1, energy_before = 0, energy_after = 0; - int diff, margin; - - struct energy_env eenv_before = { - .util_delta = 0, - .src_cpu = eenv->src_cpu, - .dst_cpu = eenv->dst_cpu, - .nrg = { 0, 0, 0, 0}, - .cap = { 0, 0, 0 }, + int sd_cpu = -1; + + struct task_energy tsk_nrg = { + .cpu = eenv->cpu_comp, + .task = eenv->task, + .task_util = 0, };
- if (eenv->src_cpu == eenv->dst_cpu) - return 0; + struct task_energy tsk_nrg_after = { + .cpu = eenv->cpu_comp, + .task = eenv->task, + .task_util = eenv->task_util, + };
- sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu; + sd_cpu = eenv->cpu_comp; sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
if (!sd) @@ -5559,39 +5559,23 @@ static inline int __energy_diff(struct energy_env *eenv) sg = sd->groups;
do { - if (cpu_in_sg(sg, eenv->src_cpu) || cpu_in_sg(sg, eenv->dst_cpu)) { - eenv_before.sg_top = eenv->sg_top = sg; - - if (sched_group_energy(&eenv_before)) - return 0; /* Invalid result abort */ - energy_before += eenv_before.energy; - - /* Keep track of SRC cpu (before) capacity */ - eenv->cap.before = eenv_before.cap.before; - eenv->cap.delta = eenv_before.cap.delta; - - if (sched_group_energy(eenv)) - return 0; /* Invalid result abort */ - energy_after += eenv->energy; + if (cpu_in_sg(sg, tsk_nrg.cpu)) { + tsk_nrg.sg_top = sg; + tsk_nrg_after.sg_top = sg; + break; } } while (sg = sg->next, sg != sd->groups);
- eenv->nrg.before = energy_before; - eenv->nrg.after = energy_after; - eenv->nrg.diff = eenv->nrg.after - eenv->nrg.before; - eenv->payoff = 0; + if (sched_group_energy(&tsk_nrg)) + return 0; /* Invalid result abort */
- /* - * Dead-zone margin preventing too many migrations. - */ - - margin = eenv->nrg.before >> 6; /* ~1.56% */ - - diff = eenv->nrg.after - eenv->nrg.before; + if (sched_group_energy(&tsk_nrg_after)) + return 0; /* Invalid result abort */
- eenv->nrg.diff = (abs(diff) < margin) ? 0 : eenv->nrg.diff; + eenv->nrg.comp = tsk_nrg_after.nrg - tsk_nrg.nrg; + eenv->cap.comp = tsk_nrg_after.cap;
- return eenv->nrg.diff; + return 0; }
#ifdef CONFIG_SCHED_TUNE @@ -5633,18 +5617,18 @@ normalize_energy(int energy_diff) return (energy_diff < 0) ? -normalized_nrg : normalized_nrg; }
-static inline int -energy_diff(struct energy_env *eenv) +static inline int task_energy_diff(struct energy_env *eenv) { int boost = schedtune_task_boost(eenv->task); - int nrg_delta, ret; + int nrg_delta, diff;
/* Conpute "absolute" energy diff */ - __energy_diff(eenv); + eenv->nrg.diff = eenv->nrg.comp - eenv->nrg.best; + eenv->cap.delta = eenv->cap.comp - eenv->cap.best;
/* Return energy diff when boost margin is 0 */ if (boost == 0) { - ret = eenv->nrg.diff; + diff = eenv->nrg.diff; goto out; }
@@ -5665,18 +5649,34 @@ energy_diff(struct energy_env *eenv) * positive payoff, which is the condition for the acceptance of * a scheduling decision */ - ret = -eenv->payoff; + diff = -eenv->payoff;
out: trace_sched_energy_diff(eenv->task, - eenv->src_cpu, eenv->dst_cpu, eenv->util_delta, - eenv->nrg.before, eenv->nrg.after, eenv->nrg.diff, - eenv->cap.before, eenv->cap.after, eenv->cap.delta, - eenv->nrg.delta, eenv->payoff); + eenv->cpu_best, eenv->cpu_comp, eenv->task_util, + eenv->nrg.best, eenv->nrg.comp, eenv->nrg.diff, + eenv->cap.best, eenv->cap.comp, eenv->cap.delta, + eenv->nrg.delta, eenv->payoff);
- return ret; + return diff; } #else /* CONFIG_SCHED_TUNE */ + +static inline int task_energy_diff(struct energy_env *eenv) +{ + /* Conpute "absolute" energy diff */ + eenv->nrg.diff = eenv->nrg.comp - eenv->nrg.best; + eenv->cap.delta = eenv->cap.comp - eenv->cap.best; + + trace_sched_energy_diff(eenv->task, + eenv->cpu_best, eenv->cpu_comp, eenv->task_util, + eenv->nrg.best, eenv->nrg.comp, eenv->nrg.diff, + eenv->cap.best, eenv->cap.comp, eenv->cap.delta, + eenv->nrg.delta, eenv->payoff); + + return eenv->nrg.diff; +} + #define energy_diff(eenv) __energy_diff(eenv) #endif
@@ -6527,39 +6527,56 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu, int sync find_nrg_efficient_target(p, sd, &eenv); }
- eenv.target_cpu = -1; + eenv.cpu_best = -1; + eenv.cpu_comp = -1; + eenv.task = p; + eenv.task_util = task_util(p); + eenv.payoff = 0; + + /* directly return for only one CPU case */ + if (cpumask_weight(&eenv.search_cpus) == 1) { + target_cpu = cpumask_first(&eenv.search_cpus); + goto unlock; + }
for_each_cpu(cpu, &eenv.search_cpus) {
- if (eenv.target_cpu == -1) { - eenv.target_cpu = cpu; + if (eenv.cpu_best == -1) { + eenv.cpu_best = cpu; + eenv.cpu_comp = cpu; + + task_energy(&eenv); + + /* init energy data */ + eenv.nrg.best = eenv.nrg.comp; + eenv.cap.best = eenv.cap.comp; continue; }
if (unlikely(!task_util(p))) { - if (capacity_orig_of(cpu) < capacity_orig_of(eenv.target_cpu)) - eenv.target_cpu = cpu; + if (capacity_orig_of(cpu) < capacity_orig_of(eenv.cpu_best)) + eenv.cpu_best = cpu;
continue; }
- eenv.util_delta = task_util(p); - eenv.src_cpu = eenv.target_cpu; - eenv.dst_cpu = cpu; - eenv.task = p; - - if (energy_diff(&eenv) < 0) - eenv.target_cpu = cpu; + eenv.cpu_comp = cpu; + task_energy(&eenv); + if (task_energy_diff(&eenv) < 0) { + eenv.cpu_best = cpu; + eenv.nrg.best = eenv.nrg.comp; + eenv.cap.best = eenv.cap.comp; + } }
- if (eenv.target_cpu == -1) { + if (eenv.cpu_best == -1) { schedstat_inc(p, se.statistics.nr_wakeups_secb_no_nrg_sav); schedstat_inc(this_rq(), eas_stats.secb_no_nrg_sav); target_cpu = prev_cpu; } else { schedstat_inc(p, se.statistics.nr_wakeups_secb_nrg_sav); schedstat_inc(this_rq(), eas_stats.secb_nrg_sav); - target_cpu = eenv.target_cpu; + target_cpu = eenv.cpu_best; }
schedstat_inc(p, se.statistics.nr_wakeups_secb_count); -- 1.9.1
From: Thara Gopinath thara.gopinath@linaro.org
The current implementation of overutilization, aborts energy aware scheduling if any cpu in the system is over-utilized. This patch introduces over utilization flag per sched group level instead of a single flag system wide. Load balancing is done at the sched domain where any of the sched group is over utilized. If energy aware scheduling is enabled and no sched group in a sched domain is overuttilized, load balancing is skipped for that sched domain and energy aware scheduling continues at that level.
The implementation is based on two points 1. For every cpu in every sched domain the first group is the group that contains the cpu itself. 2. sched groups are shared between cpus.
Thus if a sched group find itself need to spread tasks then it should set corresponding overutilized flag properly. There have three kinds overutilized flag we should consider:
- Inner overutilized: if one cpu wants to spread tasks within the same cluster, then the overutilized flag is set at the first sched group of the lowest sched domain. This flag presents the task spreading is required from the CPU and ask other CPUs in lowest schedule domain to take over possible tasks from it;
- Outer overutilized: if one cpu wants to spread tasks to another cluster, then the overutilized flag is set at the first sched group of the parent sched domain. This ensures a load balancing at the overutilzed sched domain level. So that means the CPU is seeking help from another cluster and CPUs in another cluster can migrate tasks to improve performance;
- Global overutilized: if whole system is busy, we can set root domain flag so bypass energy aware scheduling and totally go back to traditional load balance. This can explore the overall performance by spreading task as possible.
For example consider a big little system with two little cpu's (CPU A and CPU B) and two big cpu's (CPU C and CPU D). In this system, the hierarchy will be as follows
CPU A SD level 1 - SG1(CPUA), SG2(CPUB) SD level 2 - SG5(CPUA, CPUB), SG6(CPUC, CPUD) RD
CPU B SD level 1 - SG2(CPUB), SG1(CPUA) SD level 2 - SG5(CPUA, CPUB), SG6(CPUC, CPUD) RD
CPU C SD level 1 - SG3(CPUC), SG4(CPUD) SD level 2 - SG6(CPUC, CPUD), SG5(CPUA, CPUB) RD
CPU D SD level 1 - SG4(CPUD), SG3(CPUC) SD level 2 - SG6(CPUC, CPUD), SG5(CPUA, APUB) RD
In the above system if CPUA is not run at lowest OPP, the overutilized flag is set at SG1 so scheduler can make load balance between CPUA and CPUB;
If CPUA is overutilized or has misfit task, the overutilized flag is set at SG5(parent sched domain first sched group); During load balancing, at SD level 2, it will iterate all sched group's overutilized flag and if any flag has been set then it executes load balance in this sched domain.
If whole system the overall utilization is bigger than 50% of overall CPU capacity, then the flag is set/checked at the root domain; this means overall utilization is cross at least one cluster capacity.
[ Changed by Leo to support discrete flags for inner/outer/global over-utilization ]
Signed-off-by: Thara Gopinath thara.gopinath@linaro.org Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 145 ++++++++++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 1 + 2 files changed, 122 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6833524..2a263f7 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4658,6 +4658,68 @@ static inline void hrtick_update(struct rq *rq) #ifdef CONFIG_SMP static bool cpu_overutilized(int cpu); unsigned long boosted_cpu_util(int cpu); + +/* + * 1. Inner overutilized: + * + * The load balance will happen only in SD Level 1, so this means + * only take affact on clustser internally. + * + * 2. Outer overutilized: + * + * If the CPU has misfit on it, it's no doubt to migrate task + * to another high capacity CPU. + * + * Or if one CPU is overutilized and we assume now scheduler has + * done good enough work to explore cluster internal capacity, so + * if one CPU is overutilized that means finally need seek another + * cluster to provide more computing capacity. + * + * 3. Global overutilized: + * + * If set root domain flag, means explore performance as possible + * to spread out tasks. + * + */ +static void set_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->groups->overutilized = true; +} + +static void clear_sd_overutilized(struct sched_domain *sd) +{ + if (sd) + sd->groups->overutilized = false; +} + +static void set_rd_overutilized(struct root_domain *rd) +{ + rd->overutilized = true; +} + +static void clear_rd_overutilized(struct root_domain *rd) +{ + rd->overutilized = false; +} + +static bool is_sd_overutilized(struct sched_domain *sd) +{ + struct sched_group *group = sd->groups; + int cpu = smp_processor_id(); + + if (cpu_rq(cpu)->rd->overutilized) + return true; + + do { + if (group->overutilized) + return true; + + } while (group = group->next, group != sd->groups); + + return false; +} + #else #define boosted_cpu_util(cpu) cpu_util(cpu) #endif @@ -4686,6 +4748,7 @@ static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq; + struct sched_domain *sd; struct sched_entity *se = &p->se; #ifdef CONFIG_SMP int task_new = flags & ENQUEUE_WAKEUP_NEW; @@ -4758,11 +4821,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) { walt_inc_cumulative_runnable_avg(rq, p); - if (!task_new && !rq->rd->overutilized && - cpu_overutilized(rq->cpu)) { - rq->rd->overutilized = true; - trace_sched_overutilized(true); + + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + if (!task_new) { + if (cpu_overutilized(rq->cpu) && sd) + set_sd_overutilized(sd); + + if (rq->misfit_task && sd && sd->parent) + set_sd_overutilized(sd->parent); } + rcu_read_unlock();
/* * We want to potentially trigger a freq switch @@ -7754,6 +7823,7 @@ struct sd_lb_stats { struct sched_group *local; /* Local group in this sd */ unsigned long total_load; /* Total load of all groups in sd */ unsigned long total_capacity; /* Total capacity of all groups in sd */ + unsigned long total_util; /* Total util of all groups in sd */ unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */ @@ -7773,6 +7843,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds) .local = NULL, .total_load = 0UL, .total_capacity = 0UL, + .total_util = 0UL, .busiest_stat = { .avg_load = 0UL, .sum_nr_running = 0, @@ -8100,10 +8171,11 @@ group_type group_classify(struct sched_group *group, static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, - bool *overload, bool *overutilized) + bool *overload, bool *misfit) { unsigned long load; int i, nr_running; + bool overutilized = false;
memset(sgs, 0, sizeof(*sgs));
@@ -8136,7 +8208,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, sgs->idle_cpus++;
if (cpu_overutilized(i)) { - *overutilized = true; + overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i); } @@ -8153,6 +8225,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_no_capacity = group_is_overloaded(env, sgs); sgs->group_type = group_classify(group, sgs); + + if (sgs->group_weight == 1) + group->overutilized = overutilized; }
/** @@ -8270,7 +8345,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd struct sched_group *sg = env->sd->groups; struct sg_lb_stats tmp_sgs; int load_idx, prefer_sibling = 0; - bool overload = false, overutilized = false; + bool overload = false, misfit = false;
if (child && child->flags & SD_PREFER_SIBLING) prefer_sibling = 1; @@ -8292,7 +8367,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd }
update_sg_lb_stats(env, sg, load_idx, local_group, sgs, - &overload, &overutilized); + &overload, &misfit);
if (local_group) goto next_group; @@ -8332,6 +8407,7 @@ next_group: /* Now, start updating sd_lb_stats */ sds->total_load += sgs->group_load; sds->total_capacity += sgs->group_capacity; + sds->total_util += sgs->group_util;
sg = sg->next; } while (sg != env->sd->groups); @@ -8346,18 +8422,28 @@ next_group: if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload;
- /* Update over-utilization (tipping point, U >= 0) indicator */ - if (env->dst_rq->rd->overutilized != overutilized) { - env->dst_rq->rd->overutilized = overutilized; - trace_sched_overutilized(overutilized); - } + /* + * If total utilization is more than half of capacity, + * this means at least the average CPU utilization is + * crossing half of max capacity CPU; so this is a quite + * high bar to set root domain's overutlized flag. + */ + if (sds->total_capacity < sds->total_util * 2) + set_rd_overutilized(env->dst_rq->rd); + else + clear_rd_overutilized(env->dst_rq->rd); } else { - if (!env->dst_rq->rd->overutilized && overutilized) { - env->dst_rq->rd->overutilized = true; - trace_sched_overutilized(true); - } + /* + * If the domain util is greater that domain capacity, + * load balancing needs to be done at the next sched + * domain level as well + */ + if ((sds->total_capacity * 1024 < + sds->total_util * capacity_margin) || misfit) + set_sd_overutilized(env->sd->parent); + else + clear_sd_overutilized(env->sd->parent); } - }
/** @@ -8600,7 +8686,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized) + if (energy_aware() && !is_sd_overutilized(env->sd)) goto out_balanced;
local = &sds.local_stat; @@ -9514,6 +9600,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock(); for_each_domain(cpu, sd) { + + if (energy_aware() && !is_sd_overutilized(sd)) + continue; + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second. @@ -9805,6 +9895,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) { struct cfs_rq *cfs_rq; struct sched_entity *se = &curr->se; + struct sched_domain *sd;
for_each_sched_entity(se) { cfs_rq = cfs_rq_of(se); @@ -9815,12 +9906,18 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued) task_tick_numa(rq, curr);
#ifdef CONFIG_SMP - if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr))) { - rq->rd->overutilized = true; - trace_sched_overutilized(true); - } - rq->misfit_task = !task_fits_max(curr, rq->cpu); + + rcu_read_lock(); + sd = rcu_dereference(rq->sd); + + if (cpu_overutilized(task_cpu(curr)) && sd) + set_sd_overutilized(sd); + + if (rq->misfit_task && sd && sd->parent) + set_sd_overutilized(sd->parent); + + rcu_read_unlock(); #endif
} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ce364dd..c1b03a6 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -925,6 +925,7 @@ struct sched_group { unsigned int group_weight; struct sched_group_capacity *sgc; const struct sched_group_energy *sge; + bool overutilized;
/* * The CPUs this group covers. -- 1.9.1
Idle CPU will keep stale utilization value and this value will not be updated until this CPU is waken up. In some worse case, idle CPU may stay in idle states for very long time (even may in second level), so before the CPU enters idle state it has quite high utilization value this will let scheduler always think the CPU is "overutilized".
This is a defect for scheduler load metrics, and as result it introduces misunderstanding for scheduler to make decision for tipping point. E.g, scheduler calls function update_sg_lb_stats() to iterate all CPUs to make sure all CPUs are not overutilized and then clear flag to indicate system is under tipping point; if any idle CPU has stale utilization value which unfortunately introduce "overutilized", the function update_sg_lb_stats() will wrongly consider the system is over tipping point, even the idle CPU has been staying in idle states for long time.
So essentially we need figure out proper method to decay idle CPU utilization. One possible method is to wake up idle CPUs in scheduler tick and the idle CPU will exit from idle states and update its own utilization value and sleep again if there have no task on it, so this method is suboptimal and potentially harm energy if CPUs entry and exit idle states merely for decaying load metrics.
This patch is to using the load balance as a good occasion to decaying idle CPUs blocked load and eventually this will let system can get correct load metrics for idle CPUs.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2a263f7..3278b563 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8204,9 +8204,14 @@ static inline void update_sg_lb_stats(struct lb_env *env, /* * No need to call idle_cpu() if nr_running is not 0 */ - if (!nr_running && idle_cpu(i)) + if (!nr_running && idle_cpu(i)) { sgs->idle_cpus++;
+ /* update idle CPU blocked load */ + if (cpu_util(i)) + update_blocked_averages(i); + } + if (cpu_overutilized(i)) { overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) -- 1.9.1
Add trace event for sched group energy calculation.
Signed-off-by: Leo Yan leo.yan@linaro.org --- include/trace/events/sched.h | 45 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 5 +++++ 2 files changed, 50 insertions(+) mode change 100644 => 100755 include/trace/events/sched.h
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h old mode 100644 new mode 100755 index 433d391..d002d01 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -1141,6 +1141,51 @@ TRACE_EVENT(walt_migration_update_sum, ); #endif /* CONFIG_SCHED_WALT */
+/* + * Tracepoint for schedule group energy + */ +TRACE_EVENT(sched_group_energy, + + TP_PROTO(const struct cpumask *mask, + int util_delta, int cap_idx, int idle_idx, + unsigned long group_util, + int sg_busy_energy, int sg_idle_energy, + int total_energy), + + TP_ARGS(mask, util_delta, cap_idx, idle_idx, group_util, + sg_busy_energy, sg_idle_energy, total_energy), + + TP_STRUCT__entry( + __bitmask(cpumask, num_possible_cpus() ) + __field( int, util_delta ) + __field( int, cap_idx ) + __field( int, idle_idx ) + __field( unsigned long, group_util ) + __field( int, sg_busy_energy ) + __field( int, sg_idle_energy ) + __field( int, total_energy ) + ), + + TP_fast_assign( + __assign_bitmask(cpumask, cpumask_bits(mask), + num_possible_cpus()); + __entry->util_delta = util_delta; + __entry->cap_idx = cap_idx; + __entry->idle_idx = idle_idx; + __entry->group_util = group_util; + __entry->sg_busy_energy = sg_busy_energy; + __entry->sg_idle_energy = sg_idle_energy; + __entry->total_energy = total_energy; + ), + + TP_printk("cpus=%s util_delta=%d cap_idx=%d idle_idx=%d " + "group_util=%lu sg_busy_energy=%d sg_idle_energy=%d total_energy=%d", + __get_bitmask(cpumask), __entry->util_delta, + __entry->cap_idx, __entry->idle_idx, + __entry->group_util, __entry->sg_busy_energy, + __entry->sg_idle_energy, __entry->total_energy) +); + #endif /* CONFIG_SMP */
#endif /* _TRACE_SCHED_H */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3278b563..aab8c1c 100755 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5564,6 +5564,11 @@ static int sched_group_energy(struct task_energy *tsk_nrg)
total_energy += sg_busy_energy + sg_idle_energy;
+ trace_sched_group_energy(sched_group_cpus(sg), + tsk_nrg->task_util, cap_idx, idle_idx, + group_util, sg_busy_energy, + sg_idle_energy, total_energy); + if (!sd->child) cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
-- 1.9.1
On Mon, Jun 26, 2017 at 01:27:25PM +0800, Leo Yan wrote:
[...]
c) Tipping point optimization
Power saving optimization mainly focus on how to defer the system tipping point so energy aware path can be enabled for most case, but deferring tipping point also means it hurts performance case if system cannot over tipping point for overloaded scenarios (like benchmarks).
So the target is: optimize power without performance regression.
Patch 0011 is Thara's patch v1 "Per Sched domain over utilization", the patch gives good method for how to store the per sched domain flag. I tweaked it with below criterias for overutilization:
- If single CPU is more than 80% util, then set lowest level sched domain as 'overutilized'; so this is the tipping point for 'inner overutilized' flag.
- If any CPU has 'misfit' task or the cluster's overall util > 80% of the cluster overall capacity, then set parent level sched doamin as 'overutilized', this is the tipping point for 'outer overutilized' flag.
- If overall util > 50% of the all CPU overall capacity, then set root domain's 'overutilized' flag. The 50% actually is a quite high bar, e.g. if there have two clusters that means the overall util > the middle capacity for two clusters, also means the overall util has totally beyond one cluster capacity so kick 'global' tipping point and spread tasks cross two clusters.
So with 'per sched domain flag', we can defer the 'global' tipping point and rely on it as a switch for energy aware path. Patch 0011 is to move energy aware function to the beginning of waken path, so this give the function energy_aware_wake_cpu() more chance to execute if system is under tipping point; only when system is over tipping point then it will go back to execute traditional waken balance to select idlest CPU.
Hi Thara, Vincent,
I have seen Thara's patch v3 "Per Sched domain over utilization"; but this is almost when I finish this round testing.
So my patch set includes Thara's v1 patch, we can take it as a quick pilot experiment on Hikey960. After get review and comments, we can decide if need invest more time to port and verify your v3 patch.
Thanks, Leo Yan