### Basic Ideas ###
This patch set is rebased on EASv1.2 for power optimization on Hikey960.
The ARM big.LITTLE systems have many variants, some platforms use the same CPU architecture for multi-clusters, every cluster has different manufacture process (or clock design) so the clusters can have different OPP settings; this kind system the 'LITTLE' core and 'big' core have the same architecture but we can get power benefit from the 'LITTLE' core due it has better power efficiency compared to 'big' core at the same OPP. On the other hand, for this kind system, usually the 'LITTLE' core power efficiency doesn't has huge difference compared to 'big' core's; and furthermore the final CPU power saving percentage will discount twice, so when optimize power for some scenarios, the optimization may not significant as expected; or this means power optimization is not priority issue on these platforms.
Regarding the CPU power discouting for whole system, the first discount is related with CPU duty cycle, the second discount is related with SoC/Board baseline power data. We can estimate the CPU power saving percentage for system level with below formula:
CPU power saving percentage: CPU_PS% CPU duty cycle: CPU_DC% The percentage between CPU power and whole system: CPU_SYS%
So finally the estimated power saving percentage as below: CPU_PS% * CPU_DC% * CPU_SYS%
Let's see one example, we have two CA53 clusters, the 'LITTLE' cluster can improve 30% power efficiency than 'big' cluster, so CPU_PS% = 30%; the video playback (1080p) has CPU duty cycle CPU_DC% = 30% (1 core); the ratio between CPU power and system power is CPU_SYS% = 15%, so finally we can save power by using 'LITTLE' compared to 'big' core:
CPU_PS% * CPU_DC% * CPU_SYS% = 30% * 30% * 15% = 1.35%
Naturally we consider 1.35% percentage is not a significant improvement; but for some cases there have the concept for delta power; and if we compare it with delta power can see the importantance for power saving. Let's use video playback as example, the delta power percentage (DP%) is:
DP% = (video_playback_power - home_screen_power) / video_playback_power
So DP% is one important criteria for phone models to check some scenarios compared to Android idle syste. If we think DP% = 15% and power saving percentage 1.35%, then power saving with 1.35% is meaningful when we compare 1.35% vs 15%.
Another kind of big.LITTLE system has big different power efficiency, if we review the power efficiency on Hikey960, we can see the coefficient (mw/MHz) the worst case is the CA73 is 6.2 times than CA53, if we select the median OPPs as reference we can see CA73 is 2.42 times than CA52. So the highest CPU_PS%(max) = 86%, the median CPU_PS%(median) = 70%. Let's check upper case we can save how much on Hikey960 in theory:
CPU_PS%(max) * CPU_DC% * CPU_SYS% = 86% * 30% * 15% = 3.87% CPU_PS%(median) * CPU_DC% * CPU_SYS% = 70% * 30% * 15% = 3.15%
We can see power saving percentage 3.87%/3.15% is significant to DP% (15%).
So on Hikey960, if there have some scenarios with high CPU duty cycle and sustainable power consuming, the power optimization is important for them.
### Implementations ###
Below are detailed implemenation for the optimizations:
a) Add back CPU selection based on power efficiency
EASv1.2 has function find_best_target(), this function mainly focus on the idlest CPU so reduce the scheduling latency; but in some cases it will miss to select the best power efficiency CPU.
So patch 0001/0002 are mainly to add back CPU selection based on power efficiency; we still keep the function find_best_target() but it's only used for "boost" and "prefer_idle" cases, and use power efficiency path for normal cases.
b) EAS core algorithm optimization
For EAS core algorithm, it should resolve problems for below items:
1. Support more than two clusters; 2. Keep CPU to stay lowest OPP as possible, and pack small tasks when system is idle; 3. Directly migrate waken task to best CPU to meet performance requirement, this means the task could be migrated to higher capacity CPU and vice versa; 4. Consistent result for energy calculation and simple implemenation;
Patch 0003/0004 change to select CPU with cluster basis; this means scheduler firstly select candidate CPUs within every cluster, so every cluster can has one candidate CPU or the cluster hasn't any one CPU can meet the requirement. Finally all energy difference calculation happens within these candidate CPUs. This gives us several benefit, the first one is scheduler doesn't couple with previous CPU anymore; in the old code it always compare energy between previous CPU and a new possible CPU, but for some case the previous CPU is completely wrong CPU for the task so the comparison is pointless actually. After applied patch 0003/ 0004, it introduces one side effect: the task can be directly migrate from lower capacity CPU to higher capacity CPU (LITTLE -> big), usually this doesn't happen in old code, due the energy comparison the lower capacity CPU can beat higher capacity CPU so the task is missed the chance to migrate to higher capacity CPU.
Patch 0005/0006 are to select best CPU within cluster. In task waken path, EAS core algorithm is responsible for task selection; it should achieve two targets: keep CPU to stay lowest OPP as possible and spread tasks if we can predict the OPP is possible to increased after place waken task on one specific CPU. So patch 0003/0004 are to find the CPU with lowest OPP and has highest utilization compared with other CPUs with the same OPP, so we can rely on EAS core algorithm to spread tasks if CPU OPP is increasing and pack tasks after CPU is decreased to lowest OPP.
After applied patch 0003/0004, there introduced much more times energy comparison between one big CPU and one little CPU. As result, it's observed the EAS core algorithm is fragile for some corner cases. So patch 0007/0008/0009 are for more robust energy calculation, especailly 0008/0009 patches introduce an extra signal for "util_waken_avg", by using this signal we can remove waken task value for CPU's utilization, so finally all CPU signals are cleaned by removing waken task stale utilization value.
Patch 0010 is a significant change for EAS core algorithm, the main idea is to change energy calculation from CPU oriented to task oriented. Based on energy modeling we can easily anwser the question is: if place the task onto one specific CPU, how much power is consumed by this task? So essentially we can calculate the task consumed energy for specific CPU, so can get to know the power consumption for every possible CPU and finally filter out which CPU is best power saving one. After changed to task oriented energy calculation, it's also more smooth to generate perf idx and energy idx based on task oriented but not CPU oriented so hope this also can benefit for schedTune PE filter as well.
c) Tipping point optimization
Power saving optimization mainly focus on how to defer the system tipping point so energy aware path can be enabled for most case, but deferring tipping point also means it hurts performance case if system cannot over tipping point for overloaded scenarios (like benchmarks).
So the target is: optimize power without performance regression.
Patch 0011 is Thara's patch v1 "Per Sched domain over utilization", the patch gives good method for how to store the per sched domain flag. I tweaked it with below criterias for overutilization:
1. If single CPU is more than 80% util, then set lowest level sched domain as 'overutilized'; so this is the tipping point for 'inner overutilized' flag. 2. If any CPU has 'misfit' task or the cluster's overall util > 80% of the cluster overall capacity, then set parent level sched doamin as 'overutilized', this is the tipping point for 'outer overutilized' flag. 3. If overall util > 50% of the all CPU overall capacity, then set root domain's 'overutilized' flag. The 50% actually is a quite high bar, e.g. if there have two clusters that means the overall util > the middle capacity for two clusters, also means the overall util has totally beyond one cluster capacity so kick 'global' tipping point and spread tasks cross two clusters.
So with 'per sched domain flag', we can defer the 'global' tipping point and rely on it as a switch for energy aware path. Patch 0011 is to move energy aware function to the beginning of waken path, so this give the function energy_aware_wake_cpu() more chance to execute if system is under tipping point; only when system is over tipping point then it will go back to execute traditional waken balance to select idlest CPU.
### Testing result ###
On Hikey960, below is testing enviornment: - Android AOSP kernel 4.4 https://android.googlesource.com/kernel/hikey-linaro branch: android-hikey-linaro-4.4 - CPUFreq governor: sched-freq - Fxied DDR: 400MHz - Fixed GPU: 533MHz - HDMI: unplugged - WIFI: disabled
Please note, the video playback (1080p) is using software codec with VLC player on Android, camera recording is use synthesized workload camera-long.json to simulate the camera scenario.
Test_Case Referenced_Phone PELT_Optimized PELT_Optimized WALT_Optimized WALT_Optimized (mW) [*] (mW) (Percentage) (mW) (Percentage) homescreen 800 -5.05 [**] -0.63% -10.46 -1.31% Audio(MP3) 200 (LCD OFF) 5.33 2.66% 60.62 30.31% Video(1080p) 1000 133.09 13.31% 26.10 2.61% Camera Recording 2000 163.94 8.20% -79.57 [***] -3.98%
[*] The reference phone is not any specific phone model, here I give out some very roughly power data for well optimized commercial phones. So this are only some data based on old experience and they are not not very precise. These power data is based on power data from the battery measurement point with 4.2v.
[**] Positive value: power reducing by this patch set Negative value: power increasing by this patch set
[***] Camera Recording + WALT power data is much worse with this patch set; Will explian in "conclusion" section.
Testing raw data: http://people.linaro.org/~leo.yan/eas_upstream/hikey960_result/
### Conclusion ###
Firstly Hikey960 is a good candidate platform to verify power saving optimization :)
This patch set with PELT signal has good result on Hikey960, especially for cases video playback (saving 133.09mW) and camera recording (saving 164.94mW). For audio playback, it can save 5.33mW; for homescreen it has a bit regression (increased 5.05mW), suspect this related with task packing on LITTLE core but need investigate for this.
This patch set with WALT signal has good result for audio playback and video playback, but it's broken for camera recording case. After reviewed the trace log, the main issue is many tasks' WALT signal can reach into the range 100~200, so there have many comparision between LITTLE CPU 1844MHz and big CPU 903MHz. From the power modeling parameters, the big CPU 903MHz has lower power efficiency than LITTLE CPU 1844MHz, so tasks are migrated onto big core frequently. Compared to WALT signal, PELT signal can co-work with power modeling parameters well, so we can see the energy awaring algorithm can avoid task easily migration to big cores. (seems to me, this is a question as: what's the signal can match for eas core algorithm?)
Some known issues:
- CPUFreq governor impacts power consumption much, sched-freq is easily to reach 1844MHz. so need check if have mechanism to optimize the policy to reduce the chance to set 1844MHz;
Another testing is to use other governors: schedutil, interactive.
- RT threads now are not energy awared, so they are migrated to big cores;
- Load balancing flow has no energy awared optimization;
- Now fixed DDR frequency, if enable DDR frequency change then power modeling will be changed significant: Need devfreq driver for DDR, and tune power modeling for this.
- Though these patches have been verified on Juno there have no harm for performance, need do performance comprision on Hikey960.
Leo Yan (12): sched/fair: add function find_nrg_efficient_target() sched/fair: enable energy efficiency selection sched/fair: use new function to select CPU from sched group sched/fair: select candidate CPUs by cluster basis sched/fair: refine find_new_capacity() sched/fair: optimize CPU selection with lowest OPP sched/fair: increase resolution for energy calculation sched/fair: introduce signal util_waken_avg for CPU sched/fair: select idle CPU as backup for waken up path sched/fair: task oriented energy calculation sched/fair: update idle CPU blocked load in update_sg_lb_stats() sched/fair: add trace event for sched group energy
Thara Gopinath (1): Per Sched domain over utilization
include/linux/sched.h | 2 +- include/trace/events/sched.h | 45 ++++ kernel/sched/fair.c | 593 +++++++++++++++++++++++++++++++------------ kernel/sched/sched.h | 1 + 4 files changed, 484 insertions(+), 157 deletions(-) mode change 100644 => 100755 include/trace/events/sched.h mode change 100644 => 100755 kernel/sched/fair.c
-- 1.9.1