New subject: [PATCH 01/13] sched/fair: add function find_nrg_efficient_target()

26 Jun 2017


      ### Basic Ideas ###
This patch set is rebased on EASv1.2 for power optimization on Hikey960.
The ARM big.LITTLE systems have many variants, some platforms use the
same CPU architecture for multi-clusters, every cluster has different
manufacture process (or clock design) so the clusters can have different
OPP settings; this kind system the 'LITTLE' core and 'big' core have the
same architecture but we can get power benefit from the 'LITTLE' core due
it has better power efficiency compared to 'big' core at the same OPP.
On the other hand, for this kind system, usually the 'LITTLE' core power
efficiency doesn't has huge difference compared to 'big' core's; and
furthermore the final CPU power saving percentage will discount twice,
so when optimize power for some scenarios, the optimization may not
significant as expected; or this means power optimization is not priority
issue on these platforms.
Regarding the CPU power discouting for whole system, the first discount
is related with CPU duty cycle, the second discount is related with
SoC/Board baseline power data. We can estimate the CPU power saving
percentage for system level with below formula:
CPU power saving percentage: CPU_PS%
  CPU duty cycle: CPU_DC%
  The percentage between CPU power and whole system: CPU_SYS%
So finally the estimated power saving percentage as below:
  CPU_PS% * CPU_DC% * CPU_SYS%
Let's see one example, we have two CA53 clusters, the 'LITTLE' cluster can
improve 30% power efficiency than 'big' cluster, so CPU_PS% = 30%; the video
playback (1080p) has CPU duty cycle CPU_DC% = 30% (1 core); the ratio between
CPU power and system power is CPU_SYS% = 15%, so finally we can save power
by using 'LITTLE' compared to 'big' core:
CPU_PS% * CPU_DC% * CPU_SYS% = 30% * 30% * 15% = 1.35%
Naturally we consider 1.35% percentage is not a significant improvement; but
for some cases there have the concept for delta power; and if we compare it
with delta power can see the importantance for power saving. Let's use video
playback as example, the delta power percentage (DP%) is:
DP% = (video_playback_power - home_screen_power) / video_playback_power
So DP% is one important criteria for phone models to check some scenarios
compared to Android idle syste. If we think DP% = 15% and power saving
percentage 1.35%, then power saving with 1.35% is meaningful when we compare
1.35% vs 15%.
Another kind of big.LITTLE system has big different power efficiency, if we
review the power efficiency on Hikey960, we can see the coefficient (mw/MHz)
the worst case is the CA73 is 6.2 times than CA53, if we select the median
OPPs as reference we can see CA73 is 2.42 times than CA52. So the highest
CPU_PS%(max) = 86%, the median CPU_PS%(median) = 70%. Let's check upper case
we can save how much on Hikey960 in theory:
CPU_PS%(max)    * CPU_DC% * CPU_SYS% = 86% * 30% * 15% = 3.87%
  CPU_PS%(median) * CPU_DC% * CPU_SYS% = 70% * 30% * 15% = 3.15%
We can see power saving percentage 3.87%/3.15% is significant to DP% (15%).
So on Hikey960, if there have some scenarios with high CPU duty cycle and
sustainable power consuming, the power optimization is important for them.
### Implementations ###
Below are detailed implemenation for the optimizations:
a) Add back CPU selection based on power efficiency
EASv1.2 has function find_best_target(), this function mainly focus
on the idlest CPU so reduce the scheduling latency; but in some cases it
will miss to select the best power efficiency CPU.
So patch 0001/0002 are mainly to add back CPU selection based on power
efficiency; we still keep the function find_best_target() but it's only
used for "boost" and "prefer_idle" cases, and use power efficiency path
for normal cases.
b) EAS core algorithm optimization
For EAS core algorithm, it should resolve problems for below items:
1. Support more than two clusters;
2. Keep CPU to stay lowest OPP as possible, and pack small tasks
   when system is idle;
3. Directly migrate waken task to best CPU to meet performance
   requirement, this means the task could be migrated to higher
   capacity CPU and vice versa;
4. Consistent result for energy calculation and simple implemenation;
Patch 0003/0004 change to select CPU with cluster basis; this means
scheduler firstly select candidate CPUs within every cluster, so every
cluster can has one candidate CPU or the cluster hasn't any one CPU can
meet the requirement. Finally all energy difference calculation happens
within these candidate CPUs. This gives us several benefit, the first
one is scheduler doesn't couple with previous CPU anymore; in the old
code it always compare energy between previous CPU and a new possible
CPU, but for some case the previous CPU is completely wrong CPU for the
task so the comparison is pointless actually. After applied patch 0003/
0004, it introduces one side effect: the task can be directly migrate
from lower capacity CPU to higher capacity CPU (LITTLE -> big), usually
this doesn't happen in old code, due the energy comparison the lower
capacity CPU can beat higher capacity CPU so the task is missed the
chance to migrate to higher capacity CPU.
Patch 0005/0006 are to select best CPU within cluster. In task waken
path, EAS core algorithm is responsible for task selection; it should
achieve two targets: keep CPU to stay lowest OPP as possible and spread
tasks if we can predict the OPP is possible to increased after place
waken task on one specific CPU. So patch 0003/0004 are to find the CPU
with lowest OPP and has highest utilization compared with other CPUs
with the same OPP, so we can rely on EAS core algorithm to spread tasks
if CPU OPP is increasing and pack tasks after CPU is decreased to
lowest OPP.
After applied patch 0003/0004, there introduced much more times energy
comparison between one big CPU and one little CPU. As result, it's
observed the EAS core algorithm is fragile for some corner cases. So
patch 0007/0008/0009 are for more robust energy calculation, especailly
0008/0009 patches introduce an extra signal for "util_waken_avg", by
using this signal we can remove waken task value for CPU's utilization,
so finally all CPU signals are cleaned by removing waken task stale
utilization value.
Patch 0010 is a significant change for EAS core algorithm, the main
idea is to change energy calculation from CPU oriented to task oriented.
Based on energy modeling we can easily anwser the question is: if place
the task onto one specific CPU, how much power is consumed by this task?
So essentially we can calculate the task consumed energy for specific
CPU, so can get to know the power consumption for every possible CPU
and finally filter out which CPU is best power saving one. After
changed to task oriented energy calculation, it's also more smooth to
generate perf idx and energy idx based on task oriented but not CPU
oriented so hope this also can benefit for schedTune PE filter as well.
c) Tipping point optimization
Power saving optimization mainly focus on how to defer the system
tipping point so energy aware path can be enabled for most case, but
deferring tipping point also means it hurts performance case if system
cannot over tipping point for overloaded scenarios (like benchmarks).
So the target is: optimize power without performance regression.
Patch 0011 is Thara's patch v1 "Per Sched domain over utilization",
the patch gives good method for how to store the per sched domain flag.
I tweaked it with below criterias for overutilization:
1. If single CPU is more than 80% util, then set lowest level sched
   domain as 'overutilized'; so this is the tipping point for 'inner
   overutilized' flag.
2. If any CPU has 'misfit' task or the cluster's overall util > 80% of
   the cluster overall capacity, then set parent level sched doamin as
   'overutilized', this is the tipping point for 'outer overutilized'
   flag.
3. If overall util > 50% of the all CPU overall capacity, then set
   root domain's 'overutilized' flag. The 50% actually is a quite high
   bar, e.g. if there have two clusters that means the overall util >
   the middle capacity for two clusters, also means the overall util
   has totally beyond one cluster capacity so kick 'global' tipping
   point and spread tasks cross two clusters.
So with 'per sched domain flag', we can defer the 'global' tipping point
and rely on it as a switch for energy aware path. Patch 0011 is to move
energy aware function to the beginning of waken path, so this give the
function energy_aware_wake_cpu() more chance to execute if system is
under tipping point; only when system is over tipping point then it will
go back to execute traditional waken balance to select idlest CPU.
### Testing result ###
On Hikey960, below is testing enviornment:
- Android AOSP kernel 4.4
  https://android.googlesource.com/kernel/hikey-linaro
  branch: android-hikey-linaro-4.4
- CPUFreq governor: sched-freq
- Fxied DDR: 400MHz
- Fixed GPU: 533MHz
- HDMI: unplugged
- WIFI: disabled
Please note, the video playback (1080p) is using software codec with VLC
player on Android, camera recording is use synthesized workload
camera-long.json to simulate the camera scenario.
Test_Case         Referenced_Phone  PELT_Optimized  PELT_Optimized  WALT_Optimized  WALT_Optimized
                  (mW) [*]          (mW)            (Percentage)    (mW)            (Percentage)
homescreen        800                -5.05 [**]      -0.63%          -10.46           -1.31%
Audio(MP3)        200 (LCD OFF)       5.33            2.66%           60.62           30.31%
Video(1080p)      1000              133.09           13.31%           26.10            2.61%
Camera Recording  2000              163.94            8.20%          -79.57 [***]     -3.98%
[*]   The reference phone is not any specific phone model, here I give
      out some very roughly power data for well optimized commercial
      phones. So this are only some data based on old experience and they
      are not not very precise. These power data is based on power data
      from the battery measurement point with 4.2v.
[**]  Positive value: power reducing by this patch set
      Negative value: power increasing by this patch set
[***] Camera Recording + WALT power data is much worse with this patch set;
      Will explian in "conclusion" section.
Testing raw data: http://people.linaro.org/~leo.yan/eas_upstream/hikey960_result/
### Conclusion ###
Firstly Hikey960 is a good candidate platform to verify power saving
optimization :)
This patch set with PELT signal has good result on Hikey960, especially
for cases video playback (saving 133.09mW) and camera recording (saving
164.94mW). For audio playback, it can save 5.33mW; for homescreen it has
a bit regression (increased 5.05mW), suspect this related with task packing
on LITTLE core but need investigate for this.
This patch set with WALT signal has good result for audio playback and
video playback, but it's broken for camera recording case. After reviewed
the trace log, the main issue is many tasks' WALT signal can reach into
the range 100~200, so there have many comparision between LITTLE CPU
1844MHz and big CPU 903MHz. From the power modeling parameters, the big
CPU 903MHz has lower power efficiency than LITTLE CPU 1844MHz, so tasks
are migrated onto big core frequently. Compared to WALT signal, PELT signal
can co-work with power modeling parameters well, so we can see the energy
awaring algorithm can avoid task easily migration to big cores. (seems to me,
this is a question as: what's the signal can match for eas core algorithm?)
Some known issues:
- CPUFreq governor impacts power consumption much, sched-freq is easily
  to reach 1844MHz. so need check if have mechanism to optimize the policy
  to reduce the chance to set 1844MHz;
Another testing is to use other governors: schedutil, interactive.
- RT threads now are not energy awared, so they are migrated to big cores;
- Load balancing flow has no energy awared optimization;
- Now fixed DDR frequency, if enable DDR frequency change then power modeling
  will be changed significant:
  Need devfreq driver for DDR, and tune power modeling for this.
- Though these patches have been verified on Juno there have no harm for
  performance, need do performance comprision on Hikey960.
Leo Yan (12):
  sched/fair: add function find_nrg_efficient_target()
  sched/fair: enable energy efficiency selection
  sched/fair: use new function to select CPU from sched group
  sched/fair: select candidate CPUs by cluster basis
  sched/fair: refine find_new_capacity()
  sched/fair: optimize CPU selection with lowest OPP
  sched/fair: increase resolution for energy calculation
  sched/fair: introduce signal util_waken_avg for CPU
  sched/fair: select idle CPU as backup for waken up path
  sched/fair: task oriented energy calculation
  sched/fair: update idle CPU blocked load in update_sg_lb_stats()
  sched/fair: add trace event for sched group energy
Thara Gopinath (1):
  Per Sched domain over utilization
include/linux/sched.h        |   2 +-
 include/trace/events/sched.h |  45 ++++
 kernel/sched/fair.c          | 593 +++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h         |   1 +
 4 files changed, 484 insertions(+), 157 deletions(-)
 mode change 100644 => 100755 include/trace/events/sched.h
 mode change 100644 => 100755 kernel/sched/fair.c
--
1.9.1

[PATCH 00/13] EXPERIMENT: Power Optimization On Hikey960