During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.
Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.
It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.
2) Any CPU which is running below 80% capacity is considered running low
on capacity[*].
3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.
4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.
The performance numbers:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.
I also used barrier.c (open_mp code) as a micro-benchmark. It does a number
of iterations and barrier sync at the end of each for loop.
I was also running ping on CPU 0 as:
'ping -l 10000 -q -s 10 -f host2'
The results below should be read as:
* 'Baseline without ping' is how the workload would've behaved if there
was no IRQ activity.
* Compare 'Baseline with ping' and 'Baseline without ping' to see the
effect of ping
* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
CAS can give over baseline
The program (barrier.c) can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 20 core x86 machine:
+-------+----------------+----------------+------------------+
|Num. |CAS |Baseline |Baseline without |
|Threads|with ping |with ping |ping |
+-------+-------+--------+-------+--------+-------+----------+
| |Mean |Std. Dev|Mean |Std. Dev|Mean |Std. Dev |
+-------+-------+--------+-------+--------+-------+----------+
|1 | 511.7 | 6.9 | 508.3 | 17.3 | 514.6 | 4.7 |
|2 | 486.8 | 16.3 | 463.9 | 17.4 | 510.8 | 3.9 |
|4 | 466.1 | 11.7 | 451.4 | 12.5 | 489.3 | 4.1 |
|8 | 433.6 | 3.7 | 427.5 | 2.2 | 447.6 | 5.0 |
|16 | 391.9 | 7.9 | 385.5 | 16.4 | 396.2 | 0.3 |
|32 | 269.3 | 5.3 | 266.0 | 6.6 | 276.8 | 0.2 |
+-------+-------+--------+-------+--------+-------+----------+
Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 20 core x86 machine:
+---------------+------+--------+--------+
|Num. |CAS |Baseline|Baseline|
|Tasks |with |with |without |
|(groups of 40) |ping |ping |ping |
+---------------+------+--------+--------+
| |Mean |Mean |Mean |
+---------------+------+--------+--------+
|1 | 0.97 | 0.97 | 0.68 |
|2 | 1.36 | 1.36 | 1.30 |
|4 | 2.57 | 2.57 | 1.84 |
|8 | 3.31 | 3.34 | 2.86 |
|16 | 5.63 | 5.71 | 4.61 |
|25 | 7.99 | 8.23 | 6.78 |
+---------------+------+--------+--------+
[*] Question (RFC part):
---------------------------------------------------------------------------
In the previous discussion of this patch the threshold to decide whether
a CPU is running low on capacity, was being calculated dynamically. In
the tests I have done, 80% seems to be a good threshold.
Would it be OK to choose a fixed cutoff?
Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
can be avoided.
Previous discussion can be found at:
---------------------------------------------------------------------------
https://patchwork.kernel.org/patch/9741351/
Signed-off-by: Rohit Jain <rohit.k.jain(a)oracle.com>
---
kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 66 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..3c26c13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5298,6 +5298,11 @@ static unsigned long cpu_avg_load_per_task(int cpu)
return 0;
}
+static inline bool full_capacity(int cpu)
+{
+ return (capacity_of(cpu) >= (capacity_orig_of(cpu)*819 >> 10));
+}
+
static void record_wakee(struct task_struct *p)
{
/*
@@ -5516,9 +5521,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
{
unsigned long load, min_load = ULONG_MAX;
unsigned int min_exit_latency = UINT_MAX;
+ unsigned int backup_cap = 0;
u64 latest_idle_timestamp = 0;
int least_loaded_cpu = this_cpu;
int shallowest_idle_cpu = -1;
+ int shallowest_idle_cpu_backup = -1;
int i;
/* Check if we have any choice: */
@@ -5538,7 +5545,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
*/
min_exit_latency = idle->exit_latency;
latest_idle_timestamp = rq->idle_stamp;
- shallowest_idle_cpu = i;
+ if (full_capacity(i)) {
+ shallowest_idle_cpu = i;
+ } else if (capacity_of(i) > backup_cap) {
+ shallowest_idle_cpu_backup = i;
+ backup_cap = capacity_of(i);
+ }
} else if ((!idle || idle->exit_latency == min_exit_latency) &&
rq->idle_stamp > latest_idle_timestamp) {
/*
@@ -5547,7 +5559,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
* a warmer cache.
*/
latest_idle_timestamp = rq->idle_stamp;
- shallowest_idle_cpu = i;
+ if (full_capacity(i)) {
+ shallowest_idle_cpu = i;
+ } else if (capacity_of(i) > backup_cap) {
+ shallowest_idle_cpu_backup = i;
+ backup_cap = capacity_of(i);
+ }
}
} else if (shallowest_idle_cpu == -1) {
load = weighted_cpuload(i);
@@ -5558,7 +5575,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
}
}
- return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
+ if (shallowest_idle_cpu != -1)
+ return shallowest_idle_cpu;
+
+ return (shallowest_idle_cpu_backup != -1 ?
+ shallowest_idle_cpu_backup : least_loaded_cpu);
}
#ifdef CONFIG_SCHED_SMT
@@ -5620,7 +5641,9 @@ void __update_idle_core(struct rq *rq)
static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
{
struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
- int core, cpu;
+ int core, cpu, rcpu, rcpu_backup;
+ unsigned int backup_cap = 0;
+ rcpu = rcpu_backup = -1;
if (!static_branch_likely(&sched_smt_present))
return -1;
@@ -5637,10 +5660,20 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
cpumask_clear_cpu(cpu, cpus);
if (!idle_cpu(cpu))
idle = false;
+
+ if (full_capacity(cpu)) {
+ rcpu = cpu;
+ } else if ((rcpu == -1) && (capacity_of(cpu) > backup_cap)) {
+ backup_cap = capacity_of(cpu);
+ rcpu_backup = cpu;
+ }
}
- if (idle)
- return core;
+ if (idle) {
+ if (rcpu == -1)
+ return (rcpu_backup != -1 ? rcpu_backup : core);
+ return rcpu;
+ }
}
/*
@@ -5656,7 +5689,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
*/
static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
{
- int cpu;
+ int cpu, backup_cpu = -1;
+ unsigned int backup_cap = 0;
if (!static_branch_likely(&sched_smt_present))
return -1;
@@ -5664,11 +5698,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
for_each_cpu(cpu, cpu_smt_mask(target)) {
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
- return cpu;
+ if (idle_cpu(cpu)) {
+ if (full_capacity(cpu))
+ return cpu;
+ if (capacity_of(cpu) > backup_cap) {
+ backup_cap = capacity_of(cpu);
+ backup_cpu = cpu;
+ }
+ }
}
- return -1;
+ return backup_cpu;
}
#else /* CONFIG_SCHED_SMT */
@@ -5697,6 +5737,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
u64 time, cost;
s64 delta;
int cpu, nr = INT_MAX;
+ int backup_cpu = -1;
+ unsigned int backup_cap = 0;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
if (!this_sd)
@@ -5727,10 +5769,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
return -1;
if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
continue;
- if (idle_cpu(cpu))
- break;
+ if (idle_cpu(cpu)) {
+ if (full_capacity(cpu)) {
+ backup_cpu = -1;
+ break;
+ } else if (capacity_of(cpu) > backup_cap) {
+ backup_cap = capacity_of(cpu);
+ backup_cpu = cpu;
+ }
+ }
}
+ if (backup_cpu >= 0)
+ cpu = backup_cpu;
time = local_clock() - time;
cost = this_sd->avg_scan_cost;
delta = (s64)(time - cost) / 8;
@@ -5747,13 +5798,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
struct sched_domain *sd;
int i;
- if (idle_cpu(target))
+ if (idle_cpu(target) && full_capacity(target))
return target;
/*
* If the previous cpu is cache affine and idle, don't be stupid.
*/
- if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+ if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)
+ && full_capacity(prev))
return prev;
sd = rcu_dereference(per_cpu(sd_llc, target));
--
2.7.4
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into cpufreq governors are only made from the scheduler if the target
CPU of the event is the same as the current CPU. This means there are
certain situations where a target CPU may not run the cpufreq governor
for some time.
One testcase [1] to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that the new tasks should receive maximum
demand initially, this should result in CPU0 increasing frequency
immediately. But because of the above mentioned limitation though, this
does not occur.
This series updates the scheduler core to call the cpufreq callbacks for
remote CPUs as well and updates the registered hooks to handle that.
This is tested with couple of usecases (Android: hackbench, recentfling,
galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
octa-core, single policy). Only galleryfling showed minor improvements,
while others didn't had much deviation.
The reason being that this patch only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has high demand, and should take the target CPU to higher
OPPs.
- And the target CPU doesn't call into the cpufreq governor until the
next tick.
Rebased over: pm/linux-next
V4->V5:
- Drop cpu field from "struct update_util_data" and add it in "struct
sugov_cpu" instead.
- Can't have separate patches now because of the above change and so
merged all the patches from V4 into a single patch.
- Add a comment suggested by PeterZ.
- Commit log of 1/2 is improved to contain more details.
- A new patch (which was posted during V1) is also added to take care of
platforms where any CPU can do DVFS on behalf of any other CPU, even
if they are part of different cpufreq policies. This has been
requested by Saravana several times already and as the series is quite
straight forward now, I decided to include it in.
V3->V4:
- Respect iowait boost flag and util updates for the all remote
callbacks.
- Minor updates in commit log of 2/3.
V2->V3:
- Rearranged/merged patches as suggested by Rafael (looks much better
now)
- Also handle new hook added to intel-pstate driver.
- The final code remains the same as V2, except for the above hook.
V1->V2:
- Don't support remote callbacks for unshared cpufreq policies.
- Don't support remote callbacks where local CPU isn't part of the
target CPU's cpufreq policy.
- Dropped dvfs_possible_from_any_cpu flag.
--
viresh
[1] http://pastebin.com/7LkMSRxE
Viresh Kumar (2):
sched: cpufreq: Allow remote cpufreq callbacks
cpufreq: Process remote callbacks from any CPU if the platform permits
drivers/cpufreq/cpufreq-dt.c | 1 +
drivers/cpufreq/cpufreq_governor.c | 3 +++
drivers/cpufreq/intel_pstate.c | 8 ++++++++
include/linux/cpufreq.h | 23 +++++++++++++++++++++++
kernel/sched/cpufreq_schedutil.c | 31 ++++++++++++++++++++++++++-----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 8 +++++---
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 10 ++--------
9 files changed, 70 insertions(+), 18 deletions(-)
--
2.13.0.71.gd7076ec9c9cb
Hello EAS-dev!
ARM is pleased to announce the EAS r1.3 release.
This is the next tick in our regular updates to EAS in AOSP, including documentation and testing updates.
In particular this release is the first major update to EAS in Android Common Kernel 4.9
Changes in EAS 1.3
* Validation on real devices and additional development boards (Hikey960)
* Increased test coverage
* Upstream schedutil backporting
* Schedutil is now the recommended CPUFreq governor
* General EAS refactoring improvements (find_best_target changes)
* android common kernel-4.9 brought to EAS equivalence with 4.4
Android Common Kernel 4.4:
https://android.googlesource.com/kernel/common/+/android-4.4
Android Common Kernel 4.9:
https://android-review.googlesource.com/#/c/444387/
Once merged into android-4.9, the gerrit web interface will tell you that the patches have been merged
however the changeset link should stay active.
Documentation:
https://developer.arm.com/-/media/developer/developers/open-source/energy-a…
Specifically about schedutil:
We have backported schedutil patches up until 38d4ea229d which was included in v4.12.
(https://github.com/torvalds/linux/commit/38d4ea229d25d30be6bf41bcd6cd663a58…)
"cpufreq: schedutil: Trace frequency only if it has changed".
The version included in android-4.9 includes backported patches to the same level. This brings
schedutil in both versions of Android up to v4.12.
We have satisfied ourselves in testing that this version of schedutil works well enough to be used
in place of schedfreq both for performance and energy usage.
EAS Updates:
We have done a large refactoring of find_best_target as it was beginning to become difficult
to make further improvements without impacting other behaviors. The refactored version
has exactly the same behaviour in the refactor commit, and it has allowed us to further
refine the task of selecting a CPU during wakeup. We added the ability to return a second
target CPU from find_best_target, which is chosen using a different strategy. When the first
target is not allowable due to the energy/performance trade-off not being good enough,
we now check the alternative strategy as well (but only if the primary strategy fails).
A new tracepoint was added to help in understanding EAS task placement decisions -
sched_find_best_target - which traces the task, schedtune flags and CPUs which were
selected by find_best_target for energy evaluation.
More patches were added to improve system behavior with idle CPUs. We now prevent
an idle CPU from holding the system in overutilized mode (if it was overutilized just
before going into nohz mode), allowing EAS to handle task placements again sooner.
In addition, when misfit tasks are present, we bypass some of the normal nohz balance
rate-limiting to reduce the time needed for those tasks to be redistributed.
Finally, we added the ability for EAS to forecast the idle state which could potentially be
selected under the utilization conditions when calculating the energy for a particular
sched group. The forecast is intentionally simple as it is done during wakeup - we
reserve the deepest idle state for completely idle groups and otherwise linearly
map the group utilization to idle states. In previous versions, EAS used the current
idle state when estimating energy. This change allows EAS to see the potential
impact of moving the last task from one group to another and move tasks if appropriate.
Android-4.9:
android-4.9 has not yet had the same level of testing that android-4.4 has due
to us having a limited set of platforms which can run a 4.9 kernel version. For most of
this dev cycle we have only had access to Juno, and we have confirmed that our tests
behave the same on 4.9 as they do on 4.4. In the last week or two, the Hikey960 board
has gained a usable BSP for running android-4.9, so we have also been testing that
but it is too early to share those results.
We continue to develop EAS on AOSP in public. Please feel free to participate in
testing patches, reviewing code and generally being a good open-source citizen.
Best Regards,
Chris Redpath
Hi,
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into schedutil are only made from the scheduler if the target CPU of the
event is the same as the current CPU. This means there are certain
situations where a target CPU may not run schedutil for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that new tasks should receive maximum demand
initially, this should result in CPU0 increasing frequency immediately.
Because of the above mentioned limitation though this does not occur.
This is verified using ftrace with the sample [1] application.
Maybe the ideal solution is to always allow remote callbacks but that
has its own challenges:
o There is no protection required for single CPU per policy case today,
and adding any kind of locking there, to supply remote callbacks,
isn't really a good idea.
o If is local CPU isn't part of the same cpufreq policy as the target
CPU, then we wouldn't be able to do fast switching at all and have to
use some kind of bottom half to schedule work on the target CPU to do
real switching. That may be overkill as well.
And so this series only allows remote callbacks for target CPUs that
share the cpufreq policy with the local CPU.
This series is tested with couple of usecases (Android: hackbench,
recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey
board (64 bit octa-core, single policy). Only galleryfling showed minor
improvements, while others didn't had much deviation.
The reason being that this patchset only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has maximum demand initially, and should take the CPU to
higher OPPs.
- And the target CPU doesn't call into schedutil until the next tick.
V2->V3:
- Rearranged/merged patches as suggested by Rafael (looks much better
now)
- Also handle new hook added to intel-pstate driver.
- The final code remains the same as V2, except for the above hook.
V1->V2:
- Don't support remote callbacks for unshared cpufreq policies.
- Don't support remote callbacks where local CPU isn't part of the
target CPU's cpufreq policy.
- Dropped dvfs_possible_from_any_cpu flag.
--
viresh
[1] http://pastebin.com/7LkMSRxE
Viresh Kumar (3):
sched: cpufreq: Allow remote cpufreq callbacks
cpufreq: schedutil: Process remote callback for shared policies
cpufreq: governor: Process remote callback for shared policies
drivers/cpufreq/cpufreq_governor.c | 4 ++++
drivers/cpufreq/intel_pstate.c | 8 ++++++++
include/linux/sched/cpufreq.h | 1 +
kernel/sched/cpufreq.c | 1 +
kernel/sched/cpufreq_schedutil.c | 19 ++++++++++++++-----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 8 +++++---
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 10 ++--------
9 files changed, 37 insertions(+), 18 deletions(-)
--
2.13.0.71.gd7076ec9c9cb
Hi,
I had some IRC discussions with Peter and V4 is based on his feedback.
Here is the diff between V3 and V4:
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index d64754fb912e..df9aa1ee53ff 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -79,6 +79,10 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
s64 delta_ns;
bool update;
+ /* Allow remote callbacks only on the CPUs sharing cpufreq policy */
+ if (!cpumask_test_cpu(smp_processor_id(), sg_policy->policy->cpus))
+ return false;
+
if (sg_policy->work_in_progress)
return false;
@@ -225,10 +229,6 @@ static void sugov_update_single(struct update_util_data *hook, u64 time,
unsigned int next_f;
bool busy;
- /* Remote callbacks aren't allowed for policies which aren't shared */
- if (smp_processor_id() != hook->cpu)
- return;
-
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;
@@ -298,14 +298,9 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time,
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_util);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
- struct cpufreq_policy *policy = sg_policy->policy;
unsigned long util, max;
unsigned int next_f;
- /* Allow remote callbacks only on the CPUs sharing cpufreq policy */
- if (!cpumask_test_cpu(smp_processor_id(), policy->cpus))
- return;
-
sugov_get_util(&util, &max, hook->cpu);
raw_spin_lock(&sg_policy->update_lock);
-------------------------8<-------------------------
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into schedutil are only made from the scheduler if the target CPU of the
event is the same as the current CPU. This means there are certain
situations where a target CPU may not run schedutil for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that new tasks should receive maximum demand
initially, this should result in CPU0 increasing frequency immediately.
Because of the above mentioned limitation though this does not occur.
This is verified using ftrace with the sample [1] application.
Maybe the ideal solution is to always allow remote callbacks but that
has its own challenges:
o There is no protection required for single CPU per policy case today,
and adding any kind of locking there, to supply remote callbacks,
isn't really a good idea.
o If is local CPU isn't part of the same cpufreq policy as the target
CPU, then we wouldn't be able to do fast switching at all and have to
use some kind of bottom half to schedule work on the target CPU to do
real switching. That may be overkill as well.
And so this series only allows remote callbacks for target CPUs that
share the cpufreq policy with the local CPU.
This series is tested with couple of usecases (Android: hackbench,
recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey
board (64 bit octa-core, single policy). Only galleryfling showed minor
improvements, while others didn't had much deviation.
The reason being that this patchset only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has maximum demand initially, and should take the CPU to
higher OPPs.
- And the target CPU doesn't call into schedutil until the next tick.
V3->V4:
- Respect iowait boost flag and util updates for the all remote
callbacks.
- Minor updates in commit log of 2/3.
V2->V3:
- Rearranged/merged patches as suggested by Rafael (looks much better
now)
- Also handle new hook added to intel-pstate driver.
- The final code remains the same as V2, except for the above hook.
V1->V2:
- Don't support remote callbacks for unshared cpufreq policies.
- Don't support remote callbacks where local CPU isn't part of the
target CPU's cpufreq policy.
- Dropped dvfs_possible_from_any_cpu flag.
--
viresh
Viresh Kumar (3):
sched: cpufreq: Allow remote cpufreq callbacks
cpufreq: schedutil: Process remote callback for shared policies
cpufreq: governor: Process remote callback for shared policies
drivers/cpufreq/cpufreq_governor.c | 4 ++++
drivers/cpufreq/intel_pstate.c | 8 ++++++++
include/linux/sched/cpufreq.h | 1 +
kernel/sched/cpufreq.c | 1 +
kernel/sched/cpufreq_schedutil.c | 14 +++++++++-----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 8 +++++---
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 10 ++--------
9 files changed, 32 insertions(+), 18 deletions(-)
--
2.13.0.71.gd7076ec9c9cb
Thanks guys for all the great info! I will take another look and see what I
can do now that I have a better idea of how to go about it. Once again,
it's appreciated that you guys are working out in the open. I know many
others that are also keeping up with this mailing list. It has been a great
learning experience.
Kind Regards,
Zachariah Kennedy
Good day!
I have been following EAS development for sometime now. Currently, I have
implemented EAS in my own personal kernel for the Oneplus 3. It was largely
based on the work done for the pixel and I am happy to say that currently,
I have gotten better performance and battery life compared to stock CAF
with HMP.
These questions will be based on the ACK android-4.4 branch
My first question is regarding tunings for EAS. I have seen many different
values thrown around for awhile but I was curious about what everyone close
to the project is using for schedutil up/down_rate_limit. Currently the
stock values are 1000 (for up and down). Is this still the case for those
testing the newest EAS changes using schedutil?
Also what about stune? I know stock pixel is using 50 for
top-app\schedtune.boost for interactions but that turns out to be overkill
with schedutil.
Lastly, I had purchased the Oneplus 5 with the SD835 just so I can port EAS
to it as well. I am looking forward to testing how EAS scales with the
extra cores to work with when compared to the SD820/821. One main questions
regarding the SD835 is I wanted to see if anyone on the EAS-DEV list has
developed an energy model for the SD835 (MSM8998). Even if it is just
preliminary, I would appreciate any help with this. I do not have a proper
energy meter yet.
This is something I am truly interested in. I love the openness of all the
Devs close to this project. I have become a better developer having
participated and watching from the sidelines. Thanks guys for your hard
work.
Kind Regards,
Zachariah Kennedy
Hi,
Here is the second version of this series. The first [1] version was
sent several months back.
With Android UI and benchmarks the latency of cpufreq response to
certain scheduling events can become very critical. Currently, callbacks
into schedutil are only made from the scheduler if the target CPU of the
event is the same as the current CPU. This means there are certain
situations where a target CPU may not run schedutil for some time.
One testcase to show this behavior is where a task starts running on
CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
system is configured such that new tasks should receive maximum demand
initially, this should result in CPU0 increasing frequency immediately.
Because of the above mentioned limitation though this does not occur.
This is verified using ftrace with the sample [2] application.
Maybe the ideal solution is to always allow remote callbacks but that
has its own challenges:
o There is no protection required for single CPU per policy case today,
and adding any kind of locking there, to supply remote callbacks,
isn't really a good idea.
o If is local CPU isn't part of the same cpufreq policy as the target
CPU, then we wouldn't be able to do fast switching at all and have to
use some kind of bottom half to schedule work on the target CPU to do
real switching. That may be overkill as well.
Taking above challenges into consideration, this version proposes a much
simpler diff as compared to the first version.
This series only allows remote callbacks for target CPUs that share the
cpufreq policy with the local CPU. Locking is mostly in place everywhere
and we wouldn't be required to change a lot of things.
This series is tested with couple of usecases (Android: hackbench,
recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey
board (64 bit octa-core, single policy). Only galleryfling showed minor
improvements, while others didn't had much deviation.
The reason being that this patchset only targets a corner case, where
following are required to be true to improve performance and that
doesn't happen too often with these tests:
- Task is migrated to another CPU.
- The task has maximum demand initially, and should take the CPU to
higher OPPs.
- And the target CPU doesn't call into schedutil until the next tick.
V1->V2:
- Don't support remote callbacks for unshared cpufreq policies.
- Don't support remote callbacks where local CPU isn't part of the
target CPU's cpufreq policy.
- Dropped dvfs_possible_from_any_cpu flag.
--
viresh
[1] https://marc.info/?l=linux-pm&m=148906015927796&w=2
[2] http://pastebin.com/7LkMSRxE
Steve Muckle (1):
intel_pstate: Ignore scheduler cpufreq callbacks on remote CPUs
Viresh Kumar (3):
cpufreq: schedutil: Process remote callback for shared policies
cpufreq: governor: Process remote callback for shared policies
sched: cpufreq: Enable remote sched cpufreq callbacks
drivers/cpufreq/cpufreq_governor.c | 4 ++++
drivers/cpufreq/intel_pstate.c | 3 +++
include/linux/sched/cpufreq.h | 1 +
kernel/sched/cpufreq.c | 1 +
kernel/sched/cpufreq_schedutil.c | 19 ++++++++++++++-----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 8 +++++---
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 10 ++--------
9 files changed, 32 insertions(+), 18 deletions(-)
--
2.13.0.71.gd7076ec9c9cb
### Basic Ideas ###
This patch set is rebased on EASv1.2 for power optimization on Hikey960.
The ARM big.LITTLE systems have many variants, some platforms use the
same CPU architecture for multi-clusters, every cluster has different
manufacture process (or clock design) so the clusters can have different
OPP settings; this kind system the 'LITTLE' core and 'big' core have the
same architecture but we can get power benefit from the 'LITTLE' core due
it has better power efficiency compared to 'big' core at the same OPP.
On the other hand, for this kind system, usually the 'LITTLE' core power
efficiency doesn't has huge difference compared to 'big' core's; and
furthermore the final CPU power saving percentage will discount twice,
so when optimize power for some scenarios, the optimization may not
significant as expected; or this means power optimization is not priority
issue on these platforms.
Regarding the CPU power discouting for whole system, the first discount
is related with CPU duty cycle, the second discount is related with
SoC/Board baseline power data. We can estimate the CPU power saving
percentage for system level with below formula:
CPU power saving percentage: CPU_PS%
CPU duty cycle: CPU_DC%
The percentage between CPU power and whole system: CPU_SYS%
So finally the estimated power saving percentage as below:
CPU_PS% * CPU_DC% * CPU_SYS%
Let's see one example, we have two CA53 clusters, the 'LITTLE' cluster can
improve 30% power efficiency than 'big' cluster, so CPU_PS% = 30%; the video
playback (1080p) has CPU duty cycle CPU_DC% = 30% (1 core); the ratio between
CPU power and system power is CPU_SYS% = 15%, so finally we can save power
by using 'LITTLE' compared to 'big' core:
CPU_PS% * CPU_DC% * CPU_SYS% = 30% * 30% * 15% = 1.35%
Naturally we consider 1.35% percentage is not a significant improvement; but
for some cases there have the concept for delta power; and if we compare it
with delta power can see the importantance for power saving. Let's use video
playback as example, the delta power percentage (DP%) is:
DP% = (video_playback_power - home_screen_power) / video_playback_power
So DP% is one important criteria for phone models to check some scenarios
compared to Android idle syste. If we think DP% = 15% and power saving
percentage 1.35%, then power saving with 1.35% is meaningful when we compare
1.35% vs 15%.
Another kind of big.LITTLE system has big different power efficiency, if we
review the power efficiency on Hikey960, we can see the coefficient (mw/MHz)
the worst case is the CA73 is 6.2 times than CA53, if we select the median
OPPs as reference we can see CA73 is 2.42 times than CA52. So the highest
CPU_PS%(max) = 86%, the median CPU_PS%(median) = 70%. Let's check upper case
we can save how much on Hikey960 in theory:
CPU_PS%(max) * CPU_DC% * CPU_SYS% = 86% * 30% * 15% = 3.87%
CPU_PS%(median) * CPU_DC% * CPU_SYS% = 70% * 30% * 15% = 3.15%
We can see power saving percentage 3.87%/3.15% is significant to DP% (15%).
So on Hikey960, if there have some scenarios with high CPU duty cycle and
sustainable power consuming, the power optimization is important for them.
### Implementations ###
Below are detailed implemenation for the optimizations:
a) Add back CPU selection based on power efficiency
EASv1.2 has function find_best_target(), this function mainly focus
on the idlest CPU so reduce the scheduling latency; but in some cases it
will miss to select the best power efficiency CPU.
So patch 0001/0002 are mainly to add back CPU selection based on power
efficiency; we still keep the function find_best_target() but it's only
used for "boost" and "prefer_idle" cases, and use power efficiency path
for normal cases.
b) EAS core algorithm optimization
For EAS core algorithm, it should resolve problems for below items:
1. Support more than two clusters;
2. Keep CPU to stay lowest OPP as possible, and pack small tasks
when system is idle;
3. Directly migrate waken task to best CPU to meet performance
requirement, this means the task could be migrated to higher
capacity CPU and vice versa;
4. Consistent result for energy calculation and simple implemenation;
Patch 0003/0004 change to select CPU with cluster basis; this means
scheduler firstly select candidate CPUs within every cluster, so every
cluster can has one candidate CPU or the cluster hasn't any one CPU can
meet the requirement. Finally all energy difference calculation happens
within these candidate CPUs. This gives us several benefit, the first
one is scheduler doesn't couple with previous CPU anymore; in the old
code it always compare energy between previous CPU and a new possible
CPU, but for some case the previous CPU is completely wrong CPU for the
task so the comparison is pointless actually. After applied patch 0003/
0004, it introduces one side effect: the task can be directly migrate
from lower capacity CPU to higher capacity CPU (LITTLE -> big), usually
this doesn't happen in old code, due the energy comparison the lower
capacity CPU can beat higher capacity CPU so the task is missed the
chance to migrate to higher capacity CPU.
Patch 0005/0006 are to select best CPU within cluster. In task waken
path, EAS core algorithm is responsible for task selection; it should
achieve two targets: keep CPU to stay lowest OPP as possible and spread
tasks if we can predict the OPP is possible to increased after place
waken task on one specific CPU. So patch 0003/0004 are to find the CPU
with lowest OPP and has highest utilization compared with other CPUs
with the same OPP, so we can rely on EAS core algorithm to spread tasks
if CPU OPP is increasing and pack tasks after CPU is decreased to
lowest OPP.
After applied patch 0003/0004, there introduced much more times energy
comparison between one big CPU and one little CPU. As result, it's
observed the EAS core algorithm is fragile for some corner cases. So
patch 0007/0008/0009 are for more robust energy calculation, especailly
0008/0009 patches introduce an extra signal for "util_waken_avg", by
using this signal we can remove waken task value for CPU's utilization,
so finally all CPU signals are cleaned by removing waken task stale
utilization value.
Patch 0010 is a significant change for EAS core algorithm, the main
idea is to change energy calculation from CPU oriented to task oriented.
Based on energy modeling we can easily anwser the question is: if place
the task onto one specific CPU, how much power is consumed by this task?
So essentially we can calculate the task consumed energy for specific
CPU, so can get to know the power consumption for every possible CPU
and finally filter out which CPU is best power saving one. After
changed to task oriented energy calculation, it's also more smooth to
generate perf idx and energy idx based on task oriented but not CPU
oriented so hope this also can benefit for schedTune PE filter as well.
c) Tipping point optimization
Power saving optimization mainly focus on how to defer the system
tipping point so energy aware path can be enabled for most case, but
deferring tipping point also means it hurts performance case if system
cannot over tipping point for overloaded scenarios (like benchmarks).
So the target is: optimize power without performance regression.
Patch 0011 is Thara's patch v1 "Per Sched domain over utilization",
the patch gives good method for how to store the per sched domain flag.
I tweaked it with below criterias for overutilization:
1. If single CPU is more than 80% util, then set lowest level sched
domain as 'overutilized'; so this is the tipping point for 'inner
overutilized' flag.
2. If any CPU has 'misfit' task or the cluster's overall util > 80% of
the cluster overall capacity, then set parent level sched doamin as
'overutilized', this is the tipping point for 'outer overutilized'
flag.
3. If overall util > 50% of the all CPU overall capacity, then set
root domain's 'overutilized' flag. The 50% actually is a quite high
bar, e.g. if there have two clusters that means the overall util >
the middle capacity for two clusters, also means the overall util
has totally beyond one cluster capacity so kick 'global' tipping
point and spread tasks cross two clusters.
So with 'per sched domain flag', we can defer the 'global' tipping point
and rely on it as a switch for energy aware path. Patch 0011 is to move
energy aware function to the beginning of waken path, so this give the
function energy_aware_wake_cpu() more chance to execute if system is
under tipping point; only when system is over tipping point then it will
go back to execute traditional waken balance to select idlest CPU.
### Testing result ###
On Hikey960, below is testing enviornment:
- Android AOSP kernel 4.4
https://android.googlesource.com/kernel/hikey-linaro
branch: android-hikey-linaro-4.4
- CPUFreq governor: sched-freq
- Fxied DDR: 400MHz
- Fixed GPU: 533MHz
- HDMI: unplugged
- WIFI: disabled
Please note, the video playback (1080p) is using software codec with VLC
player on Android, camera recording is use synthesized workload
camera-long.json to simulate the camera scenario.
Test_Case Referenced_Phone PELT_Optimized PELT_Optimized WALT_Optimized WALT_Optimized
(mW) [*] (mW) (Percentage) (mW) (Percentage)
homescreen 800 -5.05 [**] -0.63% -10.46 -1.31%
Audio(MP3) 200 (LCD OFF) 5.33 2.66% 60.62 30.31%
Video(1080p) 1000 133.09 13.31% 26.10 2.61%
Camera Recording 2000 163.94 8.20% -79.57 [***] -3.98%
[*] The reference phone is not any specific phone model, here I give
out some very roughly power data for well optimized commercial
phones. So this are only some data based on old experience and they
are not not very precise. These power data is based on power data
from the battery measurement point with 4.2v.
[**] Positive value: power reducing by this patch set
Negative value: power increasing by this patch set
[***] Camera Recording + WALT power data is much worse with this patch set;
Will explian in "conclusion" section.
Testing raw data: http://people.linaro.org/~leo.yan/eas_upstream/hikey960_result/
### Conclusion ###
Firstly Hikey960 is a good candidate platform to verify power saving
optimization :)
This patch set with PELT signal has good result on Hikey960, especially
for cases video playback (saving 133.09mW) and camera recording (saving
164.94mW). For audio playback, it can save 5.33mW; for homescreen it has
a bit regression (increased 5.05mW), suspect this related with task packing
on LITTLE core but need investigate for this.
This patch set with WALT signal has good result for audio playback and
video playback, but it's broken for camera recording case. After reviewed
the trace log, the main issue is many tasks' WALT signal can reach into
the range 100~200, so there have many comparision between LITTLE CPU
1844MHz and big CPU 903MHz. From the power modeling parameters, the big
CPU 903MHz has lower power efficiency than LITTLE CPU 1844MHz, so tasks
are migrated onto big core frequently. Compared to WALT signal, PELT signal
can co-work with power modeling parameters well, so we can see the energy
awaring algorithm can avoid task easily migration to big cores. (seems to me,
this is a question as: what's the signal can match for eas core algorithm?)
Some known issues:
- CPUFreq governor impacts power consumption much, sched-freq is easily
to reach 1844MHz. so need check if have mechanism to optimize the policy
to reduce the chance to set 1844MHz;
Another testing is to use other governors: schedutil, interactive.
- RT threads now are not energy awared, so they are migrated to big cores;
- Load balancing flow has no energy awared optimization;
- Now fixed DDR frequency, if enable DDR frequency change then power modeling
will be changed significant:
Need devfreq driver for DDR, and tune power modeling for this.
- Though these patches have been verified on Juno there have no harm for
performance, need do performance comprision on Hikey960.
Leo Yan (12):
sched/fair: add function find_nrg_efficient_target()
sched/fair: enable energy efficiency selection
sched/fair: use new function to select CPU from sched group
sched/fair: select candidate CPUs by cluster basis
sched/fair: refine find_new_capacity()
sched/fair: optimize CPU selection with lowest OPP
sched/fair: increase resolution for energy calculation
sched/fair: introduce signal util_waken_avg for CPU
sched/fair: select idle CPU as backup for waken up path
sched/fair: task oriented energy calculation
sched/fair: update idle CPU blocked load in update_sg_lb_stats()
sched/fair: add trace event for sched group energy
Thara Gopinath (1):
Per Sched domain over utilization
include/linux/sched.h | 2 +-
include/trace/events/sched.h | 45 ++++
kernel/sched/fair.c | 593 +++++++++++++++++++++++++++++++------------
kernel/sched/sched.h | 1 +
4 files changed, 484 insertions(+), 157 deletions(-)
mode change 100644 => 100755 include/trace/events/sched.h
mode change 100644 => 100755 kernel/sched/fair.c
--
1.9.1
Hello eas-dev, I wanted to give you all an update on EAS product codeline development for Android.
As you may have noticed, the Android Common Kernel branch android-4.4 (https://android.googlesource.com/kernel/common.git/+/android-4.4) now has the EAS product codeline merged. All the patches in there have been validated on a big.LITTLE device. Some of the more experimental patches which were part of the EASr1.2 stack did not make the cut yet, but we will continue to develop them.
All development for the Android Common Kernel will be done in the open so that interested people can see the code, pull the patches and participate in code reviews on the AOSP Gerrit. Development is expected to be continuous - we plan a number of enhancements and upstream backports over the coming months. The main focus for EAS development on the product codeline is to have a kernel which has good performance and efficiency on big.LITTLE devices running Android. As part of this we aim to reduce the delta to mainline as much as possible while of course maintaining any differences necessary for mobile devices.
EAS is intended to be an upstream technology eventually and as we upstream various components we will be backporting them to the product codeline where suitable.
We intend not only to push patches which we consider ready (for code review) but also more ‘experimental’ patches for things that we are working on, an example is here:
https://android-review.googlesource.com/#/c/411501/
This is a stack consisting of 11 patches. The first 6 patches make some changes to the performance index filtering in schedtune, and have a topic of 'fix_performance_index'. On top of those are 2 patches with a topic of 'small_optimizations' which are some optimizations which can be done to existing code. Finally there are 3 patches with a topic of 'experimental_utilest' which is a backport of the current mainline-focussed util_est solution.
Util_est is a filtered version of the existing PELT signals intended to address some of the generally acknowledged responsiveness issues PELT has when compared to WALT. There are multiple mainline ideas about making PELT more suitable for the uses we have, util_est is not the only one and may not be attractive to maintainers but it is included here to help evaluating the signal characteristics.
Where we have multiple patches which make sense to review together, we will send them as a single set and set the topics as appropriate. The review patch stacks will be structured as less controversial changes at the start, and more experimental/risky patches at the end - it's perfectly possible that some will be merged and not others.
As we push patches for review, we intend to post here announcing them.
Since we will now be doing continuous development in the open, the EAS releases (EASr1.3 etc.) become documentation points where we describe the features which are merged and are expected to be merged soon. The development model for android is to have frequent merges of the common kernel, and we feel this fits.
For EAS product codeline testing, we are starting to use Hikey960 as it has big.LITTLE (4xA73+4xA53). Note that we find it essential to add a heatsink and a fan, so firmware thermal capping doesn’t affect EAS performance. We’ve been using a 14mm x 14mm heatsink on the SoC:
http://uk.farnell.com/abl-heatsinks/bga-std-015/heat-sink-bga-standard-26-5…
with a 5v fan blowing on it. We are using Baylibre ACME for power measurement, which is supported in LISA (https://github.com/ARM-software/lisa) which we're using for development testing and trace analysis.
The current (21-June-2017) Hikey960 status is that cpufreq/cpuidle is working using the ARM Trusted Firmware. There is an open issue relating to OPP switch time on the little cluster which is being investigated. We are also engaged with Leo in validating the energy model.
For Android Common Kernel 4.9 (https://android.googlesource.com/kernel/common.git/+/android-4.9) we have a first pass of a patch stack bringing it up to parity with android-4.4 which is being tested on Juno and original hikey. This will be posted for review against the android-4.9 branch and is a change to the existing EAS code that’s there. The tests are passing on Juno, but please note there is significantly less testing than for kernel 4.4
We very much welcome broad participation in the future direction of EAS for products - to participate in EAS product codeline development, please ensure you are registered for a googlesource Gerrit userid which you can get at https://android.googlesource.com/new-password
Thanks,
Chris
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.