Currently energy calculation in EAS has missed to consider RT pressure,
it's quite possible to select CPU for CFS tasks which has high RT
pressure and finally accumulate total utilization; as result the other
low RT pressure CPUs lose chance to run CFS tasks and reduce contention
between CFS and RT tasks, from performance view this is not optimal;
furthermore this also harms power data due pack RT task and CFS task on
single one CPU is more easily to trigger CPU frequency increasing.
We can measure the summed CPU utilization and calculate the CPU freqency
standard deviation to get to if the tasks can be well spreading within
the same cluster for middle workload case. So below is the comparison
result for video playback on Hikey960 for before and after applied this
patch set (Using schedutil CPUFreq governor):
Without Patch Set: With Patch Set:
CPU Min(Util) Mean(Util) Mean(Util) | Min(Util) Mean(Util) Mean(Util)
0 7 67 205 | 8 52 170
1 4 53 227 | 9 47 188
2 4 57 191 | 8 38 192
3 4 35 165 | 16 47 146
s.d. 1.5 13.3 25.9 | 3.9 5.83 20.9
4 0 35 160 | 10 34 129
5 0 24 129 | 0 30 115
6 0 18 123 | 0 18 95
7 0 12 84 | 0 21 73
s.d. 0 9.8 31.2 | 5 7.5 24.4
The standard diviation for CPU utilization mean value has been decreased
after applying this patch set (Little cluster: 13.3 vs 5.83, big cluster:
9.8 vs 7.5). This also confirm from the average CPU frequency:
Without Patch Set: With Patch Set:
Average Frequency | Average Frequency
LITTLT Cluster 737MHz | 646MHz
big Cluster 916MHz | 922MHz
Leo Yan (4):
sched/fair: Select maximum spare capacity for idle candidate CPUs
sched: Introduce cpu_util_sum()/__cpu_util_sum() functions
sched/fair: Consider RT pressure for find_best_target()
sched/fair: Consider RT/DL pressure for energy calculation
kernel/sched/fair.c | 22 +++++++++++++++++++---
kernel/sched/sched.h | 29 +++++++++++++++++++++++++++++
2 files changed, 48 insertions(+), 3 deletions(-)
--
1.9.1
Here are some patches that are generally minor changes and I am posting them
together. Patches 1/5 and 2/5 are related to skipping cpufreq updates for the
dequeue of the last task before the CPU enters idle. That's just a rebase of
[1] mostly. Patches 3/5 and 4/5 fix some minor things I noticed after the
remote cpufreq update work. and patch 5/5 is just a small clean up of
find_idlest_group. Let me know your thoughts and thanks. I've based these
patches on peterz's queue.git master branch.
[1] https://patchwork.kernel.org/patch/9936555/
Joel Fernandes (5):
Revert "sched/fair: Drop always true parameter of
update_cfs_rq_load_avg()"
sched/fair: Skip frequency update if CPU about to idle
cpufreq: schedutil: Use idle_calls counter of the remote CPU
sched/fair: Correct obsolete comment about cpufreq_update_util
sched/fair: remove impossible condition from find_idlest_group_cpu
include/linux/tick.h | 1 +
kernel/sched/cpufreq_schedutil.c | 2 +-
kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++------------
kernel/sched/sched.h | 1 +
kernel/time/tick-sched.c | 13 ++++++++++++
5 files changed, 47 insertions(+), 14 deletions(-)
--
2.15.0.rc2.357.g7e34df9404-goog
capacity_spare_wake in the slow path influences choice of idlest groups,
as we search for groups with maximum spare capacity. In scenarios where
RT pressure is high, a sub optimal group can be chosen and hurt
performance of the task being woken up.
Several tests with results are included below to show improvements with
this change.
1) Hackbench on Pixel 2 Android device (4x4 ARM64 Octa core)
------------------------------------------------------------
Here we have RT activity running on big CPU cluster induced with rt-app,
and running hackbench in parallel. The RT tasks are bound to 4 CPUs on
the big cluster (cpu 4,5,6,7) and have 100ms periodicity with
runtime=20ms sleep=80ms.
Hackbench shows big benefit (30%) improvement when number of tasks is 8
and 32: Note: data is completion time in seconds (lower is better).
Number of loops for 8 and 16 tasks is 50000, and for 32 tasks its 20000.
+--------+-----+-------+-------------------+---------------------------+
| groups | fds | tasks | Without Patch | With Patch |
+--------+-----+-------+---------+---------+-----------------+---------+
| | | | Mean | Stdev | Mean | Stdev |
| | | +-------------------+-----------------+---------+
| 1 | 8 | 8 | 1.0534 | 0.13722 | 0.7293 (+30.7%) | 0.02653 |
| 2 | 8 | 16 | 1.6219 | 0.16631 | 1.6391 (-1%) | 0.24001 |
| 4 | 8 | 32 | 1.2538 | 0.13086 | 1.1080 (+11.6%) | 0.16201 |
+--------+-----+-------+---------+---------+-----------------+---------+
2) Rohit ran barrier.c test (details below) with following improvements:
------------------------------------------------------------------------
This was Rohit's original use case for a patch he posted at [1] however
from his recent tests he showed my patch can replace his slow path
changes [1] and there's no need to selectively scan/skip CPUs in
find_idlest_group_cpu in the slow path to get the improvement he sees.
barrier.c (open_mp code) as a micro-benchmark. It does a number of
iterations and barrier sync at the end of each for loop.
Here barrier,c is running in along with ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'
barrier.c can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
Intel x86 machine:
+--------+------------------+---------------------------+
|Threads | Without patch | With patch |
| | | |
+--------+--------+---------+-----------------+---------+
| | Mean | Std Dev | Mean | Std Dev |
+--------+--------+---------+-----------------+---------+
|1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 |
|2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 |
|4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 |
|8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 |
|16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 |
|32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 |
|64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 |
+--------+--------+---------+-----------------+---------+
3) ping+hackbench test on bare-metal sever (Rohit ran this test)
----------------------------------------------------------------
Here hackbench is running in threaded mode along
with, running ping on CPU 0 and 1 as:
'ping -l 10000 -q -s 10 -f hostX'
This test is running on 2 socket, 20 core and 40 threads Intel x86
machine:
Number of loops is 10000 and runtime is in seconds (Lower is better).
+--------------+-----------------+--------------------------+
|Task Groups | Without patch | With patch |
| +-------+---------+----------------+---------+
|(Groups of 40)| Mean | Std Dev | Mean | Std Dev |
+--------------+-------+---------+----------------+---------+
|1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 |
|2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 |
|4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 |
|8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 |
|16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 |
|25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 |
+--------------+-------+---------+----------------+---------+
[1] https://patchwork.kernel.org/patch/9991635/
Cc: Dietmar Eggemann <dietmar.eggemann(a)arm.com>
Cc: Vincent Guittot <vincent.guittot(a)linaro.org>
Cc: Morten Ramussen <morten.rasmussen(a)arm.com>
Cc: Brendan Jackman <brendan.jackman(a)arm.com>
Cc: Matt Fleming <matt(a)codeblueprint.co.uk>
Tested-by: Rohit Jain <rohit.k.jain(a)oracle.com>
Signed-off-by: Joel Fernandes <joelaf(a)google.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 740602ce799f..487e485b3560 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5742,7 +5742,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
{
- return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+ return max_t(long, capacity_of(cpu) - cpu_util_wake(cpu, p), 0);
}
/*
--
2.15.0.rc2.357.g7e34df9404-goog
Hi,
I tried an experiment this weekend - basically I have RT threads bound
to big CPUs running a fixed period load, with hack bench
running with all CPUs allowed. The system is a Pixel2 ARM big.LITTLE
8-core (4x4).
Basically, I changed capacity_orig_of to capacity_of in
capacity_spare_wake and wake_cap and I see a good performance
improvement. That makes sense because wake_cap would send the task
wake up to the slow-path if RT capacity was eating into the CFS
capacity on prev/current CPU, and capacity_spare_wake would find a
better group with spare-capacity deducted by the RT pressure capacity.
One of the concerns for such a change to wake_cap, that I had, was
that it might affect upstream cases that may still want to do a
select_idle_sibling even if the capacity on the previous/waker's CPU
was not enough after deducting RT pressure. In that case, the wake_cap
change to use capacity_of would cause it to enter the slow-path for
those cases I think.
Could you let me know your thoughts about such a change? I heard that
capacity_of was attempted before and there might be some cases to
consider. Anything from your previous experiences with this change
that you could share? Atleast for capacity_spare_wake, the
improvements seems to be worthwhile and dramatic in some cases. I also
have some more changes I am thinking off to find_idlest_group but I
wanted to start a discussion on the spare capacity idea first.
This is related to Rohit's work on RT Capacity awareness, I was
talking to him and we were discussing ideas on the implementation.
thanks,
- Joel
The blocked load and shares of root cfs_rqs is currently only
updated by a the CPU owning the rq. That means if a CPU goes
suddenly from being busy to totally idle, its load and shares are
not updated.
Schedutil works around this problem by ignoring the util of CPUs
that were last updated more than a tick ago. However the stale
load does impact task placement: elements that look at load and
util (in particular the slow-path of select_task_rq_fair) can
leave the idle CPUs un-used while other CPUs go unnecessarily
overloaded. Furthermore the stale shares can impact CPU time
allotment.
Two complementary solutions are proposed here:
1. When a task wakes up, if necessary an idle CPU is woken as if to
perform a NOHZ idle balance, which is then aborted once the load
of NOHZ idle CPUs has been updated. This solves the problem but
brings with it extra CPU wakeups, which have an energy cost.
2. During newly-idle load balancing, the load of remote nohz-idle
CPUs in the sched_domain is updated. When all of the idle CPUs
were updated in that step, the nohz.next_update field
is pushed further into the future. This field is used to determine
the need for triggering the newly-added NOHZ kick. So if such
newly-idle balances are happening often enough, no additional CPU
wakeups are required to keep all the CPUs' loads updated.
[eas-dev] Patch 2/3 here is to highlight a change I made from
Vincent's original patch, so that it can be reviewed more
easily - if the modification is accepted then I'll squash
it before posting this to LKML proper.
Brendan Jackman (2):
sched/fair: Refactor nohz blocked load udpates
sched/fair: Update blocked load from newly idle balance
Vincent Guittot (1):
sched: force update of blocked load of idle cpus
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 106 ++++++++++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 2 +
3 files changed, 96 insertions(+), 13 deletions(-)
--
2.14.1
Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
can be avoided.
v2->v3:
* Split up the patch for find_idlest_cpu and select_idle_sibling code
paths.
v3->v4:
* Rebased it to peterz's tree (apologies for wrong tree for v3)
v4->v5:
* Changed the threshold to 768 from 819 for easier shifts
* Changed the find_idlest_cpu code path to be simpler
* Changed the select_idle_core code path to search for
idlest+full_capacity core
* Added scaled capacity awareness to wake_affine_idle code path
---------------------------------------------------------------------------
During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.
Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.
It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.
2) Any CPU which is running below 80% capacity is considered running low
on capacity.
3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.
4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.
The performance numbers:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.
For microbenchmark results, I used hackbench running with process along
with, running ping on CPU 0,1 and 2 as:
'ping -l 10000 -q -s 10 -f hostX'
The results below should be read as:
* 'Baseline without ping' is how the workload would've behaved if there
was no IRQ activity.
* Compare 'Baseline with ping' and 'Baseline without ping' to see the
effect of ping
* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
CAS can give over baseline
Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 44 core 2 socket x86 machine:
+---------------+------+--------+--------+
|Num. |CAS |Baseline|Baseline|
|Tasks |with |with |without |
|(groups of 40) |ping |ping |ping |
+---------------+------+--------+--------+
| |Mean |Mean |Mean |
+---------------+------+--------+--------+
|1 | 0.55 | 0.59 | 0.53 |
|2 | 0.66 | 0.81 | 0.51 |
|4 | 0.99 | 1.16 | 0.95 |
|8 | 1.92 | 1.93 | 1.88 |
|16 | 3.24 | 3.26 | 3.15 |
|32 | 5.93 | 5.98 | 5.68 |
|64 | 11.55| 11.94 | 10.89 |
+---------------+------+--------+--------+
Rohit Jain (3):
sched/fair: Introduce scaled capacity awareness in find_idlest_cpu
code path
sched/fair: Introduce scaled capacity awareness in select_idle_sibling
code path
sched/fair: Introduce scaled capacity awareness in wake_affine_idle
code path
kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 13 deletions(-)
--
2.7.4