New subject: [PATCH 1/2] sched: Introduce scaled capacity awareness in find_idlest_cpu code path

29 Aug 2017


      During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.
Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.
It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.
2) Any CPU which is running below 80% capacity is considered running low
on capacity.
3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.
4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.
The performance numbers*:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.
I also used barrier.c (open_mp code) as a micro-benchmark. It does a number
of iterations and barrier sync at the end of each for loop.
I was also running ping on CPU 0 as:
'ping -l 10000 -q -s 10 -f host2'
The results below should be read as:
* 'Baseline without ping' is how the workload would've behaved if there
  was no IRQ activity.
* Compare 'Baseline with ping' and 'Baseline without ping' to see the
  effect of ping
* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
  CAS can give over baseline
The program (barrier.c) can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 20 core x86 machine:
+-------+----------------+----------------+------------------+
|Num.   |CAS             |Baseline        |Baseline without  |
|Threads|with ping       |with ping       |ping              |
+-------+-------+--------+-------+--------+-------+----------+
|       |Mean   |Std. Dev|Mean   |Std. Dev|Mean   |Std. Dev  |
+-------+-------+--------+-------+--------+-------+----------+
|1      | 511.7 | 6.9    | 508.3 | 17.3   | 514.6 | 4.7      |
|2      | 486.8 | 16.3   | 463.9 | 17.4   | 510.8 | 3.9      |
|4      | 466.1 | 11.7   | 451.4 | 12.5   | 489.3 | 4.1      |
|8      | 433.6 | 3.7    | 427.5 | 2.2    | 447.6 | 5.0      |
|16     | 391.9 | 7.9    | 385.5 | 16.4   | 396.2 | 0.3      |
|32     | 269.3 | 5.3    | 266.0 | 6.6    | 276.8 | 0.2      |
+-------+-------+--------+-------+--------+-------+----------+
Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 20 core x86 machine:
+---------------+------+--------+--------+
|Num.           |CAS   |Baseline|Baseline|
|Tasks          |with  |with    |without |
|(groups of 40) |ping  |ping    |ping    |
+---------------+------+--------+--------+
|               |Mean  |Mean    |Mean    |
+---------------+------+--------+--------+
|1              | 0.97 | 0.97   | 0.68   |
|2              | 1.36 | 1.36   | 1.30   |
|4              | 2.57 | 2.57   | 1.84   |
|8              | 3.31 | 3.34   | 2.86   |
|16             | 5.63 | 5.71   | 4.61   |
|25             | 7.99 | 8.23   | 6.78   |
+---------------+------+--------+--------+
*Performance numbers for ARM:
---------------------------------------------------------------------------
I was asked to show the efficacy on ARM in v2 review, however I am
having some difficulty gathering an ARM machine. Would it be possible
for someone to give try this out on ARM?
Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
  can be avoided.
v2->v3:
* Split up the patch for find_idlest_cpu and select_idle_sibling code
  paths.
Previous discussion can be found at:
---------------------------------------------------------------------------
https://patchwork.kernel.org/patch/9741351/
Rohit Jain (2):
  sched: Introduce scaled capacity awareness in find_idlest_cpu code
    path
  sched: Introduce scaled capacity awareness in select_idle_sibling code
    path
kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 66 insertions(+), 14 deletions(-)
--
2.7.4

[RFC PATCH v3 0/2] sched: Introduce scaled capacity awareness in enqueue