Re: [Eas-dev] [RFC PATCH v2] sched: Introduce scaled capacity awareness in enqueue

23 Aug 2017

On Tue, Aug 22, 2017 at 1:46 PM, Rohit Jain rohit.k.jain@oracle.com wrote:
...
Hi,
Just to clarify: I sent this patch for review, to make sure it aligns
well with EAS and to make sure I am not undoing any of the efforts.
Thanks,
Rohit
On 08/08/2017 04:28 PM, Rohit Jain wrote:
...
During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.
Which path handles accounting of IRQ portion of the capacity? I know
RT pressure is accounted but I didn't know (not aware) of the IRQ
part.
...
...
Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.
It does so by, using the following algorithm:

As in rt_avg the scaled capacities are already calculated.

Any CPU which is running below 80% capacity is considered running low


on capacity[*].

During idle CPU search if a CPU is found running low on capacity, it

is skipped if better CPUs are available.

If none of the CPUs are better in terms of idleness and capacity, then

the low-capacity CPU is considered to be the best available CPU.
The performance numbers:

CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.
I also used barrier.c (open_mp code) as a micro-benchmark. It does a
number
of iterations and barrier sync at the end of each for loop.
I was also running ping on CPU 0 as:
'ping -l 10000 -q -s 10 -f host2'
The results below should be read as:

'Baseline without ping' is how the workload would've behaved if there
 was no IRQ activity.

Compare 'Baseline with ping' and 'Baseline without ping' to see the
 effect of ping

Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
 CAS can give over baseline


The program (barrier.c) can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 20 core x86 machine:
+-------+----------------+----------------+------------------+
|Num.   |CAS             |Baseline        |Baseline without  |
|Threads|with ping       |with ping       |ping              |
+-------+-------+--------+-------+--------+-------+----------+
|       |Mean   |Std. Dev|Mean   |Std. Dev|Mean   |Std. Dev  |
+-------+-------+--------+-------+--------+-------+----------+
|1      | 511.7 | 6.9    | 508.3 | 17.3   | 514.6 | 4.7      |
|2      | 486.8 | 16.3   | 463.9 | 17.4   | 510.8 | 3.9      |
|4      | 466.1 | 11.7   | 451.4 | 12.5   | 489.3 | 4.1      |
|8      | 433.6 | 3.7    | 427.5 | 2.2    | 447.6 | 5.0      |
|16     | 391.9 | 7.9    | 385.5 | 16.4   | 396.2 | 0.3      |
|32     | 269.3 | 5.3    | 266.0 | 6.6    | 276.8 | 0.2      |
+-------+-------+--------+-------+--------+-------+----------+
Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 20 core x86 machine:
+---------------+------+--------+--------+
|Num.           |CAS   |Baseline|Baseline|
|Tasks          |with  |with    |without |
|(groups of 40) |ping  |ping    |ping    |
+---------------+------+--------+--------+
|               |Mean  |Mean    |Mean    |
+---------------+------+--------+--------+
|1              | 0.97 | 0.97   | 0.68   |
|2              | 1.36 | 1.36   | 1.30   |
|4              | 2.57 | 2.57   | 1.84   |
|8              | 3.31 | 3.34   | 2.86   |
|16             | 5.63 | 5.71   | 4.61   |
|25             | 7.99 | 8.23   | 6.78   |
+---------------+------+--------+--------+
[*] Question (RFC part):

In the previous discussion of this patch the threshold to decide whether
a CPU is running low on capacity, was being calculated dynamically. In
the tests I have done, 80% seems to be a good threshold.
Would it be OK to choose a fixed cutoff?
I think its fine.
I believe though your patch touches both the affine and non-affine
paths so you should split it.
Also I think it has diverged from upstream as this patch has been
modified (check PeterZ's tree) so you should probably rebase, split
the patches and post it again.
-Joel
...
...
Changelog:

v1->v2:

Changed the dynamic threshold calculation as the having global state
 can be avoided.

Previous discussion can be found at:

https://patchwork.kernel.org/patch/9741351/
Signed-off-by: Rohit Jainrohit.k.jain@oracle.com
kernel/sched/fair.c | 80
+++++++++++++++++++++++++++++++++++++++++++----------
  1 file changed, 66 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..3c26c13 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5298,6 +5298,11 @@ static unsigned long cpu_avg_load_per_task(int cpu)
        return 0;
  }
  +static inline bool full_capacity(int cpu)
+{

  return (capacity_of(cpu) >= (capacity_orig_of(cpu)*819 >> 10));



+}

static void record_wakee(struct task_struct *p)
{
      /*

@@ -5516,9 +5521,11 @@ find_idlest_cpu(struct sched_group *group, struct
task_struct *p, int this_cpu)
  {
        unsigned long load, min_load = ULONG_MAX;
        unsigned int min_exit_latency = UINT_MAX;

  unsigned int backup_cap = 0;
  u64 latest_idle_timestamp = 0;
  int least_loaded_cpu = this_cpu;
  int shallowest_idle_cpu = -1;


  int shallowest_idle_cpu_backup = -1;
  int i;
  /* Check if we have any choice: */



@@ -5538,7 +5545,12 @@ find_idlest_cpu(struct sched_group *group, struct
task_struct *p, int this_cpu)
                                 */
                                min_exit_latency = idle->exit_latency;
                                latest_idle_timestamp = rq->idle_stamp;

                          shallowest_idle_cpu = i;




                          if (full_capacity(i)) {


                                  shallowest_idle_cpu = i;


                          } else if (capacity_of(i) > backup_cap) {


                                  shallowest_idle_cpu_backup = i;


                                  backup_cap = capacity_of(i);


                          }
                  } else if ((!idle || idle->exit_latency ==



min_exit_latency) &&
                                   rq->idle_stamp > latest_idle_timestamp)
{
                                /*
@@ -5547,7 +5559,12 @@ find_idlest_cpu(struct sched_group *group, struct
task_struct *p, int this_cpu)
                                 * a warmer cache.
                                 */
                                latest_idle_timestamp = rq->idle_stamp;

                          shallowest_idle_cpu = i;




                          if (full_capacity(i)) {


                                  shallowest_idle_cpu = i;


                          } else if (capacity_of(i) > backup_cap) {


                                  shallowest_idle_cpu_backup = i;


                                  backup_cap = capacity_of(i);


                          }
                  }
          } else if (shallowest_idle_cpu == -1) {
                  load = weighted_cpuload(i);



@@ -5558,7 +5575,11 @@ find_idlest_cpu(struct sched_group *group, struct
task_struct *p, int this_cpu)
                }
        }

return shallowest_idle_cpu != -1 ? shallowest_idle_cpu :



least_loaded_cpu;

  if (shallowest_idle_cpu != -1)


          return shallowest_idle_cpu;



  return (shallowest_idle_cpu_backup != -1 ?


          shallowest_idle_cpu_backup : least_loaded_cpu);

}
  #ifdef CONFIG_SCHED_SMT

@@ -5620,7 +5641,9 @@ void __update_idle_core(struct rq *rq)
  static int select_idle_core(struct task_struct *p, struct sched_domain
*sd, int target)
  {
        struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);

  int core, cpu;




  int core, cpu, rcpu, rcpu_backup;


  unsigned int backup_cap = 0;


  rcpu = rcpu_backup = -1;
  if (!static_branch_likely(&sched_smt_present))
          return -1;



@@ -5637,10 +5660,20 @@ static int select_idle_core(struct task_struct *p,
struct sched_domain *sd, int
                        cpumask_clear_cpu(cpu, cpus);
                        if (!idle_cpu(cpu))
                                idle = false;


                  if (full_capacity(cpu)) {


                          rcpu = cpu;


                  } else if ((rcpu == -1) && (capacity_of(cpu) >



backup_cap)) {

                          backup_cap = capacity_of(cpu);


                          rcpu_backup = cpu;


                  }
          }


        if (idle)






                  return core;




          if (idle) {


                  if (rcpu == -1)


                          return (rcpu_backup != -1 ? rcpu_backup :



core);

                  return rcpu;


          }
  }
  /*



@@ -5656,7 +5689,8 @@ static int select_idle_core(struct task_struct *p,
struct sched_domain *sd, int
   */
  static int select_idle_smt(struct task_struct *p, struct sched_domain
*sd, int target)
  {

  int cpu;




  int cpu, backup_cpu = -1;


  unsigned int backup_cap = 0;
  if (!static_branch_likely(&sched_smt_present))
          return -1;



@@ -5664,11 +5698,17 @@ static int select_idle_smt(struct task_struct *p,
struct sched_domain *sd, int t
        for_each_cpu(cpu, cpu_smt_mask(target)) {
                if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
                        continue;

          if (idle_cpu(cpu))


                  return cpu;




          if (idle_cpu(cpu)) {


                  if (full_capacity(cpu))


                          return cpu;


                  if (capacity_of(cpu) > backup_cap) {


                          backup_cap = capacity_of(cpu);


                          backup_cpu = cpu;


                  }


          }
  }


return -1;




  return backup_cpu;

}
  #else /* CONFIG_SCHED_SMT */

@@ -5697,6 +5737,8 @@ static int select_idle_cpu(struct task_struct *p,
struct sched_domain *sd, int t
        u64 time, cost;
        s64 delta;
        int cpu, nr = INT_MAX;

  int backup_cpu = -1;


  unsigned int backup_cap = 0;
  this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
  if (!this_sd)



@@ -5727,10 +5769,19 @@ static int select_idle_cpu(struct task_struct *p,
struct sched_domain *sd, int t
                        return -1;
                if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
                        continue;

          if (idle_cpu(cpu))


                  break;




          if (idle_cpu(cpu)) {


                  if (full_capacity(cpu)) {


                          backup_cpu = -1;


                          break;


                  } else if (capacity_of(cpu) > backup_cap) {


                          backup_cap = capacity_of(cpu);


                          backup_cpu = cpu;


                  }


          }
  }


if (backup_cpu >= 0)




          cpu = backup_cpu;
  time = local_clock() - time;
  cost = this_sd->avg_scan_cost;
  delta = (s64)(time - cost) / 8;



@@ -5747,13 +5798,14 @@ static int select_idle_sibling(struct task_struct
*p, int prev, int target)
        struct sched_domain *sd;
        int i;

if (idle_cpu(target))




  if (idle_cpu(target) && full_capacity(target))
          return target;
  /*
   * If the previous cpu is cache affine and idle, don't be stupid.
   */




  if (prev != target && cpus_share_cache(prev, target) &&



idle_cpu(prev))

  if (prev != target && cpus_share_cache(prev, target) &&



idle_cpu(prev)

      && full_capacity(prev))
          return prev;
  sd = rcu_dereference(per_cpu(sd_llc, target));




eas-dev mailing list
eas-dev@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/eas-dev

    

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Eas-dev] [RFC PATCH v2] sched: Introduce scaled capacity awareness in enqueue

Signed-off-by: Rohit Jainrohit.k.jain@oracle.com