Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
can be avoided.
v2->v3:
* Split up the patch for find_idlest_cpu and select_idle_sibling code
paths.
v3->v4:
* Rebased it to peterz's tree (apologies for wrong tree for v3)
v4->v5:
* Changed the threshold to 768 from 819 for easier shifts
* Changed the find_idlest_cpu code path to be simpler
* Changed the select_idle_core code path to search for
idlest+full_capacity core
* Added scaled capacity awareness to wake_affine_idle code path
---------------------------------------------------------------------------
During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.
Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.
It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.
2) Any CPU which is running below 80% capacity is considered running low
on capacity.
3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.
4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.
The performance numbers:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.
For microbenchmark results, I used hackbench running with process along
with, running ping on CPU 0,1 and 2 as:
'ping -l 10000 -q -s 10 -f hostX'
The results below should be read as:
* 'Baseline without ping' is how the workload would've behaved if there
was no IRQ activity.
* Compare 'Baseline with ping' and 'Baseline without ping' to see the
effect of ping
* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
CAS can give over baseline
Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 44 core 2 socket x86 machine:
+---------------+------+--------+--------+
|Num. |CAS |Baseline|Baseline|
|Tasks |with |with |without |
|(groups of 40) |ping |ping |ping |
+---------------+------+--------+--------+
| |Mean |Mean |Mean |
+---------------+------+--------+--------+
|1 | 0.55 | 0.59 | 0.53 |
|2 | 0.66 | 0.81 | 0.51 |
|4 | 0.99 | 1.16 | 0.95 |
|8 | 1.92 | 1.93 | 1.88 |
|16 | 3.24 | 3.26 | 3.15 |
|32 | 5.93 | 5.98 | 5.68 |
|64 | 11.55| 11.94 | 10.89 |
+---------------+------+--------+--------+
Rohit Jain (3):
sched/fair: Introduce scaled capacity awareness in find_idlest_cpu
code path
sched/fair: Introduce scaled capacity awareness in select_idle_sibling
code path
sched/fair: Introduce scaled capacity awareness in wake_affine_idle
code path
kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 53 insertions(+), 13 deletions(-)
--
2.7.4
The current implementation of overutilization, aborts energy aware
scheduling if any cpu in the system is over-utilized. This patch introduces
over utilization flag per sched domain level instead of a single flag
system wide. Load balancing is done at the sched domain where any
of the cpu is over utilized. If energy aware scheduling is
enabled and no cpu in a sched domain is overuttilized,
load balancing is skipped for that sched domain and energy aware
scheduling continues at that level.
The implementation takes advantage of the shared sched_domain structure
that is common across all the sched domains at a level. The new flag
introduced is placed in this structure so that all the sched domains the
same level share the flag. In case of an overutilized cpu, the flag gets
set at level1 sched_domain. The flag at the parent sched_domain level gets
set in either of the two following scenarios.
1. There is a misfit task in one of the cpu's in this sched_domain.
2. The total utilization of the domain is greater than the domain capacity
The flag is cleared if no cpu in a sched domain is overutilized.
This implementation still can have corner scenarios with respect to
misfit tasks. For example consider a sched group with n cpus and
n+1 70%utilized tasks. Ideally this is a case for load balance to happen
in a parent sched domain. But neither the total group utilization is
high enough for the load balance to be triggered
in the parent domain nor there is a cpu with a single overutilized task so
that aload balance is triggered in a parent domain. But again this could be
a purely academic sceanrio, as during task wake up these tasks will be placed
more appropriately.
Signed-off-by: Thara Gopinath <thara.gopinath(a)linaro.org>
---
V2->V3:
- Rebased on latest kernel.
- The previous check for misfit task is replaced with the
newely introduced rq->misfit_task flag.
V1->V2:
- Removed overutilized flag from sched_group structure.
- In case of misfit task, it is ensured that a load balance is
triggered in a parent sched domain with assymetric cpu capacities.
include/linux/sched/topology.h | 1 +
kernel/sched/fair.c | 137 +++++++++++++++++++++++++++++++++--------
kernel/sched/sched.h | 3 -
kernel/sched/topology.c | 8 +--
4 files changed, 117 insertions(+), 32 deletions(-)
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 3137750..ae44044 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -88,6 +88,7 @@ struct sched_domain_shared {
atomic_t ref;
atomic_t nr_busy_cpus;
int has_idle_cores;
+ bool overutilized;
};
struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a9ac67c..34bdfeb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4791,6 +4791,29 @@ static inline void hrtick_update(struct rq *rq)
static bool cpu_overutilized(int cpu);
+static bool
+is_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ return sd->shared->overutilized;
+ else
+ return false;
+}
+
+static void
+set_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ sd->shared->overutilized = true;
+}
+
+static void
+clear_sd_overutilized(struct sched_domain *sd)
+{
+ if (sd)
+ sd->shared->overutilized = false;
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4800,6 +4823,7 @@ static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
+ struct sched_domain *sd;
struct sched_entity *se = &p->se;
int task_new = !(flags & ENQUEUE_WAKEUP);
@@ -4843,9 +4867,12 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se) {
add_nr_running(rq, 1);
- if (!task_new && !rq->rd->overutilized &&
- cpu_overutilized(rq->cpu))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!task_new && !is_sd_overutilized(sd) &&
+ cpu_overutilized(rq->cpu))
+ set_sd_overutilized(sd);
+ rcu_read_unlock();
}
hrtick_update(rq);
}
@@ -6276,8 +6303,7 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
unsigned long max_spare = 0;
struct sched_domain *sd;
- rcu_read_lock();
-
+ /* The rcu lock is/should be held in the caller function */
sd = rcu_dereference(per_cpu(sd_ea, prev_cpu));
if (!sd)
@@ -6315,8 +6341,6 @@ static int select_energy_cpu_brute(struct task_struct *p, int prev_cpu)
}
unlock:
- rcu_read_unlock();
-
if (energy_cpu == prev_cpu && !cpu_overutilized(prev_cpu))
return prev_cpu;
@@ -6350,10 +6374,16 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
&& cpumask_test_cpu(cpu, &p->cpus_allowed);
}
- if (energy_aware() && !(cpu_rq(prev_cpu)->rd->overutilized))
- return select_energy_cpu_brute(p, prev_cpu);
-
rcu_read_lock();
+ sd = rcu_dereference(cpu_rq(prev_cpu)->sd);
+ if (energy_aware() &&
+ !is_sd_overutilized(sd)) {
+ new_cpu = select_energy_cpu_brute(p, prev_cpu);
+ goto unlock;
+ }
+
+ sd = NULL;
+
for_each_domain(cpu, tmp) {
if (!(tmp->flags & SD_LOAD_BALANCE))
break;
@@ -6418,6 +6448,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
}
/* while loop will break here if sd == NULL */
}
+
+unlock:
rcu_read_unlock();
return new_cpu;
@@ -7478,6 +7510,7 @@ struct sd_lb_stats {
struct sched_group *local; /* Local group in this sd */
unsigned long total_load; /* Total load of all groups in sd */
unsigned long total_capacity; /* Total capacity of all groups in sd */
+ unsigned long total_util; /* Total util of all groups in sd */
unsigned long avg_load; /* Average load across all groups in sd */
struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7497,6 +7530,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
.local = NULL,
.total_load = 0UL,
.total_capacity = 0UL,
+ .total_util = 0UL,
.busiest_stat = {
.avg_load = 0UL,
.sum_nr_running = 0,
@@ -7792,7 +7826,7 @@ group_type group_classify(struct sched_group *group,
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
- bool *overload, bool *overutilized)
+ bool *overload, bool *overutilized, bool *misfit_task)
{
unsigned long load;
int i, nr_running;
@@ -7831,8 +7865,16 @@ static inline void update_sg_lb_stats(struct lb_env *env,
!sgs->group_misfit_task && rq->misfit_task)
sgs->group_misfit_task = capacity_of(i);
- if (cpu_overutilized(i))
+ if (cpu_overutilized(i)) {
*overutilized = true;
+ /*
+ * If the cpu is overutilized and if there is only one
+ * current task in cfs runqueue, it is potentially a misfit
+ * task.
+ */
+ if (rq->misfit_task)
+ *misfit_task = true;
+ }
}
/* Adjust by relative CPU capacity of the group */
@@ -7974,12 +8016,12 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq)
*/
static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
- struct sched_domain *child = env->sd->child;
+ struct sched_domain *child = env->sd->child, *sd;
struct sched_group *sg = env->sd->groups;
struct sg_lb_stats *local = &sds->local_stat;
struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;
- bool overload = false, overutilized = false;
+ bool overload = false, overutilized = false, misfit_task = false;
if (child && child->flags & SD_PREFER_SIBLING)
prefer_sibling = 1;
@@ -8001,7 +8043,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
}
update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
- &overload, &overutilized);
+ &overload, &overutilized,
+ &misfit_task);
if (local_group)
goto next_group;
@@ -8032,6 +8075,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* Now, start updating sd_lb_stats */
sds->total_load += sgs->group_load;
sds->total_capacity += sgs->group_capacity;
+ sds->total_util += sgs->group_util;
sg = sg->next;
} while (sg != env->sd->groups);
@@ -8045,14 +8089,45 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* update overload indicator if we are at root domain */
if (env->dst_rq->rd->overload != overload)
env->dst_rq->rd->overload = overload;
+ }
- /* Update over-utilization (tipping point, U >= 0) indicator */
- if (env->dst_rq->rd->overutilized != overutilized)
- env->dst_rq->rd->overutilized = overutilized;
- } else {
- if (!env->dst_rq->rd->overutilized && overutilized)
- env->dst_rq->rd->overutilized = true;
+ if (overutilized)
+ set_sd_overutilized(env->sd);
+ else
+ clear_sd_overutilized(env->sd);
+
+ /*
+ * If there is a misfit task in one cpu in this sched_domain
+ * it is likely that the imbalance cannot be sorted out among
+ * the cpu's in this sched_domain. In this case set the
+ * overutilized flag at the parent sched_domain.
+ */
+ if (misfit_task) {
+
+ sd = env->sd->parent;
+
+ /*
+ * In case of a misfit task, load balance at the parent
+ * sched domain level will make sense only if the the cpus
+ * have a different capacity. If cpus at a domain level have
+ * the same capacity, the misfit task cannot be well
+ * accomodated in any of the cpus and there in no point in
+ * trying a load balance at this level
+ */
+ while (sd) {
+ if (sd->flags & SD_ASYM_CPUCAPACITY) {
+ set_sd_overutilized(sd);
+ break;
+ }
+ sd = sd->parent;
+ }
}
+
+ /* If the domain util is greater that domain capacity, load balancing
+ * needs to be done at the next sched domain level as well
+ */
+ if (sds->total_capacity * 1024 < sds->total_util * capacity_margin)
+ set_sd_overutilized(env->sd->parent);
}
/**
@@ -8279,8 +8354,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
*/
update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized)
- goto out_balanced;
+ if (energy_aware()) {
+ if (!is_sd_overutilized(env->sd))
+ goto out_balanced;
+ }
local = &sds.local_stat;
busiest = &sds.busiest_stat;
@@ -9164,6 +9241,11 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
rcu_read_lock();
for_each_domain(cpu, sd) {
+ if (energy_aware()) {
+ if (!is_sd_overutilized(sd))
+ continue;
+ }
+
/*
* Decay the newidle max times here because this is a regular
* visit to all the domains. Decay ~1% per second.
@@ -9466,6 +9548,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
+ struct sched_domain *sd;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
@@ -9477,8 +9560,12 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
rq->misfit_task = !task_fits_capacity(curr, capacity_of(rq->cpu));
- if (!rq->rd->overutilized && cpu_overutilized(task_cpu(curr)))
- rq->rd->overutilized = true;
+ rcu_read_lock();
+ sd = rcu_dereference(rq->sd);
+ if (!is_sd_overutilized(sd) &&
+ cpu_overutilized(task_cpu(curr)))
+ set_sd_overutilized(sd);
+ rcu_read_unlock();
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8d27d5b..1604ef2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -585,9 +585,6 @@ struct root_domain {
/* Indicate more than one runnable task for any CPU */
bool overload;
- /* Indicate one or more cpus over-utilized (tipping point) */
- bool overutilized;
-
/*
* The bit corresponding to a CPU gets set here if such CPU has more
* than one runnable -deadline task (as it is below for RT tasks).
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 263e549..e5ba6fc 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1040,11 +1040,11 @@ sd_init(struct sched_domain_topology_level *tl,
* For all levels sharing cache; connect a sched_domain_shared
* instance.
*/
- if (sd->flags & SD_SHARE_PKG_RESOURCES) {
- sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
- atomic_inc(&sd->shared->ref);
+ sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+ atomic_inc(&sd->shared->ref);
+
+ if (sd->flags & SD_SHARE_PKG_RESOURCES)
atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
- }
sd->private = sdd;
--
2.1.4
During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.
Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.
It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.
2) Any CPU which is running below 80% capacity is considered running low
on capacity.
3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.
4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.
The performance numbers:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.
I also used barrier.c (open_mp code) as a micro-benchmark. It does a number
of iterations and barrier sync at the end of each for loop.
I was also running ping on CPU 0 as:
'ping -l 10000 -q -s 10 -f host2'
The results below should be read as:
* 'Baseline without ping' is how the workload would've behaved if there
was no IRQ activity.
* Compare 'Baseline with ping' and 'Baseline without ping' to see the
effect of ping
* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
CAS can give over baseline
The program (barrier.c) can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 20 core x86 machine:
+-------+----------------+----------------+------------------+
|Num. |CAS |Baseline |Baseline without |
|Threads|with ping |with ping |ping |
+-------+-------+--------+-------+--------+-------+----------+
| |Mean |Std. Dev|Mean |Std. Dev|Mean |Std. Dev |
+-------+-------+--------+-------+--------+-------+----------+
|1 | 511.7 | 6.9 | 508.3 | 17.3 | 514.6 | 4.7 |
|2 | 486.8 | 16.3 | 463.9 | 17.4 | 510.8 | 3.9 |
|4 | 466.1 | 11.7 | 451.4 | 12.5 | 489.3 | 4.1 |
|8 | 433.6 | 3.7 | 427.5 | 2.2 | 447.6 | 5.0 |
|16 | 391.9 | 7.9 | 385.5 | 16.4 | 396.2 | 0.3 |
|32 | 269.3 | 5.3 | 266.0 | 6.6 | 276.8 | 0.2 |
+-------+-------+--------+-------+--------+-------+----------+
Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 20 core x86 machine:
+---------------+------+--------+--------+
|Num. |CAS |Baseline|Baseline|
|Tasks |with |with |without |
|(groups of 40) |ping |ping |ping |
+---------------+------+--------+--------+
| |Mean |Mean |Mean |
+---------------+------+--------+--------+
|1 | 0.97 | 0.97 | 0.68 |
|2 | 1.36 | 1.36 | 1.30 |
|4 | 2.57 | 2.57 | 1.84 |
|8 | 3.31 | 3.34 | 2.86 |
|16 | 5.63 | 5.71 | 4.61 |
|25 | 7.99 | 8.23 | 6.78 |
+---------------+------+--------+--------+
Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
can be avoided.
v2->v3:
* Split up the patch for find_idlest_cpu and select_idle_sibling code
paths.
v3->v4:
* Rebased it to peterz's tree (apologies for wrong tree for v3)
Previous discussion can be found at:
---------------------------------------------------------------------------
https://patchwork.kernel.org/patch/9741351/https://lists.linaro.org/pipermail/eas-dev/2017-August/000933.html
Rohit Jain (3):
sched/fair: Introduce scaled capacity awareness in find_idlest_cpu
code path
sched/fair: Introduce scaled capacity awareness in select_idle_sibling
code path
ignore_this_patch: Fixing compilation error on Peter's tree
kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++---------
kernel/time/tick-sched.c | 1 +
2 files changed, 68 insertions(+), 14 deletions(-)
--
2.7.4
For multi-tenancy currently there are mechanisms to share the system CPUs
by time-sharing (e.g: CFS) and by dividing up the system in 'rigid'
containers by using system calls like sched_setaffinity. There is no
existing way in the linux kernel today, for flexible workloads where there
is a need to give the whole system while still maintaining a notion of
preference to CPUs.
This patch introduces a new CPU mask, 'cpus_preferred' within the
task_struct structure and allows applications a way to specify a set of
CPUs which the application would like to run on. The scheduler will try
to honor the applications' request the best it can, however if the
scheduler finds that there are no idle CPUs within the preferred list,
it shall run the application anywhere within the system.
This can be used to design soft containers which allows a tenant to use
more capacity than he is entitled to when others aren't fully using
theirs. The advantage of space sharing the system as opposed to time
sharing is that you maintain more cache locality when the soft
containers are being utilized.
Since this behavior is observed on every scheduling decision, the
application gets to run on its preferred CPUs as long as the application
does not overuse its specified resources.
The design of soft containers still needs more user-space code however,
this is what is needed from the kernel.
FAQs:
Q) What if I set "hard" affinity after I set a preference by using soft
affinity?
A: Hard affinity will over-ride any previous soft affinity.
Q) What if my application had already specified a "hard" affinity? Can I
still provide a set of CPUs for soft affinity?
A: Yes, it will work as long as the new soft affinity is a subset of the
"hard" affinity.
Q) Can I have mutually exclusive hard and soft affinities?
A: No, soft affinity is always a subset of hard affinity.
Note:
Ignore the kernel/sched/tick-sched.c change. It is just fixing a build
error on Peter's tree.
Rohit Jain (2):
sched: Introduce new flags to sched_setaffinity to support soft
affinity.
sched: Actual changes after adding SCHED_SOFT_AFFINITY to make it work
with the scheduler
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/init_task.h | 1 +
include/linux/sched.h | 4 +-
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/sched.h | 3 +
kernel/compat.c | 2 +-
kernel/sched/core.c | 167 ++++++++++++++++++++++++++++-----
kernel/sched/cpudeadline.c | 4 +-
kernel/sched/cpupri.c | 4 +-
kernel/sched/fair.c | 116 +++++++++++++++++------
kernel/time/tick-sched.c | 1 +
12 files changed, 250 insertions(+), 60 deletions(-)
--
2.7.4
Good day!
Been really enjoying watching development of eas on Android Gerrit and in
other places so first let me say thanks again for all the great work.
With that, I had a couple questions about idle states in the energy model.
I am mainly curious about the number of tuples used for the idle states. On
the Pixel they used "2 2 0" for the CPU Idle States, I would think that
would be for:
wfi
fpc-def
fpc
But for Cluster Idle states they used "0 0" but I would think there should
be 3 tuples since the Cluster Idle States are:
l2-wfi
l2-gdhs
l2-fpc
So why only the two "0 0" for the Pixel EM?
Now this leads up to my ultimate question and that is about the SD835 (from
the OnePlus5)
For CPU Idle States we have:
wfi
ret
pc
But we disable idle_enable for "ret". So would this mean that in my own EM
for the OP5 I should only have 2 tuples?
And for Cluster Idle States we have:
l2-wfi
l2-dynret
l2-ret
l2-pc
But on the OP5 we disable l2-dynret and l2-ret. So once again, should I
only have 2 tuples for the number of idle states used or a tuple for each
physical idle state possible?
Kind Regards,
Zachariah Kennedy
During OLTP workload runs, threads can end up on CPUs with a lot of
softIRQ activity, thus delaying progress. For more reliable and
faster runs, if the system can spare it, these threads should be
scheduled on CPUs with lower IRQ/RT activity.
Currently, the scheduler takes into account the original capacity of
CPUs when providing 'hints' for select_idle_sibling code path to return
an idle CPU. However, the rest of the select_idle_* code paths remain
capacity agnostic. Further, these code paths are only aware of the
original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids
CPUs which might have their capacities reduced (due to IRQ/RT activity)
when trying to schedule threads (on the push side) in the system. This
awareness has been added into the fair scheduling class.
It does so by, using the following algorithm:
1) As in rt_avg the scaled capacities are already calculated.
2) Any CPU which is running below 80% capacity is considered running low
on capacity.
3) During idle CPU search if a CPU is found running low on capacity, it
is skipped if better CPUs are available.
4) If none of the CPUs are better in terms of idleness and capacity, then
the low-capacity CPU is considered to be the best available CPU.
The performance numbers*:
---------------------------------------------------------------------------
CAS shows upto 1.5% improvement on x86 when running 'SELECT' database
workload.
I also used barrier.c (open_mp code) as a micro-benchmark. It does a number
of iterations and barrier sync at the end of each for loop.
I was also running ping on CPU 0 as:
'ping -l 10000 -q -s 10 -f host2'
The results below should be read as:
* 'Baseline without ping' is how the workload would've behaved if there
was no IRQ activity.
* Compare 'Baseline with ping' and 'Baseline without ping' to see the
effect of ping
* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement
CAS can give over baseline
The program (barrier.c) can be found at:
http://www.spinics.net/lists/kernel/msg2506955.html
Following are the results for the iterations per second with this
micro-benchmark (higher is better), on a 20 core x86 machine:
+-------+----------------+----------------+------------------+
|Num. |CAS |Baseline |Baseline without |
|Threads|with ping |with ping |ping |
+-------+-------+--------+-------+--------+-------+----------+
| |Mean |Std. Dev|Mean |Std. Dev|Mean |Std. Dev |
+-------+-------+--------+-------+--------+-------+----------+
|1 | 511.7 | 6.9 | 508.3 | 17.3 | 514.6 | 4.7 |
|2 | 486.8 | 16.3 | 463.9 | 17.4 | 510.8 | 3.9 |
|4 | 466.1 | 11.7 | 451.4 | 12.5 | 489.3 | 4.1 |
|8 | 433.6 | 3.7 | 427.5 | 2.2 | 447.6 | 5.0 |
|16 | 391.9 | 7.9 | 385.5 | 16.4 | 396.2 | 0.3 |
|32 | 269.3 | 5.3 | 266.0 | 6.6 | 276.8 | 0.2 |
+-------+-------+--------+-------+--------+-------+----------+
Following are the runtime(s) with hackbench and ping activity as
described above (lower is better), on a 20 core x86 machine:
+---------------+------+--------+--------+
|Num. |CAS |Baseline|Baseline|
|Tasks |with |with |without |
|(groups of 40) |ping |ping |ping |
+---------------+------+--------+--------+
| |Mean |Mean |Mean |
+---------------+------+--------+--------+
|1 | 0.97 | 0.97 | 0.68 |
|2 | 1.36 | 1.36 | 1.30 |
|4 | 2.57 | 2.57 | 1.84 |
|8 | 3.31 | 3.34 | 2.86 |
|16 | 5.63 | 5.71 | 4.61 |
|25 | 7.99 | 8.23 | 6.78 |
+---------------+------+--------+--------+
*Performance numbers for ARM:
---------------------------------------------------------------------------
I was asked to show the efficacy on ARM in v2 review, however I am
having some difficulty gathering an ARM machine. Would it be possible
for someone to give try this out on ARM?
Changelog:
---------------------------------------------------------------------------
v1->v2:
* Changed the dynamic threshold calculation as the having global state
can be avoided.
v2->v3:
* Split up the patch for find_idlest_cpu and select_idle_sibling code
paths.
Previous discussion can be found at:
---------------------------------------------------------------------------
https://patchwork.kernel.org/patch/9741351/
Rohit Jain (2):
sched: Introduce scaled capacity awareness in find_idlest_cpu code
path
sched: Introduce scaled capacity awareness in select_idle_sibling code
path
kernel/sched/fair.c | 80 +++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 66 insertions(+), 14 deletions(-)
--
2.7.4