Changelog: --------------------------------------------------------------------------- v1->v2: * Changed the dynamic threshold calculation as the having global state can be avoided.
v2->v3: * Split up the patch for find_idlest_cpu and select_idle_sibling code paths.
v3->v4: * Rebased it to peterz's tree (apologies for wrong tree for v3)
v4->v5: * Changed the threshold to 768 from 819 for easier shifts * Changed the find_idlest_cpu code path to be simpler * Changed the select_idle_core code path to search for idlest+full_capacity core * Added scaled capacity awareness to wake_affine_idle code path ---------------------------------------------------------------------------
During OLTP workload runs, threads can end up on CPUs with a lot of softIRQ activity, thus delaying progress. For more reliable and faster runs, if the system can spare it, these threads should be scheduled on CPUs with lower IRQ/RT activity.
Currently, the scheduler takes into account the original capacity of CPUs when providing 'hints' for select_idle_sibling code path to return an idle CPU. However, the rest of the select_idle_* code paths remain capacity agnostic. Further, these code paths are only aware of the original capacity and not the capacity stolen by IRQ/RT activity.
This patch introduces capacity awarness in scheduler (CAS) which avoids CPUs which might have their capacities reduced (due to IRQ/RT activity) when trying to schedule threads (on the push side) in the system. This awareness has been added into the fair scheduling class.
It does so by, using the following algorithm: 1) As in rt_avg the scaled capacities are already calculated.
2) Any CPU which is running below 80% capacity is considered running low on capacity.
3) During idle CPU search if a CPU is found running low on capacity, it is skipped if better CPUs are available.
4) If none of the CPUs are better in terms of idleness and capacity, then the low-capacity CPU is considered to be the best available CPU.
The performance numbers: --------------------------------------------------------------------------- CAS shows upto 1.5% improvement on x86 when running 'SELECT' database workload.
For microbenchmark results, I used hackbench running with process along with, running ping on CPU 0,1 and 2 as: 'ping -l 10000 -q -s 10 -f hostX'
The results below should be read as:
* 'Baseline without ping' is how the workload would've behaved if there was no IRQ activity.
* Compare 'Baseline with ping' and 'Baseline without ping' to see the effect of ping
* Compare 'Baseline with ping' and 'CAS with ping' to see the improvement CAS can give over baseline
Following are the runtime(s) with hackbench and ping activity as described above (lower is better), on a 44 core 2 socket x86 machine:
+---------------+------+--------+--------+ |Num. |CAS |Baseline|Baseline| |Tasks |with |with |without | |(groups of 40) |ping |ping |ping | +---------------+------+--------+--------+ | |Mean |Mean |Mean | +---------------+------+--------+--------+ |1 | 0.55 | 0.59 | 0.53 | |2 | 0.66 | 0.81 | 0.51 | |4 | 0.99 | 1.16 | 0.95 | |8 | 1.92 | 1.93 | 1.88 | |16 | 3.24 | 3.26 | 3.15 | |32 | 5.93 | 5.98 | 5.68 | |64 | 11.55| 11.94 | 10.89 | +---------------+------+--------+--------+
Rohit Jain (3): sched/fair: Introduce scaled capacity awareness in find_idlest_cpu code path sched/fair: Introduce scaled capacity awareness in select_idle_sibling code path sched/fair: Introduce scaled capacity awareness in wake_affine_idle code path
kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 53 insertions(+), 13 deletions(-)
-- 2.7.4
While looking for idle CPUs for a waking task, we should also account for the delays caused due to the bandwidth reduction by RT/IRQ tasks.
This patch does that by trying to find a higher capacity CPU with minimum wake up latency.
Signed-off-by: Rohit Jain rohit.k.jain@oracle.com --- kernel/sched/fair.c | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0107280..eaede50 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5579,6 +5579,11 @@ static unsigned long capacity_orig_of(int cpu) return cpu_rq(cpu)->cpu_capacity_orig; }
+static inline bool full_capacity(int cpu) +{ + return (capacity_of(cpu) >= (capacity_orig_of(cpu)*768 >> 10)); +} + static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu); @@ -5865,8 +5870,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0; + unsigned int backup_cap = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; + int shallowest_idle_cpu_backup = -1; int i;
/* Check if we have any choice: */ @@ -5876,6 +5883,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { if (idle_cpu(i)) { + int idle_candidate = -1; struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) { @@ -5886,7 +5894,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + idle_candidate = i; } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /* @@ -5895,7 +5903,16 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + idle_candidate = i; + } + + if (idle_candidate != -1) { + if (full_capacity(idle_candidate)) { + shallowest_idle_cpu = idle_candidate; + } else if (capacity_of(idle_candidate) > backup_cap) { + shallowest_idle_cpu_backup = idle_candidate; + backup_cap = capacity_of(idle_candidate); + } } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(cpu_rq(i)); @@ -5906,7 +5923,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) } }
- return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu; + if (shallowest_idle_cpu != -1) + return shallowest_idle_cpu; + + return (shallowest_idle_cpu_backup != -1 ? + shallowest_idle_cpu_backup : least_loaded_cpu); }
#ifdef CONFIG_SCHED_SMT -- 2.7.4
Hi Joel, Atish,
Moving off-line discussions to LKML, just so everyone's on the same page, I actually like this version now and it is outperforming my previous code, so I am on board with this version. It makes the code simpler too.
Since we need a fast way of returning an idle cpu in select_idle_sibling path, I think that can remain as it is (or may be we can argue about the patch on that thread)
If what I said abovemakes sense to everyone, I will send out a v6.
As always, please let me know what you think.
Thanks, Rohit
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b..a1f622c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) { - return capacity_orig_of(cpu) - cpu_util_wake(cpu, p); + return capacity_of(cpu) - cpu_util_wake(cpu, p); }
/* @@ -5870,6 +5870,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0; + unsigned int idle_cpu_cap = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; int i; @@ -5881,6 +5882,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { if (idle_cpu(i)) { + int idle_candidate = -1; struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) { @@ -5891,7 +5893,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this */ min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + idle_candidate = i; } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /* @@ -5900,8 +5902,14 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp; - shallowest_idle_cpu = i; + idle_candidate = i; } + + if (idle_candidate != -1 && + (capacity_of(idle_candidate) > idle_cpu_cap)) { + shallowest_idle_cpu = idle_candidate; + idle_cpu_cap = capacity_of(idle_candidate); + } } else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(cpu_rq(i)); if (load < min_load || (load == min_load && i == this_cpu)) { -- 2.7.4
On 10/07/2017 04:48 PM, Rohit Jain wrote:
While looking for idle CPUs for a waking task, we should also account for the delays caused due to the bandwidth reduction by RT/IRQ tasks.
This patch does that by trying to find a higher capacity CPU with minimum wake up latency.
Signed-off-by: Rohit Jainrohit.k.jain@oracle.com
kernel/sched/fair.c | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0107280..eaede50 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5579,6 +5579,11 @@ static unsigned long capacity_orig_of(int cpu) return cpu_rq(cpu)->cpu_capacity_orig; }
+static inline bool full_capacity(int cpu) +{
- return (capacity_of(cpu) >= (capacity_orig_of(cpu)*768 >> 10));
+}
- static unsigned long cpu_avg_load_per_task(int cpu) { struct rq *rq = cpu_rq(cpu);
@@ -5865,8 +5870,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0;
unsigned int backup_cap = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1;
int shallowest_idle_cpu_backup = -1; int i;
/* Check if we have any choice: */
@@ -5876,6 +5883,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { if (idle_cpu(i)) {
int idle_candidate = -1; struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) {
@@ -5886,7 +5894,7 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) */ min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
idle_candidate = i; } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /*
@@ -5895,7 +5903,16 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
idle_candidate = i;
}
if (idle_candidate != -1) {
if (full_capacity(idle_candidate)) {
shallowest_idle_cpu = idle_candidate;
} else if (capacity_of(idle_candidate) > backup_cap) {
shallowest_idle_cpu_backup = idle_candidate;
backup_cap = capacity_of(idle_candidate);
} else if (shallowest_idle_cpu == -1) { load = weighted_cpuload(cpu_rq(i));} }
@@ -5906,7 +5923,11 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) } }
- return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
if (shallowest_idle_cpu != -1)
return shallowest_idle_cpu;
return (shallowest_idle_cpu_backup != -1 ?
shallowest_idle_cpu_backup : least_loaded_cpu);
}
#ifdef CONFIG_SCHED_SMT
On Thu, Oct 12, 2017 at 10:03 AM, Rohit Jain rohit.k.jain@oracle.com wrote:
Hi Joel, Atish,
Moving off-line discussions to LKML, just so everyone's on the same page, I actually like this version now and it is outperforming my previous code, so I am on board with this version. It makes the code simpler too.
I think you should have explained what the version does differently. Nobody can read your mind.
Since we need a fast way of returning an idle cpu in select_idle_sibling path, I think that can remain as it is (or may be we can argue about the patch on that thread)
This is hardly an explanation of the diff below.
If what I said abovemakes sense to everyone, I will send out a v6.
As always, please let me know what you think.
More below:
Thanks, Rohit
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b..a1f622c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) {
- return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
- return capacity_of(cpu) - cpu_util_wake(cpu, p);
}
/* @@ -5870,6 +5870,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0;
- unsigned int idle_cpu_cap = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; int i;
@@ -5881,6 +5882,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { if (idle_cpu(i)) {
int idle_candidate = -1; struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) {
@@ -5891,7 +5893,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this */ min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
idle_candidate = i; } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /*
@@ -5900,8 +5902,14 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
idle_candidate = i; }
if (idle_candidate != -1 &&
(capacity_of(idle_candidate) > idle_cpu_cap)) {
shallowest_idle_cpu = idle_candidate;
idle_cpu_cap = capacity_of(idle_candidate);
}
This is broken, incase idle_candidate != -1 but idle_cpu_cap makes the condition false - you're still setting min_exit_latency which is wrong.
Also this means if you have 2 CPUs and 1 is in a shallower idle state than the other, but lesser in capacity, then it would select the CPU with less shallow idle state right? So 'shallowest_idle_cpu' loses its meaning.
thanks,
- Joel
[..]
Hi Joel,
On 10/12/2017 02:47 PM, Joel Fernandes wrote:
On Thu, Oct 12, 2017 at 10:03 AM, Rohit Jain rohit.k.jain@oracle.com wrote:
Hi Joel, Atish,
Moving off-line discussions to LKML, just so everyone's on the same page, I actually like this version now and it is outperforming my previous code, so I am on board with this version. It makes the code simpler too.
I think you should have explained what the version does differently. Nobody can read your mind.
I apologize for being terse (will do better next time)
This is based on your (offline) suggestion (and rightly so), that find_idlest_group today bases its decision on capacity_spare_wake which in turn only looks at the original capacity of the CPU. This diff (version) changes that to look at the current capacity after being scaled down (due to IRQ/RT/etc.).
Also, this diff changed find_idlest_group_cpu to not do a search for CPUs based on the 'full_capacity()' function, instead changed it to find the idlest CPU with max available capacity. This way we can avoid all the 'backup' stuff in the code as in the version (v5) below it.
I think as you can see from the way it will work itself out that the code will look much simpler with the new search. This is OK because we are doing a full CPU search in the sched_group_span anyway.
[..]
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 56f343b..a1f622c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5724,7 +5724,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p) {
- return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
return capacity_of(cpu) - cpu_util_wake(cpu, p); }
/*
@@ -5870,6 +5870,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this unsigned long load, min_load = ULONG_MAX; unsigned int min_exit_latency = UINT_MAX; u64 latest_idle_timestamp = 0;
- unsigned int idle_cpu_cap = 0; int least_loaded_cpu = this_cpu; int shallowest_idle_cpu = -1; int i;
@@ -5881,6 +5882,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this /* Traverse only the allowed CPUs */ for_each_cpu_and(i, sched_group_span(group), &p->cpus_allowed) { if (idle_cpu(i)) {
int idle_candidate = -1; struct rq *rq = cpu_rq(i); struct cpuidle_state *idle = idle_get_state(rq); if (idle && idle->exit_latency < min_exit_latency) {
@@ -5891,7 +5893,7 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this */ min_exit_latency = idle->exit_latency; latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
idle_candidate = i; } else if ((!idle || idle->exit_latency == min_exit_latency) && rq->idle_stamp > latest_idle_timestamp) { /*
@@ -5900,8 +5902,14 @@ find_idlest_group_cpu(struct sched_group *group, struct task_struct *p, int this * a warmer cache. */ latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
idle_candidate = i; }
if (idle_candidate != -1 &&
(capacity_of(idle_candidate) > idle_cpu_cap)) {
shallowest_idle_cpu = idle_candidate;
idle_cpu_cap = capacity_of(idle_candidate);
}
This is broken, incase idle_candidate != -1 but idle_cpu_cap makes the condition false - you're still setting min_exit_latency which is wrong.
Yes, you're right. I will fix this.
Also this means if you have 2 CPUs and 1 is in a shallower idle state than the other, but lesser in capacity, then it would select the CPU with less shallow idle state right? So 'shallowest_idle_cpu' loses its meaning.
OK, I will change the name
Thanks, Rohit
[..]
While looking for CPUs to place running tasks on, the scheduler completely ignores the capacity stolen away by RT/IRQ tasks. This patch changes that behavior to also take the scaled capacity into account.
Signed-off-by: Rohit Jain rohit.k.jain@oracle.com --- kernel/sched/fair.c | 37 ++++++++++++++++++++++++++++--------- 1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eaede50..5b1f7b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6004,7 +6004,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
for_each_cpu(cpu, cpu_smt_mask(core)) { cpumask_clear_cpu(cpu, cpus); - if (!idle_cpu(cpu)) + if (!idle_cpu(cpu) || !full_capacity(cpu)) idle = false; }
@@ -6025,7 +6025,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) { - int cpu; + int cpu, backup_cpu = -1; + unsigned int backup_cap = 0;
if (!static_branch_likely(&sched_smt_present)) return -1; @@ -6033,11 +6034,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) - return cpu; + if (idle_cpu(cpu)) { + if (full_capacity(cpu)) + return cpu; + if (capacity_of(cpu) > backup_cap) { + backup_cap = capacity_of(cpu); + backup_cpu = cpu; + } + } }
- return -1; + return backup_cpu; }
#else /* CONFIG_SCHED_SMT */ @@ -6066,6 +6073,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 time, cost; s64 delta; int cpu, nr = INT_MAX; + int backup_cpu = -1; + unsigned int backup_cap = 0;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd) @@ -6096,10 +6105,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue; - if (idle_cpu(cpu)) - break; + if (idle_cpu(cpu)) { + if (full_capacity(cpu)) { + backup_cpu = -1; + break; + } else if (capacity_of(cpu) > backup_cap) { + backup_cap = capacity_of(cpu); + backup_cpu = cpu; + } + } }
+ if (backup_cpu >= 0) + cpu = backup_cpu; time = local_clock() - time; cost = this_sd->avg_scan_cost; delta = (s64)(time - cost) / 8; @@ -6116,13 +6134,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) struct sched_domain *sd; int i;
- if (idle_cpu(target)) + if (idle_cpu(target) && full_capacity(target)) return target;
/* * If the previous cpu is cache affine and idle, don't be stupid. */ - if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)) + if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev) + && full_capacity(prev)) return prev;
sd = rcu_dereference(per_cpu(sd_llc, target)); -- 2.7.4
Minor nit: version number missing
On 10/07/2017 06:48 PM, Rohit Jain wrote:
While looking for CPUs to place running tasks on, the scheduler completely ignores the capacity stolen away by RT/IRQ tasks. This patch changes that behavior to also take the scaled capacity into account.
Signed-off-by: Rohit Jain rohit.k.jain@oracle.com
kernel/sched/fair.c | 37 ++++++++++++++++++++++++++++--------- 1 file changed, 28 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index eaede50..5b1f7b9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6004,7 +6004,7 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int
for_each_cpu(cpu, cpu_smt_mask(core)) { cpumask_clear_cpu(cpu, cpus);
if (!idle_cpu(cpu))
if (!idle_cpu(cpu) || !full_capacity(cpu))
Do we need to skip the entire core just because 1st cpu in the core doesn't have full capacity ? Let's say that is the only idle core available. It will go and try to select_idle_cpu() to find the idlest cpu. Is it worth spending extra time to search an idle cpu with full capacity when there are idle cores available ?
Regards, Atish
idle = false; }
@@ -6025,7 +6025,8 @@ static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int */ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target) {
- int cpu;
int cpu, backup_cpu = -1;
unsigned int backup_cap = 0;
if (!static_branch_likely(&sched_smt_present)) return -1;
@@ -6033,11 +6034,17 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t for_each_cpu(cpu, cpu_smt_mask(target)) { if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue;
if (idle_cpu(cpu))
return cpu;
if (idle_cpu(cpu)) {
if (full_capacity(cpu))
return cpu;
if (capacity_of(cpu) > backup_cap) {
backup_cap = capacity_of(cpu);
backup_cpu = cpu;
}
}}
- return -1;
return backup_cpu; }
#else /* CONFIG_SCHED_SMT */
@@ -6066,6 +6073,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t u64 time, cost; s64 delta; int cpu, nr = INT_MAX;
int backup_cpu = -1;
unsigned int backup_cap = 0;
this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc)); if (!this_sd)
@@ -6096,10 +6105,19 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t return -1; if (!cpumask_test_cpu(cpu, &p->cpus_allowed)) continue;
if (idle_cpu(cpu))
break;
if (idle_cpu(cpu)) {
if (full_capacity(cpu)) {
backup_cpu = -1;
break;
} else if (capacity_of(cpu) > backup_cap) {
backup_cap = capacity_of(cpu);
backup_cpu = cpu;
}
}
}
if (backup_cpu >= 0)
cpu = backup_cpu;
time = local_clock() - time; cost = this_sd->avg_scan_cost; delta = (s64)(time - cost) / 8;
@@ -6116,13 +6134,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) struct sched_domain *sd; int i;
- if (idle_cpu(target))
if (idle_cpu(target) && full_capacity(target)) return target;
/*
- If the previous cpu is cache affine and idle, don't be stupid.
*/
- if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev)
&& full_capacity(prev))
return prev;
sd = rcu_dereference(per_cpu(sd_llc, target));
Hi Atish,
Thanks for the comments
On 10/10/2017 08:54 AM, Atish Patra wrote:
<snip> > > Signed-off-by: Rohit Jain <rohit.k.jain@oracle.com> > --- > kernel/sched/fair.c | 37 ++++++++++++++++++++++++++++--------- > 1 file changed, 28 insertions(+), 9 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index eaede50..5b1f7b9 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6004,7 +6004,7 @@ static int select_idle_core(struct task_struct > *p, struct sched_domain *sd, int > for_each_cpu(cpu, cpu_smt_mask(core)) { > cpumask_clear_cpu(cpu, cpus); > - if (!idle_cpu(cpu)) > + if (!idle_cpu(cpu) || !full_capacity(cpu)) Do we need to skip the entire core just because 1st cpu in the core doesn't have full capacity ? Let's say that is the only idle core available. It will go and try to select_idle_cpu() to find the idlest cpu. Is it worth spending extra time to search an idle cpu with full capacity when there are idle cores available ?
This has been previously discussed: https://lkml.org/lkml/2017/10/3/1001
Returning the best CPU within the idle core did not result in a statistically significant performance benefit, hence I went with Joel's suggestion to keep the code simple.
Thanks, Rohit
<snip>
wake_affine_idle returns true if the CPU can run the task. Since it is ignoring capacity, adding that check there to only return true if the CPU is full_capacity.
Signed-off-by: Rohit Jain rohit.k.jain@oracle.com --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5b1f7b9..f4761f2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5660,7 +5660,7 @@ static bool wake_affine_idle(struct sched_domain *sd, struct task_struct *p, int this_cpu, int prev_cpu, int sync) { - if (idle_cpu(this_cpu)) + if (idle_cpu(this_cpu) && full_capacity(this_cpu)) return true;
if (sync && cpu_rq(this_cpu)->nr_running == 1) -- 2.7.4