In current code if under tipping point tasks are packed on one CPU. In this case, it's possible to have only one CPU is very busy but other CPUs in the same cluster are in idle states. So the performance issue occurs: if under tipping point, there have no mechanism to spread tasks within same cluster.
If rely on "over-utilized" as tipping point to spread tasks, there have two issues: the first reason is: "over-utilized" is rigid condition so CPU need take long time to reach 80% of CPU capacity, so this will delay the time to meet task performance requirement; the second reason is after over tipping point, scheduler will directly migrate tasks to big cluster rather than spread tasks in little cluster.
This patch is to add "half-utilized" as a medium state; if there have CPU is over 50% utilization, then we consider the CPU is "half-utilized" and as result it will try to spread tasks within same cluster, this is true both for any schedule domain (or cluster). So it need change two places for condition checking, one is for waken up path, another is for idle balance path; after over "half-utilized", both of them will try to spread tasks in the lowest schedule domain for the cluster.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 44 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04bb3d9..bc16787 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4183,6 +4183,8 @@ static inline void hrtick_update(struct rq *rq)
#ifdef CONFIG_SMP static bool cpu_overutilized(int cpu); +static bool cpu_halfutilized(int cpu); +static bool need_spread_task(int cpu); static inline unsigned long boosted_cpu_util(int cpu); #else #define boosted_cpu_util(cpu) cpu_util(cpu) @@ -5309,6 +5311,34 @@ static bool cpu_overutilized(int cpu) return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); }
+static bool cpu_halfutilized(int cpu) +{ + return capacity_of(cpu) < (cpu_util(cpu) * 2); +} + +static bool need_spread_task(int cpu) +{ + struct sched_domain *sd; + int spread = 0, i; + + if (cpu_rq(cpu)->rd->overutilized) + return 1; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!sd) + return 0; + + for_each_cpu(i, sched_domain_span(sd)) { + if (cpu_rq(i)->cfs.h_nr_running >= 1 && + cpu_halfutilized(i)) { + spread = 1; + break; + } + } + + return spread; +} + #ifdef CONFIG_SCHED_TUNE
static long @@ -5921,7 +5951,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f }
if (!sd) { - if (energy_aware() && !cpu_rq(cpu)->rd->overutilized) + if (energy_aware() && !need_spread_task(cpu)) new_cpu = energy_aware_wake_cpu(p, prev_cpu, sync); else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, new_cpu); @@ -7834,8 +7864,19 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware() && !env->dst_rq->rd->overutilized) { + + struct sched_domain *sd; + int cpu = env->dst_cpu; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!cpumask_equal(sched_domain_span(sd), + sched_domain_span(env->sd))) + goto out_balanced; + + if (!need_spread_task(cpu)) + goto out_balanced; + }
local = &sds.local_stat; busiest = &sds.busiest_stat; -- 1.9.1