This patch series is to optimize performance and refine patches according to review comments.
- Patch 0001 is add more chance to select previous CPU for cache hot;
- In EAS code, the critical path is task waken up with function energy_aware_wake_cpu(); this function is purposed to select one possible target CPU with most energy saving. So it includes two underlying functionality: the first one is to select most power efficiency CPU for the task in one cluster, another is to migrate task from big core to little core if little core can meet performance requirement.
For first functionality for selection most power efficiency CPU within cluster, EAS prefers to select a non-idle CPU so as result it packs tasks into one CPU as possible. This is not an optimal solution with two reasons: the first reason is this introduces long schedule latency after multiple tasks on the same rq; the second reason is it easily gets result as small tasks packing within one CPU with higher operating point. Finally this is the observed foremost issue if there have multiple tasks, neither power or performance can achieve optimal result.
So patch 0002 is to solve this issue to try to select CPU if can keep CPU at lowest OPP as possible.
- Current code has no mechanism to spread these tasks throughout the little cluster so tasks are packed on one CPU when CPU is not “over-utilized”. In this case, only one CPU is very busy but other CPUs in the same cluster are in idle state.
Patch 0003 is to spread task in lowest schedule domain (in cluster level) after add a medium state named "half-utilized". This may a temperary solution, due this likely a better solution is to unify flag for "over-utilized".
- In CFS, PELT signals take long time to increase to high value and decay to small value; on the other hand, EAS does not take account load_avg value (runnable time) but only focus on util_avg value (running time). So these issues are really dependent on fundamental signals.
So hope have advanced method to accelerate PELT signals and dismiss the issue introduced by long runnable time. Patch 0004 we can take it as a temperary solution, likely we can use the big difference between load_avg and util_avg to change to use inflate value, also can use it to reflect runnable time.
Patch 0004 also has side effects for misfit flag. If any CPU has “misfit” task on it, then EAS will set imbalance value as CPU capacity and migrate such load from little core to big core. So “misfit” is quite good for there have only one big task on the little CPU so the CPU cannot meet task’s performance requirement with function “task_fits_max(p, rq->cpu)”; but if there have two tasks on the little CPU, then the task’s utilization value just half of CPU capacity value so finally EAS considers CPU can meet task requirement. Patch 0004 can more easily to set true for misfit: rq->misfit_task = !task_fits_max(p, rq->cpu)
- In function energy_aware_wake_cpu(), it is possible to directly migrate task from little core to big core, but the conditions are rigid: the condition 1 is CPU capacity cannot meet this task requirement; the condition 2 is source CPU is “over-utilized”. If the source CPU is not “over-utilized” for condition 2, then even little CPU cannot meet task requirement but EAS will compare CPU energy and as the end it still selects previous little CPU
Patch 0005 is to add extra path to directly migrate task from little core to big core.
- For very heavily workload with multi-threads, we observed the tasks are not migrated within big cluster, also tasks are hard to migrate from big cluster to little cluster even little cluster have idle CPUs are available to run. So need optimize EAS to handle this case likely to go back with CFS behaviour.
Patch 0006 and 0008 are to fix this related issues.
- SMP load balance may migrate small task onto big core, but usually at this time point we are only looking forward big tasks migration, finally this hurts both power and performance. So patch 0007 it will avoid small task to migrate to higher capacity CPU so it will give more chance to real big task migration to higher capacity CPU.
Leo Yan (8): sched/fair: optimize to more chance to select previous CPU sched/fair: select CPU based on using lowest capacity sched/fair: support to spread task in lowest schedule domain sched/fair: use load metrics to replace util when have big difference sched/fair: add path to migrate to higher capacity CPU sched/fair: force idle balance when busiest group is overloaded sched/fair: avoid small task to migrate to higher capacity CPU sched/fair: set imbalance for too many tasks on rq
kernel/sched/fair.c | 193 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 173 insertions(+), 20 deletions(-)
-- 1.9.1
In previous EAS waken up path, it will select any possible CPU which have higher capacity which can meet the task requirement. This patch will prefer to fall back to previous CPU as possible. So this can avoid unnecessary task migration between the cluster.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7e5ffe8..4a6190b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5569,7 +5569,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX; - int target_cpu = task_cpu(p); + int target_cpu = -1; int i;
sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p))); @@ -5621,11 +5621,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) break; }
- /* cpu has capacity at higher OPP, keep it as fallback */ - if (target_cpu == task_cpu(p)) + /* + * cpu has capacity at higher OPP, keep it as fallback; + * give the previous cpu more chance to run + */ + if (task_cpu(p) == i || target_cpu == -1) target_cpu = i; }
+ /* If have not select any CPU, then to use previous CPU */ + if (target_cpu == -1) + return task_cpu(p); + if (target_cpu != task_cpu(p)) { struct energy_env eenv = { .util_delta = task_util(p), -- 1.9.1
In current code, energy aware scheduling selects CPU based on current CPU capacity. So this will fall into the circle: when CPUFreq governor improves frequency, scheduler will try to put more tasks into one single CPU; then CPUFreq governor detects more loads on CPU then increase frequency furthermore. So step by step to pack small tasks onto one CPU with quite high operating point.
So current code wants to avoid wake up more idle CPUs and save power by avoid power consuming during paths during CPU waken up and going to sleeping. But as an contrary result, if pack small tasks into single CPU with high operating point also get worse power.
So this patch changes to compare CPU current utilization with CPU capacity, predict the CPU operating point after place task on the CPU. So we can easily to select CPU with lowest operation point after place task on it. Beyond this, it still keep to pack tasks into one CPU if CPU can keep in the lowest operating point.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++---------- 1 file changed, 37 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4a6190b..804e8c8 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4905,6 +4905,25 @@ static int find_new_capacity(struct energy_env *eenv, return idx; }
+static int find_cpu_new_capacity(int cpu, unsigned long util) +{ + struct sched_domain *sd; + const struct sched_group_energy *sge; + int idx; + + sd = rcu_dereference(per_cpu(sd_ea, cpu)); + sge = sd->groups->sge; + + for (idx = 0; idx < sge->nr_cap_states; idx++) + if (sge->cap_states[idx].cap >= util) + break; + + if (idx == sge->nr_cap_states) + idx = idx - 1; + + return idx; +} + static int group_idle_state(struct sched_group *sg) { int i, state = INT_MAX; @@ -5569,6 +5588,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX; + int target_cap_idx = INT_MAX; int target_cpu = -1; int i;
@@ -5611,22 +5631,29 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) * accounting. However, the blocked utilization may be zero. */ int new_util = cpu_util(i) + boosted_task_util(p); + int cap_idx;
if (new_util > capacity_orig_of(i)) continue;
- if (new_util < capacity_curr_of(i)) { - target_cpu = i; - if (cpu_rq(i)->nr_running) - break; - } + cap_idx = find_cpu_new_capacity(i, new_util); + if (target_cap_idx > cap_idx) {
- /* - * cpu has capacity at higher OPP, keep it as fallback; - * give the previous cpu more chance to run - */ - if (task_cpu(p) == i || target_cpu == -1) + /* Select cpu with possible lower OPP */ target_cpu = i; + target_cap_idx = cap_idx; + + } else if (target_cap_idx == cap_idx) { + + /* Pack tasks if possible */ + if (cpu_rq(i)->nr_running) { + if (!cpu_rq(target_cpu)->nr_running) + target_cpu = i; + /* Give the previous cpu more chance to run */ + else if (task_cpu(p) == i) + target_cpu = i; + } + } }
/* If have not select any CPU, then to use previous CPU */ -- 1.9.1
In current code if under tipping point tasks are packed on one CPU. In this case, it's possible to have only one CPU is very busy but other CPUs in the same cluster are in idle states. So the performance issue occurs: if under tipping point, there have no mechanism to spread tasks within same cluster.
If rely on "over-utilized" as tipping point to spread tasks, there have two issues: the first reason is: "over-utilized" is rigid condition so CPU need take long time to reach 80% of CPU capacity, so this will delay the time to meet task performance requirement; the second reason is after over tipping point, scheduler will directly migrate tasks to big cluster rather than spread tasks in little cluster.
This patch is to add "half-utilized" as a medium state; if there have CPU is over 50% utilization, then we consider the CPU is "half-utilized" and as result it will try to spread tasks within same cluster, this is true both for any schedule domain (or cluster). So it need change two places for condition checking, one is for waken up path, another is for idle balance path; after over "half-utilized", both of them will try to spread tasks in the lowest schedule domain for the cluster.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 45 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 804e8c8..747d27d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4213,6 +4213,8 @@ static void update_capacity_of(int cpu) }
static bool cpu_overutilized(int cpu); +static bool cpu_halfutilized(int cpu); +static bool need_spread_task(int cpu);
/* * The enqueue_task method is called before nr_running is @@ -5284,6 +5286,32 @@ static bool cpu_overutilized(int cpu) return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); }
+static bool cpu_halfutilized(int cpu) +{ + return capacity_of(cpu) < (cpu_util(cpu) * 2); +} + +static bool need_spread_task(int cpu) +{ + struct sched_domain *sd; + int spread = 0, i; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + + if (!sd) + return 0; + + for_each_cpu(i, sched_domain_span(sd)) { + if (cpu_rq(cpu)->cfs.h_nr_running >= 2 && + cpu_halfutilized(i)) { + spread = 1; + break; + } + } + + return spread; +} + #ifdef CONFIG_SCHED_TUNE
static unsigned long @@ -5733,7 +5761,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f }
if (!sd) { - if (energy_aware() && !cpu_rq(cpu)->rd->overutilized) + if (energy_aware() && !need_spread_task(cpu)) new_cpu = energy_aware_wake_cpu(p, prev_cpu); else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, new_cpu); @@ -7683,8 +7711,19 @@ static struct sched_group *find_busiest_group(struct lb_env *env) trace_sched_sd_lb_stats(sched_group_cpus(env->sd->groups), sds.total_load, sds.total_capacity, sds.avg_load);
- if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware() && !env->dst_rq->rd->overutilized) { + + struct sched_domain *sd; + int cpu = env->dst_cpu; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!cpumask_equal(sched_domain_span(sd), + sched_domain_span(env->sd))) + goto out_balanced; + + if (!need_spread_task(cpu)) + goto out_balanced; + }
local = &sds.local_stat; busiest = &sds.busiest_stat; -- 1.9.1
When load_avg is much higher than util_avg, then it indicates either the task have higher priority with more weight value for load_avg or because the task have much longer time in "runnable" state rather than "running" state.
This patch changes to use load metrics rather than util metrics if meet any of below condition: - load * capacity_margin > SCHED_CAPACITY_SCALE * SCHED_LOAD_SCALE, this means tasks require CPU computation reach CPU 80% capacity; - util * capacity_margin > capacity_of(cpu) * SCHED_LOAD_SCALE, this means tasks CPU reach 80% utilization; - load is 20% higher than util, so task have extra 20% time for runnable state and waiting to run; Or the task has higher prioirty than nice 0; then consider to use load signal rather than util signal.
At last, we need constraint util value in the range of [0..arch_scale_cpu_capacity(cpu)] so can ensure tweaked value can fall into correct range.
This patch has another side effect for misfit flag. After apply this patch, if task with remarkable runnable time then it can more easily to set true for misfit: rq->misfit_task = !task_fits_max(p, rq->cpu); this will benefit for the case if if there have two tasks on the little CPU, then the task utilization value just half of CPU capacity value so EAS wrongly considers CPU can meet task requirement and don't migrate task to higher capacity CPU. After apply this patch, the task will change to use load metric and the issue also will be fixed.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 56 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 747d27d..54d80908 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4198,6 +4198,8 @@ static inline void hrtick_update(struct rq *rq) #endif
static inline unsigned long boosted_cpu_util(int cpu); +static inline unsigned long boosted_task_util(struct task_struct *task); +static inline unsigned long schedtune_task_margin(struct task_struct *task);
static void update_capacity_of(int cpu) { @@ -5249,15 +5251,66 @@ static inline unsigned long task_util(struct task_struct *p) return p->se.avg.util_avg; }
+static inline unsigned long task_load(struct task_struct *p) +{ + return p->se.avg.load_avg; +} + unsigned int capacity_margin = 1280; /* ~20% margin */
-static inline unsigned long boosted_task_util(struct task_struct *task); +/* + * Change to use load metrics if can meet two conditions: + * + * - load * capacity_margin > SCHED_CAPACITY_SCALE * SCHED_LOAD_SCALE, + * this means tasks require CPU computation reach CPU 80% capacity; + * - util * capacity_margin > capacity_of(cpu) * SCHED_LOAD_SCALE, + * this means tasks CPU reach 80% utilization; + * - load is 20% higher than util, so task have extra 20% time for + * runnable state and waiting to run; Or the task has higher prioirty + * than nice 0; then consider to use load signal rather than util signal. + * + */ +static inline bool task_has_big_load(struct task_struct *p) +{ + unsigned long util = task_util(p); + unsigned long load = task_load(p); + int cpu = task_cpu(p); + + if (load * capacity_margin > SCHED_CAPACITY_SCALE * SCHED_LOAD_SCALE) + return true; + + if (util * capacity_margin > capacity_of(cpu) * SCHED_LOAD_SCALE) + return true; + + if (util * capacity_margin < load * SCHED_LOAD_SCALE) + return true; + + return false; +} + +static inline unsigned long task_tweaked_util(struct task_struct *p) +{ + int cpu = task_cpu(p); + unsigned long util = task_util(p); + unsigned long load = task_load(p); + unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu); + + if (task_has_big_load(p)) + util = max_t(unsigned long, util, load); + + util = clamp(util, 0UL, (unsigned long)scale_cpu); + return util; +}
static inline bool __task_fits(struct task_struct *p, int cpu, int util) { unsigned long capacity = capacity_of(cpu); + unsigned long margin = schedtune_task_margin(p);
- util += boosted_task_util(p); + if (margin) + util += boosted_task_util(p); + else + util += task_tweaked_util(p);
return (capacity * 1024) > (util * capacity_margin); } @@ -5393,7 +5446,7 @@ schedtune_cpu_margin(unsigned long util, int cpu) return 0; }
-static inline unsigned int +static inline unsigned long schedtune_task_margin(struct task_struct *task) { return 0; -- 1.9.1
Current code has one possible path to directly migrate task from lower capacity CPU to higher capacity CPU in waken up balance: function task_fits_max() return false and the previous CPU is over-utilized. So it is hard for tasks to migrate to higher capacity CPU.
This patch add the path to directly migrate task from lower capacity CPU (LITTLE core) to higher capacity CPU (big core) if find lower capacity CPU cannot meet performance requirement. In this path, it leave PE filter to decide if can migrate task after set task boost margin.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 54d80908..71f020d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5741,6 +5741,19 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target) if (target_cpu == -1) return task_cpu(p);
+ /* + * Destination CPU has higher capacity than previous CPU, + * so that means pervious CPU has no enough capacity to meet + * the waken up task. Therefore directly return back and select + * destination CPU. + * + * If has set boost margin for this task, then leave to PE filter + * to decide if can migrate task. + */ + if (capacity_of(target_cpu) > capacity_of(task_cpu(p)) && + !schedtune_task_margin(p)) + return target_cpu; + if (target_cpu != task_cpu(p)) { struct energy_env eenv = { .util_delta = task_util(p), -- 1.9.1
When execute nohz_load_balance, it will call find_busiest_group() so find out which schedule group is busiest. There have one case is: busiest group is overloaded but local group still has spare capacity, but current code it will skip this situation by below code:
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ if (env->idle == CPU_NEWLY_IDLE &&group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance;
This is due env->idle is CPU_IDLE for the idle CPUs during nohz_load_balance. So finally it will not force balance. And for worse situation, it will directly skip load balance after meet below condition:
/* * If the local group is busier than the selected busiest group * don't try and pull any tasks. */ if (local->avg_load >= busiest->avg_load) goto out_balanced;
/* * Don't pull any tasks if this group is already above the domain * average load. */ if (local->avg_load >= sds.avg_load) goto out_balanced;
So this patch is to force balance for idle CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 71f020d..42b8801 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7843,8 +7843,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && - busiest->group_no_capacity) + if ((env->idle == CPU_NEWLY_IDLE || env->idle == CPU_IDLE) && + group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance;
/* Misfitting tasks should be dealt with regardless of the avg load */ -- 1.9.1
EAS defines “if any CPU is over-utilized” as tipping point criteria, after meet this criteria EAS takes system as over tipping point and go back to SMP load balance. So SMP load balance may migrate small task onto big core but usually we are only looking forward big tasks migration, finally this hurts both power and performance
This patch is to add one more checking in function can_migrate_task to avoid migrate small task to higher capacity CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 42b8801..06b0e1b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6551,11 +6551,19 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* * We do not migrate tasks that are: - * 1) throttled_lb_pair, or - * 2) cannot be migrated to this CPU due to cpus_allowed, or - * 3) running (obviously), or - * 4) are cache-hot on their current CPU. + * 1) energy_aware is enable and small task is not migrate to higher + * capacity CPU + * 2) throttled_lb_pair, or + * 3) cannot be migrated to this CPU due to cpus_allowed, or + * 4) running (obviously), or + * 5) are cache-hot on their current CPU. */ + + if (energy_aware() && + (capacity_of(env->dst_cpu) > capacity_of(env->src_cpu)) && + (task_util(p) * 8 < capacity_of(env->src_cpu))) + return 0; + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0;
-- 1.9.1
If there have runnable tasks number is bigger than CPU number in one schedule group, then this sched_group load_avg signal will be under estimated due the CPU load_avg value cannot accumulate all running tasks load value.
On the other hand, if another sched_group CPU has less tasks number than CPU number. As the result, the first sched_group's per_task_load will be much less than second CPU's value.
So this patch is to consider this situation and set imbalance as to busiest->load_per_task.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 06b0e1b..ab99b6f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7604,6 +7604,12 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) local = &sds->local_stat; busiest = &sds->busiest_stat;
+ if (busiest->sum_nr_running >= busiest->group_weight && + local->sum_nr_running < local->group_weight) { + env->imbalance = busiest->load_per_task; + return; + } + if (!local->sum_nr_running) local->load_per_task = cpu_avg_load_per_task(env->dst_cpu); else if (busiest->load_per_task > local->load_per_task) -- 1.9.1