This patch series is to optimize both power and performance on big.LITTLE system. Almost optimization methodology are same with RFCv2 version, so you can refer the detailed description in [1].
In this patch series, the new enhencemences for performance is to spread tasks to more clusters when detect the highest capacity cores are busy, the criteria for 'big core busy' is the big core is not idle, this is because we have patch "sched/fair: avoid small task to migrate to higher capacity CPU" to filter out to only migrate relative big load task to big core, so if there have task is running on big core, that means the big core utilization is not small. If all big cores have task running, usually system is quite busy so we should go back to select idlest CPU to replace "want_affine". So this is mainly finished by patch "sched/fair: avoid migrate single task to busy big CPU" and patch "sched/fair: select idle CPUs when big cluster is busy".
This patch series also optimize power for both PELT and WALT signals, this is finished by patch "sched/fair: save power for when use walt signals".
[1] https://lists.linaro.org/pipermail/eas-dev/2016-July/000522.html
Leo Yan (18): sched/fair: optimize to more chance to select previous CPU sched/fair: select CPU based on using lowest capacity sched/fair: support to spread task in lowest schedule domain sched/fair: add path to migrate to higher capacity CPU sched/fair: force idle balance when busiest group is overloaded sched/fair: avoid small task to migrate to higher capacity CPU sched/fair: set imbalance for too many tasks on rq sched/fair: kick nohz idle balance for misfit task sched/fair: consider over utilized only for CPU is not idle sched/fair: filter task for energy aware path sched/fair: replace capacity_of by capacity_orig_of sched/fair: refine when task is allowed only run one CPU Documentation: EAS performance tunning for sysfs sched/fair: avoid migrate single task to busy big CPU sched/fair: fix building error for schedtune_task_margin sched/fair: save power for when use walt signals sched/fair: check task boosted value on destination CPU sched/fair: select idle CPUs when big cluster is busy
Documentation/scheduler/sched-energy.txt | 87 ++++++++++ kernel/sched/fair.c | 286 ++++++++++++++++++++++++++++--- 2 files changed, 348 insertions(+), 25 deletions(-)
-- 1.9.1
In previous EAS waken up path, it will select any possible CPU which have higher capacity which can meet the task requirement. This patch will prefer to fall back to previous CPU as possible. So this can avoid unnecessary task migration between the cluster.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index efa516d..724b36c 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5718,7 +5718,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX; - int target_cpu = task_cpu(p); + int target_cpu = -1; unsigned long task_util_boosted, new_util; int i;
@@ -5787,10 +5787,18 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) break; }
- /* cpu has capacity at higher OPP, keep it as fallback */ - if (target_cpu == task_cpu(p)) + /* + * cpu has capacity at higher OPP, keep it as fallback; + * give the previous cpu more chance to run + */ + if (task_cpu(p) == i || target_cpu == -1) target_cpu = i; } + + /* If have not select any CPU, then to use previous CPU */ + if (target_cpu == -1) + return task_cpu(p); + } else { /* * Find a cpu with sufficient capacity @@ -5807,7 +5815,8 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) target_cpu = tmp_target; if ((boosted || prefer_idle) && idle_cpu(target_cpu)) return target_cpu; - } + } else + target_cpu = task_cpu(p); }
if (target_cpu != task_cpu(p)) { -- 1.9.1
In current code, energy aware scheduling selects CPU based on current CPU capacity. So this will fall into the circle: when CPUFreq governor improves frequency, scheduler will try to put more tasks into one single CPU; then CPUFreq governor detects more loads on CPU then increase frequency furthermore. So step by step to pack small tasks onto one CPU with quite high operating point.
So current code wants to avoid wake up more idle CPUs and save power by avoid power consuming during paths during CPU waken up and going to sleeping. But as an contrary result, if pack small tasks into single CPU with high operating point also get worse power.
So this patch changes to compare CPU current utilization with CPU capacity, predict the CPU operating point after place task on the CPU. So we can easily to select CPU with lowest operation point after place task on it. Beyond this, it still keep to pack tasks into one CPU if CPU can keep in the lowest operating point.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 39 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 724b36c..04bb3d9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4893,6 +4893,25 @@ static int find_new_capacity(struct energy_env *eenv, return idx; }
+static int find_cpu_new_capacity(int cpu, unsigned long util) +{ + struct sched_domain *sd; + const struct sched_group_energy *sge; + int idx; + + sd = rcu_dereference(per_cpu(sd_ea, cpu)); + sge = sd->groups->sge; + + for (idx = 0; idx < sge->nr_cap_states; idx++) + if (sge->cap_states[idx].cap >= util) + break; + + if (idx == sge->nr_cap_states) + idx = idx - 1; + + return idx; +} + static int group_idle_state(struct sched_group *sg) { int i, state = INT_MAX; @@ -5718,6 +5737,7 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX; + int target_cap_idx = INT_MAX; int target_cpu = -1; unsigned long task_util_boosted, new_util; int i; @@ -5766,6 +5786,9 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) task_util_boosted = boosted_task_util(p); /* Find cpu with sufficient capacity */ for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { + + int cap_idx; + /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due to the double @@ -5781,18 +5804,24 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) if (new_util > capacity_orig_of(i)) continue;
- if (new_util < capacity_curr_of(i)) { - target_cpu = i; - if (cpu_rq(i)->nr_running) - break; - } + cap_idx = find_cpu_new_capacity(i, new_util); + if (target_cap_idx > cap_idx) {
- /* - * cpu has capacity at higher OPP, keep it as fallback; - * give the previous cpu more chance to run - */ - if (task_cpu(p) == i || target_cpu == -1) + /* Select cpu with possible lower OPP */ target_cpu = i; + target_cap_idx = cap_idx; + + } else if (target_cap_idx == cap_idx) { + + /* Pack tasks if possible */ + if (cpu_rq(i)->nr_running) { + if (!cpu_rq(target_cpu)->nr_running) + target_cpu = i; + /* Give the previous cpu more chance to run */ + else if (task_cpu(p) == i) + target_cpu = i; + } + } }
/* If have not select any CPU, then to use previous CPU */ -- 1.9.1
In current code if under tipping point tasks are packed on one CPU. In this case, it's possible to have only one CPU is very busy but other CPUs in the same cluster are in idle states. So the performance issue occurs: if under tipping point, there have no mechanism to spread tasks within same cluster.
If rely on "over-utilized" as tipping point to spread tasks, there have two issues: the first reason is: "over-utilized" is rigid condition so CPU need take long time to reach 80% of CPU capacity, so this will delay the time to meet task performance requirement; the second reason is after over tipping point, scheduler will directly migrate tasks to big cluster rather than spread tasks in little cluster.
This patch is to add "half-utilized" as a medium state; if there have CPU is over 50% utilization, then we consider the CPU is "half-utilized" and as result it will try to spread tasks within same cluster, this is true both for any schedule domain (or cluster). So it need change two places for condition checking, one is for waken up path, another is for idle balance path; after over "half-utilized", both of them will try to spread tasks in the lowest schedule domain for the cluster.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 47 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 44 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04bb3d9..bc16787 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4183,6 +4183,8 @@ static inline void hrtick_update(struct rq *rq)
#ifdef CONFIG_SMP static bool cpu_overutilized(int cpu); +static bool cpu_halfutilized(int cpu); +static bool need_spread_task(int cpu); static inline unsigned long boosted_cpu_util(int cpu); #else #define boosted_cpu_util(cpu) cpu_util(cpu) @@ -5309,6 +5311,34 @@ static bool cpu_overutilized(int cpu) return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); }
+static bool cpu_halfutilized(int cpu) +{ + return capacity_of(cpu) < (cpu_util(cpu) * 2); +} + +static bool need_spread_task(int cpu) +{ + struct sched_domain *sd; + int spread = 0, i; + + if (cpu_rq(cpu)->rd->overutilized) + return 1; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!sd) + return 0; + + for_each_cpu(i, sched_domain_span(sd)) { + if (cpu_rq(i)->cfs.h_nr_running >= 1 && + cpu_halfutilized(i)) { + spread = 1; + break; + } + } + + return spread; +} + #ifdef CONFIG_SCHED_TUNE
static long @@ -5921,7 +5951,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f }
if (!sd) { - if (energy_aware() && !cpu_rq(cpu)->rd->overutilized) + if (energy_aware() && !need_spread_task(cpu)) new_cpu = energy_aware_wake_cpu(p, prev_cpu, sync); else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, new_cpu); @@ -7834,8 +7864,19 @@ static struct sched_group *find_busiest_group(struct lb_env *env) */ update_sd_lb_stats(env, &sds);
- if (energy_aware() && !env->dst_rq->rd->overutilized) - goto out_balanced; + if (energy_aware() && !env->dst_rq->rd->overutilized) { + + struct sched_domain *sd; + int cpu = env->dst_cpu; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!cpumask_equal(sched_domain_span(sd), + sched_domain_span(env->sd))) + goto out_balanced; + + if (!need_spread_task(cpu)) + goto out_balanced; + }
local = &sds.local_stat; busiest = &sds.busiest_stat; -- 1.9.1
Current code has one possible path to directly migrate task from lower capacity CPU to higher capacity CPU in waken up balance: function task_fits_max() return false and the previous CPU is over-utilized. So it is hard for tasks to migrate to higher capacity CPU.
This patch add the path to directly migrate task from lower capacity CPU (LITTLE core) to higher capacity CPU (big core) if find lower capacity CPU cannot meet performance requirement. In this path, it leave PE filter to decide if can migrate task after set task boost margin.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc16787..2a94895 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5858,6 +5858,19 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) if (target_cpu == -1) return task_cpu(p);
+ /* + * Destination CPU has higher capacity than previous CPU, + * so that means pervious CPU has no enough capacity to meet + * the waken up task. Therefore directly return back and select + * destination CPU. + * + * If has set boost margin for this task, then leave to PE filter + * to decide if can migrate task. + */ + if (capacity_of(target_cpu) > capacity_of(task_cpu(p)) && + !schedtune_task_margin(p)) + return target_cpu; + } else { /* * Find a cpu with sufficient capacity -- 1.9.1
When execute nohz_load_balance, it will call find_busiest_group() so find out which schedule group is busiest. There have one case is: busiest group is overloaded but local group still has spare capacity, but current code it will skip this situation by below code:
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ if (env->idle == CPU_NEWLY_IDLE &&group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance;
This is due env->idle is CPU_IDLE for the idle CPUs during nohz_load_balance. So finally it will not force balance. And for worse situation, it will directly skip load balance after meet below condition:
/* * If the local group is busier than the selected busiest group * don't try and pull any tasks. */ if (local->avg_load >= busiest->avg_load) goto out_balanced;
/* * Don't pull any tasks if this group is already above the domain * average load. */ if (local->avg_load >= sds.avg_load) goto out_balanced;
So this patch is to force balance for idle CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2a94895..2c9abc3 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7915,8 +7915,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ - if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) && - busiest->group_no_capacity) + if ((env->idle == CPU_NEWLY_IDLE || env->idle == CPU_IDLE) && + group_has_capacity(env, local) && busiest->group_no_capacity) goto force_balance;
/* Misfitting tasks should be dealt with regardless of the avg load */ -- 1.9.1
EAS defines “if any CPU is over-utilized” as tipping point criteria, after meet this criteria EAS takes system as over tipping point and go back to SMP load balance. So SMP load balance may migrate small task onto big core but usually we are only looking forward big tasks migration, finally this hurts both power and performance
This patch is to add more checking in function can_migrate_task to avoid migrate small task to higher capacity CPU: if the task's utilization is more than 1/4 of CPU's capacity, then consider this task is a small task and don't need migrate to big core; if the destination CPU is already over-utilized by other big cores, then it's not need migrate the task to the big core due it's pointless.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 2c9abc3..5d845c2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6665,11 +6665,25 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/* * We do not migrate tasks that are: - * 1) throttled_lb_pair, or - * 2) cannot be migrated to this CPU due to cpus_allowed, or - * 3) running (obviously), or - * 4) are cache-hot on their current CPU. + * 1) energy_aware is enabled and small task is not migrated to higher + * capacity CPU + * 2) throttled_lb_pair, or + * 3) cannot be migrated to this CPU due to cpus_allowed, or + * 4) running (obviously), or + * 5) are cache-hot on their current CPU. */ + + if (energy_aware() && + (capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu))) { + + if (task_util(p) * 4 < capacity_orig_of(env->src_cpu)) + return 0; + + if (cpu_overutilized(env->dst_cpu) && + !idle_cpu(env->dst_cpu)) + return 0; + } + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0;
-- 1.9.1
If there have runnable tasks number is bigger than CPU number in one schedule group, then this sched_group load_avg signal will be under estimated due the CPU load_avg value cannot accumulate all running tasks load value.
On the other hand, if another sched_group CPU has less tasks number than CPU number. As the result, the first sched_group's per_task_load will be much less than second CPU's value.
So this patch is to consider this situation and set imbalance as to busiest->load_per_task.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5d845c2..9c96a0a5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7713,6 +7713,12 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds) local = &sds->local_stat; busiest = &sds->busiest_stat;
+ if (busiest->sum_nr_running >= busiest->group_weight && + local->sum_nr_running < local->group_weight) { + env->imbalance = busiest->load_per_task; + return; + } + if (!local->sum_nr_running) local->load_per_task = cpu_avg_load_per_task(env->dst_cpu); else if (busiest->load_per_task > local->load_per_task) -- 1.9.1
If there have misfit task on one CPU, current code does not handle this situation for nohz idle balance. As result, we can see the misfit task stays run on little core for long time.
So this patch check if the CPU has misfit task or not. If has misfit task then kick nohz idle balance so finally can execute active balance.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 4 ++++ 1 file changed, 4 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9c96a0a5..5b9e227 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9013,6 +9013,10 @@ static inline bool nohz_kick_needed(struct rq *rq) (!energy_aware() || cpu_overutilized(cpu))) return true;
+ /* Do idle load balance if there have misfit task */ + if (energy_aware() && rq->misfit_task) + return true; + rcu_read_lock(); sd = rcu_dereference(per_cpu(sd_busy, cpu)); if (sd && !energy_aware()) { -- 1.9.1
Current code will check tipping point in load balance when calculate schedule group load and utilization. But it will not distinguish if CPU is in idle state or not, so as result it may wrongly think some idle CPU is overutilized when this CPU has stale utilization value and this value has never been updated since the CPU entered idle state.
There have two methods to fix this issue, one method is to check if CPU is in idle state in function cpu_overutilized(), so can directly return back false when CPU has been staying in idle state. Another method is to check CPU idle state in function update_sg_lb_stats(), so avoid to set tipping point for idle CPUs.
This patch tries to fix issue with the second method.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 5b9e227..594d050 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7435,7 +7435,7 @@ static inline void update_sg_lb_stats(struct lb_env *env, if (!nr_running && idle_cpu(i)) sgs->idle_cpus++;
- if (cpu_overutilized(i)) { + if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i); -- 1.9.1
Current code if over tipping point, the all tasks will be spread out. So this will hurt power if there have only one big task with many other small tasks. So add criteria to filter out task to force energy aware path: the first one is the task has set boost margin, then means the task should be controlled well by PE filter rather than spread by SMP load balance. The second one is: if we can find more lower capacity CPU can meet the task computation requirement, then also consider to force to energy aware path.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++++++- 1 file changed, 37 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 594d050..604793d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5339,6 +5339,41 @@ static bool need_spread_task(int cpu) return spread; }
+static bool need_filter_task(struct task_struct *p) +{ + int cpu = task_cpu(p); + int origin_max_cap = capacity_orig_of(cpu); + int target_max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val; + unsigned long margin = schedtune_task_margin(p); + struct sched_domain *sd; + struct sched_group *sg; + + if (margin) + return 1; + + sd = rcu_dereference(per_cpu(sd_ea, cpu)); + if (!sd) + return 0; + + sg = sd->groups; + do { + int first_cpu = group_first_cpu(sg); + + if (capacity_orig_of(first_cpu) < target_max_cap && + task_util(p) * 4 < capacity_orig_of(first_cpu)) + target_max_cap = capacity_orig_of(first_cpu); + + } while (sg = sg->next, sg != sd->groups); + + if (capacity_orig_of(smp_processor_id()) > target_max_cap) + return 1; + + if (target_max_cap < origin_max_cap) + return 1; + + return 0; +} + #ifdef CONFIG_SCHED_TUNE
static long @@ -5964,7 +5999,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f }
if (!sd) { - if (energy_aware() && !need_spread_task(cpu)) + if (energy_aware() && + (!need_spread_task(cpu) || need_filter_task(p))) new_cpu = energy_aware_wake_cpu(p, prev_cpu, sync); else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, new_cpu); -- 1.9.1
In some case, we just want to distinguish CPU capacity and compare CPU capacity between two CPUs. So we change to use capacity_orig_of. Current code if use capacity_of(), this function will return a dynamic value which is introduced by rt threads. So this will finally introduce misunderstanding when compare two CPUs in same cluster.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 604793d..ac4f509 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8138,12 +8138,11 @@ static int need_active_balance(struct lb_env *env) return 1; }
- if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) && - env->src_rq->cfs.h_nr_running == 1 && - cpu_overutilized(env->src_cpu) && - !cpu_overutilized(env->dst_cpu)) { - return 1; - } + if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) && + env->src_rq->cfs.h_nr_running == 1 && + cpu_overutilized(env->src_cpu) && + !cpu_overutilized(env->dst_cpu)) + return 1;
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); } -- 1.9.1
Add check if task is only permitted to run on one CPU. So it will directly return back when task is waken up.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ac4f509..9ac593b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5966,6 +5966,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f int want_affine = 0; int sync = wake_flags & WF_SYNC;
+ if (p->nr_cpus_allowed == 1) + return prev_cpu; + if (sd_flag & SD_BALANCE_WAKE) want_affine = (!wake_wide(p) && task_fits_max(p, cpu) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) || -- 1.9.1
Add extra methods from sysfs to optimize performance: - Set migration cost to 0, so it will give chance to spread tasks within difference CPUs in same cluster; - Set busy_factor to 1, so it will give more chance for active load balance for migration running tasks.
Signed-off-by: Leo Yan leo.yan@linaro.org --- Documentation/scheduler/sched-energy.txt | 87 ++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+)
diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt index dab2f90..bfd4eb8 100644 --- a/Documentation/scheduler/sched-energy.txt +++ b/Documentation/scheduler/sched-energy.txt @@ -360,3 +360,90 @@ of the cpu from idle/busy power of the shared resources. The cpu can be tricked into different per-cpu idle states by disabling the other states. Based on various combinations of measurements with specific cpus busy and disabling idle-states it is possible to extrapolate the idle-state power. + +Performance tunning method +========================== + +Below setting may impact heavily for performance tunning when spread tasks: + +#!/system/bin/sh + +echo 'enable ENERGY_AWARE feature' +echo ENERGY_AWARE > /sys/kernel/debug/sched_features +echo 1 > /proc/sys/kernel/sched_is_big_little +echo 0 > /proc/sys/kernel/sched_sync_hint_enable +echo 0 > /proc/sys/kernel/sched_initial_task_util +echo 1 > /proc/sys/kernel/sched_cstate_aware + +if [ "$1" = "pelt" ]; then + +echo 'set for pelt' +echo 0 > /proc/sys/kernel/sched_use_walt_cpu_util +echo 0 > /proc/sys/kernel/sched_use_walt_task_util + +elif [ "$1" = "walt" ]; then + +echo 'set for walt' +echo 1 > /proc/sys/kernel/sched_use_walt_cpu_util +echo 1 > /proc/sys/kernel/sched_use_walt_task_util +echo 10000000 > /proc/sys/kernel/sched_walt_cpu_high_irqload + +fi + +echo 'set sched_migration_cost_ns=0' +echo 0 > /proc/sys/kernel/sched_migration_cost_ns + +echo 'set interactive governor' +echo interactive > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor + +echo 'set busy_factor=1' +echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu0/domain1/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu1/domain1/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu2/domain1/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu3/domain1/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu4/domain1/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu5/domain1/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu6/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu6/domain1/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu7/domain0/busy_factor +echo 1 > /proc/sys/kernel/sched_domain/cpu7/domain1/busy_factor + +echo 'final checking' +set -v + +cat /sys/kernel/debug/sched_features +cat /proc/sys/kernel/sched_is_big_little +cat /proc/sys/kernel/sched_sync_hint_enable +cat /proc/sys/kernel/sched_initial_task_util +cat /proc/sys/kernel/sched_cstate_aware + +cat /proc/sys/kernel/sched_use_walt_cpu_util +cat /proc/sys/kernel/sched_use_walt_task_util +cat /proc/sys/kernel/sched_walt_cpu_high_irqload + +cat /proc/sys/kernel/sched_migration_cost_ns + +cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor + +cat /proc/sys/kernel/sched_domain/cpu0/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu0/domain1/busy_factor +cat /proc/sys/kernel/sched_domain/cpu1/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu1/domain1/busy_factor +cat /proc/sys/kernel/sched_domain/cpu2/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu2/domain1/busy_factor +cat /proc/sys/kernel/sched_domain/cpu3/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu3/domain1/busy_factor +cat /proc/sys/kernel/sched_domain/cpu4/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu4/domain1/busy_factor +cat /proc/sys/kernel/sched_domain/cpu5/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu5/domain1/busy_factor +cat /proc/sys/kernel/sched_domain/cpu6/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu6/domain1/busy_factor +cat /proc/sys/kernel/sched_domain/cpu7/domain0/busy_factor +cat /proc/sys/kernel/sched_domain/cpu7/domain1/busy_factor -- 1.9.1
If big CPU has been already busy (halfutilized for <= 2 tasks, or overutilized for > 2 tasks), then it's pointless to migrate a big task from little CPU to big CPU. Otherwise it will introduce big CPU over-utilized and finally want to spread out tasks cross two clusters.
So this patch is avoid single big task migration for busy big CPU.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 37 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9ac593b..d403816 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5339,6 +5339,34 @@ static bool need_spread_task(int cpu) return spread; }
+static bool need_want_affine(struct task_struct *p, int cpu) +{ + int capacity = capacity_orig_of(cpu); + int max_capacity = cpu_rq(cpu)->rd->max_cpu_capacity.val; + unsigned long margin = schedtune_task_margin(p); + struct sched_domain *sd; + int affine = 0, i; + + if (margin) + return 1; + + if (capacity != max_capacity) + return 1; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!sd) + return 1; + + for_each_cpu(i, sched_domain_span(sd)) { + if (idle_cpu(i)) { + affine = 1; + break; + } + } + + return affine; +} + static bool need_filter_task(struct task_struct *p) { int cpu = task_cpu(p); @@ -5972,7 +6000,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (sd_flag & SD_BALANCE_WAKE) want_affine = (!wake_wide(p) && task_fits_max(p, cpu) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) || - energy_aware(); + (energy_aware() && need_want_affine(p, cpu));
rcu_read_lock(); for_each_domain(cpu, tmp) { @@ -8143,9 +8171,14 @@ static int need_active_balance(struct lb_env *env)
if ((capacity_orig_of(env->src_cpu) < capacity_orig_of(env->dst_cpu)) && env->src_rq->cfs.h_nr_running == 1 && - cpu_overutilized(env->src_cpu) && - !cpu_overutilized(env->dst_cpu)) - return 1; + cpu_overutilized(env->src_cpu)) { + + if (idle_cpu(env->dst_cpu)) + return 1; + + if (!idle_cpu(env->dst_cpu) && !cpu_overutilized(env->dst_cpu)) + return 1; + }
return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2); } -- 1.9.1
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d403816..80c5a91 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4182,6 +4182,10 @@ static inline void hrtick_update(struct rq *rq) #endif
#ifdef CONFIG_SMP + +static inline long +schedtune_task_margin(struct task_struct *task); + static bool cpu_overutilized(int cpu); static bool cpu_halfutilized(int cpu); static bool need_spread_task(int cpu); @@ -5478,7 +5482,7 @@ schedtune_cpu_margin(unsigned long util, int cpu) return 0; }
-static inline int +static inline long schedtune_task_margin(struct task_struct *task) { return 0; -- 1.9.1
When use WALT signals the task is more easily to migrate from little core to big core and will introduce quite high power consumption for some scenarios.
This patch is to add extra checking for WALT signals, if the task util_avg is less than 1/4 of CPU capacity, even the task WALT signal is a quite high value still don't migrate task to higher capacity CPU. This is because WALT is accounting both running and runnable time for task, even a small task still may have high WALT value. So we need use util_avg to filter out which tasks are real big load tasks.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 25 ++++++++++++++++++++----- 1 file changed, 20 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 80c5a91..11d438b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5391,9 +5391,17 @@ static bool need_filter_task(struct task_struct *p) do { int first_cpu = group_first_cpu(sg);
- if (capacity_orig_of(first_cpu) < target_max_cap && - task_util(p) * 4 < capacity_orig_of(first_cpu)) - target_max_cap = capacity_orig_of(first_cpu); + if (capacity_orig_of(first_cpu) >= target_max_cap) + continue; + + if (sysctl_sched_use_walt_task_util) { + if (task_fits_max(p, first_cpu) || + p->se.avg.util_avg * 4 < capacity_orig_of(first_cpu)) + target_max_cap = capacity_orig_of(first_cpu); + } else { + if (task_util(p) * 4 < capacity_orig_of(first_cpu)) + target_max_cap = capacity_orig_of(first_cpu); + }
} while (sg = sg->next, sg != sd->groups);
@@ -6747,8 +6755,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (energy_aware() && (capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu))) {
- if (task_util(p) * 4 < capacity_orig_of(env->src_cpu)) - return 0; + if (sysctl_sched_use_walt_task_util) { + if (task_fits_max(p, env->src_cpu) || + p->se.avg.util_avg * 4 < capacity_orig_of(env->src_cpu)) + return 0; + + } else { + if (task_util(p) * 4 < capacity_orig_of(env->src_cpu)) + return 0; + }
if (cpu_overutilized(env->dst_cpu) && !idle_cpu(env->dst_cpu)) -- 1.9.1
Also add checking for if destination CPU can meet task boosted value, if not then directly skip task migration.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 7 +++++++ 1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 11d438b..1001e27 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6768,6 +6768,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (cpu_overutilized(env->dst_cpu) && !idle_cpu(env->dst_cpu)) return 0; + + if (!is_idle_task(env->dst_rq->curr)) { + + if (capacity_orig_of(env->dst_cpu) * 1024 < + (boosted_task_util(env->dst_rq->curr) * capacity_margin)) + return 0; + } }
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) -- 1.9.1
This patch is to optimize task spreading when all big cores are busy, so need go back to check if little cores are idle. So this can spread out task as possible and avoid scheduling latency.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1001e27..17dcd8e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5733,6 +5733,29 @@ next: target = best_idle;
done: + + /* + * In this case means all big cores are not idle, so spread task + * as possible if system is quite busy. + */ + if (!idle_cpu(target) && !need_want_affine(p, target)) { + + int cpu = smp_processor_id(); + + for (i = 0; i < NR_CPUS; i++) { + + if (!cpu_online(i) || + !cpumask_test_cpu(i, tsk_cpus_allowed(p))) + continue; + + if (idle_cpu(i)) + target = i; + } + + if (!idle_cpu(target) && cpu_rq(cpu)->cfs.h_nr_running == 1) + target = cpu; + } + return target; }
-- 1.9.1