o This patch series is to optimize power. For power optimization, it should resolve issues from two factors, the first one is to find the method to save power and avoid unnecessary task migrations to big core, on the other hand it cannot downgrade for performance.
So this patch series is based on performance optimization patch series [1] to finish furthermore works for power saving and achieve the target: optimize power but without performance downgradation.
In RFCv3 have introduced power optimization related patches, but related patches are not general enough. E.g, RFCv3 defines the criteria for small task is: task_util(p) < 1/4 * cpu_capacity(cpu), So this is very hard to apply this criteria cross all SoCs. This patch series tries to figure out more general method for this.
o Below are backgroud info for power optimization:
For first step of power optimization, we should make sure the tasks in the cluster can spread out; this have two benefits, one benefit is trying to decrease frequency for every cluster, another benefit is after spreading tasks within cluster it can explore the CPU capacity as possible and avoid CPU is overutilized, so as result this can avoid to migrate tasks to big cores; This is finished by patch 0001.
If there have big tasks and really need to migrate them onto big core, for this case we should ensure the big tasks can be migrate to big core firstly rather than small tasks. So introduces rb tree to track biggest task on RQ in patch 0002, and patch 0003 uses rb tree to migrate biggest tasks for higher capacity CPU.
Patch 0004 has most affection for power saving, it checks if wakeup task can run at low capacity CPU. If so, it will force to run energy aware scheduling path even system is over tipping point. The criteria for wakeup task can run at low capacity CPU is: if any CPU's spare bandwidth can meet waken task requirement; so this can ensure even the task is keeping to run on low capacity CPU, the performanc is not sacrificed.
o Test result:
Firstly applied patch series "EASv5.2+: Performance Optimization And Fixes", tested power and performance; Then based on the code base also applied this power saving patch series. Finally compare the power data and performance data.
For power comprision the test case is video playback (1080p), below are results on Juno board:
Items | LITTLE Nrg | big Nrg | Nrg ---------------------------------------------------------------- Perf opt | 11.0520992 | 9.7118762 | 20.7639754 Perf + Power opt | 11.4157602 | 8.7319138 | 20.147674 Comparision | +3.29% | -10.09% | -2.97%
[1] https://lists.linaro.org/pipermail/eas-dev/2016-October/000610.html
Leo Yan (4): sched/fair: select lowest capacity CPU with packing tasks sched/fair: support to track biggest task util on rq sched/fair: migrate highest utilization task to higher capacity CPU sched/fair: check if wakeup task can run low capacity CPU
include/linux/sched.h | 1 + kernel/sched/fair.c | 213 +++++++++++++++++++++++++++++++++++++++++++++----- kernel/sched/sched.h | 4 + 3 files changed, 200 insertions(+), 18 deletions(-)
-- 1.9.1
In previous code, energy aware scheduling selects CPU based on CPU current capacity can meet wakeup task requirement and prefer the running CPU rather than idle CPU so can reduce the wakeup latency. This has a side effect which usually introduces vicious circle: after place the task onto CPU, CPUFreq governor is possible to increase frequency after it detects CPU has higher utilization; if next time have another waken up task, this task also will be placed onto the same CPU due this CPU has been improved frequency yet and it's the preferred CPU due it have running CPU. As result, CPUFreq governor increases frequency in turn to packing more tasks on one CPU.
Also observed another issue is: if system has a big task and several small tasks, the LITTLE core can meet the task capacity requirement; with previous code it's more likely to pack tasks onto one CPU so it has more chance to let CPU to be overutilized, at the end the big task is migrated to big core but actually one standalone LITTLE core can meet the task capacity requirement.
So this patch is based on below two rules for power saving: - Higher frequency will consume more power than lower frequency; so prefer to spread tasks to stay on lower frequency rather than increase frequency; - After we can achieve lowest frequency, it's better to pack tasks so can reduce the times to wake up CPUs.
This patch follows up upper two rules, the most important criteria is to select CPUs which has lowest capacity to meet task requirement; furthermore it selects the CPU with most highest utilization which still can stay at the lowest frequency, so can pack tasks as possible and keep at lowest frequency.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 45 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 050e8b9..713b031 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4931,6 +4931,25 @@ static int find_new_capacity(struct energy_env *eenv, return idx; }
+static int find_cpu_new_capacity(int cpu, unsigned long util) +{ + struct sched_domain *sd; + const struct sched_group_energy *sge; + int idx; + + sd = rcu_dereference(per_cpu(sd_ea, cpu)); + sge = sd->groups->sge; + + for (idx = 0; idx < sge->nr_cap_states; idx++) + if (sge->cap_states[idx].cap >= util) + break; + + if (idx == sge->nr_cap_states) + idx = idx - 1; + + return idx; +} + static int group_idle_state(struct sched_group *sg) { int i, state = INT_MAX; @@ -5756,9 +5775,10 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) struct sched_domain *sd; struct sched_group *sg, *sg_target; int target_max_cap = INT_MAX; - int target_cpu = task_cpu(p); - unsigned long task_util_boosted, new_util; - int i; + int target_cap_idx = INT_MAX; + int target_cpu = -1; + unsigned long task_util_boosted, new_util, wake_util; + int cap_idx, i;
if (sysctl_sched_sync_hint_enable && sync) { int cpu = smp_processor_id(); @@ -5804,12 +5824,15 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) task_util_boosted = boosted_task_util(p); /* Find cpu with sufficient capacity */ for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) { + + wake_util = cpu_util(i); + /* * p's blocked utilization is still accounted for on prev_cpu * so prev_cpu will receive a negative bias due to the double * accounting. However, the blocked utilization may be zero. */ - new_util = cpu_util(i) + task_util_boosted; + new_util = wake_util + task_util_boosted;
/* * Ensure minimum capacity to grant the required boost. @@ -5819,16 +5842,25 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) if (new_util > capacity_orig_of(i)) continue;
- if (new_util < capacity_curr_of(i)) { - target_cpu = i; - if (cpu_rq(i)->nr_running) - break; - } + cap_idx = find_cpu_new_capacity(i, new_util); + if (target_cap_idx > cap_idx) {
- /* cpu has capacity at higher OPP, keep it as fallback */ - if (target_cpu == task_cpu(p)) + /* Select cpu with possible lower OPP */ target_cpu = i; + target_cap_idx = cap_idx; + + } else if (target_cap_idx == cap_idx) { + + /* Pack tasks if possible */ + if (wake_util > cpu_util(target_cpu)) + target_cpu = i; + } } + + /* If have not select any CPU, then to use previous CPU */ + if (target_cpu == -1) + return task_cpu(p); + } else { /* * Find a cpu with sufficient capacity @@ -5845,7 +5877,8 @@ static int energy_aware_wake_cpu(struct task_struct *p, int target, int sync) target_cpu = tmp_target; if ((boosted || prefer_idle) && idle_cpu(target_cpu)) return target_cpu; - } + } else + target_cpu = task_cpu(p); }
if (target_cpu != task_cpu(p)) { -- 1.9.1
CFS scheduler uses rq->cfs_tasks to track all tasks have been enqueued on it and it is when iterate tasks of source rq for load balance. This list is in the order for first waken task will be inserted into first on the list, so it don't consider any attribution for task utilizations.
This patch introduces to use rb tree for rq, so every rq has one rb tree to track all runnable task and running task on it. And it uses highest utilization task as its most left node, so can easily to get to know which task has highest utilization on the rq.
Signed-off-by: Leo Yan leo.yan@linaro.org --- include/linux/sched.h | 1 + kernel/sched/fair.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 4 +++ 3 files changed, 100 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index ede29e8..4fa171d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1323,6 +1323,7 @@ struct ravg { struct sched_entity { struct load_weight load; /* for load-balancing */ struct rb_node run_node; + struct rb_node util_node; struct list_head group_node; unsigned int on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 713b031..7183000 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -605,6 +605,90 @@ int sched_proc_update_handler(struct ctl_table *table, int write, #endif
/* + * Enqueue/dequeue task into rb-tree based on task util metrics, + * so can easily to fetch biggest task from RQ. + */ +static inline int task_util_before(struct sched_entity *a, + struct sched_entity *b) +{ + return (s64)(a->avg.util_avg - b->avg.util_avg) > 0; +} + +static void __enqueue_util(struct rq *rq, struct sched_entity *se) +{ + struct rb_node **link = &rq->util_node.rb_node; + struct rb_node *parent = NULL; + struct sched_entity *entry; + int leftmost = 1; + + /* + * Find the right place in the rbtree: + */ + while (*link) { + parent = *link; + entry = rb_entry(parent, struct sched_entity, util_node); + /* + * We dont care about collisions. Nodes with + * the same key stay together. + */ + if (task_util_before(se, entry)) { + link = &parent->rb_left; + } else { + link = &parent->rb_right; + leftmost = 0; + } + } + + /* + * Maintain a cache of leftmost tree entries (it is frequently + * used): + */ + if (leftmost) + rq->util_node_leftmost = &se->util_node; + + rb_link_node(&se->util_node, parent, link); + rb_insert_color(&se->util_node, &rq->util_node); +} + +static void __dequeue_util(struct rq *rq, struct sched_entity *se) +{ + if (rq->util_node_leftmost == &se->util_node) { + struct rb_node *next_node; + + next_node = rb_next(&se->util_node); + rq->util_node_leftmost = next_node; + } + + rb_erase(&se->util_node, &rq->util_node); +} + +static void __update_queue_util(struct rq *rq, struct sched_entity *se) +{ + __dequeue_util(rq, se); + __enqueue_util(rq, se); +} + +struct sched_entity *__pick_first_queue_util(struct rq *rq) +{ + struct rb_node *left = rq->util_node_leftmost; + + if (!left) + return NULL; + + return rb_entry(left, struct sched_entity, util_node); +} + +static struct sched_entity *__pick_next_queue_util(struct sched_entity *se) +{ + struct rb_node *next = rb_next(&se->util_node); + + if (!next) + return NULL; + + return rb_entry(next, struct sched_entity, util_node); +} + +/* * delta /= w */ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se) @@ -2369,6 +2453,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
account_numa_enqueue(rq, task_of(se)); list_add(&se->group_node, &rq->cfs_tasks); + __enqueue_util(rq, se); } #endif cfs_rq->nr_running++; @@ -2383,6 +2468,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) if (entity_is_task(se)) { account_numa_dequeue(rq_of(cfs_rq), task_of(se)); list_del_init(&se->group_node); + __dequeue_util(rq_of(cfs_rq), se); } cfs_rq->nr_running--; } @@ -3246,6 +3332,9 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); update_load_avg(se, 1); + + if (entity_is_task(se)) + __update_queue_util(rq_of(cfs_rq), se); }
update_stats_curr_start(cfs_rq, se); @@ -3346,6 +3435,9 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) __enqueue_entity(cfs_rq, prev); /* in !on_rq case, update occurred at dequeue */ update_load_avg(prev, 0); + + if (entity_is_task(prev)) + __update_queue_util(rq_of(cfs_rq), prev); } cfs_rq->curr = NULL; } @@ -3364,6 +3456,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) update_load_avg(curr, 1); update_cfs_shares(cfs_rq);
+ if (entity_is_task(curr)) + __update_queue_util(rq_of(cfs_rq), curr); + #ifdef CONFIG_SCHED_HRTICK /* * queued ticks are scheduled to match the slice, so don't bother diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2f2b959..b77c17a 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -606,6 +606,10 @@ struct rq { seqcount_t ave_seqcnt; #endif
+ /* track most high utilization tasks */ + struct rb_root util_node; + struct rb_node *util_node_leftmost; + /* capture load from *all* tasks on this cpu: */ struct load_weight load; unsigned long nr_load_updates; -- 1.9.1
In the previous code, if there have imbalance the scheduler will randomly to select task for migration. E.g, it's possible to migrate only one big task or alternately to migrate two small tasks which have similar total utilization with the big task. This works well for SMP platform by eventually spreading tasks as possible, but this will introduce potential performance and power issues for asymmetric architecture (like ARM big.LITTLE). In the example, the sub optimal result is to migrate two small tasks but leave the big task on LITTLE core; so the big task cannot utilize big core's high performance and two small tasks consume more power on big core but actually the LITTLE cores can meet their capacity requirement.
This patch is to use rq's rb tree for task utilization track and easily to select the highest utilization task and give it the chance to migrate to higher capacity CPU; after the highest utilization task has been detached, it also will be removed from the rb tree; so the code still can give more chance to other lower utilization tasks to migrate to higher capacity CPU if still has rest imbalance value.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 24 ++++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 7183000..ec43670 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6745,16 +6745,32 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env) { int tsk_cache_hot; + struct sched_entity *se;
lockdep_assert_held(&env->src_rq->lock);
/* * We do not migrate tasks that are: - * 1) throttled_lb_pair, or - * 2) cannot be migrated to this CPU due to cpus_allowed, or - * 3) running (obviously), or - * 4) are cache-hot on their current CPU. + * 1) energy_aware is enabled and highest utilization task is + * migrated to higher capacity CPU, + * 2) throttled_lb_pair, or + * 3) cannot be migrated to this CPU due to cpus_allowed, or + * 4) running (obviously), or + * 5) are cache-hot on their current CPU. */ + + if (energy_aware() && + (capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu))) { + + se = __pick_first_queue_util(env->src_rq); + if (&p->se != se) + return 0; + + if (cpu_overutilized(env->dst_cpu) && + !idle_cpu(env->dst_cpu)) + return 0; + } + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0;
-- 1.9.1
Usually after system is over tipping point, the optimal result should be filtering out big tasks and migrate them onto big core; on the other hand the scheduler should keep small tasks onto LITTLE cores. But in current code, after system is over tipping point and set "overutilized" flag for root domain, the wakeup balance will totally disable energy aware scheduling totally. So even the waken task is a small task, it still has no chance to use energy aware algorithm to select a proper CPU for power saving and it's possible to migrate small tasks onto big core after fall back to traditional load balance.
This patch adds checking for wakeup path, if any low capacity CPU has spare capacity for the waken task, it will force to execute energy aware scheduling, in this case it does not care about if over tipping point or not. Another situation also will always run energy aware scheduling is for task with boosted margin, this means the task migration should be decided by schedTune PE filter; so for this kind tasks are mandatory to execute energy aware scheduling function.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index ec43670..429a10b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5442,6 +5442,39 @@ static bool cpu_overutilized(int cpu) return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin); }
+static bool can_task_run_low_capacity_cpu(struct task_struct *p) +{ + int cpu = task_cpu(p); + int origin_max_cap = capacity_orig_of(cpu); + int target_max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val; + int i; + + for_each_cpu_and(i, tsk_cpus_allowed(p), cpu_online_mask) { + if (capacity_orig_of(i) < target_max_cap && + task_fits_spare(p, i)) + target_max_cap = capacity_orig_of(i); + } + + if (capacity_orig_of(smp_processor_id()) > target_max_cap) + return 1; + + if (origin_max_cap > target_max_cap) + return 1; + + return 0; +} + +static bool should_run_energy_aware_path(int cpu, struct task_struct *p) +{ + unsigned long margin = schedtune_task_margin(p); + + if (margin) + return 1; + + return !cpu_rq(cpu)->rd->overutilized || + can_task_run_low_capacity_cpu(p); +} + #ifdef CONFIG_SCHED_TUNE
static long @@ -6049,8 +6082,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f }
if (!sd) { - if (energy_aware() && !cpu_rq(cpu)->rd->overutilized) - new_cpu = energy_aware_wake_cpu(p, prev_cpu, sync); + if (energy_aware() && should_run_energy_aware_path(prev_cpu, p)) + new_cpu = energy_aware_wake_cpu(p, cpu, sync); else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, new_cpu);
-- 1.9.1