o This patch series is to evaluate if can use rb tree to track task load and util on rq; there have some concern for this method is: rb tree has O(log(N)) computation complexity, so this will introduce extra workload by rb tree's maintainence. For this concern using hackbench to do stress testing, hackbench will generate mass tasks for message sender and receiver, so there will have many enqueue and dequeue operations, so we can use hackbench to get to know if rb tree will introduce big workload or not (Thanks a lot for Chris suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after enable feature LB_MIN the scheduler will avoid to migrate task with load < 16. Somehow this also can help to filter out big tasks for migration. So we need compare power data between this patch series with directly setting LB_MIN.
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done' real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better result than baseline kernel.
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum --------------------------------------------------------- 11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum --------------------------------------------------------- 11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148 vs LB_MIN +1.99% -3.04% -0.21%
o Known issues:
For patch 2, function detach_tasks() iterates rb tree for tasks, if there have one task has been detached then it calls rb_first() to fetch first node and it will iterate again from first node; it's better to use rb_next() but after change to use rb_next() will introduce panic.
Welcome any suggestion for better implementation for it.
Leo Yan (3): sched/fair: support to track biggest task on rq sched/fair: select biggest task for migration sched: remove unused rq::cfs_tasks
include/linux/sched.h | 1 + include/linux/sched/sysctl.h | 1 + kernel/sched/core.c | 2 - kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 5 +- kernel/sysctl.c | 7 +++ 6 files changed, 116 insertions(+), 23 deletions(-)
-- 1.9.1
CFS scheduler uses rq::cfs_tasks as list to track tasks which are been enqueued on its rq, during load balance scheduler iterates this list to migrate tasks. The tasks are is randomly added into the list so finally the task migration is random without considering task load or utilization.
This patch introduces an extra rb tree in rq, use this rb tree to trace task sequence based on load or util metrics. With this rb tree it can easily get to know which task has biggest load so it's very helpful to distinguish heavy load tasks.
In traditional SMP load balance, scheduler should consider smp nice; so this patch introduce the flag sysctl_sched_task_seq_use_load and it can be set from proc fs. If sysctl_sched_task_seq_use_load = 1, then rb tree will use load metrics for the key value; other it will use util_avg metrics as the key value, this can be used for energy aware scheduling to filter out high utilization tasks.
Signed-off-by: Leo Yan leo.yan@linaro.org --- include/linux/sched.h | 1 + include/linux/sched/sysctl.h | 1 + kernel/sched/fair.c | 66 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/sched.h | 3 ++ kernel/sysctl.c | 7 +++++ 5 files changed, 78 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h index ede29e8..6e9a2f4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1323,6 +1323,7 @@ struct ravg { struct sched_entity { struct load_weight load; /* for load-balancing */ struct rb_node run_node; + struct rb_node seq_node; struct list_head group_node; unsigned int on_rq;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index d68e88c..3d28019 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -43,6 +43,7 @@ extern unsigned int sysctl_sched_is_big_little; extern unsigned int sysctl_sched_sync_hint_enable; extern unsigned int sysctl_sched_initial_task_util; extern unsigned int sysctl_sched_cstate_aware; +extern unsigned int sysctl_sched_task_seq_use_load; #ifdef CONFIG_SCHED_WALT extern unsigned int sysctl_sched_use_walt_cpu_util; extern unsigned int sysctl_sched_use_walt_task_util; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3dfaf21..f2148d9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -57,6 +57,7 @@ unsigned int sysctl_sched_is_big_little = 0; unsigned int sysctl_sched_sync_hint_enable = 1; unsigned int sysctl_sched_initial_task_util = 0; unsigned int sysctl_sched_cstate_aware = 1; +unsigned int sysctl_sched_task_seq_use_load = 0;
#ifdef CONFIG_SCHED_WALT unsigned int sysctl_sched_use_walt_cpu_util = 1; @@ -2357,6 +2358,60 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) } #endif /* CONFIG_NUMA_BALANCING */
+/* + * Enqueue/dequeue task into rb-tree based on task load or util metrics, + * so can easily to fetch biggest task from RQ. + */ +static inline int task_sequence_before(struct sched_entity *a, + struct sched_entity *b) +{ + s64 seq; + + if (sysctl_sched_task_seq_use_load) + seq = (s64)(task_h_load(task_of(a)) - task_h_load(task_of(b))); + else + seq = (s64)(a->avg.util_avg - b->avg.util_avg); + + return seq > 0; +} + +static void __enqueue_sequence(struct rq *rq, struct sched_entity *se) +{ + struct rb_node **link = &rq->seq_node.rb_node; + struct rb_node *parent = NULL; + struct sched_entity *entry; + + /* + * Find the right place in the rbtree: + */ + while (*link) { + parent = *link; + entry = rb_entry(parent, struct sched_entity, seq_node); + /* + * We dont care about collisions. Nodes with + * the same key stay together. + */ + if (task_sequence_before(se, entry)) + link = &parent->rb_left; + else + link = &parent->rb_right; + } + + rb_link_node(&se->seq_node, parent, link); + rb_insert_color(&se->seq_node, &rq->seq_node); +} + +static void __dequeue_sequence(struct rq *rq, struct sched_entity *se) +{ + rb_erase(&se->seq_node, &rq->seq_node); +} + +static void __update_sequence(struct rq *rq, struct sched_entity *se) +{ + __dequeue_sequence(rq, se); + __enqueue_sequence(rq, se); +} + static void account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) { @@ -2369,6 +2424,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
account_numa_enqueue(rq, task_of(se)); list_add(&se->group_node, &rq->cfs_tasks); + __enqueue_sequence(rq, se); } #endif cfs_rq->nr_running++; @@ -2383,6 +2439,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) if (entity_is_task(se)) { account_numa_dequeue(rq_of(cfs_rq), task_of(se)); list_del_init(&se->group_node); + __dequeue_sequence(rq_of(cfs_rq), se); } cfs_rq->nr_running--; } @@ -3246,6 +3303,9 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) update_stats_wait_end(cfs_rq, se); __dequeue_entity(cfs_rq, se); update_load_avg(se, 1); + + if (entity_is_task(se)) + __update_sequence(rq_of(cfs_rq), se); }
update_stats_curr_start(cfs_rq, se); @@ -3346,6 +3406,9 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev) __enqueue_entity(cfs_rq, prev); /* in !on_rq case, update occurred at dequeue */ update_load_avg(prev, 0); + + if (entity_is_task(prev)) + __update_sequence(rq_of(cfs_rq), prev); } cfs_rq->curr = NULL; } @@ -3364,6 +3427,9 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) update_load_avg(curr, 1); update_cfs_shares(cfs_rq);
+ if (entity_is_task(curr)) + __update_sequence(rq_of(cfs_rq), curr); + #ifdef CONFIG_SCHED_HRTICK /* * queued ticks are scheduled to match the slice, so don't bother diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 2f2b959..7284a45 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -606,6 +606,9 @@ struct rq { seqcount_t ave_seqcnt; #endif
+ /* track highest load task */ + struct rb_root seq_node; + /* capture load from *all* tasks on this cpu: */ struct load_weight load; unsigned long nr_load_updates; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index d964422..14a3cfe 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -311,6 +311,13 @@ static struct ctl_table kern_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "sched_task_seq_use_load", + .data = &sysctl_sched_task_seq_use_load, + .maxlen = sizeof(unsigned int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, #ifdef CONFIG_SCHED_WALT { .procname = "sched_use_walt_cpu_util", -- 1.9.1
Based on rq's rb tree, it's possible to directly select biggest task for migration. This patch is change to use the rb tree to iterate runnble tasks with the sequence from biggest task to most small task to check if can meet migration criteria. As result, this gives more chance for load balance to migrate big tasks.
This patch is to use rb tree in two functions, one is in detach_tasks() to do balance within busiest group and load group; another is in detach_one_task() for active load balance.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 38 insertions(+), 17 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f2148d9..0c68c80 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6847,25 +6847,40 @@ static void detach_task(struct task_struct *p, struct lb_env *env) static struct task_struct *detach_one_task(struct lb_env *env) { struct task_struct *p, *n; + struct rb_node *node; + struct sched_entity *se;
lockdep_assert_held(&env->src_rq->lock);
- list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) { - if (!can_migrate_task(p, env)) - continue; + p = NULL;
- detach_task(p, env); + node = rb_first(&env->src_rq->seq_node); + while (node) {
- /* - * Right now, this is only the second place where - * lb_gained[env->idle] is updated (other is detach_tasks) - * so we can safely collect stats here rather than - * inside detach_tasks(). - */ - schedstat_inc(env->sd, lb_gained[env->idle]); - return p; + se = rb_entry(node, struct sched_entity, seq_node); + n = task_of(se); + + if (can_migrate_task(n, env)) { + p = n; + break; + } + + node = rb_next(&se->seq_node); } - return NULL; + + if (!p) + return NULL; + + detach_task(p, env); + + /* + * Right now, this is only the second place where + * lb_gained[env->idle] is updated (other is detach_tasks) + * so we can safely collect stats here rather than + * inside detach_tasks(). + */ + schedstat_inc(env->sd, lb_gained[env->idle]); + return p; }
static const unsigned int sched_nr_migrate_break = 32; @@ -6878,17 +6893,22 @@ static const unsigned int sched_nr_migrate_break = 32; */ static int detach_tasks(struct lb_env *env) { - struct list_head *tasks = &env->src_rq->cfs_tasks; struct task_struct *p; unsigned long load; int detached = 0; + struct sched_entity *se; + struct rb_node *node;
lockdep_assert_held(&env->src_rq->lock);
if (env->imbalance <= 0) return 0;
- while (!list_empty(tasks)) { + node = rb_first(&env->src_rq->seq_node); + while (node) { + + se = rb_entry(node, struct sched_entity, seq_node); + /* * We don't want to steal all, otherwise we may be treated likewise, * which could at worst lead to a livelock crash. @@ -6896,7 +6916,7 @@ static int detach_tasks(struct lb_env *env) if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) break;
- p = list_first_entry(tasks, struct task_struct, se.group_node); + p = task_of(se);
env->loop++; /* We've more or less seen every task there is, call it quits */ @@ -6944,9 +6964,10 @@ static int detach_tasks(struct lb_env *env) if (env->imbalance <= 0) break;
+ node = rb_first(&env->src_rq->seq_node); continue; next: - list_move_tail(&p->se.group_node, tasks); + node = rb_next(&se->seq_node); }
/* -- 1.9.1
On Thu, Oct 27, 2016 at 01:28:51AM +0800, Leo Yan wrote:
@@ -6878,17 +6893,22 @@ static const unsigned int sched_nr_migrate_break = 32; */ static int detach_tasks(struct lb_env *env) {
struct list_head *tasks = &env->src_rq->cfs_tasks; struct task_struct *p; unsigned long load; int detached = 0;
struct sched_entity *se;
struct rb_node *node; lockdep_assert_held(&env->src_rq->lock); if (env->imbalance <= 0) return 0;
while (!list_empty(tasks)) {
node = rb_first(&env->src_rq->seq_node);
while (node) {
se = rb_entry(node, struct sched_entity, seq_node);
/* * We don't want to steal all, otherwise we may be treated likewise, * which could at worst lead to a livelock crash.
@@ -6896,7 +6916,7 @@ static int detach_tasks(struct lb_env *env) if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) break;
p = list_first_entry(tasks, struct task_struct, se.group_node);
p = task_of(se); env->loop++; /* We've more or less seen every task there is, call it quits */
@@ -6944,9 +6964,10 @@ static int detach_tasks(struct lb_env *env) if (env->imbalance <= 0) break;
node = rb_first(&env->src_rq->seq_node); continue;
next:
list_move_tail(&p->se.group_node, tasks);
node = rb_next(&se->seq_node);
I think you subtly break existing detach_tasks() behaviour here. list_move_tail() is cleverly used to rotate the list as we work through it such that the starting point is not the same each time we try to balance.
The rationale is that we may bail out before we have iterated through the entire list (env->loop++ < env->loop_max) if we think we have spent too much time trying to pull tasks. If didn't pull any tasks we may start from the exact same position in the list and reach the same result for subsequent load-balancing attempts. To prevent that we continue iterating over the same env->loop_max tasks each time, we need to change the starting point. list_move_tail() does that by moving tasks that we couldn't migrate to the tail of the list.
For that reason you can't use your rb-tree as a drop-in replacement for the task list.
Morten IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Morten,
On Thu, Oct 27, 2016 at 04:49:54PM +0100, Morten Rasmussen wrote:
On Thu, Oct 27, 2016 at 01:28:51AM +0800, Leo Yan wrote:
@@ -6878,17 +6893,22 @@ static const unsigned int sched_nr_migrate_break = 32; */ static int detach_tasks(struct lb_env *env) {
struct list_head *tasks = &env->src_rq->cfs_tasks; struct task_struct *p; unsigned long load; int detached = 0;
struct sched_entity *se;
struct rb_node *node; lockdep_assert_held(&env->src_rq->lock); if (env->imbalance <= 0) return 0;
while (!list_empty(tasks)) {
node = rb_first(&env->src_rq->seq_node);
while (node) {
se = rb_entry(node, struct sched_entity, seq_node);
/* * We don't want to steal all, otherwise we may be treated likewise, * which could at worst lead to a livelock crash.
@@ -6896,7 +6916,7 @@ static int detach_tasks(struct lb_env *env) if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) break;
p = list_first_entry(tasks, struct task_struct, se.group_node);
p = task_of(se); env->loop++; /* We've more or less seen every task there is, call it quits */
@@ -6944,9 +6964,10 @@ static int detach_tasks(struct lb_env *env) if (env->imbalance <= 0) break;
node = rb_first(&env->src_rq->seq_node); continue;
next:
list_move_tail(&p->se.group_node, tasks);
node = rb_next(&se->seq_node);
I think you subtly break existing detach_tasks() behaviour here. list_move_tail() is cleverly used to rotate the list as we work through it such that the starting point is not the same each time we try to balance.
The rationale is that we may bail out before we have iterated through the entire list (env->loop++ < env->loop_max) if we think we have spent too much time trying to pull tasks. If didn't pull any tasks we may start from the exact same position in the list and reach the same result for subsequent load-balancing attempts. To prevent that we continue iterating over the same env->loop_max tasks each time, we need to change the starting point. list_move_tail() does that by moving tasks that we couldn't migrate to the tail of the list.
For that reason you can't use your rb-tree as a drop-in replacement for the task list.
Thanks for reveiwing.
IMHO, for this reason I think it's easily to add a cache node to point to last one node which iterate in previous time before bail out. So next time we can start to iterate from the cache node but not the root node. Do you think this is feasible?
Thanks, Leo Yan
On 27 October 2016 at 18:12, Leo Yan leo.yan@linaro.org wrote:
Hi Morten,
On Thu, Oct 27, 2016 at 04:49:54PM +0100, Morten Rasmussen wrote:
On Thu, Oct 27, 2016 at 01:28:51AM +0800, Leo Yan wrote:
@@ -6878,17 +6893,22 @@ static const unsigned int sched_nr_migrate_break = 32; */ static int detach_tasks(struct lb_env *env) {
struct list_head *tasks = &env->src_rq->cfs_tasks; struct task_struct *p; unsigned long load; int detached = 0;
struct sched_entity *se;
struct rb_node *node; lockdep_assert_held(&env->src_rq->lock); if (env->imbalance <= 0) return 0;
while (!list_empty(tasks)) {
node = rb_first(&env->src_rq->seq_node);
while (node) {
se = rb_entry(node, struct sched_entity, seq_node);
/* * We don't want to steal all, otherwise we may be treated likewise, * which could at worst lead to a livelock crash.
@@ -6896,7 +6916,7 @@ static int detach_tasks(struct lb_env *env) if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) break;
p = list_first_entry(tasks, struct task_struct, se.group_node);
p = task_of(se); env->loop++; /* We've more or less seen every task there is, call it quits */
@@ -6944,9 +6964,10 @@ static int detach_tasks(struct lb_env *env) if (env->imbalance <= 0) break;
node = rb_first(&env->src_rq->seq_node); continue;
next:
list_move_tail(&p->se.group_node, tasks);
node = rb_next(&se->seq_node);
I think you subtly break existing detach_tasks() behaviour here. list_move_tail() is cleverly used to rotate the list as we work through it such that the starting point is not the same each time we try to balance.
The rationale is that we may bail out before we have iterated through the entire list (env->loop++ < env->loop_max) if we think we have spent too much time trying to pull tasks. If didn't pull any tasks we may start from the exact same position in the list and reach the same result for subsequent load-balancing attempts. To prevent that we continue iterating over the same env->loop_max tasks each time, we need to change the starting point. list_move_tail() does that by moving tasks that we couldn't migrate to the tail of the list.
For that reason you can't use your rb-tree as a drop-in replacement for the task list.
Thanks for reveiwing.
IMHO, for this reason I think it's easily to add a cache node to point to last one node which iterate in previous time before bail out. So next time we can start to iterate from the cache node but not the root node. Do you think this is feasible?
But in this case, don't you lose all interest of sorting the tasks in rb tree ?
Thanks, Leo Yan
On Thu, Oct 27, 2016 at 06:14:14PM +0200, Vincent Guittot wrote:
On 27 October 2016 at 18:12, Leo Yan leo.yan@linaro.org wrote:
Hi Morten,
On Thu, Oct 27, 2016 at 04:49:54PM +0100, Morten Rasmussen wrote:
On Thu, Oct 27, 2016 at 01:28:51AM +0800, Leo Yan wrote:
@@ -6878,17 +6893,22 @@ static const unsigned int sched_nr_migrate_break = 32; */ static int detach_tasks(struct lb_env *env) {
struct list_head *tasks = &env->src_rq->cfs_tasks; struct task_struct *p; unsigned long load; int detached = 0;
struct sched_entity *se;
struct rb_node *node; lockdep_assert_held(&env->src_rq->lock); if (env->imbalance <= 0) return 0;
while (!list_empty(tasks)) {
node = rb_first(&env->src_rq->seq_node);
while (node) {
se = rb_entry(node, struct sched_entity, seq_node);
/* * We don't want to steal all, otherwise we may be treated likewise, * which could at worst lead to a livelock crash.
@@ -6896,7 +6916,7 @@ static int detach_tasks(struct lb_env *env) if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) break;
p = list_first_entry(tasks, struct task_struct, se.group_node);
p = task_of(se); env->loop++; /* We've more or less seen every task there is, call it quits */
@@ -6944,9 +6964,10 @@ static int detach_tasks(struct lb_env *env) if (env->imbalance <= 0) break;
node = rb_first(&env->src_rq->seq_node); continue;
next:
list_move_tail(&p->se.group_node, tasks);
node = rb_next(&se->seq_node);
I think you subtly break existing detach_tasks() behaviour here. list_move_tail() is cleverly used to rotate the list as we work through it such that the starting point is not the same each time we try to balance.
The rationale is that we may bail out before we have iterated through the entire list (env->loop++ < env->loop_max) if we think we have spent too much time trying to pull tasks. If didn't pull any tasks we may start from the exact same position in the list and reach the same result for subsequent load-balancing attempts. To prevent that we continue iterating over the same env->loop_max tasks each time, we need to change the starting point. list_move_tail() does that by moving tasks that we couldn't migrate to the tail of the list.
For that reason you can't use your rb-tree as a drop-in replacement for the task list.
Thanks for reveiwing.
IMHO, for this reason I think it's easily to add a cache node to point to last one node which iterate in previous time before bail out. So next time we can start to iterate from the cache node but not the root node. Do you think this is feasible?
But in this case, don't you lose all interest of sorting the tasks in rb tree ?
I think the more important thing is we should get clear does it make sense to start iteration for the sequence from biggest task to most small task on the rq.
If so, then every time should start from biggest task so can give more chance for migration.
Will do more analysis for this.
Thanks, Leo Yan
After introduced rb tree for tracking task on rq, rq::cfs_tasks this node is not used anymore. So this patch removes it.
Signed-off-by: Leo Yan leo.yan@linaro.org --- kernel/sched/core.c | 2 -- kernel/sched/fair.c | 2 -- kernel/sched/sched.h | 2 -- 3 files changed, 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 778335a..5d80bf8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7812,8 +7812,6 @@ void __init sched_init(void) rq->irqload_ts = 0; #endif
- INIT_LIST_HEAD(&rq->cfs_tasks); - rq_attach_root(rq, &def_root_domain); #ifdef CONFIG_NO_HZ_COMMON rq->nohz_flags = 0; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0c68c80..77ca4df 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2423,7 +2423,6 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se) struct rq *rq = rq_of(cfs_rq);
account_numa_enqueue(rq, task_of(se)); - list_add(&se->group_node, &rq->cfs_tasks); __enqueue_sequence(rq, se); } #endif @@ -2438,7 +2437,6 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se) update_load_sub(&rq_of(cfs_rq)->load, se->load.weight); if (entity_is_task(se)) { account_numa_dequeue(rq_of(cfs_rq), task_of(se)); - list_del_init(&se->group_node); __dequeue_sequence(rq_of(cfs_rq), se); } cfs_rq->nr_running--; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 7284a45..3907bc9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -659,8 +659,6 @@ struct rq { int cpu; int online;
- struct list_head cfs_tasks; - u64 rt_avg; u64 age_stamp; u64 idle_stamp; -- 1.9.1
Hi Leo,
On Thu, Oct 27, 2016 at 01:28:49AM +0800, Leo Yan wrote:
o This patch series is to evaluate if can use rb tree to track task load and util on rq; there have some concern for this method is: rb tree has O(log(N)) computation complexity, so this will introduce extra workload by rb tree's maintainence. For this concern using hackbench to do stress testing, hackbench will generate mass tasks for message sender and receiver, so there will have many enqueue and dequeue operations, so we can use hackbench to get to know if rb tree will introduce big workload or not (Thanks a lot for Chris suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after enable feature LB_MIN the scheduler will avoid to migrate task with load < 16. Somehow this also can help to filter out big tasks for migration. So we need compare power data between this patch series with directly setting LB_MIN.
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done' real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better result than baseline kernel.
It is around 1% difference, so pretty much in the noise? Have tried longer runs of hackbench? For the capacity awareness postings I'm using:
perf stat --null --repeat 10 -- \ perf bench sched messaging -g 50 -l 1000
I'm still curious how many tasks that are actually on the rq for hackbench, but it seems that overhead isn't a major issue.
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
stdev 0.431777914
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148
stdev 0.1263371635
vs LB_MIN +1.99% -3.04% -0.21%
Should I read this as the energy benefits of the rb-tree solution is the negligible? It seems to be much smaller than the error margins.
Does we have evidence that the util-rb-tree solution is any better than using LB_MIN?
Thanks, Morten IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Hi Morten,
On Thu, Oct 27, 2016 at 02:38:12PM +0100, Morten Rasmussen wrote:
[...]
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done' real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better result than baseline kernel.
It is around 1% difference, so pretty much in the noise? Have tried longer runs of hackbench? For the capacity awareness postings I'm using:
Sorry in my previous testing I wrongly used WALT signals, so please ignore it.
I did more testing on it for PELT signals, below are testing result:
baseline: 5m57.86s real 1m42.36s user 34m23.30s system (PELT) 5m56.60s real 1m41.45s user 34m16.23s system (PELT)
with rb-tree patches: 5m43.84s real 1m35.82s user 32m56.43s system (PELT) 5m46.67s real 1m35.39s user 33m18.07s system (PELT)
The real run time is not decreased too much (10s ~ 14s), but we can see the system time can decrease obviously (58.16s ~ 86.97s) for multi-core's processing.
perf stat --null --repeat 10 -- \ perf bench sched messaging -g 50 -l 1000
I'm still curious how many tasks that are actually on the rq for hackbench, but it seems that overhead isn't a major issue.
From the code, at the same time hackbench launches 40 processes (20
processes for messager's senders and 20 for receivers).
Do you think this is similiar testing with perf you meantioned?
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
stdev 0.431777914
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148
stdev 0.1263371635
vs LB_MIN +1.99% -3.04% -0.21%
Should I read this as the energy benefits of the rb-tree solution is the negligible? It seems to be much smaller than the error margins.
From the power data on Juno, the final difference is quite small. But
if we review the energy for big cluster and little cluster, we can see the benefit by reducing big core's energy and uses more LITTLE core for tasks.
From my previous experience, the power optimization patches can save
power 10% on another b.L system for camera case, but it only can save CPU power 2.97% for video playback on Juno. This is caused by video playback still have no enough small tasks, for the case with many small tasks we can see more benefit.
Do you have suggestion for power testing case on Juno?
Does we have evidence that the util-rb-tree solution is any better than using LB_MIN?
I think the evidence is big core saving 3.04% energy than LB_MIN, this is not big difference on Juno. The difference is possible to enlarged for the scenario with many tasks (like camera case).
But yes, this should be verified on other SoCs for more confidence.
Thanks, Leo Yan
On Thu, Oct 27, 2016 at 10:28:41PM +0800, Leo Yan wrote:
Hi Morten,
On Thu, Oct 27, 2016 at 02:38:12PM +0100, Morten Rasmussen wrote:
[...]
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done' real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better result than baseline kernel.
It is around 1% difference, so pretty much in the noise? Have tried longer runs of hackbench? For the capacity awareness postings I'm using:
Sorry in my previous testing I wrongly used WALT signals, so please ignore it.
I did more testing on it for PELT signals, below are testing result:
baseline: 5m57.86s real 1m42.36s user 34m23.30s system (PELT) 5m56.60s real 1m41.45s user 34m16.23s system (PELT)
with rb-tree patches: 5m43.84s real 1m35.82s user 32m56.43s system (PELT) 5m46.67s real 1m35.39s user 33m18.07s system (PELT)
The real run time is not decreased too much (10s ~ 14s), but we can see the system time can decrease obviously (58.16s ~ 86.97s) for multi-core's processing.
So we have ~3% improvement for real run time and ~6% user time, which seems significant.
The next question is why? It is an SMP system, it shouldn't matter how the task list is ordered and I doubt that an RB tree is faster to use than the list data structure. So I can't really explain the numbers.
perf stat --null --repeat 10 -- \ perf bench sched messaging -g 50 -l 1000
I'm still curious how many tasks that are actually on the rq for hackbench, but it seems that overhead isn't a major issue.
From the code, at the same time hackbench launches 40 processes (20 processes for messager's senders and 20 for receivers).
How many processes in total?
The perf bench command gives you 20 sender + 20 receivers in each of the 50 groups for a total of 2000 processes.
Do you think this is similiar testing with perf you meantioned?
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
stdev 0.431777914
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148
stdev 0.1263371635
vs LB_MIN +1.99% -3.04% -0.21%
Should I read this as the energy benefits of the rb-tree solution is the negligible? It seems to be much smaller than the error margins.
From the power data on Juno, the final difference is quite small. But if we review the energy for big cluster and little cluster, we can see the benefit by reducing big core's energy and uses more LITTLE core for tasks.
From my previous experience, the power optimization patches can save power 10% on another b.L system for camera case, but it only can save CPU power 2.97% for video playback on Juno. This is caused by video playback still have no enough small tasks, for the case with many small tasks we can see more benefit.
Do you have suggestion for power testing case on Juno?
In the end it is real energy consumption that matters, shifting energy from big to little doesn't really matter if the sum is the same which seems to be the case for the Juno test.
You are suggesting that video doesn't have enough small tasks. Would it be possible to use rtapp to generate a bunch of small tasks of roughly similar size to those you see for the camera case?
Does we have evidence that the util-rb-tree solution is any better than using LB_MIN?
I think the evidence is big core saving 3.04% energy than LB_MIN, this is not big difference on Juno. The difference is possible to enlarged for the scenario with many tasks (like camera case).
The energy sum difference is only 0.21%.
But yes, this should be verified on other SoCs for more confidence.
A synthetic test and a good theory to explain the numbers is also quite helpful in convincing people :)
Thanks, Morten IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Thu, Oct 27, 2016 at 05:14:58PM +0100, Morten Rasmussen wrote:
[...]
Sorry in my previous testing I wrongly used WALT signals, so please ignore it.
I did more testing on it for PELT signals, below are testing result:
baseline: 5m57.86s real 1m42.36s user 34m23.30s system (PELT) 5m56.60s real 1m41.45s user 34m16.23s system (PELT)
with rb-tree patches: 5m43.84s real 1m35.82s user 32m56.43s system (PELT) 5m46.67s real 1m35.39s user 33m18.07s system (PELT)
The real run time is not decreased too much (10s ~ 14s), but we can see the system time can decrease obviously (58.16s ~ 86.97s) for multi-core's processing.
So we have ~3% improvement for real run time and ~6% user time, which seems significant.
The next question is why? It is an SMP system, it shouldn't matter how the task list is ordered and I doubt that an RB tree is faster to use than the list data structure. So I can't really explain the numbers.
I have not digged into trace log yet. At beginning I expect the data could be worse after applied rb tree related patches but finally get reverse result.
perf stat --null --repeat 10 -- \ perf bench sched messaging -g 50 -l 1000
I'm still curious how many tasks that are actually on the rq for hackbench, but it seems that overhead isn't a major issue.
From the code, at the same time hackbench launches 40 processes (20 processes for messager's senders and 20 for receivers).
How many processes in total?
40 process x 10 groups = 400 processes for hackbench.
The perf bench command gives you 20 sender + 20 receivers in each of the 50 groups for a total of 2000 processes.
Do you think this is similiar testing with perf you meantioned?
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
stdev 0.431777914
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148
stdev 0.1263371635
vs LB_MIN +1.99% -3.04% -0.21%
Should I read this as the energy benefits of the rb-tree solution is the negligible? It seems to be much smaller than the error margins.
From the power data on Juno, the final difference is quite small. But if we review the energy for big cluster and little cluster, we can see the benefit by reducing big core's energy and uses more LITTLE core for tasks.
From my previous experience, the power optimization patches can save power 10% on another b.L system for camera case, but it only can save CPU power 2.97% for video playback on Juno. This is caused by video playback still have no enough small tasks, for the case with many small tasks we can see more benefit.
Do you have suggestion for power testing case on Juno?
In the end it is real energy consumption that matters, shifting energy from big to little doesn't really matter if the sum is the same which seems to be the case for the Juno test.
You are suggesting that video doesn't have enough small tasks. Would it be possible to use rtapp to generate a bunch of small tasks of roughly similar size to those you see for the camera case?
Let me firstly gather camera case log if possible.
Does we have evidence that the util-rb-tree solution is any better than using LB_MIN?
I think the evidence is big core saving 3.04% energy than LB_MIN, this is not big difference on Juno. The difference is possible to enlarged for the scenario with many tasks (like camera case).
The energy sum difference is only 0.21%.
Yeah.
But yes, this should be verified on other SoCs for more confidence.
A synthetic test and a good theory to explain the numbers is also quite helpful in convincing people :)
Yes and agree :) Will do this and try to use rt-app.
Thanks, Leo Yan
Hi Morten,
On Thu, Oct 27, 2016 at 05:14:58PM +0100, Morten Rasmussen wrote:
[...]
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
stdev 0.431777914
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148
stdev 0.1263371635
vs LB_MIN +1.99% -3.04% -0.21%
Should I read this as the energy benefits of the rb-tree solution is the negligible? It seems to be much smaller than the error margins.
From the power data on Juno, the final difference is quite small. But if we review the energy for big cluster and little cluster, we can see the benefit by reducing big core's energy and uses more LITTLE core for tasks.
From my previous experience, the power optimization patches can save power 10% on another b.L system for camera case, but it only can save CPU power 2.97% for video playback on Juno. This is caused by video playback still have no enough small tasks, for the case with many small tasks we can see more benefit.
Do you have suggestion for power testing case on Juno?
In the end it is real energy consumption that matters, shifting energy from big to little doesn't really matter if the sum is the same which seems to be the case for the Juno test.
You are suggesting that video doesn't have enough small tasks. Would it be possible to use rtapp to generate a bunch of small tasks of roughly similar size to those you see for the camera case?
Following up your suggestion, I wrote a rt-app json script to generate synthetic test for camera user case and uploaded onto github[1].
Based on this camera case, I finished several testing on Juno for power data. In below testings, I set big cluster's scaling_min_freq to 600MHz, 1000MHz and 1200MHz saperately, using this method we can deliberately enlarge the power efficiency's difference between two clusters, so finally we can see how much the power benefit we can get with power optimization patches.
Big cluster's scaling_min_freq = 600MHz: W/o power opt. (J) W/t power opt.(J) Comparision Big Nrg. 22.30 10.44 -53.18% LITTLE Nrg. 119.98 131.58 +9.67% Total Nrg. 142.28 142.02 -0.18%
Big cluster's scaling_min_freq = 1000MHz: W/o power opt. (J) W/t power opt.(J) Comparision Big Nrg. 28.53 12.85 -54.96% LITTLE Nrg. 120.47 132.06 +9.62% Total Nrg. 149 144.91 -2.74%
Big cluster's scaling_min_freq = 1200MHz: W/o power opt. (J) W/t power opt.(J) Comparision Big Nrg. 38.39 16.58 -54.96% LITTLE Nrg. 120.01 131.35 +9.62% Total Nrg. 158.4 147.93 -6.6%
Here should meantion one thing is: I reviewed a bit for power modeling on Juno-r2, below tables are voltage and OPPs mapping on it. So for LITTLE cluster OPP 950MHz increases voltage to 1010mV, which increase 11.11% for voltage level compared against 800MHz's voltage level 909mV. So LITTLE core 950MHz OPP has much worse power efficiency, and it should have similiar power efficiency with big core's 600MHz. So this is why we see if we use big cluster minimum frequency to 600MHz, even optimized to place small tasks on LITTLE cluster, we cannot see benefit for total power data (-0.18%). So for Juno default configuration, the power optimization patches cannot reflect much benefit, but we should get benefit if two clusters have bigger gap for power efficiency within them.
LITTLE core voltage level: 450MHz -> 829mV 800MHz -> 909mV 950MHz -> 1010mV
Big core voltage level: 600MHz -> 829mV 1000MHz -> 909mV 1200MHz -> 1010mV
[1] https://github.com/ARM-software/workload-automation/pull/284/commits/d3c59e2...
Thanks, Leo Yan
Hi Leo,
On 26/10/16 18:28, Leo Yan wrote:
o This patch series is to evaluate if can use rb tree to track task load and util on rq; there have some concern for this method is: rb tree has O(log(N)) computation complexity, so this will introduce extra workload by rb tree's maintainence. For this concern using hackbench to do stress testing, hackbench will generate mass tasks for message sender and receiver, so there will have many enqueue and dequeue operations, so we can use hackbench to get to know if rb tree will introduce big workload or not (Thanks a lot for Chris suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after enable feature LB_MIN the scheduler will avoid to migrate task with load < 16. Somehow this also can help to filter out big tasks for migration. So we need compare power data between this patch series with directly setting LB_MIN.
I have difficulties to understand the whole idea here. Basically, you're still doing classical load-balancing (lb) (with the aim of setting env->imbalance ((runnable) load based) to 0. On a system like Hikey (SMP) any order (load or util related) of the tasks can potentially change how many tasks a dst cpu might pull (in case of an ordered list (large to small load) we potentially pull only one task (doesn't have to be the first one because 'task_h_load(p)/2 > env->imbalance', in case its load is smaller but close to env->imbalance/2). But how can this help to increase performance in a workload-agnostic way? Besides this periodic lb case, what's the benefit of potentially pulling the biggest task instead of an arbitrary one in active lb?
Now if we think about a big.LITTLE system and we are on DIE sd level, an ordered list (large to small) on little cpu and vice versa on big could potentially make sense. Not very scalable though because we're just exploiting the idea of _two_ items with an opposite property (big vs. LITTLE) and different order.
Another remark, where do you see the benefit of using util instead of load? In both cases, you do 'task_h_load(p)/2 > env->imbalance), env->imbalance -= task_h_load(p) and if (env->imbalance <= 0) break' i.e. a load based policy any way?
The LB_MIN thing (avoiding to consider tiny tasks for pulling) is a related thing but IMHO, aren't we seeing results which are pretty much inside the noise-cloud here for a particular system/workload combination?
Let's talk about it further in tomorrow's meeting.
-- Dietmar
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done' real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better result than baseline kernel.
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
11.3122 8.983429 20.295629 11.337446 8.174061 19.511507 11.256941 8.547895 19.804836 10.994329 9.633028 20.627357 11.483148 8.522364 20.005512 avg. 11.2768128 8.7721554 20.0489682
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
11.384301 8.412714 19.797015 11.673992 8.455219 20.129211 11.586081 8.414606 20.000687 11.423509 8.64781 20.071319 11.43709 8.595252 20.032342 avg. 11.5009946 8.5051202 20.0061148 vs LB_MIN +1.99% -3.04% -0.21%
o Known issues:
For patch 2, function detach_tasks() iterates rb tree for tasks, if there have one task has been detached then it calls rb_first() to fetch first node and it will iterate again from first node; it's better to use rb_next() but after change to use rb_next() will introduce panic.
Welcome any suggestion for better implementation for it.
Leo Yan (3): sched/fair: support to track biggest task on rq sched/fair: select biggest task for migration sched: remove unused rq::cfs_tasks
include/linux/sched.h | 1 + include/linux/sched/sysctl.h | 1 + kernel/sched/core.c | 2 - kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++------- kernel/sched/sched.h | 5 +- kernel/sysctl.c | 7 +++ 6 files changed, 116 insertions(+), 23 deletions(-)
On Thu, Oct 27, 2016 at 08:37:05PM +0100, Dietmar Eggemann wrote:
Hi Leo,
On 26/10/16 18:28, Leo Yan wrote:
o This patch series is to evaluate if can use rb tree to track task load and util on rq; there have some concern for this method is: rb tree has O(log(N)) computation complexity, so this will introduce extra workload by rb tree's maintainence. For this concern using hackbench to do stress testing, hackbench will generate mass tasks for message sender and receiver, so there will have many enqueue and dequeue operations, so we can use hackbench to get to know if rb tree will introduce big workload or not (Thanks a lot for Chris suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after enable feature LB_MIN the scheduler will avoid to migrate task with load < 16. Somehow this also can help to filter out big tasks for migration. So we need compare power data between this patch series with directly setting LB_MIN.
I have difficulties to understand the whole idea here. Basically, you're still doing classical load-balancing (lb) (with the aim of setting env->imbalance ((runnable) load based) to 0. On a system like Hikey (SMP) any order (load or util related) of the tasks can potentially change how many tasks a dst cpu might pull (in case of an ordered list (large to small load) we potentially pull only one task (doesn't have to be the first one because 'task_h_load(p)/2 > env->imbalance', in case its load is smaller but close to env->imbalance/2). But how can this help to increase performance in a workload-agnostic way?
On 4.4, the result is better than simple list. Vincent also suggested me to do comparision on mainline kernel, the result shows my rt-tree patches introduce performance regression:
mainline kernel real 2m23.701s user 1m4.500s sys 4m34.604s
mainline kernel + fork regression patch real 2m24.377s user 1m3.952s sys 4m39.928s real 2m19.100s user 0m48.776s sys 3m33.440s
mainline with big task tracking: real 2m28.837s user 1m16.388s sys 5m26.864s real 2m28.501s user 1m18.104s sys 5m30.516s
mainline with big task tracking + fork regression patch real 2m29.369s user 1m16.432s sys 5m27.668s real 2m28.888s user 1m14.792s sys 5m27.800s
fork regression patch: https://lkml.kernel.org/r/20161010173440.GA28945@linaro.org
Besides this periodic lb case, what's the benefit of potentially pulling the biggest task instead of an arbitrary one in active lb?
The original idea is to figure out method to only migrate big task from LITTLE core to big core and avoid to migrate small tasks to big core for saving power.
So at beginning I think rb-tree's potential benefit is for power (actually I also hope this also can benefit for single big task can be easily migrated to big core).
Now if we think about a big.LITTLE system and we are on DIE sd level, an ordered list (large to small) on little cpu and vice versa on big could potentially make sense. Not very scalable though because we're just exploiting the idea of _two_ items with an opposite property (big vs. LITTLE) and different order.
Good point. I only considered the case for task migration from LITTLE core to big core. Should iterate tasks from small to large when migrate from big core to LITTLE core.
Another remark, where do you see the benefit of using util instead of load? In both cases, you do 'task_h_load(p)/2 > env->imbalance), env->imbalance -= task_h_load(p) and if (env->imbalance <= 0) break' i.e. a load based policy any way?
For using util instead of load, as Juri has pointed out in previous meeting, this may pontentially introduce smp nice issue. But in practice, the load value includes running time + runnable time, but util value only includes running time. So util will select a longer running task than another one task, these two tasks may have similiar load value.
I don't change any policy for migration conditions. But this maybe is a good hint for optimization :)
The LB_MIN thing (avoiding to consider tiny tasks for pulling) is a related thing but IMHO, aren't we seeing results which are pretty much inside the noise-cloud here for a particular system/workload combination?
Sorry I'm not quite clear for your question. Do you mean we cannot get strong evidence from testing result to show LB_MIN or rb-tree which is better?
Let's talk about it further in tomorrow's meeting.
Sure.
Thanks, Leo Yan
On 28 October 2016 at 10:13, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 27, 2016 at 08:37:05PM +0100, Dietmar Eggemann wrote:
Hi Leo,
On 26/10/16 18:28, Leo Yan wrote:
o This patch series is to evaluate if can use rb tree to track task load and util on rq; there have some concern for this method is: rb tree has O(log(N)) computation complexity, so this will introduce extra workload by rb tree's maintainence. For this concern using hackbench to do stress testing, hackbench will generate mass tasks for message sender and receiver, so there will have many enqueue and dequeue operations, so we can use hackbench to get to know if rb tree will introduce big workload or not (Thanks a lot for Chris suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after enable feature LB_MIN the scheduler will avoid to migrate task with load < 16. Somehow this also can help to filter out big tasks for migration. So we need compare power data between this patch series with directly setting LB_MIN.
I have difficulties to understand the whole idea here. Basically, you're still doing classical load-balancing (lb) (with the aim of setting env->imbalance ((runnable) load based) to 0. On a system like Hikey (SMP) any order (load or util related) of the tasks can potentially change how many tasks a dst cpu might pull (in case of an ordered list (large to small load) we potentially pull only one task (doesn't have to be the first one because 'task_h_load(p)/2 > env->imbalance', in case its load is smaller but close to env->imbalance/2). But how can this help to increase performance in a workload-agnostic way?
On 4.4, the result is better than simple list. Vincent also suggested me to do comparision on mainline kernel, the result shows my rt-tree patches introduce performance regression:
mainline kernel real 2m23.701s user 1m4.500s sys 4m34.604s
mainline kernel + fork regression patch real 2m24.377s user 1m3.952s sys 4m39.928s real 2m19.100s user 0m48.776s sys 3m33.440s
mainline with big task tracking: real 2m28.837s user 1m16.388s sys 5m26.864s real 2m28.501s user 1m18.104s sys 5m30.516s
That would be interesting to understand where the huge difference between mainline above and your 1st test with v4.4 come from: 1st results on v4.4 were real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
Is the difference linked to v4.4 vs mainline ? different version of hackbench ? different version of rootfs/distro ? something else ?
mainline with big task tracking + fork regression patch real 2m29.369s user 1m16.432s sys 5m27.668s real 2m28.888s user 1m14.792s sys 5m27.800s
fork regression patch: https://lkml.kernel.org/r/20161010173440.GA28945@linaro.org
Besides this periodic lb case, what's the benefit of potentially pulling the biggest task instead of an arbitrary one in active lb?
The original idea is to figure out method to only migrate big task from LITTLE core to big core and avoid to migrate small tasks to big core for saving power.
So at beginning I think rb-tree's potential benefit is for power (actually I also hope this also can benefit for single big task can be easily migrated to big core).
Now if we think about a big.LITTLE system and we are on DIE sd level, an ordered list (large to small) on little cpu and vice versa on big could potentially make sense. Not very scalable though because we're just exploiting the idea of _two_ items with an opposite property (big vs. LITTLE) and different order.
Good point. I only considered the case for task migration from LITTLE core to big core. Should iterate tasks from small to large when migrate from big core to LITTLE core.
Another remark, where do you see the benefit of using util instead of load? In both cases, you do 'task_h_load(p)/2 > env->imbalance), env->imbalance -= task_h_load(p) and if (env->imbalance <= 0) break' i.e. a load based policy any way?
For using util instead of load, as Juri has pointed out in previous meeting, this may pontentially introduce smp nice issue. But in practice, the load value includes running time + runnable time, but util value only includes running time. So util will select a longer running task than another one task, these two tasks may have similiar load value.
I don't change any policy for migration conditions. But this maybe is a good hint for optimization :)
The LB_MIN thing (avoiding to consider tiny tasks for pulling) is a related thing but IMHO, aren't we seeing results which are pretty much inside the noise-cloud here for a particular system/workload combination?
Sorry I'm not quite clear for your question. Do you mean we cannot get strong evidence from testing result to show LB_MIN or rb-tree which is better?
Let's talk about it further in tomorrow's meeting.
Sure.
Thanks, Leo Yan
On Fri, Oct 28, 2016 at 10:19:41AM +0200, Vincent Guittot wrote:
On 28 October 2016 at 10:13, Leo Yan leo.yan@linaro.org wrote:
On Thu, Oct 27, 2016 at 08:37:05PM +0100, Dietmar Eggemann wrote:
Hi Leo,
On 26/10/16 18:28, Leo Yan wrote:
o This patch series is to evaluate if can use rb tree to track task load and util on rq; there have some concern for this method is: rb tree has O(log(N)) computation complexity, so this will introduce extra workload by rb tree's maintainence. For this concern using hackbench to do stress testing, hackbench will generate mass tasks for message sender and receiver, so there will have many enqueue and dequeue operations, so we can use hackbench to get to know if rb tree will introduce big workload or not (Thanks a lot for Chris suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after enable feature LB_MIN the scheduler will avoid to migrate task with load < 16. Somehow this also can help to filter out big tasks for migration. So we need compare power data between this patch series with directly setting LB_MIN.
I have difficulties to understand the whole idea here. Basically, you're still doing classical load-balancing (lb) (with the aim of setting env->imbalance ((runnable) load based) to 0. On a system like Hikey (SMP) any order (load or util related) of the tasks can potentially change how many tasks a dst cpu might pull (in case of an ordered list (large to small load) we potentially pull only one task (doesn't have to be the first one because 'task_h_load(p)/2 > env->imbalance', in case its load is smaller but close to env->imbalance/2). But how can this help to increase performance in a workload-agnostic way?
On 4.4, the result is better than simple list. Vincent also suggested me to do comparision on mainline kernel, the result shows my rt-tree patches introduce performance regression:
mainline kernel real 2m23.701s user 1m4.500s sys 4m34.604s
mainline kernel + fork regression patch real 2m24.377s user 1m3.952s sys 4m39.928s real 2m19.100s user 0m48.776s sys 3m33.440s
mainline with big task tracking: real 2m28.837s user 1m16.388s sys 5m26.864s real 2m28.501s user 1m18.104s sys 5m30.516s
That would be interesting to understand where the huge difference between mainline above and your 1st test with v4.4 come from: 1st results on v4.4 were real user system baseline 6m00.57s 1m41.72s 34m38.18s rb tree 5m55.79s 1m33.68s 34m08.38s
Is the difference linked to v4.4 vs mainline ? different version of hackbench ? different version of rootfs/distro ? something else ?
Two things I think it's quite different between v4.4 and mainline kernel:
- Mainline kernel has not enabled CPUFreq, so suppose always run@1.2GHz; v4.4 kernel has enabled "interactive" governor;
- v4.4 has merged EAS and WALT related code, but when I tested I disable EAS by "echo NO_ENERGY_AWARE > sched_features"; also used PELT signals rather than WALT.
Thanks, Leo Yan