Hi all,
When I debug rb-tree related patches, it's easily to trigger panic for my
rb-tree code, I try to use below simple pseudo code to demonstrate it:
detach_tasks()
node = rb_first(&env->src_rq->seq_node); -> 'node_prev'
while(node) {
se = rb_entry(node, struct sched_entity, seq_node);
node = rb_next(&se->seq_node); -> 'node_next'
if (balanced)
break;
if (meet_conditions_for_migration)
detach_task(se); -> Other CPU acquires src_rq lock
-> and remove 'node_next' firstly
else
continue;
}
In this flow the detach_task() has been modified by WALT patches, so in
function detach_task() it releases lock for source rq in
function double_lock_balance(env->src_rq, env->dst_rq) and then acquire
source rq and destination rq lock in specific sequence so avoid
recursive deadlock; But this gives other CPUs chance to acquire lock for
souce rq and remove node_next from the rb tree, e.g. it is possible to
dequeue the corresponding task on any other CPU (Like CPU_B).
Detach_tasks() will continue iteration for 'node_next', and 'node_next'
can meet the condition to detach, so it try to remove 'node_next' from
rb tree, but 'node_next' has been removed yet by CPU_B. So finally
introduce panic. Please see enclosed kernel log.
So essentially it's unsafe to release and acquire again for rq lock
when scheduler is iterating the lists/tree for the rq. But this code is
delibrately written for WALT to update souce rq and destination rq
statistics for workload. So currently I can simply revert
double_lock_balance()/double_unlock_balance() for only using PELT signals,
but for WALT I want to get some suggestion for the fixing, if we confirm
this is a potential issue, this issue should exist both on Android common
kernel 3.18 and 4.4.
/*
* detach_task() -- detach the task for the migration specified in env
*/
static void detach_task(struct task_struct *p, struct lb_env *env)
{
lockdep_assert_held(&env->src_rq->lock);
deactivate_task(env->src_rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
double_lock_balance(env->src_rq, env->dst_rq);
set_task_cpu(p, env->dst_cpu);
double_unlock_balance(env->src_rq, env->dst_rq);
}
Thanks,
Leo Yan
After set negative boost value it impacts task placement and OPP
selection. For task placement, the scheduler uses function
boosted_task_util() to get smaller value for negative boost value, so it
give more chance for task can fit low capacity CPU; as result this
biases to place tasks on low capacity CPU (Like LITTLE core for ARM
big.LITTLE system). In current code, the waken up path uses this method
to avoid migration task with negative boost value to big core, but in
load balance flow there has no any checking for task with negative
value; so finally it still migrate tasks with negative boosting value to
big core.
So this patch checks task with negative boost value in load balance flow
and avoid to migrate it to big CPU if the task can fit low capacity CPU.
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
kernel/sched/fair.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77ca4df..c22d256 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6747,17 +6747,30 @@ static inline int migrate_degrades_locality(struct task_struct *p,
static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
- int tsk_cache_hot;
+ int tsk_cache_hot, boost;
+ unsigned long cpu_rest_util;
lockdep_assert_held(&env->src_rq->lock);
/*
* We do not migrate tasks that are:
- * 1) throttled_lb_pair, or
- * 2) cannot be migrated to this CPU due to cpus_allowed, or
- * 3) running (obviously), or
- * 4) are cache-hot on their current CPU.
+ * 1) task has negative boost value and task fits cpu, or
+ * 2) throttled_lb_pair, or
+ * 3) cannot be migrated to this CPU due to cpus_allowed, or
+ * 4) running (obviously), or
+ * 5) are cache-hot on their current CPU.
*/
+ if (energy_aware() &&
+ capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu)) {
+
+ boost = schedtune_task_boost(p);
+ cpu_rest_util = cpu_util(env->src_cpu) - task_util(p);
+ cpu_rest_util = max(0UL, cpu_rest_util);
+
+ if (boost < 0 && __task_fits(p, env->src_cpu, cpu_rest_util))
+ return 0;
+ }
+
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
--
1.9.1
o This patch series is to evaluate if can use rb tree to track task
load and util on rq; there have some concern for this method is:
rb tree has O(log(N)) computation complexity, so this will introduce
extra workload by rb tree's maintainence. For this concern using
hackbench to do stress testing, hackbench will generate mass tasks
for message sender and receiver, so there will have many enqueue
and dequeue operations, so we can use hackbench to get to know if
rb tree will introduce big workload or not (Thanks a lot for Chris
suggestion for this).
Another concern is scheduler has provided LB_MIN feature, after
enable feature LB_MIN the scheduler will avoid to migrate task with
load < 16. Somehow this also can help to filter out big tasks for
migration. So we need compare power data between this patch series
with directly setting LB_MIN.
o Testing result:
Tested hackbench on Hikey with CA53x8 CPUs with SMP load balance:
time sh -c 'for i in `seq 100`; do /data/hackbench -p -P > /dev/null; done'
real user system
baseline 6m00.57s 1m41.72s 34m38.18s
rb tree 5m55.79s 1m33.68s 34m08.38s
For hackbench test case we can see with rb tree it even has better
result than baseline kernel.
Tested video playback on Juno for LB_MIN vs rb tree:
LB_MIN Nrg:LITTLE Nrg:Big Nrg:Sum
---------------------------------------------------------
11.3122 8.983429 20.295629
11.337446 8.174061 19.511507
11.256941 8.547895 19.804836
10.994329 9.633028 20.627357
11.483148 8.522364 20.005512
avg. 11.2768128 8.7721554 20.0489682
rb tree Nrg:LITTLE Nrg:Big Nrg:Sum
---------------------------------------------------------
11.384301 8.412714 19.797015
11.673992 8.455219 20.129211
11.586081 8.414606 20.000687
11.423509 8.64781 20.071319
11.43709 8.595252 20.032342
avg. 11.5009946 8.5051202 20.0061148
vs LB_MIN +1.99% -3.04% -0.21%
o Known issues:
For patch 2, function detach_tasks() iterates rb tree for tasks, if
there have one task has been detached then it calls rb_first() to
fetch first node and it will iterate again from first node; it's
better to use rb_next() but after change to use rb_next() will
introduce panic.
Welcome any suggestion for better implementation for it.
Leo Yan (3):
sched/fair: support to track biggest task on rq
sched/fair: select biggest task for migration
sched: remove unused rq::cfs_tasks
include/linux/sched.h | 1 +
include/linux/sched/sysctl.h | 1 +
kernel/sched/core.c | 2 -
kernel/sched/fair.c | 123 ++++++++++++++++++++++++++++++++++++-------
kernel/sched/sched.h | 5 +-
kernel/sysctl.c | 7 +++
6 files changed, 116 insertions(+), 23 deletions(-)
--
1.9.1
Dear Dev,
This is to confirm that one or more of your parcels has been shipped.
Please, download Delivery Label attached to this email.
Yours trully,
Everett Bray,
Sr. Operation Manager.
Dear Dev,
Courier was unable to deliver the parcel to you.
Please, open email attachment to print shipment label.
Yours faithfully,
Ramon Klein,
FedEx Delivery Agent.