Hi all,
When I debug rb-tree related patches, it's easily to trigger panic for my rb-tree code, I try to use below simple pseudo code to demonstrate it:
detach_tasks() node = rb_first(&env->src_rq->seq_node); -> 'node_prev' while(node) { se = rb_entry(node, struct sched_entity, seq_node); node = rb_next(&se->seq_node); -> 'node_next'
if (balanced) break; if (meet_conditions_for_migration) detach_task(se); -> Other CPU acquires src_rq lock -> and remove 'node_next' firstly else continue;
}
In this flow the detach_task() has been modified by WALT patches, so in function detach_task() it releases lock for source rq in function double_lock_balance(env->src_rq, env->dst_rq) and then acquire source rq and destination rq lock in specific sequence so avoid recursive deadlock; But this gives other CPUs chance to acquire lock for souce rq and remove node_next from the rb tree, e.g. it is possible to dequeue the corresponding task on any other CPU (Like CPU_B).
Detach_tasks() will continue iteration for 'node_next', and 'node_next' can meet the condition to detach, so it try to remove 'node_next' from rb tree, but 'node_next' has been removed yet by CPU_B. So finally introduce panic. Please see enclosed kernel log.
So essentially it's unsafe to release and acquire again for rq lock when scheduler is iterating the lists/tree for the rq. But this code is delibrately written for WALT to update souce rq and destination rq statistics for workload. So currently I can simply revert double_lock_balance()/double_unlock_balance() for only using PELT signals, but for WALT I want to get some suggestion for the fixing, if we confirm this is a potential issue, this issue should exist both on Android common kernel 3.18 and 4.4.
/* * detach_task() -- detach the task for the migration specified in env */ static void detach_task(struct task_struct *p, struct lb_env *env) { lockdep_assert_held(&env->src_rq->lock);
deactivate_task(env->src_rq, p, 0); p->on_rq = TASK_ON_RQ_MIGRATING; double_lock_balance(env->src_rq, env->dst_rq); set_task_cpu(p, env->dst_cpu); double_unlock_balance(env->src_rq, env->dst_rq); }
Thanks, Leo Yan