On Tue, Feb 25, 2025 at 06:05:53PM +0000, Harshit Agarwal wrote:
Details
Let's look at the following scenario to understand this race.
- CPU A enters push_rt_task
a) CPU A has chosen next_task = task p. b) CPU A calls find_lock_lowest_rq(Task p, CPU Z’s rq). c) CPU A identifies CPU X as a destination CPU (X < Z). d) CPU A enters double_lock_balance(CPU Z’s rq, CPU X’s rq). e) Since X is lower than Z, CPU A unlocks CPU Z’s rq. Someone else has locked CPU X’s rq, and thus, CPU A must wait.
- At CPU Z
a) Previous task has completed execution and thus, CPU Z enters schedule, locks its own rq after CPU A releases it. b) CPU Z dequeues previous task and begins executing task p. c) CPU Z unlocks its rq. d) Task p yields the CPU (ex. by doing IO or waiting to acquire a lock) which triggers the schedule function on CPU Z. e) CPU Z enters schedule again, locks its own rq, and dequeues task p. f) As part of dequeue, it sets p.on_rq = 0 and unlocks its rq.
- At CPU B
a) CPU B enters try_to_wake_up with input task p. b) Since CPU Z dequeued task p, p.on_rq = 0, and CPU B updates B.state = WAKING. c) CPU B via select_task_rq determines CPU Y as the target CPU.
- The race
a) CPU A acquires CPU X’s lock and relocks CPU Z. b) CPU A reads task p.cpu = Z and incorrectly concludes task p is still on CPU Z. c) CPU A failed to notice task p had been dequeued from CPU Z while CPU A was waiting for locks in double_lock_balance. If CPU A knew that task p had been dequeued, it would return NULL forcing push_rt_task to give up the task p's migration. d) CPU B updates task p.cpu = Y and calls ttwu_queue. e) CPU B locks Ys rq. CPU B enqueues task p onto Y and sets task p.on_rq = 1. f) CPU B unlocks CPU Y, triggering memory synchronization. g) CPU A reads task p.on_rq = 1, cementing its assumption that task p has not migrated. h) CPU A decides to migrate p to CPU X.
This leads to A dequeuing p from Y's queue and various crashes down the line.
Solution
The solution here is fairly simple. After obtaining the lock (at 4a), the check is enhanced to make sure that the task is still at the head of the pushable tasks list. If not, then it is anyway not suitable for being pushed out.
Testing
The fix is tested on a cluster of 3 nodes, where the panics due to this are hit every couple of days. A fix similar to this was deployed on such cluster and was stable for more than 30 days.
Co-developed-by: Jon Kohler jon@nutanix.com Signed-off-by: Jon Kohler jon@nutanix.com Co-developed-by: Gauri Patwardhan gauri.patwardhan@nutanix.com Signed-off-by: Gauri Patwardhan gauri.patwardhan@nutanix.com Co-developed-by: Rahul Chunduru rahul.chunduru@nutanix.com Signed-off-by: Rahul Chunduru rahul.chunduru@nutanix.com Signed-off-by: Harshit Agarwal harshit@nutanix.com Tested-by: Will Ton william.ton@nutanix.com Reviewed-by: Steven Rostedt (Google) rostedt@goodmis.org Cc: stable@vger.kernel.org
Thanks, I've picked this up to land after -rc1.