On Mar 5, 2025, at 2:43 AM, Juri Lelli juri.lelli@redhat.com wrote:
Another one is dl_task_offline_migration which gets the task from dl_task_timer which in turn gets it from sched_dl_entity. I haven’t gone through the deadline code thoroughly but I think this race shouldn’t exist for the offline task (2nd) case. If that is true then the fix could be to check in push_dl_task if the task returned by find_lock_later_rq is still at the head of the queue or not.
I believe that won't work as dl_task_offline_migration() gets called in case the replenishment timer for a task fires (to unthrottle it) and it finds the old rq the task was running on has been offlined in the meantime. The task is still throttled at this point and so it is not enqueued in the dl_rq nor in the pushable task list/tree, so the check you are adding won't work I am afraid. Maybe we can use dl_se->dl_throttled to differentiate this different case.
Thanks Juri. I sent the fix please take a look: https://lore.kernel.org/lkml/20250307204255.60640-1-harshit@nutanix.com/ Instead of changing find_lock_later_rq, I added the handling in the caller i.e. push_dl_task since that’s the one affected by the race. I think we don’t need to handle the other case at all as the race is not applicable for offline migration case. Let me know if this sounds fine or if I am missing something.
Regards, Harshit