[Linaro-mm-sig] Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback

8 Mar 2023


      On 09/03/2023 04.12, Christian König wrote:
...
Am 08.03.23 um 20:05 schrieb Asahi Lina:
...
[SNIP]
...
Well it's not the better way, it's the only way that works.
I have to admit that my bet on your intentions was wrong, but even that
use case doesn't work correctly.
See when your callback returns false it is perfectly possible that all
hw fences are signaled between returning that information and processing it.
The result would be that the scheduler goes to sleep and never wakes up
again.
That can't happen, because it will just go into another iteration of the
drm_sched main loop since there is an entity available still.
Rather there is probably the opposite bug in this patch: the can_run_job
logic should be moved into the wait_event_interruptible() condition
check, otherwise I think it can end up busy-looping since the condition
itself can be true even when the can_run_job check blocks it.
But there is no risk of it going to sleep and never waking up because
job completions will wake up the waitqueue by definition, and that
happens after the driver-side queues are popped. If this problem could
happen, then the existing hw_submission_limit logic would be broken in
the same way. It is logically equivalent in how it works.
Basically, if properly done in wait_event_interruptible, it is exactly
the logic of that macro that prevents this race condition and makes
everything work at all. Without it, drm_sched would be completely broken.
...
As I said we exercised those ideas before and yes this approach here
came up before as well and no it doesn't work.
It can never deadlock with this patch as it stands (though it could busy
loop), and if properly moved into the wait_event_interruptible(), it
would also never busy loop and work entirely as intended. The actual API
change is sound.
I don't know why you're trying so hard to convince everyone that this
approach is fundamentally broken... It might be a bad idea for other
reasons, it might encourage incorrect usage, it might not be the best
option, there are plenty of arguments you can make... but you just keep
trying to make an argument that it just can't work at all for some
reason. Why? I already said I'm happy dropping it in favor of the fences...
Well because it is broken.
When you move the check into the wait_event_interruptible condition then 
who is going to call wait_event_interruptible when the condition changes?
I think you mean wake_up_interruptible(). That would be
drm_sched_job_done(), on the fence callback when a job completes, which
as I keep saying is the same logic used for
hw_rq_count/hw_submission_limit tracking.
Please think about it for a second, it's really not that complicated to
see why it works:
- Driver pops off completed commands <-- can_run_job condition satisfied
- Driver signals fence
 - drm_sched_job_done_cb()
  - drm_sched_job_done()
   - atomic_dec(&sched->hw_rq_count); <-- hw_submission_limit satisfied
   - ...
   - wake_up_interruptible(&sched->wake_up_worker);
      ^- happens after both conditions are potentially satisfied
It really is completely equivalent to just making the hw_rq_count logic
customizable by the driver. The actual flow is the same. As long as the
driver guarantees it satisfies the can_run_job() condition before
signaling the completion fence that triggered that change, it works fine.
...
As I said this idea came up before and was rejected multiple times.
Maybe it was a different idea, or maybe it was rejected for other
reasons, or maybe it was wrongly rejected for being broken when it isn't ^^
~~ Lina

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

[Linaro-mm-sig] Re: [PATCH RFC 10/18] drm/scheduler: Add can_run_job callback