Am 08.03.23 um 18:32 schrieb Asahi Lina:
[SNIP] Yes but... none of this cleans up jobs that are already submitted by the scheduler and in its pending list, with registered completion callbacks, which were already popped off of the entities.
*That* is the problem this patch fixes!
Ah! Yes that makes more sense now.
We could add a warning when users of this API doesn't do this correctly, but cleaning up incorrect API use is clearly something we don't want here.
It is the job of the Rust abstractions to make incorrect API use that leads to memory unsafety impossible. So even if you don't want that in C, it's my job to do that for Rust... and right now, I just can't because drm_sched doesn't provide an API that can be safely wrapped without weird bits of babysitting functionality on top (like tracking jobs outside or awkwardly making jobs hold a reference to the scheduler and defer dropping it to another thread).
Yeah, that was discussed before but rejected.
The argument was that upper layer needs to wait for the hw to become idle before the scheduler can be destroyed anyway.
Right now, it is not possible to create a safe Rust abstraction for drm_sched without doing something like duplicating all job tracking in the abstraction, or the above backreference + deferred cleanup mess, or something equally silly. So let's just fix the C side please ^^
Nope, as far as I can see this is just not correctly tearing down the objects in the right order.
There's no API to clean up in-flight jobs in a drm_sched at all. Destroying an entity won't do it. So there is no reasonable way to do this at all...
Yes, this was removed.
So you are trying to do something which is not supposed to work in the first place.
I need to make things that aren't supposed to work impossible to do in the first place, or at least fail gracefully instead of just oopsing like drm_sched does today...
If you're convinced there's a way to do this, can you tell me exactly what code sequence I need to run to safely shut down a scheduler assuming all entities are already destroyed? You can't ask me for a list of pending jobs (the scheduler knows this, it doesn't make any sense to duplicate that outside), and you can't ask me to just not do this until all jobs complete execution (because then we either end up with the messy deadlock situation I described if I take a reference, or more duplicative in-flight job count tracking and blocking in the free path of the Rust abstraction, which doesn't make any sense either).
Good question. We don't have anybody upstream which uses the scheduler lifetime like this.
Essentially the job list in the scheduler is something we wanted to remove because it causes tons of race conditions during hw recovery.
When you tear down the firmware queue how do you handle already submitted jobs there?
Regards, Christian.
~~ Lina