On Tue, 2025-09-23 at 14:33 +0200, Christian König wrote:
On 23.09.25 14:08, Philipp Stanner wrote:
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification. Should i sent a new version with a modified patch #3 ? If so, how the change should be reflected in the commit message ? (I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
The scheduler sets the stopped-flag, but that effectively only happens when you either flush() or fini() the entity. OR if you run into that drm_sched_fini() race.
Otherwise we would need to re-invent the flush logic for every driver again.
Let's ask differently: Does amdgpu check here whether drm_sched_entity_fini() or drm_sched_entity_flush() have been called on those entities already?
B. Add an API: drm_sched_entity_is_stopped(). There's also drm_sched_entity_is_idle(), but I guess that won't serve your purpose?
drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
Fine by me.
And btw, as we're at it: @Christian: Danilo and I recently asked about whether entities can still outlive their scheduler in amdgpu?
That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
That seems to be the reason why that race-"fix" in drm_sched_fini() was added, which is the only other place that can mark an entity as stopped, except for the proper place: drm_sched_entity_kill().
That is potentially still good to have.
That's why we left it for now and just added a FIXME, because there's not really any benefit in potentially blowing up drivers by removing it (well, technically blowing up drivers like that would reveal significant lifetime and, thus, design issues. But it wouldn't be "nice").
Still, it's a clear sign of (undocumented…) scheduler lifetimes being violated :(
P.
Regards, Christian.
P.
Best, Jules