Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job

23 Sep 2025


      On Tue, 2025-09-23 at 14:33 +0200, Christian König wrote:
...
On 23.09.25 14:08, Philipp Stanner wrote:
...
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
...
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
...
On 22.09.25 17:30, Philipp Stanner wrote:
...
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
...
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and
drm_sched_rq_update_fifo(), where the latter calls the former, are
currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule
efficiency improvement on the submit from idle path, by introducing a new
drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to
its callers.
v2:
 * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3:
 * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com
Cc: Christian König christian.koenig@amd.com
Cc: Alex Deucher alexander.deucher@amd.com
Cc: Luben Tuikov ltuikov89@gmail.com
Cc: Matthew Brost matthew.brost@intel.com
Cc: Philipp Stanner pstanner@redhat.com
Reviewed-by: Christian König christian.koenig@amd.com
Signed-off-by: Philipp Stanner pstanner@redhat.com
Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin...
(cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3)
Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into
stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
...
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification.
Should i sent a new version with a modified patch #3 ?
If so, how the change should be reflected in the commit message ?
(I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly
discourage accessing another API's struct members directly. There is no
API contract for them.
And a proper API function rarely changes its interface, and if it does,
it's easy to find for the contributor where drivers need to be
adjusted. If we were all following that rule, you wouldn't even have to
bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It
      knows which entities it has killed. So that information could be
      stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
The scheduler sets the stopped-flag, but that effectively only happens
when you either flush() or fini() the entity. OR if you run into that
drm_sched_fini() race.
...
Otherwise we would need to re-invent the flush logic for every driver again.
Let's ask differently: Does amdgpu check here whether
drm_sched_entity_fini() or drm_sched_entity_flush() have been called on
those entities already?
...
...
B. Add an API: drm_sched_entity_is_stopped(). There's also
      drm_sched_entity_is_idle(), but I guess that won't serve your
      purpose?
drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
Fine by me.
...
...
And btw, as we're at it:
@Christian: Danilo and I recently asked about whether entities can
still outlive their scheduler in amdgpu?
That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
...
That seems to be the reason why that race-"fix" in drm_sched_fini() was
added, which is the only other place that can mark an entity as
stopped, except for the proper place: drm_sched_entity_kill().
That is potentially still good to have.
That's why we left it for now and just added a FIXME, because there's
not really any benefit in potentially blowing up drivers by removing it
(well, technically blowing up drivers like that would reveal
significant lifetime and, thus, design issues. But it wouldn't be
"nice").
Still, it's a clear sign of (undocumented…) scheduler lifetimes being
violated :(
P.
...
Regards,
Christian.
...
P.
...
Best,
Jules

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH 6.12.y 1/3] drm/sched: Optimise drm_sched_entity_push_job