From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net --- drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++---- drivers/gpu/drm/scheduler/sched_main.c | 6 +++--- include/drm/gpu_scheduler.h | 2 +- 3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 3e75fc1f6607..9dbae7b08bc9 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -505,8 +505,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) struct drm_sched_job *next;
next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue)); - if (next) - drm_sched_rq_update_fifo(entity, next->submit_ts); + if (next) { + spin_lock(&entity->rq_lock); + drm_sched_rq_update_fifo_locked(entity, + next->submit_ts); + spin_unlock(&entity->rq_lock); + } }
/* Jobs and entities might have different lifecycles. Since we're @@ -606,10 +610,11 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) sched = rq->sched;
drm_sched_rq_add_entity(rq, entity); - spin_unlock(&entity->rq_lock);
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) - drm_sched_rq_update_fifo(entity, submit_ts); + drm_sched_rq_update_fifo_locked(entity, submit_ts); + + spin_unlock(&entity->rq_lock);
drm_sched_wakeup(sched); } diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 416590ea0dc3..3609d5a8fecd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -169,14 +169,15 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti } }
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts) { /* * Both locks need to be grabbed, one to protect from entity->rq change * for entity from within concurrent drm_sched_entity_select_rq and the * other to update the rb tree structure. */ - spin_lock(&entity->rq_lock); + lockdep_assert_held(&entity->rq_lock); + spin_lock(&entity->rq->lock);
drm_sched_rq_remove_fifo_locked(entity); @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) drm_sched_entity_compare_before);
spin_unlock(&entity->rq->lock); - spin_unlock(&entity->rq_lock); }
/** diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9c437a057e5d..346a3c261b43 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq, void drm_sched_rq_remove_entity(struct drm_sched_rq *rq, struct drm_sched_entity *entity);
-void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts); +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts);
int drm_sched_entity_init(struct drm_sched_entity *entity, enum drm_sched_priority priority,
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit f93126f5d55920d1447ef00a3fbe6706f40f53de upstream.
When writing to a drm_sched_entity's run-queue, writers are protected through the lock drm_sched_entity.rq_lock. This naming, however, frequently collides with the separate internal lock of struct drm_sched_rq, resulting in uses like this:
spin_lock(&entity->rq_lock); spin_lock(&entity->rq->lock);
Rename drm_sched_entity.rq_lock to improve readability. While at it, re-order that struct's members to make it more obvious what the lock protects.
v2: * Rename some rq_lock straddlers in kerneldoc, improve commit text. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Suggested-by: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com [pstanner: Fix typo in docstring] Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-5-tursulin... (cherry picked from commit f93126f5d55920d1447ef00a3fbe6706f40f53de) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net --- drivers/gpu/drm/scheduler/sched_entity.c | 28 ++++++++++++------------ drivers/gpu/drm/scheduler/sched_main.c | 2 +- include/drm/gpu_scheduler.h | 21 +++++++++--------- 3 files changed, 26 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 9dbae7b08bc9..089e8ba0435b 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -105,7 +105,7 @@ int drm_sched_entity_init(struct drm_sched_entity *entity, /* We start in an idle state. */ complete_all(&entity->entity_idle);
- spin_lock_init(&entity->rq_lock); + spin_lock_init(&entity->lock); spsc_queue_init(&entity->job_queue);
atomic_set(&entity->fence_seq, 0); @@ -133,10 +133,10 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, { WARN_ON(!num_sched_list || !sched_list);
- spin_lock(&entity->rq_lock); + spin_lock(&entity->lock); entity->sched_list = sched_list; entity->num_sched_list = num_sched_list; - spin_unlock(&entity->rq_lock); + spin_unlock(&entity->lock); } EXPORT_SYMBOL(drm_sched_entity_modify_sched);
@@ -245,10 +245,10 @@ static void drm_sched_entity_kill(struct drm_sched_entity *entity) if (!entity->rq) return;
- spin_lock(&entity->rq_lock); + spin_lock(&entity->lock); entity->stopped = true; drm_sched_rq_remove_entity(entity->rq, entity); - spin_unlock(&entity->rq_lock); + spin_unlock(&entity->lock);
/* Make sure this entity is not used by the scheduler at the moment */ wait_for_completion(&entity->entity_idle); @@ -394,9 +394,9 @@ static void drm_sched_entity_wakeup(struct dma_fence *f, void drm_sched_entity_set_priority(struct drm_sched_entity *entity, enum drm_sched_priority priority) { - spin_lock(&entity->rq_lock); + spin_lock(&entity->lock); entity->priority = priority; - spin_unlock(&entity->rq_lock); + spin_unlock(&entity->lock); } EXPORT_SYMBOL(drm_sched_entity_set_priority);
@@ -506,10 +506,10 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity)
next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue)); if (next) { - spin_lock(&entity->rq_lock); + spin_lock(&entity->lock); drm_sched_rq_update_fifo_locked(entity, next->submit_ts); - spin_unlock(&entity->rq_lock); + spin_unlock(&entity->lock); } }
@@ -550,14 +550,14 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity) if (fence && !dma_fence_is_signaled(fence)) return;
- spin_lock(&entity->rq_lock); + spin_lock(&entity->lock); sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list); rq = sched ? sched->sched_rq[entity->priority] : NULL; if (rq != entity->rq) { drm_sched_rq_remove_entity(entity->rq, entity); entity->rq = rq; } - spin_unlock(&entity->rq_lock); + spin_unlock(&entity->lock);
if (entity->num_sched_list == 1) entity->sched_list = NULL; @@ -598,9 +598,9 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) struct drm_sched_rq *rq;
/* Add the entity to the run queue */ - spin_lock(&entity->rq_lock); + spin_lock(&entity->lock); if (entity->stopped) { - spin_unlock(&entity->rq_lock); + spin_unlock(&entity->lock);
DRM_ERROR("Trying to push to a killed entity\n"); return; @@ -614,7 +614,7 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) if (drm_sched_policy == DRM_SCHED_POLICY_FIFO) drm_sched_rq_update_fifo_locked(entity, submit_ts);
- spin_unlock(&entity->rq_lock); + spin_unlock(&entity->lock);
drm_sched_wakeup(sched); } diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 3609d5a8fecd..d0c3a3d9ca4d 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -176,7 +176,7 @@ void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts * for entity from within concurrent drm_sched_entity_select_rq and the * other to update the rb tree structure. */ - lockdep_assert_held(&entity->rq_lock); + lockdep_assert_held(&entity->lock);
spin_lock(&entity->rq->lock);
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 346a3c261b43..e78adc7a9195 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -96,14 +96,22 @@ struct drm_sched_entity { */ struct list_head list;
+ /** + * @lock: + * + * Lock protecting the run-queue (@rq) to which this entity belongs, + * @priority and the list of schedulers (@sched_list, @num_sched_list). + */ + spinlock_t lock; + /** * @rq: * * Runqueue on which this entity is currently scheduled. * * FIXME: Locking is very unclear for this. Writers are protected by - * @rq_lock, but readers are generally lockless and seem to just race - * with not even a READ_ONCE. + * @lock, but readers are generally lockless and seem to just race with + * not even a READ_ONCE. */ struct drm_sched_rq *rq;
@@ -136,17 +144,10 @@ struct drm_sched_entity { * @priority: * * Priority of the entity. This can be modified by calling - * drm_sched_entity_set_priority(). Protected by &rq_lock. + * drm_sched_entity_set_priority(). Protected by @lock. */ enum drm_sched_priority priority;
- /** - * @rq_lock: - * - * Lock to modify the runqueue to which this entity belongs. - */ - spinlock_t rq_lock; - /** * @job_queue: the list of jobs of this entity. */
From: Liu01 Tong Tong.Liu01@amd.com
commit aa5fc4362fac9351557eb27c745579159a2e4520 upstream.
During process kill, drm_sched_entity_flush() will kill the vm entities. The following job submissions of this process will fail, and the resources of these jobs have not been released, nor have the fences been signalled, causing tasks to hang and timeout.
Fix by check entity status in amdgpu_vm_ready() and avoid submit jobs to stopped entity.
v2: add amdgpu_vm_ready() check before amdgpu_vm_clear_freed() in function amdgpu_cs_vm_handling().
Fixes: 1f02f2044bda ("drm/amdgpu: Avoid extra evict-restore process.") Signed-off-by: Liu01 Tong Tong.Liu01@amd.com Signed-off-by: Lin.Cao lincao12@amd.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com (cherry picked from commit f101c13a8720c73e67f8f9d511fbbeda95bcedb1) Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org (cherry picked from commit 3a0dc1f487c30521a0fba0b935d0886b9b1984a5) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net --- drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 3 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 15 +++++++++++---- 2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c index 5df21529b3b1..0316707647d8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c @@ -1116,6 +1116,9 @@ static int amdgpu_cs_vm_handling(struct amdgpu_cs_parser *p) } }
+ if (!amdgpu_vm_ready(vm)) + return -EINVAL; + r = amdgpu_vm_clear_freed(adev, vm, NULL); if (r) return r; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 37d53578825b..7cfae3fb3097 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -543,11 +543,10 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct amdgpu_vm *vm, * Check if all VM PDs/PTs are ready for updates * * Returns: - * True if VM is not evicting. + * True if VM is not evicting and all VM entities are not stopped */ bool amdgpu_vm_ready(struct amdgpu_vm *vm) { - bool empty; bool ret;
amdgpu_vm_eviction_lock(vm); @@ -555,10 +554,18 @@ bool amdgpu_vm_ready(struct amdgpu_vm *vm) amdgpu_vm_eviction_unlock(vm);
spin_lock(&vm->status_lock); - empty = list_empty(&vm->evicted); + ret &= list_empty(&vm->evicted); spin_unlock(&vm->status_lock);
- return ret && empty; + spin_lock(&vm->immediate.lock); + ret &= !vm->immediate.stopped; + spin_unlock(&vm->immediate.lock); + + spin_lock(&vm->delayed.lock); + ret &= !vm->delayed.stopped; + spin_unlock(&vm->delayed.lock); + + return ret; }
/**
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
P.
drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++---- drivers/gpu/drm/scheduler/sched_main.c | 6 +++--- include/drm/gpu_scheduler.h | 2 +- 3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 3e75fc1f6607..9dbae7b08bc9 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -505,8 +505,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) struct drm_sched_job *next; next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (next)
drm_sched_rq_update_fifo(entity, next->submit_ts);
if (next) {
spin_lock(&entity->rq_lock);
drm_sched_rq_update_fifo_locked(entity,
next->submit_ts);
spin_unlock(&entity->rq_lock);
}
} /* Jobs and entities might have different lifecycles. Since we're @@ -606,10 +610,11 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) sched = rq->sched; drm_sched_rq_add_entity(rq, entity);
spin_unlock(&entity->rq_lock);
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
drm_sched_rq_update_fifo(entity, submit_ts);
drm_sched_rq_update_fifo_locked(entity, submit_ts);
spin_unlock(&entity->rq_lock);
drm_sched_wakeup(sched); } diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 416590ea0dc3..3609d5a8fecd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -169,14 +169,15 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti } } -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts) { /* * Both locks need to be grabbed, one to protect from entity->rq change * for entity from within concurrent drm_sched_entity_select_rq and the * other to update the rb tree structure. */
- spin_lock(&entity->rq_lock);
- lockdep_assert_held(&entity->rq_lock);
spin_lock(&entity->rq->lock); drm_sched_rq_remove_fifo_locked(entity); @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) drm_sched_entity_compare_before); spin_unlock(&entity->rq->lock);
- spin_unlock(&entity->rq_lock);
} /** diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9c437a057e5d..346a3c261b43 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq, void drm_sched_rq_remove_entity(struct drm_sched_rq *rq, struct drm_sched_entity *entity); -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts); +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts); int drm_sched_entity_init(struct drm_sched_entity *entity, enum drm_sched_priority priority,
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
Regards, Christian.
P.
drivers/gpu/drm/scheduler/sched_entity.c | 13 +++++++++---- drivers/gpu/drm/scheduler/sched_main.c | 6 +++--- include/drm/gpu_scheduler.h | 2 +- 3 files changed, 13 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 3e75fc1f6607..9dbae7b08bc9 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -505,8 +505,12 @@ struct drm_sched_job *drm_sched_entity_pop_job(struct drm_sched_entity *entity) struct drm_sched_job *next; next = to_drm_sched_job(spsc_queue_peek(&entity->job_queue));
if (next)
drm_sched_rq_update_fifo(entity, next->submit_ts);
if (next) {
spin_lock(&entity->rq_lock);
drm_sched_rq_update_fifo_locked(entity,
next->submit_ts);
spin_unlock(&entity->rq_lock);
}
} /* Jobs and entities might have different lifecycles. Since we're @@ -606,10 +610,11 @@ void drm_sched_entity_push_job(struct drm_sched_job *sched_job) sched = rq->sched; drm_sched_rq_add_entity(rq, entity);
spin_unlock(&entity->rq_lock);
if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
drm_sched_rq_update_fifo(entity, submit_ts);
drm_sched_rq_update_fifo_locked(entity, submit_ts);
spin_unlock(&entity->rq_lock);
drm_sched_wakeup(sched); } diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 416590ea0dc3..3609d5a8fecd 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -169,14 +169,15 @@ static inline void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *enti } } -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts) { /* * Both locks need to be grabbed, one to protect from entity->rq change * for entity from within concurrent drm_sched_entity_select_rq and the * other to update the rb tree structure. */
- spin_lock(&entity->rq_lock);
- lockdep_assert_held(&entity->rq_lock);
spin_lock(&entity->rq->lock); drm_sched_rq_remove_fifo_locked(entity); @@ -187,7 +188,6 @@ void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) drm_sched_entity_compare_before); spin_unlock(&entity->rq->lock);
- spin_unlock(&entity->rq_lock);
} /** diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 9c437a057e5d..346a3c261b43 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -593,7 +593,7 @@ void drm_sched_rq_add_entity(struct drm_sched_rq *rq, void drm_sched_rq_remove_entity(struct drm_sched_rq *rq, struct drm_sched_entity *entity); -void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts); +void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity, ktime_t ts); int drm_sched_entity_init(struct drm_sched_entity *entity, enum drm_sched_priority priority,
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification. Should i sent a new version with a modified patch #3 ? If so, how the change should be reflected in the commit message ? (I initially ask #kernelnewbies but ended pulling the two other patches)
Best, Jules
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification. Should i sent a new version with a modified patch #3 ? If so, how the change should be reflected in the commit message ? (I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm. B. Add an API: drm_sched_entity_is_stopped(). There's also drm_sched_entity_is_idle(), but I guess that won't serve your purpose?
And btw, as we're at it: @Christian: Danilo and I recently asked about whether entities can still outlive their scheduler in amdgpu? That seems to be the reason why that race-"fix" in drm_sched_fini() was added, which is the only other place that can mark an entity as stopped, except for the proper place: drm_sched_entity_kill().
P.
Best, Jules
On 23.09.25 14:08, Philipp Stanner wrote:
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification. Should i sent a new version with a modified patch #3 ? If so, how the change should be reflected in the commit message ? (I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
Otherwise we would need to re-invent the flush logic for every driver again.
B. Add an API: drm_sched_entity_is_stopped(). There's also drm_sched_entity_is_idle(), but I guess that won't serve your purpose?
drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
And btw, as we're at it: @Christian: Danilo and I recently asked about whether entities can still outlive their scheduler in amdgpu?
That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
That seems to be the reason why that race-"fix" in drm_sched_fini() was added, which is the only other place that can mark an entity as stopped, except for the proper place: drm_sched_entity_kill().
That is potentially still good to have.
Regards, Christian.
P.
Best, Jules
On Tue, 2025-09-23 at 14:33 +0200, Christian König wrote:
On 23.09.25 14:08, Philipp Stanner wrote:
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream.
In FIFO mode (which is the default), both drm_sched_entity_push_job() and drm_sched_rq_update_fifo(), where the latter calls the former, are currently taking and releasing the same entity->rq_lock.
We can avoid that design inelegance, and also have a miniscule efficiency improvement on the submit from idle path, by introducing a new drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to its callers.
v2: * Remove drm_sched_rq_update_fifo() altogether. (Christian)
v3: * Improved commit message. (Philipp)
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: Philipp Stanner pstanner@redhat.com Reviewed-by: Christian König christian.koenig@amd.com Signed-off-by: Philipp Stanner pstanner@redhat.com Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification. Should i sent a new version with a modified patch #3 ? If so, how the change should be reflected in the commit message ? (I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
The scheduler sets the stopped-flag, but that effectively only happens when you either flush() or fini() the entity. OR if you run into that drm_sched_fini() race.
Otherwise we would need to re-invent the flush logic for every driver again.
Let's ask differently: Does amdgpu check here whether drm_sched_entity_fini() or drm_sched_entity_flush() have been called on those entities already?
B. Add an API: drm_sched_entity_is_stopped(). There's also drm_sched_entity_is_idle(), but I guess that won't serve your purpose?
drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
Fine by me.
And btw, as we're at it: @Christian: Danilo and I recently asked about whether entities can still outlive their scheduler in amdgpu?
That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
That seems to be the reason why that race-"fix" in drm_sched_fini() was added, which is the only other place that can mark an entity as stopped, except for the proper place: drm_sched_entity_kill().
That is potentially still good to have.
That's why we left it for now and just added a FIXME, because there's not really any benefit in potentially blowing up drivers by removing it (well, technically blowing up drivers like that would reveal significant lifetime and, thus, design issues. But it wouldn't be "nice").
Still, it's a clear sign of (undocumented…) scheduler lifetimes being violated :(
P.
Regards, Christian.
P.
Best, Jules
On 23.09.25 15:10, Philipp Stanner wrote:
On Tue, 2025-09-23 at 14:33 +0200, Christian König wrote:
On 23.09.25 14:08, Philipp Stanner wrote:
On Mon, 2025-09-22 at 22:50 +0200, Jules Maselbas wrote:
On Mon Sep 22, 2025 at 7:39 PM CEST, Christian König wrote:
On 22.09.25 17:30, Philipp Stanner wrote:
On Mon, 2025-09-22 at 15:09 +0200, Jules Maselbas wrote: > From: Tvrtko Ursulin tvrtko.ursulin@igalia.com > > commit d42a254633c773921884a19e8a1a0f53a31150c3 upstream. > > In FIFO mode (which is the default), both drm_sched_entity_push_job() and > drm_sched_rq_update_fifo(), where the latter calls the former, are > currently taking and releasing the same entity->rq_lock. > > We can avoid that design inelegance, and also have a miniscule > efficiency improvement on the submit from idle path, by introducing a new > drm_sched_rq_update_fifo_locked() helper and pulling up the lock taking to > its callers. > > v2: > * Remove drm_sched_rq_update_fifo() altogether. (Christian) > > v3: > * Improved commit message. (Philipp) > > Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com > Cc: Christian König christian.koenig@amd.com > Cc: Alex Deucher alexander.deucher@amd.com > Cc: Luben Tuikov ltuikov89@gmail.com > Cc: Matthew Brost matthew.brost@intel.com > Cc: Philipp Stanner pstanner@redhat.com > Reviewed-by: Christian König christian.koenig@amd.com > Signed-off-by: Philipp Stanner pstanner@redhat.com > Link: https://patchwork.freedesktop.org/patch/msgid/20241016122013.7857-2-tursulin... > (cherry picked from commit d42a254633c773921884a19e8a1a0f53a31150c3) > Signed-off-by: Jules Maselbas jmaselbas@zdiv.net
Am I interpreting this mail correctly: you want to get this patch into stable?
Why? It doesn't fix a bug.
Patch #3 in this series depends on the other two, but I agree that isn't a good idea.
Yes patch #3 fixes a freeze in amdgpu
We should just adjust patch #3 to apply on the older kernel as well instead of backporting patches #1 and #2.
I initially modified patch #3 to use .rq_lock instead of .lock, but i didn't felt very confident with this modification. Should i sent a new version with a modified patch #3 ? If so, how the change should be reflected in the commit message ? (I initially ask #kernelnewbies but ended pulling the two other patches)
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
The scheduler sets the stopped-flag, but that effectively only happens when you either flush() or fini() the entity. OR if you run into that drm_sched_fini() race.
Otherwise we would need to re-invent the flush logic for every driver again.
Let's ask differently: Does amdgpu check here whether drm_sched_entity_fini() or drm_sched_entity_flush() have been called on those entities already?
No and it should not.
drm_entity_flush() can be called many times (and even concurrently) on the same entity.
Only when the scheduler sees that it is called by the last submitter of jobs *and* because this submitter was terminated by a SIGKILL then the entity is killed as well.
Background is that the file flush callback this is used with is basically called all the time. And as we now have found once more also because userspace forgotten to set CLOEXEC.
B. Add an API: drm_sched_entity_is_stopped(). There's also drm_sched_entity_is_idle(), but I guess that won't serve your purpose?
drm_sched_entity_is_stopped() should do it. drm_sched_entity_is_idle() is something different and should potentially even not be exported to drivers in the first place.
Fine by me.
And btw, as we're at it: @Christian: Danilo and I recently asked about whether entities can still outlive their scheduler in amdgpu?
That should have been fixed by now. This happened only on hot-unplug and that was re-designed quite a bit.
That seems to be the reason why that race-"fix" in drm_sched_fini() was added, which is the only other place that can mark an entity as stopped, except for the proper place: drm_sched_entity_kill().
That is potentially still good to have.
That's why we left it for now and just added a FIXME, because there's not really any benefit in potentially blowing up drivers by removing it (well, technically blowing up drivers like that would reveal significant lifetime and, thus, design issues. But it wouldn't be "nice").
Still, it's a clear sign of (undocumented…) scheduler lifetimes being violated :(
Yeah, I know :(
Regards, Christian.
P.
Regards, Christian.
P.
Best, Jules
On Tue Sep 23, 2025 at 2:33 PM CEST, Christian König wrote:
On 23.09.25 14:08, Philipp Stanner wrote:
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
Indeed, please don't peek API internals. If you need additional functionality, please send a patch adding a supported API for the component instead.
Drivers messing with component internals makes impossible to maintain them in the long term.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
Can you please show me the code where the scheduler calls any of drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy()?
Or are you referring the broken hack in drm_sched_fini() (introduced by commit c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where it is just ignored that we need to take the entity lock as well, because it inconviniently would lead to lock inversion?
spin_lock(&rq->lock); list_for_each_entry(s_entity, &rq->entities, list) /* * Prevents reinsertion and marks job_queue as idle, * it will be removed from the rq in drm_sched_entity_fini() * eventually */ s_entity->stopped = true; spin_unlock(&rq->lock);
The patch description that introduced the hack says:
If scheduler is already stopped by the time sched_entity is released and entity's job_queue not empty I encountred a hang in drm_sched_entity_flush.
But this sounds to me as if amdgpu simply doesn't implement the correct shutdown ordering. Why do nouveau, Xe and other drivers don't have this problem? Why do we need to solve it in the scheduler instead?
Maybe there are reasonable answers to that. And assuming there are, it still isn't a justification for building on top of a broken workaround. :(
On 24.09.25 13:00, Danilo Krummrich wrote:
On Tue Sep 23, 2025 at 2:33 PM CEST, Christian König wrote:
On 23.09.25 14:08, Philipp Stanner wrote:
You know folks, situations like that are why we want to strongly discourage accessing another API's struct members directly. There is no API contract for them.
Indeed, please don't peek API internals. If you need additional functionality, please send a patch adding a supported API for the component instead.
Drivers messing with component internals makes impossible to maintain them in the long term.
And a proper API function rarely changes its interface, and if it does, it's easy to find for the contributor where drivers need to be adjusted. If we were all following that rule, you wouldn't even have to bother with patches #1 and #2.
That said, I see two proper solutions for your problem:
A. amdgpu is the one stopping the entities anyways, isn't it? It knows which entities it has killed. So that information could be stored in struct amdgpu_vm.
No, it's the scheduler which decides when entities are stopped.
Can you please show me the code where the scheduler calls any of drm_sched_entity_fini(), drm_sched_entity_flush(), drm_sched_entity_destroy()?
It's this code here in drm_sched_entity_flush() which decides if an entity should be killed or not:
/* For a killed process disallow further enqueueing of jobs. */ last_user = cmpxchg(&entity->last_user, current->group_leader, NULL); if ((!last_user || last_user == current->group_leader) && (current->flags & PF_EXITING) && (current->exit_code == SIGKILL)) drm_sched_entity_kill(entity);
Regards, Christian.
Or are you referring the broken hack in drm_sched_fini() (introduced by commit c61cdbdbffc1 ("drm/scheduler: Fix hang when sched_entity released")) where it is just ignored that we need to take the entity lock as well, because it inconviniently would lead to lock inversion?
spin_lock(&rq->lock); list_for_each_entry(s_entity, &rq->entities, list) /* * Prevents reinsertion and marks job_queue as idle, * it will be removed from the rq in drm_sched_entity_fini() * eventually */ s_entity->stopped = true; spin_unlock(&rq->lock);
The patch description that introduced the hack says:
If scheduler is already stopped by the time sched_entity is released and entity's job_queue not empty I encountred a hang in drm_sched_entity_flush.
But this sounds to me as if amdgpu simply doesn't implement the correct shutdown ordering. Why do nouveau, Xe and other drivers don't have this problem? Why do we need to solve it in the scheduler instead?
Maybe there are reasonable answers to that. And assuming there are, it still isn't a justification for building on top of a broken workaround. :(
linux-stable-mirror@lists.linaro.org