Re: [RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched

9 Sep 2024

Am 09.09.24 um 11:44 schrieb Philipp Stanner:
...
On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
...
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
Without the locking amdgpu currently can race
amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority() races
through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
The actual issue then seems to be drm_sched_job_arm() calling
drm_sched_entity_select_rq(). I would mention that, too.
...
leading to the
latter accesing potentially inconsitent entity->sched_list and
entity->num_sched_list pair.
The comment on drm_sched_entity_modify_sched() however says:
"""
  * Note that this must be called under the same common lock for
@entity as
  * drm_sched_job_arm() and drm_sched_entity_push_job(), or the driver
needs to
  * guarantee through some other means that this is never called while
new jobs
  * can be pushed to @entity.
"""
It is unclear if that is referring to this race or something else.
That comment is indeed a bit awkward. Both drm_sched_entity_push_job()
and drm_sched_job_arm() take rq_lock. But
drm_sched_entity_modify_sched() doesn't.
The comment was written in 981b04d968561. Interestingly, in
drm_sched_entity_push_job(), this "common lock" is mentioned with the
soft requirement word "should" and apparently is more about keeping
sequence numbers in order when inserting.
I tend to think that the issue discovered by you is unrelated to that
comment. But if no one can make sense of the comment, should it maybe
be removed? Confusing comment is arguably worse than no comment
Agree, we probably mixed up in 981b04d968561 that submission needs a 
common lock and that rq/priority needs to be protected by the rq_lock.
There is also the big FIXME in the drm_sched_entity documentation 
pointing out that this is most likely not implemented correctly.
I suggest to move the rq, priority and rq_lock fields together in the 
drm_sched_entity structure and document that rq_lock is protecting the two.
Then audit the code if all users of rq and priority actually hold the 
correct locks while reading and writing them.
Regards,
Christian.
...
P.
...
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com
Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify
sched list")
Cc: Christian König christian.koenig@amd.com
Cc: Alex Deucher alexander.deucher@amd.com
Cc: Luben Tuikov ltuikov89@gmail.com
Cc: Matthew Brost matthew.brost@intel.com
Cc: David Airlie airlied@gmail.com
Cc: Daniel Vetter daniel@ffwll.ch
Cc: dri-devel@lists.freedesktop.org
Cc: stable@vger.kernel.org # v5.7+

drivers/gpu/drm/scheduler/sched_entity.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index 58c8161289fe..ae8be30472cd 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct
drm_sched_entity *entity,
  {
  	WARN_ON(!num_sched_list || !sched_list);

spin_lock(&entity->rq_lock);

entity->sched_list = sched_list;
  	entity->num_sched_list = num_sched_list;

spin_unlock(&entity->rq_lock);

}
  EXPORT_SYMBOL(drm_sched_entity_modify_sched);

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [RFC 1/4] drm/sched: Add locking to drm_sched_entity_modify_sched