On Mon, 2024-09-09 at 13:37 +0100, Tvrtko Ursulin wrote:
On 09/09/2024 13:18, Christian König wrote:
Am 09.09.24 um 14:13 schrieb Philipp Stanner:
On Mon, 2024-09-09 at 13:29 +0200, Christian König wrote:
Am 09.09.24 um 11:44 schrieb Philipp Stanner:
On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
From: Tvrtko Ursulin tvrtko.ursulin@igalia.com
Without the locking amdgpu currently can race amdgpu_ctx_set_entity_priority() and drm_sched_job_arm(),
I would explicitly say "amdgpu's amdgpu_ctx_set_entity_priority() races through drm_sched_entity_modify_sched() with drm_sched_job_arm()".
The actual issue then seems to be drm_sched_job_arm() calling drm_sched_entity_select_rq(). I would mention that, too.
leading to the latter accesing potentially inconsitent entity->sched_list and entity->num_sched_list pair.
The comment on drm_sched_entity_modify_sched() however says:
""" * Note that this must be called under the same common lock for @entity as * drm_sched_job_arm() and drm_sched_entity_push_job(), or the driver needs to * guarantee through some other means that this is never called while new jobs * can be pushed to @entity. """
It is unclear if that is referring to this race or something else.
That comment is indeed a bit awkward. Both drm_sched_entity_push_job() and drm_sched_job_arm() take rq_lock. But drm_sched_entity_modify_sched() doesn't.
The comment was written in 981b04d968561. Interestingly, in drm_sched_entity_push_job(), this "common lock" is mentioned with the soft requirement word "should" and apparently is more about keeping sequence numbers in order when inserting.
I tend to think that the issue discovered by you is unrelated to that comment. But if no one can make sense of the comment, should it maybe be removed? Confusing comment is arguably worse than no comment
Agree, we probably mixed up in 981b04d968561 that submission needs a common lock and that rq/priority needs to be protected by the rq_lock.
There is also the big FIXME in the drm_sched_entity documentation pointing out that this is most likely not implemented correctly.
I suggest to move the rq, priority and rq_lock fields together in the drm_sched_entity structure and document that rq_lock is protecting the two.
That could also be a great opportunity for improving the lock naming:
Well that comment made me laugh because I point out the same when the scheduler came out ~8years ago and nobody cared about it since then.
But yeah completely agree :)
Maybe, but we need to keep in sight the fact some of these fixes may be good to backport. In which case re-naming exercises are best left to follow.
My argument basically. It's good if fixes and other improvements are separated, in general, unless there is a practical / good reason not to.
Also..
void drm_sched_rq_update_fifo(struct drm_sched_entity *entity, ktime_t ts) { /* * Both locks need to be grabbed, one to protect from entity-
rq
change * for entity from within concurrent drm_sched_entity_select_rq and the * other to update the rb tree structure. */ spin_lock(&entity->rq_lock); spin_lock(&entity->rq->lock);
.. I agree this is quite unredable and my initial reaction was a similar ugh. However.. What names would you guys suggest and for what to make this better and not lessen the logic of naming each individually?
According to the documentation, drm_sched_rq.lock does not protect the entire runque, but "@lock: to modify the entities list."
So I would keep drm_sched_entity.rq_lock as it is, because it indeed protects the entire runqueue.
And drm_sched_rq.lock could be named "entities_lock" or "entities_list_lock" or something. That's debatable, but it should be something that highlights that this lock is not for locking the entire runque as the one in the entity apparently is.
Cheers, P.
Regards,
Tvrtko
[...]
P.
Then audit the code if all users of rq and priority actually hold the correct locks while reading and writing them.
Regards, Christian.
P.
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Fixes: b37aced31eb0 ("drm/scheduler: implement a function to modify sched list") Cc: Christian König christian.koenig@amd.com Cc: Alex Deucher alexander.deucher@amd.com Cc: Luben Tuikov ltuikov89@gmail.com Cc: Matthew Brost matthew.brost@intel.com Cc: David Airlie airlied@gmail.com Cc: Daniel Vetter daniel@ffwll.ch Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org # v5.7+
drivers/gpu/drm/scheduler/sched_entity.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 58c8161289fe..ae8be30472cd 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -133,8 +133,10 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, { WARN_ON(!num_sched_list || !sched_list); + spin_lock(&entity->rq_lock); entity->sched_list = sched_list; entity->num_sched_list = num_sched_list; + spin_unlock(&entity->rq_lock); } EXPORT_SYMBOL(drm_sched_entity_modify_sched);