 
            On 13/10/2025 15:39, Philipp Stanner wrote:
On Fri, 2025-10-03 at 10:26 +0100, Tvrtko Ursulin wrote:
Drm_sched_job_add_dependency() consumes the fence reference both on
s/D/d
I cannot find the source right now but I am 99% sure someone pulled it on me in the past and educated me sentences should always start with a capital letter, even when it is not a proper word. Because in the past I was always doing the same as you suggest. I don't mind TBH.
success and failure, so in the latter case the dma_fence_put() on the error path (xarray failed to expand) is a double free.
Interestingly this bug appears to have been present ever since ebd5f74255b9 ("drm/sched: Add dependency tracking"), since the code back then looked like this:
drm_sched_job_add_implicit_dependencies(): ... for (i = 0; i < fence_count; i++) { ret = drm_sched_job_add_dependency(job, fences[i]); if (ret) break; }
for (; i < fence_count; i++) dma_fence_put(fences[i]);
Which means for the failing 'i' the dma_fence_put was already a double free. Possibly there were no users at that time, or the test cases were insufficient to hit it.
The bug was then only noticed and fixed after 9c2ba265352a ("drm/scheduler: use new iterator in drm_sched_job_add_implicit_dependencies v2") landed, with its fixup of 4eaf02d6076c ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies").
At that point it was a slightly different flavour of a double free, which 963d0b356935 ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder") noticed and attempted to fix.
But it only moved the double free from happening inside the drm_sched_job_add_dependency(), when releasing the reference not yet obtained, to the caller, when releasing the reference already released by the former in the failure case.
That's certainly interesting, but is there a specific reason why you include all of that?
Yes, because bug was attempted to be fixed two times already. So it is IMO worth having a write up on the history.
The code is as is, and AFAICS it's just a bug stemming from original bugs present and then refactorings happening.
I would at least remove the old 'implicit_dependencies' function from the commit message. It's just confusing and makes one look for that in the current code or patch.It does say "back then" and it references the commit so shouldn't be
confusing.
The discussion is long for an additional reason that Fixes: targets not the original commit which did not get this right, but the last change which tried to fix it but did not. It sounded more logical to continue the chain on fixes but dunno. Could replace it with the original one just as well.
As such it is not easy to identify the right target for the fixes tag so lets keep it simple and just continue the chain.
We also drop the misleading comment about additional reference, since it is not additional but the only one from the point of view of dependency tracking.
IMO that comment is nonsense. It's useless, too, because I can *see* that a reference is being taken there, but not *why*.
Argh, these comments. See also my commit 72ebc18b34993
Anyways. Removing it is fine, but adding a better comment is better. See below.
Signed-off-by: Tvrtko Ursulin tvrtko.ursulin@igalia.com Fixes: 963d0b356935 ("drm/scheduler: fix drm_sched_job_add_implicit_dependencies harder") Reported-by: Dan Carpenter dan.carpenter@linaro.org
Is there an error report that could be included here with a Closes: tag?No, it did not come from any such system. Could perhaps put a link to
the report as Reference: https://lore.kernel.org/dri-devel/aNFbXq8OeYl3QSdm@stanley.mountain/
?
Cc: Christian König christian.koenig@amd.com Cc: Rob Clark robdclark@chromium.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Matthew Brost matthew.brost@intel.com Cc: Danilo Krummrich dakr@kernel.org Cc: Philipp Stanner phasta@kernel.org Cc: "Christian König" ckoenig.leichtzumerken@gmail.com Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org # v5.16+
drivers/gpu/drm/scheduler/sched_main.c | 14 +++++--------- 1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 46119aacb809..aff34240f230 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -960,20 +960,16 @@ int drm_sched_job_add_resv_dependencies(struct drm_sched_job *job, { struct dma_resv_iter cursor; struct dma_fence *fence;
- int ret;
- int ret = 0;
dma_resv_assert_held(resv); dma_resv_for_each_fence(&cursor, resv, usage, fence) {
/* Make sure to grab an additional ref on the added fence */
dma_fence_get(fence);
ret = drm_sched_job_add_dependency(job, fence);
if (ret) {
dma_fence_put(fence);
return ret;
}
ret = drm_sched_job_add_dependency(job, dma_fence_get(fence));You still take a reference as before, but there is no comment anymore. Can you add one explaining why a new reference is taken here?
I guess it will be something like "This needs a new reference for the job", since you cannot rely on the one from resv.
I was mulling it over but then thought the kerneldoc for drm_sched_job_add_dependency() is good enough since it explains the function always consumes the reference and dma_resv_for_each_fence() does not hold one over the iteration. I can add it.
if (ret)
break;}
- return 0;
- return ret;
That's an unnecessarily enlargement of the git diff because of style, isn't it? Better keep the diff minimal here for git blame.
Given the relative size of the diff versus the function itself it did not look like a big deal to reduce to one return statement while touching it, but I can keep two returns just as well, no problem.
Regards,
Tvrtko