Hi,
On 30/04/25 17:51, Maíra Canal wrote:
When a CL/CSD job times out, we check if the GPU has made any progress since the last timeout. If so, instead of resetting the hardware, we skip the reset and let the timer get rearmed. This gives long-running jobs a chance to complete.
However, when `timedout_job()` is called, the job in question is removed from the pending list, which means it won't be automatically freed through `free_job()`. Consequently, when we skip the reset and keep the job running, the job won't be freed when it finally completes.
This situation leads to a memory leak, as exposed in [1] and [2].
Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler when GPU is still active"), this patch ensures the job is put back on the pending list when extending the timeout.
Cc: stable@vger.kernel.org # 6.0 Reported-by: Daivik Bhatia dtgs1208@gmail.com Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [1] Closes: https://github.com/raspberrypi/linux/issues/6817 [2] Signed-off-by: Maíra Canal mcanal@igalia.com Reviewed-by: Iago Toral Quiroga itoral@igalia.com
Hi,
While we typically strive to avoid exposing the scheduler's internals within the drivers, I'm proposing this fix as an interim solution. I'm aware that a comprehensive fix will need some adjustments on the DRM sched side, and I plan to address that soon.
However, it would be hard to justify the backport of such patches to the stable branches and this issue is affecting users in the moment. Therefore, I'd like to push this patch to drm-misc-fixes in order to address this leak as soon as possible, while working in a more generic solution.
Applied to misc/kernel.git (drm-misc-fixes).
Best Regards, - Maíra