This series addresses GPU reset issues reported in [1], where running a long compute job would trigger repeated GPU resets, leading to a UI freeze.
Patches #1 and #2 prevent the same faulty job from being resubmitted in a loop, mitigating the first cause of the issue.
However, the issue isn't entirely solved. Even with only a single GPU reset, the UI still freezes on the Raspberry Pi 5, indicating a GPU hang. Patches #3 to #6 address this by properly configuring the V3D_SMS registers, which are required for power management and resets in V3D 7.1.
Patch #7 updates the DT maintainership, replacing Emma with the current v3d driver maintainer.
[1] https://github.com/raspberrypi/linux/issues/6660
Best Regards, - Maíra
--- v1 -> v2: - [1/6, 2/6, 5/6] Add Iago's R-b (Iago Toral) - [3/6] Use V3D_GEN_* macros consistently throughout the driver (Phil Elwell) - [3/6] Don't add Iago's R-b in 3/6 due to changes in the patch - [4/6] Add per-compatible restrictions to enforce per‐SoC register rules (Conor Dooley) - [6/6] Add Emma's A-b, collected through IRC (Emma Anholt) - [6/6] Add Rob's A-b (Rob Herring) - Link to v1: https://lore.kernel.org/r/20250226-v3d-gpu-reset-fixes-v1-0-83a969fdd9c1@iga...
v2 -> v3: - [3/7] Add Iago's R-b (Iago Toral) - [4/7, 5/7] Separate the patches to ease the reviewing process -> Now, PATCH 4/7 only adds the per-compatible rules and PATCH 5/7 adds the SMS registers - [4/7] `allOf` goes above `additionalProperties` (Krzysztof Kozlowski) - [4/7, 5/7] Sync `reg` and `reg-names` items (Krzysztof Kozlowski) - Link to v2: https://lore.kernel.org/r/20250308-v3d-gpu-reset-fixes-v2-0-2939c30f0cc4@iga...
--- Maíra Canal (7): drm/v3d: Don't run jobs that have errors flagged in its fence drm/v3d: Set job pointer to NULL when the job's fence has an error drm/v3d: Associate a V3D tech revision to all supported devices dt-bindings: gpu: v3d: Add per-compatible register restrictions dt-bindings: gpu: v3d: Add SMS register to BCM2712 compatible drm/v3d: Use V3D_SMS registers for power on/off and reset on V3D 7.x dt-bindings: gpu: Add V3D driver maintainer as DT maintainer
.../devicetree/bindings/gpu/brcm,bcm-v3d.yaml | 77 +++++++++++-- drivers/gpu/drm/v3d/v3d_debugfs.c | 126 ++++++++++----------- drivers/gpu/drm/v3d/v3d_drv.c | 62 +++++++++- drivers/gpu/drm/v3d/v3d_drv.h | 22 +++- drivers/gpu/drm/v3d/v3d_gem.c | 27 ++++- drivers/gpu/drm/v3d/v3d_irq.c | 6 +- drivers/gpu/drm/v3d/v3d_perfmon.c | 4 +- drivers/gpu/drm/v3d/v3d_regs.h | 26 +++++ drivers/gpu/drm/v3d/v3d_sched.c | 29 ++++- 9 files changed, 281 insertions(+), 98 deletions(-) --- base-commit: 9e75b6ef407fee5d4ed8021cd7ddd9d6a8f7b0e8 change-id: 20250224-v3d-gpu-reset-fixes-2d21fc70711d
The V3D driver still relies on `drm_sched_increase_karma()` and `drm_sched_resubmit_jobs()` for resubmissions when a timeout occurs. The function `drm_sched_increase_karma()` marks the job as guilty, while `drm_sched_resubmit_jobs()` sets an error (-ECANCELED) in the DMA fence of that guilty job.
Because of this, we must check whether the job’s DMA fence has been flagged with an error before executing the job. Otherwise, the same guilty job may be resubmitted indefinitely, causing repeated GPU resets.
This patch adds a check for an error on the job's fence to prevent running a guilty job that was previously flagged when the GPU timed out.
Note that the CPU and CACHE_CLEAN queues do not require this check, as their jobs are executed synchronously once the DRM scheduler starts them.
Cc: stable@vger.kernel.org Fixes: d223f98f0209 ("drm/v3d: Add support for compute shader dispatch.") Fixes: 1584f16ca96e ("drm/v3d: Add support for submitting jobs to the TFU.") Reviewed-by: Iago Toral Quiroga itoral@igalia.com Signed-off-by: Maíra Canal mcanal@igalia.com --- drivers/gpu/drm/v3d/v3d_sched.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c index 80466ce8c7df669280e556c0793490b79e75d2c7..c2010ecdb08f4ba3b54f7783ed33901552d0eba1 100644 --- a/drivers/gpu/drm/v3d/v3d_sched.c +++ b/drivers/gpu/drm/v3d/v3d_sched.c @@ -327,11 +327,15 @@ v3d_tfu_job_run(struct drm_sched_job *sched_job) struct drm_device *dev = &v3d->drm; struct dma_fence *fence;
+ if (unlikely(job->base.base.s_fence->finished.error)) + return NULL; + + v3d->tfu_job = job; + fence = v3d_fence_create(v3d, V3D_TFU); if (IS_ERR(fence)) return NULL;
- v3d->tfu_job = job; if (job->base.irq_fence) dma_fence_put(job->base.irq_fence); job->base.irq_fence = dma_fence_get(fence); @@ -369,6 +373,9 @@ v3d_csd_job_run(struct drm_sched_job *sched_job) struct dma_fence *fence; int i, csd_cfg0_reg;
+ if (unlikely(job->base.base.s_fence->finished.error)) + return NULL; + v3d->csd_job = job;
v3d_invalidate_caches(v3d);
On Tue, Mar 11, 2025 at 03:13:42PM -0300, Maíra Canal wrote:
This series addresses GPU reset issues reported in [1], where running a long compute job would trigger repeated GPU resets, leading to a UI freeze.
Patches #1 and #2 prevent the same faulty job from being resubmitted in a loop, mitigating the first cause of the issue.
However, the issue isn't entirely solved. Even with only a single GPU reset, the UI still freezes on the Raspberry Pi 5, indicating a GPU hang. Patches #3 to #6 address this by properly configuring the V3D_SMS registers, which are required for power management and resets in V3D 7.1.
Not sure how much it helps your case, but still leaving it here in case it turns out to be useful here. It's already in -next and trending 6.15 merge.
https://patchwork.freedesktop.org/series/138070/
Raag
linux-stable-mirror@lists.linaro.org