As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r;
- /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity); - if (r && r != -ENODATA) + if (r) goto error;
if (!fence && job->gang_submit)
Otherwise we are determining this timeout based on a time before we go into some unrelated spinlock, which is bad.
Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 5505d646f43a..57c94901ed0a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, if (unlikely(ring->adev->debug_disable_soft_recovery)) return false;
- deadline = ktime_add_us(ktime_get(), 10000); - if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence) return false;
@@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags);
atomic_inc(&ring->adev->gpu_reset_counter); + deadline = ktime_add_us(ktime_get(), 10000); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
Otherwise we are determining this timeout based on a time before we go into some unrelated spinlock, which is bad.
Actually I don't think that this is a good idea.
The spinlock is the fence processing lock, so when fence processing is blocking this with activity it is perfectly valid and desirable that the timeout is decreased.
Regards, Christian.
Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 5505d646f43a..57c94901ed0a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, if (unlikely(ring->adev->debug_disable_soft_recovery)) return false;
- deadline = ktime_add_us(ktime_get(), 10000);
- if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence) return false;
@@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter);
- deadline = ktime_add_us(ktime_get(), 10000); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);
Results in much more reliable soft recovery on Steam Deck.
Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags);
atomic_inc(&ring->adev->gpu_reset_counter); - deadline = ktime_add_us(ktime_get(), 10000); + deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
Results in much more reliable soft recovery on Steam Deck.
Waiting 500ms for a locked up shader is way to long I think. We could increase the 10ms to something like 20ms, but I really wouldn't go much over that.
This here just kills shaders which are in an endless loop, when that takes longer than 10-20ms we really have a hardware problem which needs a full reset to resolve.
Regards, Christian.
Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter);
- deadline = ktime_add_us(ktime_get(), 10000);
- deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);
It definitely takes much longer than 10-20ms in some instances.
Some of these instances can even be shown in Freidrich's hang test suite -- specifically when there are a lot of page faults going on.
The work (or parts of the work) could also be pending and not in any wave yet, just hanging out in the ring. There may be a better solution to that, but I don't know it.
Raising it to .5s still makes sense to me.
- Joshie 🐸✨
On 3/8/24 08:29, Christian König wrote:
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
Results in much more reliable soft recovery on Steam Deck.
Waiting 500ms for a locked up shader is way to long I think. We could increase the 10ms to something like 20ms, but I really wouldn't go much over that.
This here just kills shaders which are in an endless loop, when that takes longer than 10-20ms we really have a hardware problem which needs a full reset to resolve.
Regards, Christian.
Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter); - deadline = ktime_add_us(ktime_get(), 10000); + deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);
Am 08.03.24 um 23:31 schrieb Joshua Ashton:
It definitely takes much longer than 10-20ms in some instances.
Some of these instances can even be shown in Freidrich's hang test suite -- specifically when there are a lot of page faults going on.
Exactly that's the part I want to avoid. The context based recovery is to break out of shaders with endless loops.
When there are page faults going on I would rather recommend a hard reset of the GPU.
The work (or parts of the work) could also be pending and not in any wave yet, just hanging out in the ring. There may be a better solution to that, but I don't know it.
Yeah, but killing anything of that should never take longer than what the original submission supposed to take.
In other words when we assume that we should have at least 20fps then we should never go over 50ms. And even at this point we have already waited much longer than that for the shader to complete.
If you really want to raise that this high I would rather say to make it configurable.
Regards, Christian.
Raising it to .5s still makes sense to me.
- Joshie 🐸✨
On 3/8/24 08:29, Christian König wrote:
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
Results in much more reliable soft recovery on Steam Deck.
Waiting 500ms for a locked up shader is way to long I think. We could increase the 10ms to something like 20ms, but I really wouldn't go much over that.
This here just kills shaders which are in an endless loop, when that takes longer than 10-20ms we really have a hardware problem which needs a full reset to resolve.
Regards, Christian.
Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter); - deadline = ktime_add_us(ktime_get(), 10000); + deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.
Regards, Christian.
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r;
- /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity);
- if (r && r != -ENODATA)
- if (r) goto error;
if (!fence && job->gang_submit)
Reviewed-by: Marek Olšák marek.olsak@amd.com
Marek
On Fri, Mar 8, 2024 at 3:43 AM Christian König christian.koenig@amd.com wrote:
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.
Regards, Christian.
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r;
/* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity);
if (r && r != -ENODATA)
if (r) goto error; if (!fence && job->gang_submit)
Hi,
I happened to come across an issue just now again where soft recovery fails to get reported to userspace properly, causing apps to submit hanging work in a loop (which ended up hanging the entire machine) - it seems like this patch never made it into amd-staging-drm-next. Given that it has a Reviewed-by and everything, was this just an oversight or are there some blockers to pushing it that I missed?
If not, I'd be grateful if the patch could get merged.
Thanks, Friedrich
On 08.03.24 09:33, Christian König wrote:
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.
Regards, Christian.
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r; - /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity); - if (r && r != -ENODATA) + if (r) goto error; if (!fence && job->gang_submit)
Am 01.08.24 um 17:17 schrieb Friedrich Vock:
Hi,
I happened to come across an issue just now again where soft recovery fails to get reported to userspace properly, causing apps to submit hanging work in a loop (which ended up hanging the entire machine) - it seems like this patch never made it into amd-staging-drm-next. Given that it has a Reviewed-by and everything, was this just an oversight or are there some blockers to pushing it that I missed?
If not, I'd be grateful if the patch could get merged.
Sorry that was my fault, I've forgotten about it because Alex usually picks up stuff for amd-staging-drm-next.
Thanks for the reminder, just pushed it.
Regards, Christian.
Thanks, Friedrich
On 08.03.24 09:33, Christian König wrote:
Am 07.03.24 um 20:04 schrieb Joshua Ashton:
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.
Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.
Regards, Christian.
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/
Signed-off-by: Joshua Ashton joshua@froggi.es
Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r; - /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity); - if (r && r != -ENODATA) + if (r) goto error; if (!fence && job->gang_submit)
linux-stable-mirror@lists.linaro.org