[PATCH 1/3] drm/amdgpu: Forward soft recovery errors to userspace

List overview All Threads
Download

newer

older

FAILED: patch "[PATCH] ipv4: fix...

[PATCH net v1] net: wan:...

Joshua Ashton

7 Mar 2024 7 Mar '24

7:04 p.m.

As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.

1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r;

- /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity); - if (r && r != -ENODATA) + if (r) goto error;

if (!fence && job->gang_submit)

-- 2.44.0

Show replies by date

Joshua Ashton

7 Mar 7 Mar

7:04 p.m.

New subject: [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage

Otherwise we are determining this timeout based on a time before we go into some unrelated spinlock, which is bad.

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 5505d646f43a..57c94901ed0a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, if (unlikely(ring->adev->debug_disable_soft_recovery)) return false;

- deadline = ktime_add_us(ktime_get(), 10000); - if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence) return false;

@@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags);

atomic_inc(&ring->adev->gpu_reset_counter); + deadline = ktime_add_us(ktime_get(), 10000); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);

-- 2.44.0

Christian König

8 Mar 8 Mar

8:23 a.m.

New subject: [PATCH 2/3] drm/amdgpu: Determine soft recovery deadline next to usage

Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...

Otherwise we are determining this timeout based on a time before we go into some unrelated spinlock, which is bad.

Actually I don't think that this is a good idea.

The spinlock is the fence processing lock, so when fence processing is blocking this with activity it is perfectly valid and desirable that the timeout is decreased.

Regards, Christian.

...

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 5505d646f43a..57c94901ed0a 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -439,8 +439,6 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, if (unlikely(ring->adev->debug_disable_soft_recovery)) return false;

deadline = ktime_add_us(ktime_get(), 10000);

if (amdgpu_sriov_vf(ring->adev) || !ring->funcs->soft_recovery || !fence) return false;

@@ -450,6 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter);

deadline = ktime_add_us(ktime_get(), 10000); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);

Joshua Ashton

7 Mar 7 Mar

7:04 p.m.

New subject: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s

Results in much more reliable soft recovery on Steam Deck.

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags);

atomic_inc(&ring->adev->gpu_reset_counter); - deadline = ktime_add_us(ktime_get(), 10000); + deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);

-- 2.44.0

Christian König

8 Mar 8 Mar

8:29 a.m.

New subject: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s

Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...

Results in much more reliable soft recovery on Steam Deck.

Waiting 500ms for a locked up shader is way to long I think. We could increase the 10ms to something like 20ms, but I really wouldn't go much over that.

This here just kills shaders which are in an endless loop, when that takes longer than 10-20ms we really have a hardware problem which needs a full reset to resolve.

Regards, Christian.

...

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter);

deadline = ktime_add_us(ktime_get(), 10000);

deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);

Joshua Ashton

10:31 p.m.

New subject: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s

It definitely takes much longer than 10-20ms in some instances.

Some of these instances can even be shown in Freidrich's hang test suite -- specifically when there are a lot of page faults going on.

The work (or parts of the work) could also be pending and not in any wave yet, just hanging out in the ring. There may be a better solution to that, but I don't know it.

Raising it to .5s still makes sense to me.

- Joshie 🐸✨

On 3/8/24 08:29, Christian König wrote:

...

Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...
Results in much more reliable soft recovery on Steam Deck.

Waiting 500ms for a locked up shader is way to long I think. We could increase the 10ms to something like 20ms, but I really wouldn't go much over that.

This here just kills shaders which are in an endless loop, when that takes longer than 10-20ms we really have a hardware problem which needs a full reset to resolve.

Regards, Christian.

...
Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter); - deadline = ktime_add_us(ktime_get(), 10000); + deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);

Christian König

11 Mar 11 Mar

6:46 a.m.

New subject: [PATCH 3/3] drm/amdgpu: Increase soft recovery timeout to .5s

Am 08.03.24 um 23:31 schrieb Joshua Ashton:

...

It definitely takes much longer than 10-20ms in some instances.

Some of these instances can even be shown in Freidrich's hang test suite -- specifically when there are a lot of page faults going on.

Exactly that's the part I want to avoid. The context based recovery is to break out of shaders with endless loops.

When there are page faults going on I would rather recommend a hard reset of the GPU.

...

The work (or parts of the work) could also be pending and not in any wave yet, just hanging out in the ring. There may be a better solution to that, but I don't know it.

Yeah, but killing anything of that should never take longer than what the original submission supposed to take.

In other words when we assume that we should have at least 20fps then we should never go over 50ms. And even at this point we have already waited much longer than that for the shader to complete.

If you really want to raise that this high I would rather say to make it configurable.

Regards, Christian.

...

Raising it to .5s still makes sense to me.

Joshie 🐸✨

On 3/8/24 08:29, Christian König wrote:

...
Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...
Results in much more reliable soft recovery on Steam Deck.

Waiting 500ms for a locked up shader is way to long I think. We could increase the 10ms to something like 20ms, but I really wouldn't go much over that.

This here just kills shaders which are in an endless loop, when that takes longer than 10-20ms we really have a hardware problem which needs a full reset to resolve.

Regards, Christian.

...
Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 57c94901ed0a..be99db0e077e 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,7 +448,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, spin_unlock_irqrestore(fence->lock, flags); atomic_inc(&ring->adev->gpu_reset_counter); - deadline = ktime_add_us(ktime_get(), 10000); + deadline = ktime_add_ms(ktime_get(), 500); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) ring->funcs->soft_recovery(ring, vmid);

Christian König

8 Mar 8 Mar

8:33 a.m.

Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...

As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.

Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.

Regards, Christian.

...

1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r;

/* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity);

if (r && r != -ENODATA)

if (r) goto error;

if (!fence && job->gang_submit)

Marek Olšák

9 Mar 9 Mar

4:27 p.m.

Reviewed-by: Marek Olšák marek.olsak@amd.com

Marek

On Fri, Mar 8, 2024 at 3:43 AM Christian König christian.koenig@amd.com wrote:

...

Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.

Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.

Regards, Christian.

...
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r;
/* Ignore soft recovered fences here */
r = drm_sched_entity_error(s_entity);
if (r && r != -ENODATA)
if (r)
        goto error;

if (!fence && job->gang_submit)

Friedrich Vock

1 Aug 1 Aug

3:17 p.m.

Hi,

I happened to come across an issue just now again where soft recovery fails to get reported to userspace properly, causing apps to submit hanging work in a loop (which ended up hanging the entire machine) - it seems like this patch never made it into amd-staging-drm-next. Given that it has a Reviewed-by and everything, was this just an oversight or are there some blockers to pushing it that I missed?

If not, I'd be grateful if the patch could get merged.

Thanks, Friedrich

On 08.03.24 09:33, Christian König wrote:

...

Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.

Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.

Regards, Christian.

...
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/ Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r; - /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity); - if (r && r != -ENODATA) + if (r) goto error; if (!fence && job->gang_submit)

Christian König

2 Aug 2 Aug

8:30 a.m.

Am 01.08.24 um 17:17 schrieb Friedrich Vock:

...

Hi,

I happened to come across an issue just now again where soft recovery fails to get reported to userspace properly, causing apps to submit hanging work in a loop (which ended up hanging the entire machine) - it seems like this patch never made it into amd-staging-drm-next. Given that it has a Reviewed-by and everything, was this just an oversight or are there some blockers to pushing it that I missed?

If not, I'd be grateful if the patch could get merged.

Sorry that was my fault, I've forgotten about it because Alex usually picks up stuff for amd-staging-drm-next.

Thanks for the reminder, just pushed it.

Regards, Christian.

...

Thanks, Friedrich

On 08.03.24 09:33, Christian König wrote:

...
Am 07.03.24 um 20:04 schrieb Joshua Ashton:

...
As we discussed before[1], soft recovery should be forwarded to userspace, or we can get into a really bad state where apps will keep submitting hanging command buffers cascading us to a hard reset.

Marek you are in favor of this like forever. So I would like to request you to put your Reviewed-by on it and I will just push it into our internal kernel branch.

Regards, Christian.

...
1: https://lore.kernel.org/all/bf23d5ed-9a6b-43e7-84ee-8cbfd0d60f18@froggi.es/

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 4b3000c21ef2..aebf59855e9f 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -262,9 +262,8 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job, struct dma_fence *fence = NULL; int r; - /* Ignore soft recovered fences here */ r = drm_sched_entity_error(s_entity); - if (r && r != -ENODATA) + if (r) goto error; if (!fence && job->gang_submit)

351

days inactive

499

days old

linux-stable-mirror@lists.linaro.org

10 comments

participants

tags (0)

participants (5)

Christian König
Christian König
Friedrich Vock
Joshua Ashton
Marek Olšák