[PATCH 1/2] drm/amdgpu: Pass amdgpu_job directly to amdgpu_ring_soft_recovery

List overview All Threads
Download

newer

older

[PATCH 4.19 00/25] 4.19.305-rc1...

Re: kernel BUG on network...

Joshua Ashton

13 Jan 2024 13 Jan '24

2:02 p.m.

We will need this to change the karma in the future.

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 9 ++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 3 +-- 3 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index 78476bc75b4e..c1af7ca25912 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -52,7 +52,7 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) adev->job_hang = true;

if (amdgpu_gpu_recovery && - amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { + amdgpu_ring_soft_recovery(ring, job)) { DRM_ERROR("ring %s timeout, but soft recovered\n", s_job->sched->name); goto exit; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 45424ebf9681..25209ce54552 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -425,14 +425,13 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, * amdgpu_ring_soft_recovery - try to soft recover a ring lockup * * @ring: ring to try the recovery on - * @vmid: VMID we try to get going again - * @fence: timedout fence + * @job: the locked-up job * * Tries to get a ring proceeding again when it is stuck. */ -bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, - struct dma_fence *fence) +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) { + struct dma_fence *fence = job->base.s_fence->parent; unsigned long flags; ktime_t deadline;

@@ -452,7 +451,7 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0) - ring->funcs->soft_recovery(ring, vmid); + ring->funcs->soft_recovery(ring, job->vmid);

return dma_fence_is_signaled(fence); } diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h index bbb53720a018..734df88f22d4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h @@ -354,8 +354,7 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring); void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring, uint32_t reg0, uint32_t val0, uint32_t reg1, uint32_t val1); -bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid, - struct dma_fence *fence); +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job);

static inline void amdgpu_ring_set_preempt_cond_exec(struct amdgpu_ring *ring, bool cond_exec)

-- 2.43.0

Show replies by date

Joshua Ashton

13 Jan 13 Jan

2:02 p.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

Signed-off-by: Joshua Ashton joshua@froggi.es

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags);

+ if (job->vm) + drm_sched_increase_karma(&job->base); atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

-- 2.43.0

Friedrich Vock

2:24 p.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

On 13.01.24 15:02, Joshua Ashton wrote:

...

We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

Signed-off-by: Joshua Ashton joshua@froggi.es

Tested-by: Friedrich Vock friedrich.vock@gmx.de

...

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags);
if (job->vm)
drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

André Almeida

9:35 p.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

...

We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

I sent a similar patch in the past, maybe you find the discussion interesting:

https://lore.kernel.org/lkml/20230424014324.218531-1-andrealmeid@igalia.com/

...

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags);
if (job->vm)
drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Joshua Ashton

10:55 p.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

+Marek

On 1/13/24 21:35, André Almeida wrote:

...

Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

...
We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

I sent a similar patch in the past, maybe you find the discussion interesting:

https://lore.kernel.org/lkml/20230424014324.218531-1-andrealmeid@igalia.com/

Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions: "That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does basically the same thing as was proposed. :-)

For context though, I am less interested in Gamescope (the Steam Deck compositor) hanging (we don't have code that hangs, if we go down, it's likely Steam/CEF died with us anyway atm, can solve that battle some other day) and more about the applications run under it.

Marek is very right when he says applications that fault/hang will submit one IB after another that also fault/hang -- especially if they write to descriptors from the GPU (descriptor buffers), or use draw indirect or anything bindless or... That's basically functionally equivalent to DOSing a system if it is not prevented.

And that's exactly what I see even in a simple test app doing a fault -> hang every frame.

Right now, given that soft recovery never marks a context as guilty, it means that every app I tested is never stopped from submitting garbage That holds true for basically any app that GPU hangs and makes soft recovery totally useless in my testing without this.

(That being said, without my patches, RADV treats *any* reset from the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was not guilty, so any faulting/hanging application causes every Vulkan app to die e_e. This is fixed in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )

- Joshie 🐸✨

...

...
Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags); + if (job->vm) + drm_sched_increase_karma(&job->base); atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Christian König

15 Jan 15 Jan

9:47 a.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Am 13.01.24 um 23:55 schrieb Joshua Ashton:

...

+Marek

On 1/13/24 21:35, André Almeida wrote:

...
Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

...
We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

I sent a similar patch in the past, maybe you find the discussion interesting:

https://lore.kernel.org/lkml/20230424014324.218531-1-andrealmeid@igalia.com/

Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions: "That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does basically the same thing as was proposed. :-)

For context though, I am less interested in Gamescope (the Steam Deck compositor) hanging (we don't have code that hangs, if we go down, it's likely Steam/CEF died with us anyway atm, can solve that battle some other day) and more about the applications run under it.

Marek is very right when he says applications that fault/hang will submit one IB after another that also fault/hang -- especially if they write to descriptors from the GPU (descriptor buffers), or use draw indirect or anything bindless or... That's basically functionally equivalent to DOSing a system if it is not prevented.

And that's exactly what I see even in a simple test app doing a fault -> hang every frame.

Right now, given that soft recovery never marks a context as guilty, it means that every app I tested is never stopped from submitting garbage That holds true for basically any app that GPU hangs and makes soft recovery totally useless in my testing without this.

Yeah, the problem is that your patch wouldn't help with that. A testing app can still re-create the context for each submission and so crash the system over and over again.

The question here is really if we should handled soft recovered errors as fatal or not. Marek is in pro of that Michel is against it.

Figure out what you want in userspace and I'm happy to implement it :)

...

(That being said, without my patches, RADV treats *any* reset from the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was not guilty, so any faulting/hanging application causes every Vulkan app to die e_e. This is fixed in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )

That is actually intended behavior. When something disrupted the GPU execution and the application is affected it is mandatory to forward this error to the application.

Regards, Christian.

...

Joshie 🐸✨

...
...
Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags); + if (job->vm) + drm_sched_increase_karma(&job->base); atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Joshua Ashton

11:54 a.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

On 1/15/24 09:47, Christian König wrote:

...

Am 13.01.24 um 23:55 schrieb Joshua Ashton:

...
+Marek

On 1/13/24 21:35, André Almeida wrote:

...
Hi Joshua,

Em 13/01/2024 11:02, Joshua Ashton escreveu:

...
We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

I sent a similar patch in the past, maybe you find the discussion interesting:

https://lore.kernel.org/lkml/20230424014324.218531-1-andrealmeid@igalia.com/

Thanks, I had a peruse through that old thread.

Marek definitely had the right idea here, given he mentions: "That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck)"

Given that is what I work on and also wrote this patch for that does basically the same thing as was proposed. :-)

For context though, I am less interested in Gamescope (the Steam Deck compositor) hanging (we don't have code that hangs, if we go down, it's likely Steam/CEF died with us anyway atm, can solve that battle some other day) and more about the applications run under it.

Marek is very right when he says applications that fault/hang will submit one IB after another that also fault/hang -- especially if they write to descriptors from the GPU (descriptor buffers), or use draw indirect or anything bindless or... That's basically functionally equivalent to DOSing a system if it is not prevented.

And that's exactly what I see even in a simple test app doing a fault -> hang every frame.

Right now, given that soft recovery never marks a context as guilty, it means that every app I tested is never stopped from submitting garbage That holds true for basically any app that GPU hangs and makes soft recovery totally useless in my testing without this.

Yeah, the problem is that your patch wouldn't help with that. A testing app can still re-create the context for each submission and so crash the system over and over again.

It is still definitely possible for an application to do re-create its context and hang yet again -- however that is not the problem I am trying to solve here.

The problem I am trying to solve is that applications do not even get marked guilty when triggering soft recovery right now.

The patch does help with that on SteamOS, as the type of applications we deal with that hang, just abort on VK_ERROR_DEVICE_LOST.

If a UI toolkit that handles DEVICE_LOST keeps doing this, then yes it would not help, but this patch is also a necessary step towards fixing that someday. (Eg. some policy where processes are killed for hanging too many times, etc)

...

The question here is really if we should handled soft recovered errors as fatal or not. Marek is in pro of that Michel is against it.

Figure out what you want in userspace and I'm happy to implement it :)

...
(That being said, without my patches, RADV treats *any* reset from the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was not guilty, so any faulting/hanging application causes every Vulkan app to die e_e. This is fixed in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )

That is actually intended behavior. When something disrupted the GPU execution and the application is affected it is mandatory to forward this error to the application.

No. If said context was entirely unaffected, then it should be completely transparent to the application.

Consider the following:

- I have Counter-Strike 2 running - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the specific VMID for HangApp's submissions and signals the submission fence.

In this instance, the Gamescope and Counter-Strike 2 ctxs are completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as there was no impact to their work.

Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem with HangApp, FWIU the way that the clear-out works being vmid specific means that they would be unaffected, right?

- Joshie 🐸✨

...

Regards, Christian.

...

Joshie 🐸✨

...
...
Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags); + if (job->vm) + drm_sched_increase_karma(&job->base); atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Christian König

1:19 p.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Am 15.01.24 um 12:54 schrieb Joshua Ashton:

...

[SNIP]

...
The question here is really if we should handled soft recovered errors as fatal or not. Marek is in pro of that Michel is against it.

Figure out what you want in userspace and I'm happy to implement it :)

...
(That being said, without my patches, RADV treats *any* reset from the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was not guilty, so any faulting/hanging application causes every Vulkan app to die e_e. This is fixed in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )

That is actually intended behavior. When something disrupted the GPU execution and the application is affected it is mandatory to forward this error to the application.

No. If said context was entirely unaffected, then it should be completely transparent to the application.

Consider the following:

- I have Counter-Strike 2 running - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the specific VMID for HangApp's submissions and signals the submission fence.

In this instance, the Gamescope and Counter-Strike 2 ctxs are completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as there was no impact to their work.

Ok, that is something I totally agree on. But why would the Gamescope and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery in the first place?

IIRC a soft recovery doesn't increment the reset counter in any way. So they should never be affected.

Regards, Christian.

...

Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem with HangApp, FWIU the way that the clear-out works being vmid specific means that they would be unaffected, right?

Joshie 🐸✨

...
Regards, Christian.

...

Joshie 🐸✨

...
...
Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags); + if (job->vm) + drm_sched_increase_karma(&job->base); atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Joshua Ashton

2 p.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

On 1/15/24 13:19, Christian König wrote:

...

Am 15.01.24 um 12:54 schrieb Joshua Ashton:

...
[SNIP]

...
The question here is really if we should handled soft recovered errors as fatal or not. Marek is in pro of that Michel is against it.

Figure out what you want in userspace and I'm happy to implement it :)

...
(That being said, without my patches, RADV treats *any* reset from the query as VK_ERROR_DEVICE_LOST, even if there was no VRAM lost and it was not guilty, so any faulting/hanging application causes every Vulkan app to die e_e. This is fixed in https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 )

That is actually intended behavior. When something disrupted the GPU execution and the application is affected it is mandatory to forward this error to the application.

No. If said context was entirely unaffected, then it should be completely transparent to the application.

Consider the following:

- I have Counter-Strike 2 running - I have Gamescope running

I then go ahead and start HangApp that hangs the GPU.

Soft recovery happens and that clears out all the work for the specific VMID for HangApp's submissions and signals the submission fence.

In this instance, the Gamescope and Counter-Strike 2 ctxs are completely unaffected and don't need to report VK_ERROR_DEVICE_LOST as there was no impact to their work.

Ok, that is something I totally agree on. But why would the Gamescope and Counter-Strike 2 app report VK_ERROR_DEVICE_LOST for a soft recovery in the first place?

IIRC a soft recovery doesn't increment the reset counter in any way. So they should never be affected.

It does, and without assigning any guilt, amdgpu_ring_soft_recovery calls atomic_inc(&ring->adev->gpu_reset_counter).

...

Regards, Christian.

...
Even if Gamescope or Counter-Strike 2 were occupying CUs in tandem with HangApp, FWIU the way that the clear-out works being vmid specific means that they would be unaffected, right?

Joshie 🐸✨

...
Regards, Christian.

...

Joshie 🐸✨

...
...
Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags); + if (job->vm) + drm_sched_increase_karma(&job->base); atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

- Joshie 🐸✨

Christian König

9:40 a.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

...

We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error code.

...

Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

Well, that is intentional behavior. Marek is voting for making soft recovered errors fatal as well while Michel is voting for better ignoring them.

I'm not really sure what to do. If you guys think that soft recovered hangs should be fatal as well then we can certainly do this.

Regards, Christian.

...

There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags);
if (job->vm)
drm_sched_increase_karma(&job->base);
atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

Joshua Ashton

11:37 a.m.

New subject: [PATCH 2/2] drm/amdgpu: Mark ctx as guilty in ring_soft_recovery path

On 1/15/24 09:40, Christian König wrote:

...

Am 13.01.24 um 15:02 schrieb Joshua Ashton:

...
We need to bump the karma of the drm_sched job in order for the context that we just recovered to get correct feedback that it is guilty of hanging.

Big NAK to that approach, the karma handling is completely deprecated.

When you want to signal execution errors please use the fence error code.

The fence error code does not result in ctx's being marked as guilty, only the karma path does.

See drm_sched_increase_karma.

Are you proposing that we instead mark contexts as guilty with the fence error ourselves here?

...

...
Without this feedback, the application may keep pushing through the soft recoveries, continually hanging the system with jobs that timeout.

Well, that is intentional behavior. Marek is voting for making soft recovered errors fatal as well while Michel is voting for better ignoring them.

I'm not really sure what to do. If you guys think that soft recovered hangs should be fatal as well then we can certainly do this.

They have to be!

As Marek and I have pointed out, applications that hang or fault will just hang or fault again, especially when they use things like draw indirect, buffer device address, descriptor buffers, etc.

The majority of apps these days have a lot of side effects and persistence between frames and submissions.

- Joshie 🐸✨

...

Regards, Christian.

...
There is an accompanying Mesa/RADV patch here https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/27050 to properly handle device loss state when VRAM is not lost.

With these, I was able to run Counter-Strike 2 and launch an application which can fault the GPU in a variety of ways, and still have Steam + Counter-Strike 2 + Gamescope (compositor) stay up and continue functioning on Steam Deck.

Signed-off-by: Joshua Ashton joshua@froggi.es

Cc: Friedrich Vock friedrich.vock@gmx.de Cc: Bas Nieuwenhuizen bas@basnieuwenhuizen.nl Cc: Christian König christian.koenig@amd.com Cc: André Almeida andrealmeid@igalia.com Cc: stable@vger.kernel.org

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 25209ce54552..e87cafb5b1c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -448,6 +448,8 @@ bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, struct amdgpu_job *job) dma_fence_set_error(fence, -ENODATA); spin_unlock_irqrestore(fence->lock, flags); + if (job->vm) + drm_sched_increase_karma(&job->base); atomic_inc(&ring->adev->gpu_reset_counter); while (!dma_fence_is_signaled(fence) && ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)

724

days inactive

726

days old

linux-stable-mirror@lists.linaro.org

10 comments

participants

tags (0)

participants (4)

André Almeida
Christian König
Friedrich Vock
Joshua Ashton