From: Jack Xiao Jack.Xiao@amd.com
wait memory room until enough before writing mes packets to avoid ring buffer overflow.
v2: squash in sched_hw_submission fix
Backport from 6.11.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571 Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command submission") Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission") Signed-off-by: Jack Xiao Jack.Xiao@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com (cherry picked from commit 34e087e8920e635c62e2ed6a758b0cd27f836d13) Cc: stable@vger.kernel.org # 6.10.x (cherry picked from commit 11752c013f562a1124088a35bd314aa0e9f0e88f) --- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 2 ++ drivers/gpu/drm/amd/amdgpu/mes_v11_0.c | 18 ++++++++++++++---- 2 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c index 06f0a6534a94..88ffb15e25cc 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c @@ -212,6 +212,8 @@ int amdgpu_ring_init(struct amdgpu_device *adev, struct amdgpu_ring *ring, */ if (ring->funcs->type == AMDGPU_RING_TYPE_KIQ) sched_hw_submission = max(sched_hw_submission, 256); + if (ring->funcs->type == AMDGPU_RING_TYPE_MES) + sched_hw_submission = 8; else if (ring == &adev->sdma.instance[0].page) sched_hw_submission = 256;
diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c index 32d4519541c6..e1a66d585f5e 100644 --- a/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mes_v11_0.c @@ -163,7 +163,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, const char *op_str, *misc_op_str; unsigned long flags; u64 status_gpu_addr; - u32 status_offset; + u32 seq, status_offset; u64 *status_ptr; signed long r; int ret; @@ -191,6 +191,13 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, if (r) goto error_unlock_free;
+ seq = ++ring->fence_drv.sync_seq; + r = amdgpu_fence_wait_polling(ring, + seq - ring->fence_drv.num_fences_mask, + timeout); + if (r < 1) + goto error_undo; + api_status = (struct MES_API_STATUS *)((char *)pkt + api_status_off); api_status->api_completion_fence_addr = status_gpu_addr; api_status->api_completion_fence_value = 1; @@ -203,8 +210,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, mes_status_pkt.header.dwsize = API_FRAME_SIZE_IN_DWORDS; mes_status_pkt.api_status.api_completion_fence_addr = ring->fence_drv.gpu_addr; - mes_status_pkt.api_status.api_completion_fence_value = - ++ring->fence_drv.sync_seq; + mes_status_pkt.api_status.api_completion_fence_value = seq;
amdgpu_ring_write_multiple(ring, &mes_status_pkt, sizeof(mes_status_pkt) / 4); @@ -224,7 +230,7 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, dev_dbg(adev->dev, "MES msg=%d was emitted\n", x_pkt->header.opcode);
- r = amdgpu_fence_wait_polling(ring, ring->fence_drv.sync_seq, timeout); + r = amdgpu_fence_wait_polling(ring, seq, timeout); if (r < 1 || !*status_ptr) {
if (misc_op_str) @@ -247,6 +253,10 @@ static int mes_v11_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, amdgpu_device_wb_free(adev, status_offset); return 0;
+error_undo: + dev_err(adev->dev, "MES ring buffer is full.\n"); + amdgpu_ring_undo(ring); + error_unlock_free: spin_unlock_irqrestore(&mes->ring_lock, flags);
On Tue, Aug 27, 2024 at 10:10:25AM -0400, Alex Deucher wrote:
From: Jack Xiao Jack.Xiao@amd.com
wait memory room until enough before writing mes packets to avoid ring buffer overflow.
v2: squash in sched_hw_submission fix
Backport from 6.11.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571 Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command submission") Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission")
These commits are in 6.11-rc1.
Signed-off-by: Jack Xiao Jack.Xiao@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com (cherry picked from commit 34e087e8920e635c62e2ed6a758b0cd27f836d13) Cc: stable@vger.kernel.org # 6.10.x
So why does this need to go to 6.10.y?
confused,
greg k-h
[Public]
-----Original Message----- From: Greg KH gregkh@linuxfoundation.org Sent: Tuesday, August 27, 2024 10:21 AM To: Deucher, Alexander Alexander.Deucher@amd.com Cc: stable@vger.kernel.org; sashal@kernel.org; Xiao, Jack Jack.Xiao@amd.com Subject: Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
On Tue, Aug 27, 2024 at 10:10:25AM -0400, Alex Deucher wrote:
From: Jack Xiao Jack.Xiao@amd.com
wait memory room until enough before writing mes packets to avoid ring buffer overflow.
v2: squash in sched_hw_submission fix
Backport from 6.11.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571 Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command
submission")
Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission")
These commits are in 6.11-rc1.
de3246254156 ("drm/amdgpu: cleanup MES11 command submission") was ported to 6.10 as well: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/driv... So this fix is applicable there.
Alex
Signed-off-by: Jack Xiao Jack.Xiao@amd.com Acked-by: Alex Deucher alexander.deucher@amd.com Signed-off-by: Alex Deucher alexander.deucher@amd.com (cherry picked from commit 34e087e8920e635c62e2ed6a758b0cd27f836d13) Cc: stable@vger.kernel.org # 6.10.x
So why does this need to go to 6.10.y?
confused,
greg k-h
On Tue, Aug 27, 2024 at 03:01:54PM +0000, Deucher, Alexander wrote:
[Public]
-----Original Message----- From: Greg KH gregkh@linuxfoundation.org Sent: Tuesday, August 27, 2024 10:21 AM To: Deucher, Alexander Alexander.Deucher@amd.com Cc: stable@vger.kernel.org; sashal@kernel.org; Xiao, Jack Jack.Xiao@amd.com Subject: Re: [PATCH] drm/amdgpu/mes: fix mes ring buffer overflow
On Tue, Aug 27, 2024 at 10:10:25AM -0400, Alex Deucher wrote:
From: Jack Xiao Jack.Xiao@amd.com
wait memory room until enough before writing mes packets to avoid ring buffer overflow.
v2: squash in sched_hw_submission fix
Backport from 6.11.
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/3571 Fixes: de3246254156 ("drm/amdgpu: cleanup MES11 command
submission")
Fixes: fffe347e1478 ("drm/amdgpu: cleanup MES12 command submission")
These commits are in 6.11-rc1.
de3246254156 ("drm/amdgpu: cleanup MES11 command submission") was ported to 6.10 as well: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/driv... So this fix is applicable there.
No, commit e356d321d024 ("drm/amdgpu: cleanup MES11 command submission") is in the 6.10 release, but commit de3246254156 ("drm/amdgpu: cleanup MES11 command submission") is in 6.11-rc1!
So how in the world are we supposed to know anything here?
See how broken this all is?
I give up.
If you all want any AMD patches applied to stable trees, manually send us a set of backported patches, AND be sure to get the git ids right.
I'll leave what I have right now in the queues, but after this round of -rc releases, all AMD patches with cc: stable are going to be automatically dropped and ignored. I NEED you all to manually send them to me now as this is just insane.
Time to go buy a Intel gpu card as there's no way this is going to work out well over time...
{sigh}
greg k-h
linux-stable-mirror@lists.linaro.org