This is a workaround for s3 resume hang for r7 340(amdgpu). When we test s3 with r7 340 on arm64 platform, graphics card will hang up, the error message are as follows: Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.599374][ 7] [ T291] amdgpu 0000:02:00.0: fb0: amdgpudrmfb frame buffer device Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.612869][ 7] [ T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <si_dpm> failed -22 Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.623392][ 7] [ T291] amdgpu 0000:02:00.0: amdgpu_device_ip_late_init failed Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.630696][ 7] [ T291] amdgpu 0000:02:00.0: Fatal error during GPU init Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.637477][ 7] [ T291] [drm] amdgpu: finishing device.
On the following hardware: lspci -nn -s 05:00.0 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 / Radeon 520 OEM] [1002:6611] (rev 87)
Signed-off-by: Zhenneng Li lizhenneng@kylinos.cn --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3987ecb24ef4..1eced991b5b2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2903,6 +2903,8 @@ static void amdgpu_device_delayed_init_work_handler(struct work_struct *work) container_of(work, struct amdgpu_device, delayed_init_work.work); int r;
+ mdelay(1); + r = amdgpu_ib_ring_tests(adev); if (r) DRM_ERROR("ib ring test failed (%d).\n", r);
Am 29.03.22 um 05:05 schrieb Zhenneng Li:
This is a workaround for s3 resume hang for r7 340(amdgpu). When we test s3 with r7 340 on arm64 platform, graphics card will hang up, the error message are as follows: Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.599374][ 7] [ T291] amdgpu 0000:02:00.0: fb0: amdgpudrmfb frame buffer device Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.612869][ 7] [ T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <si_dpm> failed -22 Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.623392][ 7] [ T291] amdgpu 0000:02:00.0: amdgpu_device_ip_late_init failed Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.630696][ 7] [ T291] amdgpu 0000:02:00.0: Fatal error during GPU init Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.637477][ 7] [ T291] [drm] amdgpu: finishing device.
On the following hardware: lspci -nn -s 05:00.0 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 / Radeon 520 OEM] [1002:6611] (rev 87)
Well that's rather funny and certainly a NAK. To recap you are adding a delay to a delayed work handler. In other words you could delay the work handler in the first place :)
But this is not the reason why that here is a NAK. The more obvious problem is that we seem to have a race between the DPM code kicking in to save power after driver load and the asynchronous testing if userspace command submission works.
Adding the delay here works around that for the IB submission, but there can be other things going on in parallel which can fail as well.
Please rather open up a bug report instead.
Regards, Christian.
Signed-off-by: Zhenneng Li lizhenneng@kylinos.cn
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3987ecb24ef4..1eced991b5b2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2903,6 +2903,8 @@ static void amdgpu_device_delayed_init_work_handler(struct work_struct *work) container_of(work, struct amdgpu_device, delayed_init_work.work); int r;
- mdelay(1);
- r = amdgpu_ib_ring_tests(adev); if (r) DRM_ERROR("ib ring test failed (%d).\n", r);
On Tue, Mar 29, 2022 at 08:20:24AM +0200, Christian König wrote:
Am 29.03.22 um 05:05 schrieb Zhenneng Li:
This is a workaround for s3 resume hang for r7 340(amdgpu). When we test s3 with r7 340 on arm64 platform, graphics card will hang up, the error message are as follows: Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.599374][ 7] [ T291] amdgpu 0000:02:00.0: fb0: amdgpudrmfb frame buffer device Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.612869][ 7] [ T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <si_dpm> failed -22 Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.623392][ 7] [ T291] amdgpu 0000:02:00.0: amdgpu_device_ip_late_init failed Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.630696][ 7] [ T291] amdgpu 0000:02:00.0: Fatal error during GPU init Mar 4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [ 1.637477][ 7] [ T291] [drm] amdgpu: finishing device.
On the following hardware: lspci -nn -s 05:00.0 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 / Radeon 520 OEM] [1002:6611] (rev 87)
Well that's rather funny and certainly a NAK. To recap you are adding a delay to a delayed work handler. In other words you could delay the work handler in the first place :)
But this is not the reason why that here is a NAK. The more obvious problem is that we seem to have a race between the DPM code kicking in to save power after driver load and the asynchronous testing if userspace command submission works.
Adding the delay here works around that for the IB submission, but there can be other things going on in parallel which can fail as well.
Yeah standard pattern for this is to refcount your dpm code (using power domains or runtime pm ideally or hand-rolled if you have to). And then grabbing a dpm reference before you launch that work, and dropping that when the work has finished.
That gives you a nice clean way to handle all these problems around "right now I'm really not ready to allow low power states" in a very clean fashion. arm-soc drivers go totally overboard on this with runtime pm on all the chip components, that's maybe a bit much but afaiui we could do it on big pci drivers with power domains too :-)
Also with power domains you get autosuspend delay timers for free and tunable in sysfs ...
Cheers, Daniel
Please rather open up a bug report instead.
Regards, Christian.
Signed-off-by: Zhenneng Li lizhenneng@kylinos.cn
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 3987ecb24ef4..1eced991b5b2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -2903,6 +2903,8 @@ static void amdgpu_device_delayed_init_work_handler(struct work_struct *work) container_of(work, struct amdgpu_device, delayed_init_work.work); int r;
- mdelay(1);
- r = amdgpu_ib_ring_tests(adev); if (r) DRM_ERROR("ib ring test failed (%d).\n", r);
linaro-mm-sig@lists.linaro.org