Hi,
no, right at the first cold boot with the patched kernel the warning appeared:
May 2 21:50:27 xxx kernel: WARNING: CPU: 0 PID: 1 at drivers/iommu/amd/init.c:851 amd_iommu_enable_interrupts+0x312/0x3f0 May 2 21:50:27 xxx kernel: Modules linked in: May 2 21:50:27 xxx kernel: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.17.5 #2 May 2 21:50:27 xxx kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C94/MAG B550M MORTAR (MS-7C94), BIOS 1.94 09/23/2021 May 2 21:50:27 xxx kernel: RIP: 0010:amd_iommu_enable_interrupts+0x312/0x3f0 May 2 21:50:27 xxx kernel: Code: ff ff 49 8b 7f 18 89 04 24 e8 2a ff f6 ff 8b 04 24 e9 7b fd ff ff 0f 0b 4d 8b 3f 49 81 ff 90 15 4c 9f 0f 85 35 fd ff ff eb 82 <0f> 0b 4d 8b 3f 49 81 ff 90 15 4c 9f 0f 85 21 fd ff ff e9 6b ff ff May 2 21:50:27 xxx kernel: RSP: 0018:ffffb9ad4005fdd8 EFLAGS: 00010246 May 2 21:50:27 xxx kernel: RAX: 00000015be386e7c RBX: 0000000000000000 RCX: 0000000000000000 May 2 21:50:27 xxx kernel: RDX: 0000000000009e16 RSI: 0000000000009427 RDI: 00000015be37d066 May 2 21:50:27 xxx kernel: RBP: 0000000080000000 R08: ffffffffffffffff R09: 0000000000000000 May 2 21:50:27 xxx kernel: R10: 00000000000000d1 R11: 0000000000000000 R12: 000ffffffffffff8 May 2 21:50:27 xxx kernel: R13: 0800000000000000 R14: 0008000000000000 R15: ffff9a4600190000 May 2 21:50:27 xxx kernel: FS: 0000000000000000(0000) GS:ffff9a53f1e00000(0000) knlGS:0000000000000000 May 2 21:50:27 xxx kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 2 21:50:27 xxx kernel: CR2: ffff9a51c9c01000 CR3: 0000000cc960a000 CR4: 0000000000750ef0 May 2 21:50:27 xxx kernel: PKRU: 55555554 May 2 21:50:27 xxx kernel: Call Trace: May 2 21:50:27 xxx kernel: <TASK> May 2 21:50:27 xxx kernel: iommu_go_to_state+0x10e0/0x138d May 2 21:50:27 xxx kernel: ? e820__memblock_setup+0x78/0x78 May 2 21:50:27 xxx kernel: amd_iommu_init+0xa/0x20 May 2 21:50:27 xxx kernel: pci_iommu_init+0x11/0x3a May 2 21:50:27 xxx kernel: do_one_initcall+0x47/0x180 May 2 21:50:27 xxx kernel: kernel_init_freeable+0x162/0x1a7 May 2 21:50:27 xxx kernel: ? rest_init+0xc0/0xc0 May 2 21:50:27 xxx kernel: kernel_init+0x11/0x110 May 2 21:50:27 xxx kernel: ret_from_fork+0x22/0x30 May 2 21:50:27 xxx kernel: </TASK>
For a cold boot I switch off the computer for ca. 30 seconds and switch it on again. I booted into a console where I looked out for warnings with `dmesg -l warn`. Then I tried to start X with `startx` but the screen got blocked. Via ssh I ordered `reboot`, a warm start. Then the warning didn't appear, I could start X and work normally.
In 'kern.log' I also found this:
May 2 21:53:27 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=16, emitted seq=17 May 2 21:53:27 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1787 thread Xorg:cs0 pid 1788 May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: GPU reset begin! May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) May 2 21:53:27 xxx kernel: [drm] free PSP TMR buffer May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: MODE2 reset May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: GPU reset succeeded, trying to resume May 2 21:53:27 xxx kernel: [drm] PCIE GART of 1024M enabled. May 2 21:53:27 xxx kernel: [drm] PTB located at 0x000000F400900000 May 2 21:53:27 xxx kernel: [drm] PSP is resuming... May 2 21:53:27 xxx kernel: [drm] reserve 0x400000 from 0xf4ff800000 for PSP TMR May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: RAS: optional ras ta ucode is not available May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: RAP: optional rap ta ucode is not available May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: SMU is resuming... May 2 21:53:27 xxx kernel: amdgpu 0000:30:00.0: amdgpu: SMU is resumed successfully! May 2 21:53:27 xxx kernel: [drm] DMUB hardware initialized: version=0x0101001F May 2 21:53:28 xxx kernel: [drm] kiq ring mec 2 pipe 1 q 0 May 2 21:53:28 xxx kernel: amdgpu 0000:30:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110) May 2 21:53:28 xxx kernel: [drm:amdgpu_gfx_enable_kcq.cold [amdgpu]] *ERROR* KCQ enable failed May 2 21:53:28 xxx kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110 May 2 21:53:28 xxx kernel: amdgpu 0000:30:00.0: amdgpu: GPU reset(2) failed May 2 21:53:28 xxx kernel: amdgpu 0000:30:00.0: amdgpu: GPU reset end with ret = -110 May 2 21:53:38 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=17, emitted seq=17 May 2 21:53:38 xxx kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1787 thread Xorg:cs0 pid 1788 May 2 21:53:38 xxx kernel: amdgpu 0000:30:00.0: amdgpu: GPU reset begin!
Thanks for your help. Regards, Jörg.
JoergRoedel wrote on 02/05/2022 11:45:
[now with Vasants correct email address]
Hi Jörg,
can you please try the attached patch? It should get rid of the WARNING on your system.
Suravee, Vasant, can you please test review the patch and report whether the GA log functionality is still working?
Thanks,
Joerg
From 4fee768d5c23715eae31fed3b41cdf045e099aef Mon Sep 17 00:00:00 2001 From: Joerg Roedel jroedel@suse.de Date: Mon, 2 May 2022 11:37:43 +0200 Subject: [PATCH] iommu/amd: Do not poll GA_LOG_RUNNING mask at boot
On some hardware it takes more than a second for the hardware to get the GA log into running state. This is too long to poll for in the AMD IOMMU driver code.
Instead, check whehter initialization was successful before polling the log for the first time.
Signed-off-by: Joerg Roedel jroedel@suse.de
drivers/iommu/amd/amd_iommu_types.h | 3 +++ drivers/iommu/amd/init.c | 13 ++----------- drivers/iommu/amd/iommu.c | 25 ++++++++++++++++++++++++- 3 files changed, 29 insertions(+), 12 deletions(-)
<snip>