On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
From: Jinhui Guo guojinhui.liam@bytedance.com Sent: Thursday, December 11, 2025 12:00 PM
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done:
" For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. "
Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()?
Hi, kevin, sorry for the delayed reply.
A normal or surprise removal of a PCIe device on a hot-plug port normally triggers an interrupt from the PCIe switch.
We have, however, observed cases where no interrupt is generated when the device suddenly loses its link; the behaviour is identical to setting the Link Disable bit in the switch’s Link Control register (offset 10h). Exactly what goes wrong in the LTSSM between the PCIe switch and the endpoint remains unknown.
For example, if a VM fails to connect to the PCIe device,
'failed' for what reason?
"virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd.
Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase.
Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap.
- mm-struct release
- {attach,release}_dev
- set/remove PASID
- dirty-tracking setup
surprise removal can happen at any time, e.g. after the check of pci_device_is_present(). In the end we need the logic in qi_check_fault() to check the presence upon ITE timeout error received to break the infinite loop. So in your case even with that logici in place you still observe lockup (probably due to hardware ITE timeout is longer than the lockup detection on the CPU?
Are you referring to the timeout added in patch https://lore.kernel.org/all/20240222090251.2849702-4-haifeng.zhao@linux.inte... ?
Our lockup-detection timeout is the default 10 s.
We see ITE-timeout messages in the kernel log. Yet the system still hard-locks—probably because, as you mentioned, the hardware ITE timeout is longer than the CPU’s lockup-detection window. I’ll reproduce the case and follow up with a deeper analysis.
kernel: [ 2402.642685][ T607] vfio-pci 0000:3f:00.0: Unable to change power state from D0 to D3hot, device inaccessible kernel: [ 2403.441828][T49880] DMAR: VT-d detected Invalidation Time-out Error: SID 0 kernel: [ 2403.441830][ C0] DMAR: DRHD: handling fault status reg 40 kernel: [ 2403.441831][T49880] DMAR: QI HEAD: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07fc kernel: [ 2403.441833][T49880] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07f8 kernel: [ 2403.441879][T49880] DMAR: Invalidation Time-out Error (ITE) cleared kernel: [ 2423.643527][ C7] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: kernel: [ 2423.643551][ C7] rcu: 8-...0: (0 ticks this GP) idle=198c/1/0x4000000000000000 softirq=19450/19450 fqs=4403 kernel: [ 2423.643567][ C7] rcu: (detected by 7, t=21002 jiffies, g=238909, q=4932 ncpus=96) kernel: [ 2423.643578][ C7] Sending NMI from CPU 7 to CPUs 8: kernel: [ 2423.643581][ C8] NMI backtrace for cpu 8 kernel: [ 2423.643585][ C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S E 6.18.0 #5 PREEMPT(voluntary) kernel: [ 2423.643588][ C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE kernel: [ 2423.643589][ C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021 kernel: [ 2423.643590][ C8] RIP: 0010:qi_submit_sync+0x6cf/0x8d0 kernel: [ 2423.643597][ C8] Code: 89 4c 24 50 89 70 34 48 c7 c7 f0 f5 4a a5 e8 48 15 89 ff 48 8b 4c 24 50 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 <75> 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 kernel: [ 2423.643598][ C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000097 kernel: [ 2423.643600][ C8] RAX: ffff9dac803a06bc RBX: 0000000000000000 RCX: 0000000000000000 kernel: [ 2423.643601][ C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480 kernel: [ 2423.643602][ C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003 kernel: [ 2423.643603][ C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040 kernel: [ 2423.643605][ C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000000af kernel: [ 2423.643606][ C8] FS: 0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000 kernel: [ 2423.643607][ C8] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 2423.643608][ C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0 kernel: [ 2423.643610][ C8] PKRU: 55555554 kernel: [ 2423.643611][ C8] Call Trace: kernel: [ 2423.643613][ C8] <TASK> kernel: [ 2423.643616][ C8] ? __pfx_domain_context_clear_one_cb+0x10/0x10 kernel: [ 2423.643620][ C8] qi_flush_dev_iotlb+0xd5/0xe0 kernel: [ 2423.643622][ C8] __context_flush_dev_iotlb.part.0+0x3c/0x80 kernel: [ 2423.643625][ C8] domain_context_clear_one_cb+0x16/0x20 kernel: [ 2423.643626][ C8] pci_for_each_dma_alias+0x3b/0x140 kernel: [ 2423.643631][ C8] device_block_translation+0x122/0x180 kernel: [ 2423.643634][ C8] blocking_domain_attach_dev+0x39/0x50 kernel: [ 2423.643636][ C8] __iommu_attach_device+0x1b/0x90 kernel: [ 2423.643639][ C8] __iommu_device_set_domain+0x5d/0xb0 kernel: [ 2423.643642][ C8] __iommu_group_set_domain_internal+0x60/0x110 kernel: [ 2423.643644][ C8] iommu_detach_group+0x3a/0x60 kernel: [ 2423.643650][ C8] vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1] kernel: [ 2423.643654][ C8] ? __dentry_kill+0x12a/0x180 kernel: [ 2423.643660][ C8] ? __pm_runtime_idle+0x44/0xe0 kernel: [ 2423.643666][ C8] vfio_group_detach_container+0x4f/0x160 [vfio] kernel: [ 2423.643672][ C8] vfio_group_fops_release+0x3e/0x80 [vfio] kernel: [ 2423.643677][ C8] __fput+0xe6/0x2b0 kernel: [ 2423.643682][ C8] task_work_run+0x58/0x90 kernel: [ 2423.643688][ C8] do_exit+0x29b/0xa80 kernel: [ 2423.643694][ C8] do_group_exit+0x2c/0x80 kernel: [ 2423.643696][ C8] get_signal+0x8f9/0x900 kernel: [ 2423.643700][ C8] arch_do_signal_or_restart+0x29/0x210 kernel: [ 2423.643704][ C8] ? __schedule+0x582/0xe80 kernel: [ 2423.643708][ C8] exit_to_user_mode_loop+0x8e/0x4f0 kernel: [ 2423.643712][ C8] do_syscall_64+0x262/0x630 kernel: [ 2423.643717][ C8] entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: [ 2423.643720][ C8] RIP: 0033:0x7fde19078514 kernel: [ 2423.643722][ C8] Code: Unable to access opcode bytes at 0x7fde190784ea. kernel: [ 2423.643723][ C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 kernel: [ 2423.643724][ C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514 kernel: [ 2423.643726][ C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000 kernel: [ 2423.643727][ C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000 kernel: [ 2423.643728][ C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0 kernel: [ 2423.643729][ C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000 kernel: [ 2423.643731][ C8] </TASK> kernel: [ 2424.375254][T81463] vfio-pci 0000:3f:00.0: Unable to change power state from D3cold to D0, device inaccessible ... kernel: [ 2448.327929][ C8] watchdog: CPU8: Watchdog detected hard LOCKUP on cpu 8 kernel: [ 2448.327932][ C8] Modules linked in: vfio_pci(E) vfio_pci_core(E) vfio_iommu_type1(E) vfio(E) udp_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) ip_set_hash_net(E) nft_compat(E) x_tables(E) ip_set(E) msr(E) nf_tables(E) ... kernel: [ 2448.327963][ C8] ib_core(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) nvme(E) xhci_hcd(E) i2c_i801(E) nvme_core(E) usbcore(E) scsi_mod(E) mlx5_core(E) i2c_smbus(E) lpc_ich(E) usb_common(E) scsi_common(E) wmi(E) kernel: [ 2448.327972][ C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S EL 6.18.0 #5 PREEMPT(voluntary) kernel: [ 2448.327975][ C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP kernel: [ 2448.327976][ C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021 kernel: [ 2448.327977][ C8] RIP: 0010:qi_submit_sync+0x6e7/0x8d0 kernel: [ 2448.327981][ C8] Code: 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 75 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 e8 18 <41> 01 c7 45 0f b6 ff 41 29 c7 44 39 fa 75 cb 48 85 c9 0f 85 05 01 kernel: [ 2448.327983][ C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000046 kernel: [ 2448.327984][ C8] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 kernel: [ 2448.327985][ C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480 kernel: [ 2448.327986][ C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003 kernel: [ 2448.327987][ C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040 kernel: [ 2448.327988][ C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000001b3 kernel: [ 2448.327989][ C8] FS: 0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000 kernel: [ 2448.327990][ C8] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 2448.327991][ C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0 kernel: [ 2448.327992][ C8] PKRU: 55555554 kernel: [ 2448.327993][ C8] Call Trace: kernel: [ 2448.327995][ C8] <TASK> kernel: [ 2448.327997][ C8] ? __pfx_domain_context_clear_one_cb+0x10/0x10 kernel: [ 2448.328000][ C8] qi_flush_dev_iotlb+0xd5/0xe0 kernel: [ 2448.328002][ C8] __context_flush_dev_iotlb.part.0+0x3c/0x80 kernel: [ 2448.328004][ C8] domain_context_clear_one_cb+0x16/0x20 kernel: [ 2448.328006][ C8] pci_for_each_dma_alias+0x3b/0x140 kernel: [ 2448.328010][ C8] device_block_translation+0x122/0x180 kernel: [ 2448.328012][ C8] blocking_domain_attach_dev+0x39/0x50 kernel: [ 2448.328014][ C8] __iommu_attach_device+0x1b/0x90 kernel: [ 2448.328017][ C8] __iommu_device_set_domain+0x5d/0xb0 kernel: [ 2448.328019][ C8] __iommu_group_set_domain_internal+0x60/0x110 kernel: [ 2448.328021][ C8] iommu_detach_group+0x3a/0x60 kernel: [ 2448.328023][ C8] vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1] kernel: [ 2448.328026][ C8] ? __dentry_kill+0x12a/0x180 kernel: [ 2448.328030][ C8] ? __pm_runtime_idle+0x44/0xe0 kernel: [ 2448.328035][ C8] vfio_group_detach_container+0x4f/0x160 [vfio] kernel: [ 2448.328041][ C8] vfio_group_fops_release+0x3e/0x80 [vfio] kernel: [ 2448.328046][ C8] __fput+0xe6/0x2b0 kernel: [ 2448.328049][ C8] task_work_run+0x58/0x90 kernel: [ 2448.328053][ C8] do_exit+0x29b/0xa80 kernel: [ 2448.328057][ C8] do_group_exit+0x2c/0x80 kernel: [ 2448.328060][ C8] get_signal+0x8f9/0x900 kernel: [ 2448.328064][ C8] arch_do_signal_or_restart+0x29/0x210 kernel: [ 2448.328068][ C8] ? __schedule+0x582/0xe80 kernel: [ 2448.328070][ C8] exit_to_user_mode_loop+0x8e/0x4f0 kernel: [ 2448.328074][ C8] do_syscall_64+0x262/0x630 kernel: [ 2448.328076][ C8] entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: [ 2448.328078][ C8] RIP: 0033:0x7fde19078514 kernel: [ 2448.328080][ C8] Code: Unable to access opcode bytes at 0x7fde190784ea. kernel: [ 2448.328081][ C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 kernel: [ 2448.328082][ C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514 kernel: [ 2448.328083][ C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000 kernel: [ 2448.328085][ C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000 kernel: [ 2448.328085][ C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0 kernel: [ 2448.328086][ C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000 kernel: [ 2448.328088][ C8] </TASK> kernel: [ 2450.245901][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 41s! [mongoosev3-agen:4727]
In any case this change cannot 100% fix the lockup. It just reduces the possibility which should be made clear.
I agree with the above, but it's better to cover more corner cases.
Best Regards, Jinhui