Hi, all
We hit hard-lockups when the Intel IOMMU waits indefinitely for an ATS invalidation that cannot complete, especially under GDR high-load conditions.
1. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU non-scalable mode. Two scenarios exist: NIC link-down with an explicit link-down event and link-down without any event.
a) NIC link-down with an explicit link-dow event. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist
b) NIC link-down without an event - hard-lock on VM destroy. Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
2. Hard-lock when a passthrough PCIe NIC with ATS enabled link-down in Intel IOMMU scalable mode; NIC link-down without an event hard-locks on VM destroy. Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
Fix both issues with two patches: 1. Skip dev-IOTLB flush for inaccessible devices in __context_flush_dev_iotlb() using pci_device_is_present(). 2. Use pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation in devtlb_invalidation_with_pasid().
Best Regards, Jinhui
--- v1: https://lore.kernel.org/all/20251210171431.1589-1-guojinhui.liam@bytedance.c...
Changelog in v1 -> v2 (suggested by Baolu Lu) - Simplify the pci_device_is_present() check in __context_flush_dev_iotlb(). - Add Cc: stable@vger.kernel.org to both patches.
Jinhui Guo (2): iommu/vt-d: Skip dev-iotlb flush for inaccessible PCIe device without scalable mode iommu/vt-d: Flush dev-IOTLB only when PCIe device is accessible in scalable mode
drivers/iommu/intel/pasid.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-)
PCIe endpoints with ATS enabled and passed through to userspace (e.g., QEMU, DPDK) can hard-lock the host when their link drops, either by surprise removal or by a link fault.
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") adds pci_dev_is_disconnected() to devtlb_invalidation_with_pasid() so ATS invalidation is skipped only when the device is being safely removed, but it applies only when Intel IOMMU scalable mode is enabled.
With scalable mode disabled or unsupported, a system hard-lock occurs when a PCIe endpoint's link drops because the Intel IOMMU waits indefinitely for an ATS invalidation that cannot complete.
Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist
Commit 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") adds intel_pasid_teardown_sm_context() to intel_iommu_release_device(), which calls qi_flush_dev_iotlb() and can also hard-lock the system when a PCIe endpoint's link drops.
Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 intel_context_flush_no_pasid device_pasid_table_teardown pci_pasid_table_teardown pci_for_each_dma_alias intel_pasid_teardown_sm_context intel_iommu_release_device iommu_deinit_device __iommu_group_remove_device iommu_release_device iommu_bus_notifier blocking_notifier_call_chain bus_notify device_del pci_remove_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist
Sometimes the endpoint loses connection without a link-down event (e.g., due to a link fault); killing the process (virsh destroy) then hard-locks the host.
Call Trace: qi_submit_sync qi_flush_dev_iotlb __context_flush_dev_iotlb.part.0 domain_context_clear_one_cb pci_for_each_dma_alias device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
pci_dev_is_disconnected() only covers safe-removal paths; pci_device_is_present() tests accessibility by reading vendor/device IDs and internally calls pci_dev_is_disconnected(). On a ConnectX-5 (8 GT/s, x2) this costs ~70 µs.
Since __context_flush_dev_iotlb() is only called on {attach,release}_dev paths (not hot), add pci_device_is_present() there to skip inaccessible devices and avoid the hard-lock.
Fixes: 37764b952e1b ("iommu/vt-d: Global devTLB flush when present context entry changed") Fixes: 81e921fd3216 ("iommu/vt-d: Fix NULL domain on device release") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo guojinhui.liam@bytedance.com --- drivers/iommu/intel/pasid.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 3e2255057079..a369690f5926 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -1102,6 +1102,15 @@ static void __context_flush_dev_iotlb(struct device_domain_info *info) if (!info->ats_enabled) return;
+ /* + * Skip dev-IOTLB flush for inaccessible PCIe devices to prevent the + * Intel IOMMU from waiting indefinitely for an ATS invalidation that + * cannot complete. + */ + if (dev_is_pci(info->dev) && + !pci_device_is_present(to_pci_dev(info->dev))) + return; + qi_flush_dev_iotlb(info->iommu, PCI_DEVID(info->bus, info->devfn), info->pfsid, info->ats_qdep, 0, MAX_AGAW_PFN_WIDTH);
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
For example, if a VM fails to connect to the PCIe device, "virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd.
Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase.
Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap.
1. mm-struct release 2. {attach,release}_dev 3. set/remove PASID 4. dirty-tracking setup
The gain in system stability far outweighs the negligible cost of using pci_device_is_present() instead of pci_dev_is_disconnected() to decide when to skip ATS invalidation, especially under GDR high-load conditions.
Fixes: 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo guojinhui.liam@bytedance.com --- drivers/iommu/intel/pasid.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index a369690f5926..e64d445de964 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -218,7 +218,7 @@ devtlb_invalidation_with_pasid(struct intel_iommu *iommu, if (!info || !info->ats_enabled) return;
- if (pci_dev_is_disconnected(to_pci_dev(dev))) + if (!pci_device_is_present(to_pci_dev(dev))) return;
sid = PCI_DEVID(info->bus, info->devfn);
From: Jinhui Guo guojinhui.liam@bytedance.com Sent: Thursday, December 11, 2025 12:00 PM
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done:
" For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. "
Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()?
For example, if a VM fails to connect to the PCIe device,
'failed' for what reason?
"virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd.
Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase.
Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap.
- mm-struct release
- {attach,release}_dev
- set/remove PASID
- dirty-tracking setup
surprise removal can happen at any time, e.g. after the check of pci_device_is_present(). In the end we need the logic in qi_check_fault() to check the presence upon ITE timeout error received to break the infinite loop. So in your case even with that logici in place you still observe lockup (probably due to hardware ITE timeout is longer than the lockup detection on the CPU?
In any case this change cannot 100% fix the lockup. It just reduces the possibility which should be made clear.
On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
From: Jinhui Guo guojinhui.liam@bytedance.com Sent: Thursday, December 11, 2025 12:00 PM
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done:
" For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. "
Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()?
Hi, kevin, sorry for the delayed reply.
A normal or surprise removal of a PCIe device on a hot-plug port normally triggers an interrupt from the PCIe switch.
We have, however, observed cases where no interrupt is generated when the device suddenly loses its link; the behaviour is identical to setting the Link Disable bit in the switch’s Link Control register (offset 10h). Exactly what goes wrong in the LTSSM between the PCIe switch and the endpoint remains unknown.
For example, if a VM fails to connect to the PCIe device,
'failed' for what reason?
"virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd.
Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase.
Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap.
- mm-struct release
- {attach,release}_dev
- set/remove PASID
- dirty-tracking setup
surprise removal can happen at any time, e.g. after the check of pci_device_is_present(). In the end we need the logic in qi_check_fault() to check the presence upon ITE timeout error received to break the infinite loop. So in your case even with that logici in place you still observe lockup (probably due to hardware ITE timeout is longer than the lockup detection on the CPU?
Are you referring to the timeout added in patch https://lore.kernel.org/all/20240222090251.2849702-4-haifeng.zhao@linux.inte... ?
Our lockup-detection timeout is the default 10 s.
We see ITE-timeout messages in the kernel log. Yet the system still hard-locks—probably because, as you mentioned, the hardware ITE timeout is longer than the CPU’s lockup-detection window. I’ll reproduce the case and follow up with a deeper analysis.
kernel: [ 2402.642685][ T607] vfio-pci 0000:3f:00.0: Unable to change power state from D0 to D3hot, device inaccessible kernel: [ 2403.441828][T49880] DMAR: VT-d detected Invalidation Time-out Error: SID 0 kernel: [ 2403.441830][ C0] DMAR: DRHD: handling fault status reg 40 kernel: [ 2403.441831][T49880] DMAR: QI HEAD: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07fc kernel: [ 2403.441833][T49880] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x1003a07f8 kernel: [ 2403.441879][T49880] DMAR: Invalidation Time-out Error (ITE) cleared kernel: [ 2423.643527][ C7] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: kernel: [ 2423.643551][ C7] rcu: 8-...0: (0 ticks this GP) idle=198c/1/0x4000000000000000 softirq=19450/19450 fqs=4403 kernel: [ 2423.643567][ C7] rcu: (detected by 7, t=21002 jiffies, g=238909, q=4932 ncpus=96) kernel: [ 2423.643578][ C7] Sending NMI from CPU 7 to CPUs 8: kernel: [ 2423.643581][ C8] NMI backtrace for cpu 8 kernel: [ 2423.643585][ C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S E 6.18.0 #5 PREEMPT(voluntary) kernel: [ 2423.643588][ C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE kernel: [ 2423.643589][ C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021 kernel: [ 2423.643590][ C8] RIP: 0010:qi_submit_sync+0x6cf/0x8d0 kernel: [ 2423.643597][ C8] Code: 89 4c 24 50 89 70 34 48 c7 c7 f0 f5 4a a5 e8 48 15 89 ff 48 8b 4c 24 50 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 <75> 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 kernel: [ 2423.643598][ C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000097 kernel: [ 2423.643600][ C8] RAX: ffff9dac803a06bc RBX: 0000000000000000 RCX: 0000000000000000 kernel: [ 2423.643601][ C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480 kernel: [ 2423.643602][ C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003 kernel: [ 2423.643603][ C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040 kernel: [ 2423.643605][ C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000000af kernel: [ 2423.643606][ C8] FS: 0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000 kernel: [ 2423.643607][ C8] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 2423.643608][ C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0 kernel: [ 2423.643610][ C8] PKRU: 55555554 kernel: [ 2423.643611][ C8] Call Trace: kernel: [ 2423.643613][ C8] <TASK> kernel: [ 2423.643616][ C8] ? __pfx_domain_context_clear_one_cb+0x10/0x10 kernel: [ 2423.643620][ C8] qi_flush_dev_iotlb+0xd5/0xe0 kernel: [ 2423.643622][ C8] __context_flush_dev_iotlb.part.0+0x3c/0x80 kernel: [ 2423.643625][ C8] domain_context_clear_one_cb+0x16/0x20 kernel: [ 2423.643626][ C8] pci_for_each_dma_alias+0x3b/0x140 kernel: [ 2423.643631][ C8] device_block_translation+0x122/0x180 kernel: [ 2423.643634][ C8] blocking_domain_attach_dev+0x39/0x50 kernel: [ 2423.643636][ C8] __iommu_attach_device+0x1b/0x90 kernel: [ 2423.643639][ C8] __iommu_device_set_domain+0x5d/0xb0 kernel: [ 2423.643642][ C8] __iommu_group_set_domain_internal+0x60/0x110 kernel: [ 2423.643644][ C8] iommu_detach_group+0x3a/0x60 kernel: [ 2423.643650][ C8] vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1] kernel: [ 2423.643654][ C8] ? __dentry_kill+0x12a/0x180 kernel: [ 2423.643660][ C8] ? __pm_runtime_idle+0x44/0xe0 kernel: [ 2423.643666][ C8] vfio_group_detach_container+0x4f/0x160 [vfio] kernel: [ 2423.643672][ C8] vfio_group_fops_release+0x3e/0x80 [vfio] kernel: [ 2423.643677][ C8] __fput+0xe6/0x2b0 kernel: [ 2423.643682][ C8] task_work_run+0x58/0x90 kernel: [ 2423.643688][ C8] do_exit+0x29b/0xa80 kernel: [ 2423.643694][ C8] do_group_exit+0x2c/0x80 kernel: [ 2423.643696][ C8] get_signal+0x8f9/0x900 kernel: [ 2423.643700][ C8] arch_do_signal_or_restart+0x29/0x210 kernel: [ 2423.643704][ C8] ? __schedule+0x582/0xe80 kernel: [ 2423.643708][ C8] exit_to_user_mode_loop+0x8e/0x4f0 kernel: [ 2423.643712][ C8] do_syscall_64+0x262/0x630 kernel: [ 2423.643717][ C8] entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: [ 2423.643720][ C8] RIP: 0033:0x7fde19078514 kernel: [ 2423.643722][ C8] Code: Unable to access opcode bytes at 0x7fde190784ea. kernel: [ 2423.643723][ C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 kernel: [ 2423.643724][ C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514 kernel: [ 2423.643726][ C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000 kernel: [ 2423.643727][ C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000 kernel: [ 2423.643728][ C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0 kernel: [ 2423.643729][ C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000 kernel: [ 2423.643731][ C8] </TASK> kernel: [ 2424.375254][T81463] vfio-pci 0000:3f:00.0: Unable to change power state from D3cold to D0, device inaccessible ... kernel: [ 2448.327929][ C8] watchdog: CPU8: Watchdog detected hard LOCKUP on cpu 8 kernel: [ 2448.327932][ C8] Modules linked in: vfio_pci(E) vfio_pci_core(E) vfio_iommu_type1(E) vfio(E) udp_diag(E) tcp_diag(E) inet_diag(E) binfmt_misc(E) ip_set_hash_net(E) nft_compat(E) x_tables(E) ip_set(E) msr(E) nf_tables(E) ... kernel: [ 2448.327963][ C8] ib_core(E) hid_generic(E) usbhid(E) hid(E) ahci(E) libahci(E) xhci_pci(E) libata(E) nvme(E) xhci_hcd(E) i2c_i801(E) nvme_core(E) usbcore(E) scsi_mod(E) mlx5_core(E) i2c_smbus(E) lpc_ich(E) usb_common(E) scsi_common(E) wmi(E) kernel: [ 2448.327972][ C8] CPU: 8 UID: 0 PID: 49880 Comm: vfio_test Kdump: loaded Tainted: G S EL 6.18.0 #5 PREEMPT(voluntary) kernel: [ 2448.327975][ C8] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE, [L]=SOFTLOCKUP kernel: [ 2448.327976][ C8] Hardware name: Inspur NF5468M5/YZMB-01130-105, BIOS 4.2.0 04/28/2021 kernel: [ 2448.327977][ C8] RIP: 0010:qi_submit_sync+0x6e7/0x8d0 kernel: [ 2448.327981][ C8] Code: 8b 54 24 58 49 8b 76 10 49 63 c7 48 8d 04 86 83 38 01 75 06 c7 00 03 00 00 00 41 81 c7 fe 00 00 00 44 89 f8 c1 f8 1f c1 e8 18 <41> 01 c7 45 0f b6 ff 41 29 c7 44 39 fa 75 cb 48 85 c9 0f 85 05 01 kernel: [ 2448.327983][ C8] RSP: 0018:ffffb5a3bd0a7a30 EFLAGS: 00000046 kernel: [ 2448.327984][ C8] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 kernel: [ 2448.327985][ C8] RDX: 00000000000000fe RSI: ffff9dac803a0400 RDI: ffff9ddb0081d480 kernel: [ 2448.327986][ C8] RBP: ffff9dac8037fe00 R08: 0000000000000000 R09: 0000000000000003 kernel: [ 2448.327987][ C8] R10: ffffb5a3bd0a78e0 R11: ffff9e0bbff3c068 R12: 0000000000000040 kernel: [ 2448.327988][ C8] R13: ffff9dac80314600 R14: ffff9dac8037fe00 R15: 00000000000001b3 kernel: [ 2448.327989][ C8] FS: 0000000000000000(0000) GS:ffff9ddb5a262000(0000) knlGS:0000000000000000 kernel: [ 2448.327990][ C8] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: [ 2448.327991][ C8] CR2: 000000002aee3000 CR3: 000000024a27b002 CR4: 00000000007726f0 kernel: [ 2448.327992][ C8] PKRU: 55555554 kernel: [ 2448.327993][ C8] Call Trace: kernel: [ 2448.327995][ C8] <TASK> kernel: [ 2448.327997][ C8] ? __pfx_domain_context_clear_one_cb+0x10/0x10 kernel: [ 2448.328000][ C8] qi_flush_dev_iotlb+0xd5/0xe0 kernel: [ 2448.328002][ C8] __context_flush_dev_iotlb.part.0+0x3c/0x80 kernel: [ 2448.328004][ C8] domain_context_clear_one_cb+0x16/0x20 kernel: [ 2448.328006][ C8] pci_for_each_dma_alias+0x3b/0x140 kernel: [ 2448.328010][ C8] device_block_translation+0x122/0x180 kernel: [ 2448.328012][ C8] blocking_domain_attach_dev+0x39/0x50 kernel: [ 2448.328014][ C8] __iommu_attach_device+0x1b/0x90 kernel: [ 2448.328017][ C8] __iommu_device_set_domain+0x5d/0xb0 kernel: [ 2448.328019][ C8] __iommu_group_set_domain_internal+0x60/0x110 kernel: [ 2448.328021][ C8] iommu_detach_group+0x3a/0x60 kernel: [ 2448.328023][ C8] vfio_iommu_type1_detach_group+0x106/0x610 [vfio_iommu_type1] kernel: [ 2448.328026][ C8] ? __dentry_kill+0x12a/0x180 kernel: [ 2448.328030][ C8] ? __pm_runtime_idle+0x44/0xe0 kernel: [ 2448.328035][ C8] vfio_group_detach_container+0x4f/0x160 [vfio] kernel: [ 2448.328041][ C8] vfio_group_fops_release+0x3e/0x80 [vfio] kernel: [ 2448.328046][ C8] __fput+0xe6/0x2b0 kernel: [ 2448.328049][ C8] task_work_run+0x58/0x90 kernel: [ 2448.328053][ C8] do_exit+0x29b/0xa80 kernel: [ 2448.328057][ C8] do_group_exit+0x2c/0x80 kernel: [ 2448.328060][ C8] get_signal+0x8f9/0x900 kernel: [ 2448.328064][ C8] arch_do_signal_or_restart+0x29/0x210 kernel: [ 2448.328068][ C8] ? __schedule+0x582/0xe80 kernel: [ 2448.328070][ C8] exit_to_user_mode_loop+0x8e/0x4f0 kernel: [ 2448.328074][ C8] do_syscall_64+0x262/0x630 kernel: [ 2448.328076][ C8] entry_SYSCALL_64_after_hwframe+0x76/0x7e kernel: [ 2448.328078][ C8] RIP: 0033:0x7fde19078514 kernel: [ 2448.328080][ C8] Code: Unable to access opcode bytes at 0x7fde190784ea. kernel: [ 2448.328081][ C8] RSP: 002b:00007ffd0e1dc7e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000022 kernel: [ 2448.328082][ C8] RAX: fffffffffffffdfe RBX: 0000000000000000 RCX: 00007fde19078514 kernel: [ 2448.328083][ C8] RDX: 00007fde1916e8c0 RSI: 000055b217303260 RDI: 0000000000000000 kernel: [ 2448.328085][ C8] RBP: 00007ffd0e1dc8a0 R08: 00007fde19173500 R09: 0000000000000000 kernel: [ 2448.328085][ C8] R10: fffffffffffffbea R11: 0000000000000246 R12: 000055b1f8d8d0b0 kernel: [ 2448.328086][ C8] R13: 00007ffd0e1dc980 R14: 0000000000000000 R15: 0000000000000000 kernel: [ 2448.328088][ C8] </TASK> kernel: [ 2450.245901][ C7] watchdog: BUG: soft lockup - CPU#7 stuck for 41s! [mongoosev3-agen:4727]
In any case this change cannot 100% fix the lockup. It just reduces the possibility which should be made clear.
I agree with the above, but it's better to cover more corner cases.
Best Regards, Jinhui
On 12/22/25 19:19, Jinhui Guo wrote:
On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
From: Jinhui Guoguojinhui.liam@bytedance.com Sent: Thursday, December 11, 2025 12:00 PM
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done:
" For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. "
Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()?
Hi, kevin, sorry for the delayed reply.
A normal or surprise removal of a PCIe device on a hot-plug port normally triggers an interrupt from the PCIe switch.
We have, however, observed cases where no interrupt is generated when the device suddenly loses its link; the behaviour is identical to setting the Link Disable bit in the switch’s Link Control register (offset 10h). Exactly what goes wrong in the LTSSM between the PCIe switch and the endpoint remains unknown.
In this scenario, the hardware has effectively vanished, yet the device driver remains bound and the IOMMU resources haven't been released. I’m just curious if this stale state could trigger issues in other places before the kernel fully realizes the device is gone? I’m not objecting to the fix. I'm just interested in whether this 'zombie' state creates risks elsewhere.
For example, if a VM fails to connect to the PCIe device,
'failed' for what reason?
"virsh destroy" is executed to release resources and isolate the fault, but a hard-lockup occurs while releasing the group fd.
Call Trace: qi_submit_sync qi_flush_dev_iotlb intel_pasid_tear_down_entry device_block_translation blocking_domain_attach_dev __iommu_attach_device __iommu_device_set_domain __iommu_group_set_domain_internal iommu_detach_group vfio_iommu_type1_detach_group vfio_group_detach_container vfio_group_fops_release __fput
Although pci_device_is_present() is slower than pci_dev_is_disconnected(), it still takes only ~70 µs on a ConnectX-5 (8 GT/s, x2) and becomes even faster as PCIe speed and width increase.
Besides, devtlb_invalidation_with_pasid() is called only in the paths below, which are far less frequent than memory map/unmap.
- mm-struct release
- {attach,release}_dev
- set/remove PASID
- dirty-tracking setup
surprise removal can happen at any time, e.g. after the check of pci_device_is_present(). In the end we need the logic in qi_check_fault() to check the presence upon ITE timeout error received to break the infinite loop. So in your case even with that logici in place you still observe lockup (probably due to hardware ITE timeout is longer than the lockup detection on the CPU?
Are you referring to the timeout added in patch https://lore.kernel.org/all/20240222090251.2849702-4- haifeng.zhao@linux.intel.com/ ?
This doesn't appear to be a deterministic solution, because ...
Our lockup-detection timeout is the default 10 s.
We see ITE-timeout messages in the kernel log. Yet the system still hard-locks—probably because, as you mentioned, the hardware ITE timeout is longer than the CPU’s lockup-detection window. I’ll reproduce the case and follow up with a deeper analysis.
... as you see, neither the PCI nor the VT-d specifications mandate a specific device-TLB invalidation timeout value for hardware implementations. Consequently, the ITE timeout value may exceed the CPU watchdog threshold, meaning a hard lockup will be detected before the ITE even occurs.
Thanks, baolu
On Tue, Dec 23, 2025 12:06:24 +0800, Baolu Lu wrote:
On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
From: Jinhui Guoguojinhui.liam@bytedance.com Sent: Thursday, December 11, 2025 12:00 PM
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done:
" For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. "
Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()?
Hi, kevin, sorry for the delayed reply.
A normal or surprise removal of a PCIe device on a hot-plug port normally triggers an interrupt from the PCIe switch.
We have, however, observed cases where no interrupt is generated when the device suddenly loses its link; the behaviour is identical to setting the Link Disable bit in the switch’s Link Control register (offset 10h). Exactly what goes wrong in the LTSSM between the PCIe switch and the endpoint remains unknown.
In this scenario, the hardware has effectively vanished, yet the device driver remains bound and the IOMMU resources haven't been released. I’m just curious if this stale state could trigger issues in other places before the kernel fully realizes the device is gone? I’m not objecting to the fix. I'm just interested in whether this 'zombie' state creates risks elsewhere.
Hi, Baolu
In our scenario we see no other issues; a hard-LOCKUP panic is triggered the moment the Mellanox Ethernet device vanishes. But we can analyze what happens when we access the Mellanox Ethernet device whose link is disabled. (If we check whether the PCIe endpoint device (Mellanox Ethernet) is present before issuing device-IOTLB invalidation to the Intel IOMMU, no other issues appear.)
According to the PCIe spec, Rev. 5.0 v1.0, Sec. 2.4.1, there are two kinds of TLPs: posted and non-posted. Non-posted TLPs require a completion TLP; posted TLPs do not.
- A Posted Request is a Memory Write Request or a Message Request. - A Read Request is a Configuration Read Request, an I/O Read Request, or a Memory Read Request. - An NPR (Non-Posted Request) with Data is a Configuration Write Request, an I/O Write Request, or an AtomicOp Request. - A Non-Posted Request is a Read Request or an NPR with Data.
When the CPU issues a PCIe memory-write TLP (posted) via a MOV instruction, the instruction retires immediately after the packet reaches the Root Complex; no Data-Link ACK/NAK is required. A memory-read TLP (non-posted), however, stalls the core until the corresponding Completion TLP is received - if that Completion never arrives, the CPU hangs. (The CPU hangs if the LTSSM does not enter the Disabled state.)
However, if the LTSSM enters the Disabled state, the Root Port returns Completer-Abort (CA) for any non-posted TLP, so the request completes with status 0xFFFFFFFF without stalling.
I ran some tests on the machine after setting the Link Disable bit in the switch’s Link Control register (offset 10h). - setpci -s 0000:3c:08.0 CAP_EXP+10.w=0x0010
+-[0000:3a]-+-00.0-[3b-3f]----00.0-[3c-3f]--+-00.0-[3d]---- | | +-04.0-[3e]---- | | -08.0-[3f]----00.0 Mellanox Technologies MT27800 Family [ConnectX-5]
# lspci -vvv -s 0000:3f:00.0 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] ... Region 0: Memory at 3af804000000 (64-bit, prefetchable) [size=32M] ...
1) Issue a PCI config-space read request and it returns 0xFFFFFFFF. # lspci -vvv -s 0000:3f:00.0 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: mlx5_core Kernel modules: mlx5_core
2) Issuing a PCI memory read request through /dev/mem also returns 0xFFFFFFFF. # ./devmem Usage: ./devmem <phys_addr> <size> <offset> [value] phys_addr : physical base address of the BAR (hex or decimal) size : mapping length in bytes (hex or decimal) offset : register offset from BAR base (hex or decimal) value : optional 32-bit value to write (hex or decimal) Example: ./devmem 0x600000000 0x1000 0x0 0xDEADBEEF # ./devmem 0x3af804000000 0x2000000 0x0 0x3af804000000 = 0xffffffff
Before the link was disabled, we could read 0x3af804000000 with devmem and obtain a valid result. # ./devmem 0x3af804000000 0x2000000 0x0 0x3af804000000 = 0x10002300
Besides, after searching the kernel code, I found many EP drivers already check whether their endpoint is still present. There may be exception cases in some PCIe endpoint drivers, such as commit 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device").
Best Regards, Jinhui
+Bjorn for guidance.
quick context - previously intel-iommu driver fixed a lockup issue in surprise removal, by checking pci_dev_is_disconnected(). But Jinhui still observed the lockup issue in a setup where no interrupt is raised to pci core upon surprise removal (so pci_dev_is_disconnected() is false), hence suggesting to replace the check with pci_device_is_present() instead.
Bjorn, is it a common practice to fix it directly/only in drivers or should the pci core be notified e.g. simulating a late removal event? By searching the code looks it's the former, but better confirm with you before picking this fix...
From: Baolu Lu baolu.lu@linux.intel.com Sent: Tuesday, December 23, 2025 12:06 PM
On 12/22/25 19:19, Jinhui Guo wrote:
On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
From: Jinhui Guoguojinhui.liam@bytedance.com Sent: Thursday, December 11, 2025 12:00 PM
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done:
" For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. "
Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()?
Hi, kevin, sorry for the delayed reply.
A normal or surprise removal of a PCIe device on a hot-plug port normally triggers an interrupt from the PCIe switch.
We have, however, observed cases where no interrupt is generated when
the
device suddenly loses its link; the behaviour is identical to setting the Link Disable bit in the switch’s Link Control register (offset 10h). Exactly what goes wrong in the LTSSM between the PCIe switch and the endpoint
remains
unknown.
In this scenario, the hardware has effectively vanished, yet the device driver remains bound and the IOMMU resources haven't been released. I’m just curious if this stale state could trigger issues in other places before the kernel fully realizes the device is gone? I’m not objecting to the fix. I'm just interested in whether this 'zombie' state creates risks elsewhere.
linux-stable-mirror@lists.linaro.org