On Tue, Dec 23, 2025 12:06:24 +0800, Baolu Lu wrote:
On Thu, Dec 18, 2025 08:04:20AM +0000, Tian, Kevin wrote:
From: Jinhui Guoguojinhui.liam@bytedance.com Sent: Thursday, December 11, 2025 12:00 PM
Commit 4fc82cd907ac ("iommu/vt-d: Don't issue ATS Invalidation request when device is disconnected") relies on pci_dev_is_disconnected() to skip ATS invalidation for safely-removed devices, but it does not cover link-down caused by faults, which can still hard-lock the system.
According to the commit msg it actually tries to fix the hard lockup with surprise removal. For safe removal the device is not removed before invalidation is done:
" For safe removal, device wouldn't be removed until the whole software handling process is done, it wouldn't trigger the hard lock up issue caused by too long ATS Invalidation timeout wait. "
Can you help articulate the problem especially about the part 'link-down caused by faults"? What are those faults? How are they different from the said surprise removal in the commit msg to not set pci_dev_is_disconnected()?
Hi, kevin, sorry for the delayed reply.
A normal or surprise removal of a PCIe device on a hot-plug port normally triggers an interrupt from the PCIe switch.
We have, however, observed cases where no interrupt is generated when the device suddenly loses its link; the behaviour is identical to setting the Link Disable bit in the switch’s Link Control register (offset 10h). Exactly what goes wrong in the LTSSM between the PCIe switch and the endpoint remains unknown.
In this scenario, the hardware has effectively vanished, yet the device driver remains bound and the IOMMU resources haven't been released. I’m just curious if this stale state could trigger issues in other places before the kernel fully realizes the device is gone? I’m not objecting to the fix. I'm just interested in whether this 'zombie' state creates risks elsewhere.
Hi, Baolu
In our scenario we see no other issues; a hard-LOCKUP panic is triggered the moment the Mellanox Ethernet device vanishes. But we can analyze what happens when we access the Mellanox Ethernet device whose link is disabled. (If we check whether the PCIe endpoint device (Mellanox Ethernet) is present before issuing device-IOTLB invalidation to the Intel IOMMU, no other issues appear.)
According to the PCIe spec, Rev. 5.0 v1.0, Sec. 2.4.1, there are two kinds of TLPs: posted and non-posted. Non-posted TLPs require a completion TLP; posted TLPs do not.
- A Posted Request is a Memory Write Request or a Message Request. - A Read Request is a Configuration Read Request, an I/O Read Request, or a Memory Read Request. - An NPR (Non-Posted Request) with Data is a Configuration Write Request, an I/O Write Request, or an AtomicOp Request. - A Non-Posted Request is a Read Request or an NPR with Data.
When the CPU issues a PCIe memory-write TLP (posted) via a MOV instruction, the instruction retires immediately after the packet reaches the Root Complex; no Data-Link ACK/NAK is required. A memory-read TLP (non-posted), however, stalls the core until the corresponding Completion TLP is received - if that Completion never arrives, the CPU hangs. (The CPU hangs if the LTSSM does not enter the Disabled state.)
However, if the LTSSM enters the Disabled state, the Root Port returns Completer-Abort (CA) for any non-posted TLP, so the request completes with status 0xFFFFFFFF without stalling.
I ran some tests on the machine after setting the Link Disable bit in the switch’s Link Control register (offset 10h). - setpci -s 0000:3c:08.0 CAP_EXP+10.w=0x0010
+-[0000:3a]-+-00.0-[3b-3f]----00.0-[3c-3f]--+-00.0-[3d]---- | | +-04.0-[3e]---- | | -08.0-[3f]----00.0 Mellanox Technologies MT27800 Family [ConnectX-5]
# lspci -vvv -s 0000:3f:00.0 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] ... Region 0: Memory at 3af804000000 (64-bit, prefetchable) [size=32M] ...
1) Issue a PCI config-space read request and it returns 0xFFFFFFFF. # lspci -vvv -s 0000:3f:00.0 3f:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] (rev ff) (prog-if ff) !!! Unknown header type 7f Kernel driver in use: mlx5_core Kernel modules: mlx5_core
2) Issuing a PCI memory read request through /dev/mem also returns 0xFFFFFFFF. # ./devmem Usage: ./devmem <phys_addr> <size> <offset> [value] phys_addr : physical base address of the BAR (hex or decimal) size : mapping length in bytes (hex or decimal) offset : register offset from BAR base (hex or decimal) value : optional 32-bit value to write (hex or decimal) Example: ./devmem 0x600000000 0x1000 0x0 0xDEADBEEF # ./devmem 0x3af804000000 0x2000000 0x0 0x3af804000000 = 0xffffffff
Before the link was disabled, we could read 0x3af804000000 with devmem and obtain a valid result. # ./devmem 0x3af804000000 0x2000000 0x0 0x3af804000000 = 0x10002300
Besides, after searching the kernel code, I found many EP drivers already check whether their endpoint is still present. There may be exception cases in some PCIe endpoint drivers, such as commit 43bb40c5b926 ("virtio_pci: Support surprise removal of virtio pci device").
Best Regards, Jinhui