From: Hans de Goede hdegoede@redhat.com
commit 085a9f43433f30cbe8a1ade62d9d7827c3217f4d upstream.
Use down_read_nested() and down_write_nested() when taking the ctrl->reset_lock rw-sem, passing the number of PCIe hotplug controllers in the path to the PCI root bus as lock subclass parameter.
This fixes the following false-positive lockdep report when unplugging a Lenovo X1C8 from a Lenovo 2nd gen TB3 dock:
pcieport 0000:06:01.0: pciehp: Slot(1): Link Down pcieport 0000:06:01.0: pciehp: Slot(1): Card not present ============================================ WARNING: possible recursive locking detected 5.16.0-rc2+ #621 Not tainted -------------------------------------------- irq/124-pciehp/86 is trying to acquire lock: ffff8e5ac4299ef8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_check_presence+0x23/0x80
but task is already holding lock: ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180
other info that might help us debug this: Possible unsafe locking scenario:
CPU0 ---- lock(&ctrl->reset_lock); lock(&ctrl->reset_lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by irq/124-pciehp/86: #0: ffff8e5ac4298af8 (&ctrl->reset_lock){.+.+}-{3:3}, at: pciehp_ist+0xf3/0x180 #1: ffffffffa3b024e8 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pciehp_unconfigure_device+0x31/0x110 #2: ffff8e5ac1ee2248 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x1c/0x40
stack backtrace: CPU: 4 PID: 86 Comm: irq/124-pciehp Not tainted 5.16.0-rc2+ #621 Hardware name: LENOVO 20U90SIT19/20U90SIT19, BIOS N2WET30W (1.20 ) 08/26/2021 Call Trace: <TASK> dump_stack_lvl+0x59/0x73 __lock_acquire.cold+0xc5/0x2c6 lock_acquire+0xb5/0x2b0 down_read+0x3e/0x50 pciehp_check_presence+0x23/0x80 pciehp_runtime_resume+0x5c/0xa0 device_for_each_child+0x45/0x70 pcie_port_device_runtime_resume+0x20/0x30 pci_pm_runtime_resume+0xa7/0xc0 __rpm_callback+0x41/0x110 rpm_callback+0x59/0x70 rpm_resume+0x512/0x7b0 __pm_runtime_resume+0x4a/0x90 __device_release_driver+0x28/0x240 device_release_driver+0x26/0x40 pci_stop_bus_device+0x68/0x90 pci_stop_bus_device+0x2c/0x90 pci_stop_and_remove_bus_device+0xe/0x20 pciehp_unconfigure_device+0x6c/0x110 pciehp_disable_slot+0x5b/0xe0 pciehp_handle_presence_or_link_change+0xc3/0x2f0 pciehp_ist+0x179/0x180
This lockdep warning is triggered because with Thunderbolt, hotplug ports are nested. When removing multiple devices in a daisy-chain, each hotplug port's reset_lock may be acquired recursively. It's never the same lock, so the lockdep splat is a false positive.
Because locks at the same hierarchy level are never acquired recursively, a per-level lockdep class is sufficient to fix the lockdep warning.
The choice to use one lockdep subclass per pcie-hotplug controller in the path to the root-bus was made to conserve class keys because their number is limited and the complexity grows quadratically with number of keys according to Documentation/locking/lockdep-design.rst.
Link: https://lore.kernel.org/linux-pci/20190402021933.GA2966@mit.edu/ Link: https://lore.kernel.org/linux-pci/de684a28-9038-8fc6-27ca-3f6f2f6400d7@redha... Link: https://lore.kernel.org/r/20211217141709.379663-1-hdegoede@redhat.com Link: https://bugzilla.kernel.org/show_bug.cgi?id=208855 Reported-by: "Theodore Ts'o" tytso@mit.edu Signed-off-by: Hans de Goede hdegoede@redhat.com Signed-off-by: Bjorn Helgaas bhelgaas@google.com Reviewed-by: Lukas Wunner lukas@wunner.de Cc: stable@vger.kernel.org [lukas: backport to v5.4-stable] Signed-off-by: Lukas Wunner lukas@wunner.de --- drivers/pci/hotplug/pciehp.h | 3 +++ drivers/pci/hotplug/pciehp_core.c | 2 +- drivers/pci/hotplug/pciehp_hpc.c | 19 +++++++++++++++++-- 3 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/hotplug/pciehp.h b/drivers/pci/hotplug/pciehp.h index aa61d4c219d7..79a713f5dbf4 100644 --- a/drivers/pci/hotplug/pciehp.h +++ b/drivers/pci/hotplug/pciehp.h @@ -72,6 +72,8 @@ extern int pciehp_poll_time; * @reset_lock: prevents access to the Data Link Layer Link Active bit in the * Link Status register and to the Presence Detect State bit in the Slot * Status register during a slot reset which may cause them to flap + * @depth: Number of additional hotplug ports in the path to the root bus, + * used as lock subclass for @reset_lock * @ist_running: flag to keep user request waiting while IRQ thread is running * @request_result: result of last user request submitted to the IRQ thread * @requester: wait queue to wake up on completion of user request, @@ -102,6 +104,7 @@ struct controller {
struct hotplug_slot hotplug_slot; /* hotplug core interface */ struct rw_semaphore reset_lock; + unsigned int depth; unsigned int ist_running; int request_result; wait_queue_head_t requester; diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c index 312cc45c44c7..f5255045b149 100644 --- a/drivers/pci/hotplug/pciehp_core.c +++ b/drivers/pci/hotplug/pciehp_core.c @@ -165,7 +165,7 @@ static void pciehp_check_presence(struct controller *ctrl) { int occupied;
- down_read(&ctrl->reset_lock); + down_read_nested(&ctrl->reset_lock, ctrl->depth); mutex_lock(&ctrl->state_lock);
occupied = pciehp_card_present_or_link_active(ctrl); diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 13f3bc239c66..651664fe4058 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -674,7 +674,7 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) * Disable requests have higher priority than Presence Detect Changed * or Data Link Layer State Changed events. */ - down_read(&ctrl->reset_lock); + down_read_nested(&ctrl->reset_lock, ctrl->depth); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) @@ -808,7 +808,7 @@ int pciehp_reset_slot(struct hotplug_slot *hotplug_slot, int probe) if (probe) return 0;
- down_write(&ctrl->reset_lock); + down_write_nested(&ctrl->reset_lock, ctrl->depth);
if (!ATTN_BUTTN(ctrl)) { ctrl_mask |= PCI_EXP_SLTCTL_PDCE; @@ -864,6 +864,20 @@ static inline void dbg_ctrl(struct controller *ctrl)
#define FLAG(x, y) (((x) & (y)) ? '+' : '-')
+static inline int pcie_hotplug_depth(struct pci_dev *dev) +{ + struct pci_bus *bus = dev->bus; + int depth = 0; + + while (bus->parent) { + bus = bus->parent; + if (bus->self && bus->self->is_hotplug_bridge) + depth++; + } + + return depth; +} + struct controller *pcie_init(struct pcie_device *dev) { struct controller *ctrl; @@ -877,6 +891,7 @@ struct controller *pcie_init(struct pcie_device *dev) return NULL;
ctrl->pcie = dev; + ctrl->depth = pcie_hotplug_depth(dev->port); pcie_capability_read_dword(pdev, PCI_EXP_SLTCAP, &slot_cap);
if (pdev->hotplug_user_indicators)
commit f5eff5591b8f9c5effd25c92c758a127765f74c1 upstream.
In 2013, commits
2e35afaefe64 ("PCI: pciehp: Add reset_slot() method") 608c388122c7 ("PCI: Add slot reset option to pci_dev_reset()")
amended PCIe hotplug to mask Presence Detect Changed events during a Secondary Bus Reset. The reset thus no longer causes gratuitous slot bringdown and bringup.
However the commits neglected to serialize reset with code paths reading slot registers. For instance, a slot bringup due to an earlier hotplug event may see the Presence Detect State bit cleared during a concurrent Secondary Bus Reset.
In 2018, commit
5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset")
retrofitted the missing locking. It introduced a reset_lock which serializes a Secondary Bus Reset with other parts of pciehp.
Unfortunately the locking turns out to be overzealous: reset_lock is held for the entire enumeration and de-enumeration of hotplugged devices, including driver binding and unbinding.
Driver binding and unbinding acquires device_lock while the reset_lock of the ancestral hotplug port is held. A concurrent Secondary Bus Reset acquires the ancestral reset_lock while already holding the device_lock. The asymmetric locking order in the two code paths can lead to AB-BA deadlocks.
Michael Haeuptle reports such deadlocks on simultaneous hot-removal and vfio release (the latter implies a Secondary Bus Reset):
pciehp_ist() # down_read(reset_lock) pciehp_handle_presence_or_link_change() pciehp_disable_slot() __pciehp_disable_slot() remove_board() pciehp_unconfigure_device() pci_stop_and_remove_bus_device() pci_stop_bus_device() pci_stop_dev() device_release_driver() device_release_driver_internal() __device_driver_lock() # device_lock()
SYS_munmap() vfio_device_fops_release() vfio_device_group_close() vfio_device_close() vfio_device_last_close() vfio_pci_core_close_device() vfio_pci_core_disable() # device_lock() __pci_reset_function_locked() pci_reset_bus_function() pci_dev_reset_slot_function() pci_reset_hotplug_slot() pciehp_reset_slot() # down_write(reset_lock)
Ian May reports the same deadlock on simultaneous hot-removal and an AER-induced Secondary Bus Reset:
aer_recover_work_func() pcie_do_recovery() aer_root_reset() pci_bus_error_reset() pci_slot_reset() pci_slot_lock() # device_lock() pci_reset_hotplug_slot() pciehp_reset_slot() # down_write(reset_lock)
Fix by releasing the reset_lock during driver binding and unbinding, thereby splitting and shrinking the critical section.
Driver binding and unbinding is protected by the device_lock() and thus serialized with a Secondary Bus Reset. There's no need to additionally protect it with the reset_lock. However, pciehp does not bind and unbind devices directly, but rather invokes PCI core functions which also perform certain enumeration and de-enumeration steps.
The reset_lock's purpose is to protect slot registers, not enumeration and de-enumeration of hotplugged devices. That would arguably be the job of the PCI core, not the PCIe hotplug driver. After all, an AER-induced Secondary Bus Reset may as well happen during boot-time enumeration of the PCI hierarchy and there's no locking to prevent that either.
Exempting *de-enumeration* from the reset_lock is relatively harmless: A concurrent Secondary Bus Reset may foil config space accesses such as PME interrupt disablement. But if the device is physically gone, those accesses are pointless anyway. If the device is physically present and only logically removed through an Attention Button press or the sysfs "power" attribute, PME interrupts as well as DMA cannot come through because pciehp_unconfigure_device() disables INTx and Bus Master bits. That's still protected by the reset_lock in the present commit.
Exempting *enumeration* from the reset_lock also has limited impact: The exempted call to pci_bus_add_device() may perform device accesses through pcibios_bus_add_device() and pci_fixup_device() which are now no longer protected from a concurrent Secondary Bus Reset. Otherwise there should be no impact.
In essence, the present commit seeks to fix the AB-BA deadlocks while still retaining a best-effort reset protection for enumeration and de-enumeration of hotplugged devices -- until a general solution is implemented in the PCI core.
Link: https://lore.kernel.org/linux-pci/CS1PR8401MB0728FC6FDAB8A35C22BD90EC95F10@C... Link: https://lore.kernel.org/linux-pci/20200615143250.438252-1-ian.may@canonical.... Link: https://lore.kernel.org/linux-pci/ce878dab-c0c4-5bd0-a725-9805a075682d@amd.c... Link: https://lore.kernel.org/linux-pci/ed831249-384a-6d35-0831-70af191e9bce@huawe... Link: https://bugzilla.kernel.org/show_bug.cgi?id=215590 Fixes: 5b3f7b7d062b ("PCI: pciehp: Avoid slot access during reset") Link: https://lore.kernel.org/r/fef2b2e9edf245c049a8c5b94743c0f74ff5008a.168119190... Reported-by: Michael Haeuptle michael.haeuptle@hpe.com Reported-by: Ian May ian.may@canonical.com Reported-by: Andrey Grodzovsky andrey2805@gmail.com Reported-by: Rahul Kumar rahul.kumar1@amd.com Reported-by: Jialin Zhang zhangjialin11@huawei.com Tested-by: Anatoli Antonovitch Anatoli.Antonovitch@amd.com Signed-off-by: Lukas Wunner lukas@wunner.de Signed-off-by: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org # v4.19+ Cc: Dan Stein dstein@hpe.com Cc: Ashok Raj ashok.raj@intel.com Cc: Alex Michon amichon@kalrayinc.com Cc: Xiongfeng Wang wangxiongfeng2@huawei.com Cc: Alex Williamson alex.williamson@redhat.com Cc: Mika Westerberg mika.westerberg@linux.intel.com Cc: Sathyanarayanan Kuppuswamy sathyanarayanan.kuppuswamy@linux.intel.com --- drivers/pci/hotplug/pciehp_pci.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+)
diff --git a/drivers/pci/hotplug/pciehp_pci.c b/drivers/pci/hotplug/pciehp_pci.c index d17f3bf36f70..ad12515a4a12 100644 --- a/drivers/pci/hotplug/pciehp_pci.c +++ b/drivers/pci/hotplug/pciehp_pci.c @@ -63,7 +63,14 @@ int pciehp_configure_device(struct controller *ctrl)
pci_assign_unassigned_bridge_resources(bridge); pcie_bus_configure_settings(parent); + + /* + * Release reset_lock during driver binding + * to avoid AB-BA deadlock with device_lock. + */ + up_read(&ctrl->reset_lock); pci_bus_add_devices(parent); + down_read_nested(&ctrl->reset_lock, ctrl->depth);
out: pci_unlock_rescan_remove(); @@ -104,7 +111,15 @@ void pciehp_unconfigure_device(struct controller *ctrl, bool presence) list_for_each_entry_safe_reverse(dev, temp, &parent->devices, bus_list) { pci_dev_get(dev); + + /* + * Release reset_lock during driver unbinding + * to avoid AB-BA deadlock with device_lock. + */ + up_read(&ctrl->reset_lock); pci_stop_and_remove_bus_device(dev); + down_read_nested(&ctrl->reset_lock, ctrl->depth); + /* * Ensure that no new Requests will be generated from * the device.
On Tue, May 09, 2023 at 12:41:09PM +0200, Lukas Wunner wrote:
From: Hans de Goede hdegoede@redhat.com
commit 085a9f43433f30cbe8a1ade62d9d7827c3217f4d upstream.
All now queued up, thanks.
greg k-h
linux-stable-mirror@lists.linaro.org