Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver") speeds up the loading of large numbers of device drivers by submitting asynchronous probe workers to an unbounded workqueue and binding each worker to the CPU near the device’s NUMA node. These workers are not scheduled on isolated CPUs because their cpumask is restricted to housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).
However, when PCI devices reside on the same NUMA node, all their drivers’ probe workers are bound to the same CPU within that node, yet the probes still run in parallel because pci_call_probe() invokes work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on system_percpu_wq to bind the probe thread to the first CPU in the device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).
1. The function __driver_attach() submits an asynchronous worker with callback __driver_attach_async_helper().
__driver_attach() async_schedule_dev(__driver_attach_async_helper, dev) async_schedule_node(func, dev, dev_to_node(dev)) async_schedule_node_domain(func, data, node, &async_dfl_domain) __async_schedule_node_domain(func, data, node, domain, entry) queue_work_node(node, async_wq, &entry->work)
2. The asynchronous probe worker ultimately calls work_on_cpu() in pci_call_probe(), binding the worker to the same CPU within the device’s NUMA node.
__driver_attach_async_helper() driver_probe_device(drv, dev) __driver_probe_device(drv, dev) really_probe(dev, drv) call_driver_probe(dev, drv) dev->bus->probe(dev) pci_device_probe(dev) __pci_device_probe(drv, pci_dev) pci_call_probe(drv, pci_dev, id) cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask) error = work_on_cpu(cpu, local_pci_probe, &ddi) schedule_work_on(cpu, &wfc.work); queue_work_on(cpu, system_percpu_wq, work)
To fix the issue, pci_call_probe() must not call work_on_cpu() when it is already running inside an unbounded asynchronous worker. Because a driver can be probed asynchronously either by probe_type or by the kernel command line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is executing within an unbounded workqueue worker and should skip the extra work_on_cpu() call.
Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64 2.4 GHz processor shows a 35 % probe-time improvement with the patch:
Before (all on CPU 0): nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
After (spread across CPUs 1, 2, 5): nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns
The improvement grows with more PCI devices because fewer probes contend for the same CPU.
Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver") Cc: stable@vger.kernel.org Signed-off-by: Jinhui Guo guojinhui.liam@bytedance.com --- drivers/pci/pci-driver.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 7c2d9d596258..4bc47a84d330 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev, /* * Prevent nesting work_on_cpu() for the case where a Virtual Function * device is probed from work_on_cpu() of the Physical device. + * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous + * probe worker when the driver allows asynchronous probing. */ if (node < 0 || node >= MAX_NUMNODES || !node_online(node) || - pci_physfn_is_probed(dev)) { + pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) { cpu = nr_cpu_ids; } else { cpumask_var_t wq_domain_mask;
linux-stable-mirror@lists.linaro.org