Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of
the driver") speeds up the loading of large numbers of device drivers by
submitting asynchronous probe workers to an unbounded workqueue and binding
each worker to the CPU near the device’s NUMA node. These workers are not
scheduled on isolated CPUs because their cpumask is restricted to
housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).
However, when PCI devices reside on the same NUMA node, all their
drivers’ probe workers are bound to the same CPU within that node, yet
the probes still run in parallel because pci_call_probe() invokes
work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use
in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on
system_percpu_wq to bind the probe thread to the first CPU in the
device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).
1. The function __driver_attach() submits an asynchronous worker with
callback __driver_attach_async_helper().
__driver_attach()
async_schedule_dev(__driver_attach_async_helper, dev)
async_schedule_node(func, dev, dev_to_node(dev))
async_schedule_node_domain(func, data, node, &async_dfl_domain)
__async_schedule_node_domain(func, data, node, domain, entry)
queue_work_node(node, async_wq, &entry->work)
2. The asynchronous probe worker ultimately calls work_on_cpu() in
pci_call_probe(), binding the worker to the same CPU within the
device’s NUMA node.
__driver_attach_async_helper()
driver_probe_device(drv, dev)
__driver_probe_device(drv, dev)
really_probe(dev, drv)
call_driver_probe(dev, drv)
dev->bus->probe(dev)
pci_device_probe(dev)
__pci_device_probe(drv, pci_dev)
pci_call_probe(drv, pci_dev, id)
cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask)
error = work_on_cpu(cpu, local_pci_probe, &ddi)
schedule_work_on(cpu, &wfc.work);
queue_work_on(cpu, system_percpu_wq, work)
To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
already running inside an unbounded asynchronous worker. Because a driver
can be probed asynchronously either by probe_type or by the kernel command
line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
executing within an unbounded workqueue worker and should skip the extra
work_on_cpu() call.
Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor shows a 35 % probe-time improvement with the patch:
Before (all on CPU 0):
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
After (spread across CPUs 1, 2, 5):
nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns
nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns
nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns
The improvement grows with more PCI devices because fewer probes contend
for the same CPU.
Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver")
Cc: stable(a)vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam(a)bytedance.com>
---
drivers/pci/pci-driver.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d596258..4bc47a84d330 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
/*
* Prevent nesting work_on_cpu() for the case where a Virtual Function
* device is probed from work_on_cpu() of the Physical device.
+ * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous
+ * probe worker when the driver allows asynchronous probing.
*/
if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
- pci_physfn_is_probed(dev)) {
+ pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
cpu = nr_cpu_ids;
} else {
cpumask_var_t wq_domain_mask;
--
2.20.1
The issue occurs when gfs2_freeze_lock_shared() fails in
gfs2_fill_super(). If !sb_rdonly(sb), threads for the quotad and logd
were started, however, in the error path for gfs2_freeze_lock_shared(),
the threads are not stopped by gfs2_destroy_threads() before jumping to
fail_per_node.
Introduce fail_threads to handle stopping the threads if the threads were
started.
Reported-by: syzbot+4cb0d0336db6bc6930e9(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4cb0d0336db6bc6930e9
Fixes: a28dc123fa66 ("gfs2: init system threads before freeze lock")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ryota Sakamoto <sakamo.ryota(a)gmail.com>
---
Changes in v2:
- Fix commit message style (imperative mood) as suggested by Markus Elfring.
- Add parentheses to function name in subject as suggested by Markus Elfring.
- Link to v1: https://lore.kernel.org/r/20251230-fix-use-after-free-gfs2-v1-1-ef0e46db6ec…
---
fs/gfs2/ops_fstype.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index e7a88b717991ae3647c1da039636daef7005a7f0..4b5ac1a7050f1fd34e10be4100a2bc381f49c83d 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1269,21 +1269,23 @@ static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc)
error = gfs2_freeze_lock_shared(sdp);
if (error)
- goto fail_per_node;
+ goto fail_threads;
if (!sb_rdonly(sb))
error = gfs2_make_fs_rw(sdp);
if (error) {
gfs2_freeze_unlock(sdp);
- gfs2_destroy_threads(sdp);
fs_err(sdp, "can't make FS RW: %d\n", error);
- goto fail_per_node;
+ goto fail_threads;
}
gfs2_glock_dq_uninit(&mount_gh);
gfs2_online_uevent(sdp);
return 0;
+fail_threads:
+ if (!sb_rdonly(sb))
+ gfs2_destroy_threads(sdp);
fail_per_node:
init_per_node(sdp, UNDO);
fail_inodes:
---
base-commit: 7839932417dd53bb09eb5a585a7a92781dfd7cb2
change-id: 20251230-fix-use-after-free-gfs2-66cfbe23baa8
Best regards,
--
Ryota Sakamoto <sakamo.ryota(a)gmail.com>
The issue occurs when gfs2_freeze_lock_shared() fails in
gfs2_fill_super(). If !sb_rdonly(sb), threads for the quotad and logd
were started, however, in the error path for gfs2_freeze_lock_shared(),
the threads are not stopped by gfs2_destroy_threads() before jumping to
fail_per_node.
This patch introduces fail_threads to handle stopping the threads if the
threads were started.
Reported-by: syzbot+4cb0d0336db6bc6930e9(a)syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=4cb0d0336db6bc6930e9
Fixes: a28dc123fa66 ("gfs2: init system threads before freeze lock")
Cc: stable(a)vger.kernel.org
Signed-off-by: Ryota Sakamoto <sakamo.ryota(a)gmail.com>
---
fs/gfs2/ops_fstype.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index e7a88b717991ae3647c1da039636daef7005a7f0..4b5ac1a7050f1fd34e10be4100a2bc381f49c83d 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1269,21 +1269,23 @@ static int gfs2_fill_super(struct super_block *sb, struct fs_context *fc)
error = gfs2_freeze_lock_shared(sdp);
if (error)
- goto fail_per_node;
+ goto fail_threads;
if (!sb_rdonly(sb))
error = gfs2_make_fs_rw(sdp);
if (error) {
gfs2_freeze_unlock(sdp);
- gfs2_destroy_threads(sdp);
fs_err(sdp, "can't make FS RW: %d\n", error);
- goto fail_per_node;
+ goto fail_threads;
}
gfs2_glock_dq_uninit(&mount_gh);
gfs2_online_uevent(sdp);
return 0;
+fail_threads:
+ if (!sb_rdonly(sb))
+ gfs2_destroy_threads(sdp);
fail_per_node:
init_per_node(sdp, UNDO);
fail_inodes:
---
base-commit: 7839932417dd53bb09eb5a585a7a92781dfd7cb2
change-id: 20251230-fix-use-after-free-gfs2-66cfbe23baa8
Best regards,
--
Ryota Sakamoto <sakamo.ryota(a)gmail.com>
Fix a bug where an empty FDA (fd array) object with 0 fds would cause an
out-of-bounds error. The previous implementation used `skip == 0` to
mean "this is a pointer fixup", but 0 is also the correct skip length
for an empty FDA. If the FDA is at the end of the buffer, then this
results in an attempt to write 8-bytes out of bounds. This is caught and
results in an EINVAL error being returned to userspace.
The pattern of using `skip == 0` as a special value originates from the
C-implementation of Binder. As part of fixing this bug, this pattern is
replaced with a Rust enum.
I considered the alternate option of not pushing a fixup when the length
is zero, but I think it's cleaner to just get rid of the zero-is-special
stuff.
The root cause of this bug was diagnosed by Gemini CLI on first try. I
used the following prompt:
> There appears to be a bug in @drivers/android/binder/thread.rs where
> the Fixups oob bug is triggered with 316 304 316 324. This implies
> that we somehow ended up with a fixup where buffer A has a pointer to
> buffer B, but the pointer is located at an index in buffer A that is
> out of bounds. Please investigate the code to find the bug. You may
> compare with @drivers/android/binder.c that implements this correctly.
Cc: stable(a)vger.kernel.org
Reported-by: DeepChirp <DeepChirp(a)outlook.com>
Closes: https://github.com/waydroid/waydroid/issues/2157
Fixes: eafedbc7c050 ("rust_binder: add Rust Binder driver")
Tested-by: DeepChirp <DeepChirp(a)outlook.com>
Signed-off-by: Alice Ryhl <aliceryhl(a)google.com>
---
drivers/android/binder/thread.rs | 59 +++++++++++++++++++++++-----------------
1 file changed, 34 insertions(+), 25 deletions(-)
diff --git a/drivers/android/binder/thread.rs b/drivers/android/binder/thread.rs
index 1a8e6fdc0dc42369ee078e720aa02b2554fb7332..dcd47e10aeb8c748d04320fbbe15ad35201684b9 100644
--- a/drivers/android/binder/thread.rs
+++ b/drivers/android/binder/thread.rs
@@ -69,17 +69,24 @@ struct ScatterGatherEntry {
}
/// This entry specifies that a fixup should happen at `target_offset` of the
-/// buffer. If `skip` is nonzero, then the fixup is a `binder_fd_array_object`
-/// and is applied later. Otherwise if `skip` is zero, then the size of the
-/// fixup is `sizeof::<u64>()` and `pointer_value` is written to the buffer.
-struct PointerFixupEntry {
- /// The number of bytes to skip, or zero for a `binder_buffer_object` fixup.
- skip: usize,
- /// The translated pointer to write when `skip` is zero.
- pointer_value: u64,
- /// The offset at which the value should be written. The offset is relative
- /// to the original buffer.
- target_offset: usize,
+/// buffer.
+enum PointerFixupEntry {
+ /// A fixup for a `binder_buffer_object`.
+ Fixup {
+ /// The translated pointer to write.
+ pointer_value: u64,
+ /// The offset at which the value should be written. The offset is relative
+ /// to the original buffer.
+ target_offset: usize,
+ },
+ /// A skip for a `binder_fd_array_object`.
+ Skip {
+ /// The number of bytes to skip.
+ skip: usize,
+ /// The offset at which the skip should happen. The offset is relative
+ /// to the original buffer.
+ target_offset: usize,
+ },
}
/// Return type of `apply_and_validate_fixup_in_parent`.
@@ -762,8 +769,7 @@ fn translate_object(
parent_entry.fixup_min_offset = info.new_min_offset;
parent_entry.pointer_fixups.push(
- PointerFixupEntry {
- skip: 0,
+ PointerFixupEntry::Fixup {
pointer_value: buffer_ptr_in_user_space,
target_offset: info.target_offset,
},
@@ -807,9 +813,8 @@ fn translate_object(
parent_entry
.pointer_fixups
.push(
- PointerFixupEntry {
+ PointerFixupEntry::Skip {
skip: fds_len,
- pointer_value: 0,
target_offset: info.target_offset,
},
GFP_KERNEL,
@@ -871,17 +876,21 @@ fn apply_sg(&self, alloc: &mut Allocation, sg_state: &mut ScatterGatherState) ->
let mut reader =
UserSlice::new(UserPtr::from_addr(sg_entry.sender_uaddr), sg_entry.length).reader();
for fixup in &mut sg_entry.pointer_fixups {
- let fixup_len = if fixup.skip == 0 {
- size_of::<u64>()
- } else {
- fixup.skip
+ let (fixup_len, fixup_offset) = match fixup {
+ PointerFixupEntry::Fixup { target_offset, .. } => {
+ (size_of::<u64>(), *target_offset)
+ }
+ PointerFixupEntry::Skip {
+ skip,
+ target_offset,
+ } => (*skip, *target_offset),
};
- let target_offset_end = fixup.target_offset.checked_add(fixup_len).ok_or(EINVAL)?;
- if fixup.target_offset < end_of_previous_fixup || offset_end < target_offset_end {
+ let target_offset_end = fixup_offset.checked_add(fixup_len).ok_or(EINVAL)?;
+ if fixup_offset < end_of_previous_fixup || offset_end < target_offset_end {
pr_warn!(
"Fixups oob {} {} {} {}",
- fixup.target_offset,
+ fixup_offset,
end_of_previous_fixup,
offset_end,
target_offset_end
@@ -890,13 +899,13 @@ fn apply_sg(&self, alloc: &mut Allocation, sg_state: &mut ScatterGatherState) ->
}
let copy_off = end_of_previous_fixup;
- let copy_len = fixup.target_offset - end_of_previous_fixup;
+ let copy_len = fixup_offset - end_of_previous_fixup;
if let Err(err) = alloc.copy_into(&mut reader, copy_off, copy_len) {
pr_warn!("Failed copying into alloc: {:?}", err);
return Err(err.into());
}
- if fixup.skip == 0 {
- let res = alloc.write::<u64>(fixup.target_offset, &fixup.pointer_value);
+ if let PointerFixupEntry::Fixup { pointer_value, .. } = fixup {
+ let res = alloc.write::<u64>(fixup_offset, pointer_value);
if let Err(err) = res {
pr_warn!("Failed copying ptr into alloc: {:?}", err);
return Err(err.into());
---
base-commit: 8f0b4cce4481fb22653697cced8d0d04027cb1e8
change-id: 20251229-fda-zero-4e46e56be58d
Best regards,
--
Alice Ryhl <aliceryhl(a)google.com>
Hi,
I’m reporting a performance regression of up to 6% sequential I/O
vdbench regression observed on 6.12.y kernel.
While running performance benchmarks on v6.12.60 kernel the sequential
I/O vdbench metrics are showing a 5-6% performance regression when
compared to v6.12.48
Bisect root cause commit
========================
- commit b39b62075ab4 ("cpuidle: menu: Remove iowait influence")
Things work fine again when the previously removed
performance-multiplier code is added back.
Test details
============
The system is connected to a number of disks in disk array using
multipathing and directio configuration in the vdbench profile.
wd=wd1,sd=sd*,rdpct=0,seekpct=sequential,xfersize=128k
rd=128k64T,wd=wd1,iorate=max,elapsed=600,interval=1,warmup=300,threads=64
Thanks,
Alok
6.18-stable review patch. If anyone has any objections, please let me know.
------------------
From: Andreas Gruenbacher <agruenba(a)redhat.com>
[ Upstream commit 64c10ed9274bc46416f502afea48b4ae11279669 ]
When a node tries to delete an inode, it first requests exclusive access
to the iopen glock. This triggers demote requests on all remote nodes
currently holding the iopen glock. To satisfy those requests, the
remote nodes evict the inode in question, or they poke the corresponding
inode glock to signal that the inode is still in active use.
This behavior doesn't depend on whether or not a filesystem is
read-only, so remove the incorrect read-only check.
Signed-off-by: Andreas Gruenbacher <agruenba(a)redhat.com>
Signed-off-by: Sasha Levin <sashal(a)kernel.org>
---
fs/gfs2/glops.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 0c0a80b3bacab..0c68ab4432b08 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -630,8 +630,7 @@ static void iopen_go_callback(struct gfs2_glock *gl, bool remote)
struct gfs2_inode *ip = gl->gl_object;
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
- if (!remote || sb_rdonly(sdp->sd_vfs) ||
- test_bit(SDF_KILL, &sdp->sd_flags))
+ if (!remote || test_bit(SDF_KILL, &sdp->sd_flags))
return;
if (gl->gl_demote_state == LM_ST_UNLOCKED &&
--
2.51.0