Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously.
Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. This patch may cause a small delay in applying the new settings.
This patch affects the following sysfs attributes: * io_poll_delay * io_timeout * nomerges * read_ahead_kb * rq_affinity
Here is an example of a deadlock triggered by running test srp/002:
task:multipathd Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:(udev-worker) Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK>
Cc: Christoph Hellwig hch@lst.de Cc: Ming Lei ming.lei@redhat.com Cc: Nilay Shroff nilay@linux.ibm.com Cc: Martin Wilck mwilck@suse.com Cc: Benjamin Marzinski bmarzins@redhat.com Cc: stable@vger.kernel.org Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Signed-off-by: Bart Van Assche bvanassche@acm.org ---
Changes compared to v4: - Use WRITE_ONCE() to update bdi->ra_pages. - Move a data_race() annotation from queue_io_timeout_store() into blk_queue_rq_timeout().
Changes compared to v3: - Added two data_race() annotations.
Changes compared to v2: - Dropped the controversial patch "block: Restrict the duration of sysfs attribute changes".
Changes compared to v1: - Added patch "block: Restrict the duration of sysfs attribute changes". - Remove queue freezing from more sysfs callbacks.
block/blk-settings.c | 9 ++++++++- block/blk-sysfs.c | 30 ++++++++++++------------------ 2 files changed, 20 insertions(+), 19 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c index 78dfef117623..b5587cc535e5 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -23,7 +23,14 @@
void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout) { - WRITE_ONCE(q->rq_timeout, timeout); + /* + * Use WRITE_ONCE() to write q->rq_timeout once. Use data_race() to + * suppress KCSAN race reports against the write below. + */ + data_race(({ + WRITE_ONCE(q->rq_timeout, timeout); + 0; + })); } EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 76c47fe9b8d6..99e78d907c1c 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -143,21 +143,26 @@ queue_ra_store(struct gendisk *disk, const char *page, size_t count) { unsigned long ra_kb; ssize_t ret; - unsigned int memflags; struct request_queue *q = disk->queue;
ret = queue_var_store(&ra_kb, page, count); if (ret < 0) return ret; /* - * ->ra_pages is protected by ->limits_lock because it is usually - * calculated from the queue limits by queue_limits_commit_update. + * The ->ra_pages change below is protected by ->limits_lock because it + * is usually calculated from the queue limits by + * queue_limits_commit_update(). + * + * bdi->ra_pages reads are not serialized against bdi->ra_pages writes. + * Use WRITE_ONCE() to write bdi->ra_pages once. Use data_race() to + * suppress KCSAN race reports against the write below. */ mutex_lock(&q->limits_lock); - memflags = blk_mq_freeze_queue(q); - disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10); + data_race(({ + WRITE_ONCE(disk->bdi->ra_pages, ra_kb >> (PAGE_SHIFT - 10)); + 0; + })); mutex_unlock(&q->limits_lock); - blk_mq_unfreeze_queue(q, memflags);
return ret; } @@ -375,21 +380,18 @@ static ssize_t queue_nomerges_store(struct gendisk *disk, const char *page, size_t count) { unsigned long nm; - unsigned int memflags; struct request_queue *q = disk->queue; ssize_t ret = queue_var_store(&nm, page, count);
if (ret < 0) return ret;
- memflags = blk_mq_freeze_queue(q); blk_queue_flag_clear(QUEUE_FLAG_NOMERGES, q); blk_queue_flag_clear(QUEUE_FLAG_NOXMERGES, q); if (nm == 2) blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q); else if (nm) blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q); - blk_mq_unfreeze_queue(q, memflags);
return ret; } @@ -409,7 +411,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count) #ifdef CONFIG_SMP struct request_queue *q = disk->queue; unsigned long val; - unsigned int memflags;
ret = queue_var_store(&val, page, count); if (ret < 0) @@ -421,7 +422,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count) * are accessed individually using atomic test_bit operation. So we * don't grab any lock while updating these flags. */ - memflags = blk_mq_freeze_queue(q); if (val == 2) { blk_queue_flag_set(QUEUE_FLAG_SAME_COMP, q); blk_queue_flag_set(QUEUE_FLAG_SAME_FORCE, q); @@ -432,7 +432,6 @@ queue_rq_affinity_store(struct gendisk *disk, const char *page, size_t count) blk_queue_flag_clear(QUEUE_FLAG_SAME_COMP, q); blk_queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q); } - blk_mq_unfreeze_queue(q, memflags); #endif return ret; } @@ -446,11 +445,9 @@ static ssize_t queue_poll_delay_store(struct gendisk *disk, const char *page, static ssize_t queue_poll_store(struct gendisk *disk, const char *page, size_t count) { - unsigned int memflags; ssize_t ret = count; struct request_queue *q = disk->queue;
- memflags = blk_mq_freeze_queue(q); if (!(q->limits.features & BLK_FEAT_POLL)) { ret = -EINVAL; goto out; @@ -459,7 +456,6 @@ static ssize_t queue_poll_store(struct gendisk *disk, const char *page, pr_info_ratelimited("writes to the poll attribute are ignored.\n"); pr_info_ratelimited("please use driver specific parameters instead.\n"); out: - blk_mq_unfreeze_queue(q, memflags); return ret; }
@@ -472,7 +468,7 @@ static ssize_t queue_io_timeout_show(struct gendisk *disk, char *page) static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page, size_t count) { - unsigned int val, memflags; + unsigned int val; int err; struct request_queue *q = disk->queue;
@@ -480,9 +476,7 @@ static ssize_t queue_io_timeout_store(struct gendisk *disk, const char *page, if (err || val == 0) return -EINVAL;
- memflags = blk_mq_freeze_queue(q); blk_queue_rq_timeout(q, msecs_to_jiffies(val)); - blk_mq_unfreeze_queue(q, memflags);
return count; }
On Mon, Nov 10, 2025 at 08:24:18AM -0800, Bart Van Assche wrote:
Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously.
Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. This patch may cause a small delay in applying the new settings.
This patch affects the following sysfs attributes:
- io_poll_delay
- io_timeout
- nomerges
- read_ahead_kb
- rq_affinity
I'd suggest to add words why freeze isn't needed, such as:
``` Intermediate value isn't possible, and either the old or new value is just fine to take in io fast path. ```
otherwise this patch is good for me.
Thanks, Ming
On 11/10/25 9:54 PM, Bart Van Assche wrote:
Freezing the request queue from inside sysfs store callbacks may cause a deadlock in combination with the dm-multipath driver and the queue_if_no_path option. Additionally, freezing the request queue slows down system boot on systems where sysfs attributes are set synchronously.
Fix this by removing the blk_mq_freeze_queue() / blk_mq_unfreeze_queue() calls from the store callbacks that do not strictly need these callbacks. This patch may cause a small delay in applying the new settings.
This patch affects the following sysfs attributes:
- io_poll_delay
- io_timeout
- nomerges
- read_ahead_kb
- rq_affinity
I applied your patch on my linux tree and ran some tests. And as I earlier suspected, I found the following race from KCSAN:
1. Running q->io_timeout update concurrently while running I/O:
BUG: KCSAN: data-race in blk_add_timer+0x64/0x1cc
race at unknown origin, with read to 0xc000000142575fa8 of 4 bytes by task 5528 on cpu 49: blk_add_timer+0x64/0x1cc blk_mq_start_request+0xd0/0x534 nvme_prep_rq.part.0+0x55c/0x1940 [nvme] nvme_queue_rqs+0x1e0/0x4c4 [nvme] blk_mq_dispatch_queue_requests+0x4f0/0x8c0 blk_mq_flush_plug_list+0xb4/0x3cc __blk_flush_plug+0x294/0x358 __submit_bio+0x308/0x3a4 submit_bio_noacct_nocheck+0x5e4/0x92c submit_bio_noacct+0x3a4/0xec8 submit_bio+0x70/0x2f0 submit_bio_wait+0xc8/0x180 __blkdev_direct_IO_simple+0x254/0x4a8 blkdev_direct_IO+0x6d4/0x1000 blkdev_write_iter+0x50c/0x658 vfs_write+0x678/0x874 sys_pwrite64+0x130/0x16c system_call_exception+0x1a0/0x500 system_call_vectored_common+0x15c/0x2ec
value changed: 0x000005dc -> 0x00000bb8
Reported by Kernel Concurrency Sanitizer on: CPU: 49 UID: 0 PID: 5528 Comm: fio Kdump: loaded Not tainted 6.18.0-rc4+ #65 VOLUNTARY Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries
2. Updating ->ra_pages while it's also being simultaneously accessed:
BUG: KCSAN: data-race in queue_ra_show / read_ahead_kb_store
write to 0xc00000000c107030 of 8 bytes by task 97376 on cpu 19: read_ahead_kb_store+0x84/0xcc dev_attr_store+0x70/0x9c sysfs_kf_write+0xbc/0x110 kernfs_fop_write_iter+0x284/0x3b4 vfs_write+0x678/0x874 ksys_write+0xb0/0x1ac sys_write+0x68/0x84 system_call_exception+0x1a0/0x500 system_call_vectored_common+0x15c/0x2ec
read to 0xc00000000c107030 of 8 bytes by task 167534 on cpu 33: queue_ra_show+0x70/0xd4 queue_attr_show+0x90/0x170 sysfs_kf_seq_show+0x144/0x28c kernfs_seq_show+0xbc/0xe0 seq_read_iter+0x384/0xb4c kernfs_fop_read_iter+0x308/0x418 vfs_read+0x420/0x5ac ksys_read+0xb0/0x1b0 sys_read+0x68/0x84 system_call_exception+0x1a0/0x500 system_call_vectored_common+0x15c/0x2ec
value changed: 0x0000000000000004 -> 0x0000000000000010
Reported by Kernel Concurrency Sanitizer on: CPU: 33 UID: 0 PID: 167534 Comm: cat Kdump: loaded Not tainted 6.18.0-rc4+ #65 VOLUNTARY Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries
So from the above trace it seems obvious that we need to mark both writers and readers to avoid potential race.
Thanks, --Nilay
On 11/10/25 10:25 PM, Nilay Shroff wrote:
I applied your patch on my linux tree and ran some tests. And as I earlier suspected, I found the following race from KCSAN:
[ ... ]
Thank you for having run these tests. It's unfortunate that I couldn't trigger these KCSAN complaints in my tests with KCSAN enabled in the kernel configuration.
So from the above trace it seems obvious that we need to mark both writers and readers to avoid potential race.
That would be an intrusive change. I don't think that the kernel maintainers would agree with marking all rq_timeout and all ra_pages reads with READ_ONCE(). I propose to annotate both the rq_timeout and ra_pages data members with __data_racy to suppress these KCSAN reports.
Thanks,
Bart.
On 11/12/25 1:22 AM, Bart Van Assche wrote:
On 11/10/25 10:25 PM, Nilay Shroff wrote:
I applied your patch on my linux tree and ran some tests. And as I earlier suspected, I found the following race from KCSAN:
[ ... ]
Thank you for having run these tests. It's unfortunate that I couldn't trigger these KCSAN complaints in my tests with KCSAN enabled in the kernel configuration.
So from the above trace it seems obvious that we need to mark both writers and readers to avoid potential race.
That would be an intrusive change. I don't think that the kernel maintainers would agree with marking all rq_timeout and all ra_pages reads with READ_ONCE(). I propose to annotate both the rq_timeout and ra_pages data members with __data_racy to suppress these KCSAN reports.
Yes, that should also work. I validated the use of __data_racy on my test kernel.
However, while compiling the kernel with __data_racy applied to q->rq_timeout, I encountered a build failure. After some investigation, I found that the issue occurred because my kernel configuration had CONFIG_DEBUG_INFO_BTF enabled. During the build, when the compiler attempted to generate BTF types from the vmlinux.unstripped binary, it failed.
Mu guess is that some compilation units have KCSAN disabled, in which case the pre-processor expands __data_racy to nothing. In other units where KCSAN is enabled, __data_racy expands to the volatile qualifier. As a result, BTF resolver encountered two versions of struct request_queue: one where q->rq_timeout was declared with the volatile keyword and another where it was declared without it. This type mismatch caused resolver to fail during the BTF extraction phase.
Yes this is something not related to block layer and has to fixed by KCSAN/eBPF developers.
Thanks, --Nilay
linux-stable-mirror@lists.linaro.org