Every time I run test srp/002 the following deadlock is triggered:
task:multipathd Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:(udev-worker) Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK>
Fix this by removing the superfluous queue freezing/unfreezing code from queue_ra_store().
Cc: Nilay Shroff nilay@linux.ibm.com Cc: stable@vger.kernel.org Fixes: b07a889e8335 ("block: move q->sysfs_lock and queue-freeze under show/store method") Signed-off-by: Bart Van Assche bvanassche@acm.org --- block/blk-sysfs.c | 3 --- 1 file changed, 3 deletions(-)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index b2b9b89d6967..1f63b184c6e9 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -105,7 +105,6 @@ queue_ra_store(struct gendisk *disk, const char *page, size_t count) { unsigned long ra_kb; ssize_t ret; - unsigned int memflags; struct request_queue *q = disk->queue;
ret = queue_var_store(&ra_kb, page, count); @@ -116,10 +115,8 @@ queue_ra_store(struct gendisk *disk, const char *page, size_t count) * calculated from the queue limits by queue_limits_commit_update. */ mutex_lock(&q->limits_lock); - memflags = blk_mq_freeze_queue(q); disk->bdi->ra_pages = ra_kb >> (PAGE_SHIFT - 10); mutex_unlock(&q->limits_lock); - blk_mq_unfreeze_queue(q, memflags);
return ret; }
On 6/25/25 10:15 PM, Christoph Hellwig wrote:
On Wed, Jun 25, 2025 at 12:54:50PM -0700, Bart Van Assche wrote:
Fix this by removing the superfluous queue freezing/unfreezing code from queue_ra_store().
You'll need to explain why it is useless here.
If nobody objects I will add the following text to the patch description:
"Freezing the request queue from inside a block layer sysfs store callback function is essential when modifying parameters that affect how bios or requests are processed, e.g. parameters that affect bio_split_to_limit(). Freezing the request queue when modifying parameters that do not affect bio nor request processing is not necessary."
Thanks,
Bart.
On 6/26/25 1:24 AM, Bart Van Assche wrote:
Every time I run test srp/002 the following deadlock is triggered:
task:multipathd Call Trace:
<TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 schedule_preempt_disabled+0x1c/0x30 __mutex_lock+0xb89/0x1650 mutex_lock_nested+0x1f/0x30 dm_table_set_restrictions+0x823/0xdf0 __bind+0x166/0x590 dm_swap_table+0x2a7/0x490 do_resume+0x1b1/0x610 dev_suspend+0x55/0x1a0 ctl_ioctl+0x3a5/0x7e0 dm_ctl_ioctl+0x12/0x20 __x64_sys_ioctl+0x127/0x1a0 x64_sys_call+0xe2b/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> task:(udev-worker) Call Trace: <TASK> __schedule+0x8c1/0x1bf0 schedule+0xdd/0x270 blk_mq_freeze_queue_wait+0xf2/0x140 blk_mq_freeze_queue_nomemsave+0x23/0x30 queue_ra_store+0x14e/0x290 queue_attr_store+0x23e/0x2c0 sysfs_kf_write+0xde/0x140 kernfs_fop_write_iter+0x3b2/0x630 vfs_write+0x4fd/0x1390 ksys_write+0xfd/0x230 __x64_sys_write+0x76/0xc0 x64_sys_call+0x276/0x17d0 do_syscall_64+0x96/0x3a0 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK>
It seems that some other thread on your system acquired ->freeze_lock and never released it and that prevents the udev-worker thread to forward progress. As udev-worker couldn't forward progress, it also now prevents the multipathd to make progress (as multipathd is pending on ->limits_lock which has been acquired by udev-worker).
If you haven't enabled lockdep on your system then can you please configure lockdep and rerun the srp/002 test? I think, you shall encounter a lockdep splat and that shall help further investigate the deadlock.
Thanks, --Nilay
On 6/25/25 10:31 PM, Nilay Shroff wrote:
It seems that some other thread on your system acquired ->freeze_lock and never released it and that prevents the udev-worker thread to forward progress.
That's wrong. blk_mq_freeze_queue_wait() is waiting for q_usage_counter to drop to zero as the below output shows:
(gdb) list *(blk_mq_freeze_queue_wait+0xf2) 0xffffffff823ab0b2 is in blk_mq_freeze_queue_wait (block/blk-mq.c:190). 185 } 186 EXPORT_SYMBOL_GPL(blk_freeze_queue_start); 187 188 void blk_mq_freeze_queue_wait(struct request_queue *q) 189 { 190 wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter)); 191 } 192 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait); 193 194 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
If you haven't enabled lockdep on your system then can you please configure lockdep and rerun the srp/002 test?
Lockdep was enabled during the test and didn't complain.
This is my analysis of the deadlock:
* Multiple requests are pending: # (cd /sys/kernel/debug/block && grep -aH . */*/*/*list) | head dm-2/hctx0/cpu0/default_rq_list:0000000035c26c20 {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=137, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000005060461e {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=136, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000007cd295ec {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=135, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:00000000a4a8006b {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=134, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000001f93036f {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=140, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:00000000333baffb {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=173, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000002c050850 {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=141, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000000668dd8b {.op=WRITE, .cmd_flags=SYNC|META|PRIO, .rq_flags=IO_STAT, .state=idle, .tag=133, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:0000000079b67c9f {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=207, .internal_tag=-1} dm-2/hctx0/cpu107/default_rq_list:0000000036254afb {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=1384, .internal_tag=-1}
* queue_if_no_path is enabled for the multipath device dm-2: # ls -l /dev/mapper/mpatha lrwxrwxrwx 1 root root 7 Jun 26 08:50 /dev/mapper/mpatha -> ../dm-2 # dmsetup table mpatha 0 65536 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 1 2 8:32 1 1
* The block device 8:32 is being deleted: # grep '^8:32$' /sys/class/block/*/dev | wc -l 0
* blk_mq_freeze_queue_nomemsave() waits for the pending requests to finish. Because the only path in the multipath is being deleted and because queue_if_no_path is enabled, blk_mq_freeze_queue_nomemsave() hangs.
Bart.
On 6/26/25 9:32 PM, Bart Van Assche wrote:
On 6/25/25 10:31 PM, Nilay Shroff wrote:
It seems that some other thread on your system acquired ->freeze_lock and never released it and that prevents the udev-worker thread to forward progress.
That's wrong. blk_mq_freeze_queue_wait() is waiting for q_usage_counter to drop to zero as the below output shows:
(gdb) list *(blk_mq_freeze_queue_wait+0xf2) 0xffffffff823ab0b2 is in blk_mq_freeze_queue_wait (block/blk-mq.c:190). 185 } 186 EXPORT_SYMBOL_GPL(blk_freeze_queue_start); 187 188 void blk_mq_freeze_queue_wait(struct request_queue *q) 189 { 190 wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter)); 191 } 192 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_wait); 193 194 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,
If you haven't enabled lockdep on your system then can you please configure lockdep and rerun the srp/002 test?
Lockdep was enabled during the test and didn't complain.
This is my analysis of the deadlock:
- Multiple requests are pending:
# (cd /sys/kernel/debug/block && grep -aH . */*/*/*list) | head dm-2/hctx0/cpu0/default_rq_list:0000000035c26c20 {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=137, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000005060461e {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=136, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000007cd295ec {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=135, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:00000000a4a8006b {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=134, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000001f93036f {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=140, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:00000000333baffb {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=173, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000002c050850 {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=141, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:000000000668dd8b {.op=WRITE, .cmd_flags=SYNC|META|PRIO, .rq_flags=IO_STAT, .state=idle, .tag=133, .internal_tag=-1} dm-2/hctx0/cpu0/default_rq_list:0000000079b67c9f {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=207, .internal_tag=-1} dm-2/hctx0/cpu107/default_rq_list:0000000036254afb {.op=READ, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT, .state=idle, .tag=1384, .internal_tag=-1}
- queue_if_no_path is enabled for the multipath device dm-2:
# ls -l /dev/mapper/mpatha lrwxrwxrwx 1 root root 7 Jun 26 08:50 /dev/mapper/mpatha -> ../dm-2 # dmsetup table mpatha 0 65536 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 1 2 8:32 1 1
- The block device 8:32 is being deleted:
# grep '^8:32$' /sys/class/block/*/dev | wc -l 0
- blk_mq_freeze_queue_nomemsave() waits for the pending requests to
finish. Because the only path in the multipath is being deleted and because queue_if_no_path is enabled, blk_mq_freeze_queue_nomemsave() hangs.
Thanks! this makes sense now. But then we do have few other limits (e.g. iostats_passthrough, iostats, write_cache etc.) which are accessed during IO hotpath. So if we were to update those limits then we acquire ->limits_lock and also freezes the queue. So I wonder how could those be addressed?
Thanks, --Nilay
On 6/26/25 11:16 PM, Nilay Shroff wrote:
Thanks! this makes sense now. But then we do have few other limits (e.g. iostats_passthrough, iostats, write_cache etc.) which are accessed during IO hotpath. So if we were to update those limits then we acquire ->limits_lock and also freezes the queue. So I wonder how could those be addressed?
Is there any Linux distro that sets these sysfs attributes from a udev rule? If not, I don't think that we have to worry about these sysfs attributes.
Thanks,
Bart.
On 6/27/25 8:40 PM, Bart Van Assche wrote:
On 6/26/25 11:16 PM, Nilay Shroff wrote:
Thanks! this makes sense now. But then we do have few other limits (e.g. iostats_passthrough, iostats, write_cache etc.) which are accessed during IO hotpath. So if we were to update those limits then we acquire ->limits_lock and also freezes the queue. So I wonder how could those be addressed?
Is there any Linux distro that sets these sysfs attributes from a udev rule? If not, I don't think that we have to worry about these sysfs attributes.
I think that's not only about distro udev rules setting queue limits. It's quite possible that some user applications may programmatically update these queue limits during runtime. In such cases, the application would need to freeze the queue before making changes. So even if no current distro sets these attributes via udev, that could change in the future, and we don't have control over that.
Looking at your earlier dmsetup command: # dmsetup table mpatha 0 65536 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 1 2 8:32 1 1
In the above rule, the option queue_if_no_path seems bit odd (unless used with timeout). Can't we add module param queue_if_no_path_timeout_secs=<N> while loading dm-multipath and thus avoid hanging the queue I/O indefinitely when all paths of a multipath device is lost? IMO, queue_if_no_path without timeout may make sense when we know that the paths will eventually recover and that applications should simply wait.
Thanks, --Nilay
On 6/30/25 3:41 AM, Nilay Shroff wrote:
Looking at your earlier dmsetup command: # dmsetup table mpatha 0 65536 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 1 2 8:32 1 1
In the above rule, the option queue_if_no_path seems bit odd (unless used with timeout). Can't we add module param queue_if_no_path_timeout_secs=<N> while loading dm-multipath and thus avoid hanging the queue I/O indefinitely when all paths of a multipath device is lost? IMO, queue_if_no_path without timeout may make sense when we know that the paths will eventually recover and that applications should simply wait.
I refuse to modify the tests that trigger the deadlock because: 1. The deadlock is a REGRESSION. Regressions are not tolerated in the Linux kernel and should be fixed instead of arguing about whether or not the use case should be modified. 2. The test that triggers the deadlock is not new. It is almost ten years old and the deadlock reported at the start of this email thread is the first deadlock in the block layer triggered by that test. 3. queue_if_no_path is widely used to avoid I/O errors if all paths are temporarily unavailable and if it is not known how long it will take to restore a path. queue_if_no_path can e.g. be used to prevent I/O errors if a technician mistakenly pulls the wrong cable(s) in a data center. 4. Unnecessary blk_mq_freeze_queue()/blk_mq_unfreeze_queue() pairs slow down the workflows that trigger these kernel function calls. Hence, if blk_mq_freeze_queue() and blk_mq_unfreeze_queue() are called unnecessarily, the calls to these functions should be removed.
Bart.
On 6/30/25 8:58 PM, Bart Van Assche wrote:
On 6/30/25 3:41 AM, Nilay Shroff wrote:
Looking at your earlier dmsetup command: # dmsetup table mpatha 0 65536 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 1 2 8:32 1 1
In the above rule, the option queue_if_no_path seems bit odd (unless used with timeout). Can't we add module param queue_if_no_path_timeout_secs=<N> while loading dm-multipath and thus avoid hanging the queue I/O indefinitely when all paths of a multipath device is lost? IMO, queue_if_no_path without timeout may make sense when we know that the paths will eventually recover and that applications should simply wait.
I refuse to modify the tests that trigger the deadlock because:
- The deadlock is a REGRESSION. Regressions are not tolerated in the
Linux kernel and should be fixed instead of arguing about whether or not the use case should be modified. 2. The test that triggers the deadlock is not new. It is almost ten years old and the deadlock reported at the start of this email thread is the first deadlock in the block layer triggered by that test. 3. queue_if_no_path is widely used to avoid I/O errors if all paths are temporarily unavailable and if it is not known how long it will take to restore a path. queue_if_no_path can e.g. be used to prevent I/O errors if a technician mistakenly pulls the wrong cable(s) in a data center. 4. Unnecessary blk_mq_freeze_queue()/blk_mq_unfreeze_queue() pairs slow down the workflows that trigger these kernel function calls. Hence, if blk_mq_freeze_queue() and blk_mq_unfreeze_queue() are called unnecessarily, the calls to these functions should be removed.
Yes agreed.. Your above points makes sense. As this is REGRESSION we must fix this. I am convinced that the readahead attribute is not used in IO hotpath so in that sense your patch is valid, no issues with that. However my only concern was about other sysfs attributes which needs queue freeze and that's legitimate as those attributes are used in IO hotpath. But I just saw that you proposed some idea about addressing those other sysfs attributes, so lets discuss this on that thread.
Meanwhile, I'd review your patch.
Thanks, --Nilay
linux-stable-mirror@lists.linaro.org