This issue was found in 4.14 and is present in earlier kernels.
Please backport
f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
onto the stable branches that don't have these. The second is a fix for the first. Thank you.
4.19.y and later - commits already present 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still straightforward, just drop the comment and code mentioning switching to 'none' in the trailing context 4.9.y - ditto 4.4.y - there was a refactoring of the code in commit 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial 3.16.y - ditto
I am happy to try to produce clean patches, but it may be a day or so.
Regards, Giuliano.
On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
This issue was found in 4.14 and is present in earlier kernels.
Please backport
f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
onto the stable branches that don't have these. The second is a fix for the first. Thank you.
4.19.y and later - commits already present 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still straightforward, just drop the comment and code mentioning switching to 'none' in the trailing context 4.9.y - ditto 4.4.y - there was a refactoring of the code in commit 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial 3.16.y - ditto
I am happy to try to produce clean patches, but it may be a day or so.
Clean patches would be good, as there are -rcs out right now so I can't do anything until they are released in a few days.
thanks,
greg k-h
On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
This issue was found in 4.14 and is present in earlier kernels.
Please backport
f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
onto the stable branches that don't have these. The second is a fix for the first. Thank you.
4.19.y and later - commits already present 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still straightforward, just drop the comment and code mentioning switching to 'none' in the trailing context 4.9.y - ditto 4.4.y - there was a refactoring of the code in commit 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial 3.16.y - ditto
I am happy to try to produce clean patches, but it may be a day or so.
I have done this for 4.14.y and 4.9.y, can you please provide a backport for 4.4.y that I can queue up?
thanks,
greg k-h
Hi Greg.
I also have 4.14 and 4.9, I'll send them on for comparison.
I will try 4.4 but, as one call site doesn't exist and the other didn't have any locking to start with, I'd like to try to reproduce the issue first.
I should have some spare time for this soon.
Giuilano.
On Fri, 3 Apr 2020 at 10:20, Greg KH greg@kroah.com wrote:
On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
This issue was found in 4.14 and is present in earlier kernels.
Please backport
f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
onto the stable branches that don't have these. The second is a fix for the first. Thank you.
4.19.y and later - commits already present 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still straightforward, just drop the comment and code mentioning switching to 'none' in the trailing context 4.9.y - ditto 4.4.y - there was a refactoring of the code in commit 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial 3.16.y - ditto
I am happy to try to produce clean patches, but it may be a day or so.
I have done this for 4.14.y and 4.9.y, can you please provide a backport for 4.4.y that I can queue up?
thanks,
greg k-h
Hi Greg.
On Fri, 3 Apr 2020 at 23:30, Giuliano Procida gprocida@google.com wrote:
Hi Greg.
I also have 4.14 and 4.9, I'll send them on for comparison.
I've done this.
I will try 4.4 but, as one call site doesn't exist and the other didn't have any locking to start with, I'd like to try to reproduce the issue first.
I have failed to build a bootable 4.4 kernel which is surprising / embarrassing, as my current toolchain (even after working around various known issues) compiles kernels that either panic or triple-fault (apparently, as there's no log output, just a reboot) on my amd64 hardware. Running an old live distribution with a 4.4 kernel, I wasn't able to reproduce the issue apparently resolved by these fixes after several hours of running.
I've also spent most of 2 days looking at unfamiliar code.
The code in 4.4 uses a timer instead of a workqueue for timeout callbacks. The callbacks have also have blk_queue_enter/exit protection in 4.9 but not 4.4. I'm guessing, but don't know, that the execution contexts are sufficiently similar between timers and workqueues that this protection should be back-ported to 4.4. This is relatively simple, it's bits of a couple of extra commits.
f5bbbbe4d635 adds to blk_mq_queue_tag_busy_iter an RCU-protected test to see if the blk_queue is held before doing any work. It also adds RCU synchronisation to code that manipulates the number of hardware queues. The follow-up 530ca2c9bd more sensibly just (possibly recursively) does try-to-enter/exit instead. 4.4 doesn't have code that manipulates the number of hardware queues. However, the blk_mq_queue_tag_busy_iter locking may be enough to prevent ioctl/procfs concurrency.
To this end, I've put together patches for 4.4. They are completely untested. Once I've verified they actually compile I'll send them on.
Giuliano.
I should have some spare time for this soon.
Giuilano.
On Fri, 3 Apr 2020 at 10:20, Greg KH greg@kroah.com wrote:
On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
This issue was found in 4.14 and is present in earlier kernels.
Please backport
f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
onto the stable branches that don't have these. The second is a fix for the first. Thank you.
4.19.y and later - commits already present 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still straightforward, just drop the comment and code mentioning switching to 'none' in the trailing context 4.9.y - ditto 4.4.y - there was a refactoring of the code in commit 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial 3.16.y - ditto
I am happy to try to produce clean patches, but it may be a day or so.
I have done this for 4.14.y and 4.9.y, can you please provide a backport for 4.4.y that I can queue up?
thanks,
greg k-h
Here are the patches for linux-4.4.y. Untested.
Regards, Giuliano.
Giuliano Procida (4): block: more locking around delayed work blk-mq: Allow timeouts to run while queue is freezing blk-mq: sync things with blk_mq_queue_tag_busy_iter blk-mq: Allow blocking queue tag iter callbacks
block/blk-mq-tag.c | 7 ++++++- block/blk-mq.c | 17 +++++++++++++++++ block/blk-timeout.c | 3 +++ 3 files changed, 26 insertions(+), 1 deletion(-)
On Tue, Apr 07, 2020 at 05:55:35PM +0100, Giuliano Procida wrote:
Here are the patches for linux-4.4.y. Untested.
Can you fix up patch 1/4 as noted, and test them before resending them?
thanks,
greg k-h
commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.
The upstream commit (block: defer timeouts to a workqueue) included various locking changes. The original commit message did not say anything about the extra locking. Perhaps this is only needed for workqueue callbacks and not timer callbacks. We assume it is needed here.
This patch includes the locking changes but leaves timeouts using a timer.
Both blk_mq_rq_timer and blk_rq_timed_out_timer will return without without doing any work if they cannot acquire the queue (without waiting).
Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq.c | 4 ++++ block/blk-timeout.c | 3 +++ 2 files changed, 7 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 8649dbf06ce4..11a23bf73fd9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -628,6 +628,9 @@ static void blk_mq_rq_timer(unsigned long priv) }; int i;
+ if (blk_queue_enter(q, GFP_NOWAIT)) + return; + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
if (data.next_set) { @@ -642,6 +645,7 @@ static void blk_mq_rq_timer(unsigned long priv) blk_mq_tag_idle(hctx); } } + blk_queue_exit(q); }
/* diff --git a/block/blk-timeout.c b/block/blk-timeout.c index aa40aa93381b..2bc03df554a6 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -134,6 +134,8 @@ void blk_rq_timed_out_timer(unsigned long data) struct request *rq, *tmp; int next_set = 0;
+ if (blk_queue_enter(q, GFP_NOWAIT)) + return; spin_lock_irqsave(q->queue_lock, flags);
list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list) @@ -143,6 +145,7 @@ void blk_rq_timed_out_timer(unsigned long data) mod_timer(&q->timeout, round_jiffies_up(next));
spin_unlock_irqrestore(q->queue_lock, flags); + blk_queue_exit(q); }
/**
On Tue, Apr 07, 2020 at 05:55:36PM +0100, Giuliano Procida wrote:
commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.
The upstream commit (block: defer timeouts to a workqueue) included various locking changes. The original commit message did not say anything about the extra locking. Perhaps this is only needed for workqueue callbacks and not timer callbacks. We assume it is needed here.
This patch includes the locking changes but leaves timeouts using a timer.
Both blk_mq_rq_timer and blk_rq_timed_out_timer will return without without doing any work if they cannot acquire the queue (without waiting).
Signed-off-by: Giuliano Procida gprocida@google.com
Don't write your own changelog text for something that is upstream, include the original and then add your own later on below, making it obvious what you are doing differently from the original commit.
And be sure to cc: all of the original people on that commit, and keep their s-o-b also.
thanks,
greg k-h
No problem.
Will do today.
Giuliano.
On Fri, 10 Apr 2020 at 10:03, Greg KH gregkh@linuxfoundation.org wrote:
On Tue, Apr 07, 2020 at 05:55:36PM +0100, Giuliano Procida wrote:
commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.
The upstream commit (block: defer timeouts to a workqueue) included various locking changes. The original commit message did not say anything about the extra locking. Perhaps this is only needed for workqueue callbacks and not timer callbacks. We assume it is needed here.
This patch includes the locking changes but leaves timeouts using a timer.
Both blk_mq_rq_timer and blk_rq_timed_out_timer will return without without doing any work if they cannot acquire the queue (without waiting).
Signed-off-by: Giuliano Procida gprocida@google.com
Don't write your own changelog text for something that is upstream, include the original and then add your own later on below, making it obvious what you are doing differently from the original commit.
And be sure to cc: all of the original people on that commit, and keep their s-o-b also.
thanks,
greg k-h
commit 71f79fb3179e69b0c1448a2101a866d871c66e7f upstream.
In case a submitted request gets stuck for some reason, the block layer can prevent the request starvation by starting the scheduled timeout work. If this stuck request occurs at the same time another thread has started a queue freeze, the blk_mq_timeout_work will not be able to acquire the queue reference and will return silently, thus not issuing the timeout. But since the request is already holding a q_usage_counter reference and is unable to complete, it will never release its reference, preventing the queue from completing the freeze started by first thread. This puts the request_queue in a hung state, forever waiting for the freeze completion.
This was observed while running IO to a NVMe device at the same time we toggled the CPU hotplug code. Eventually, once a request got stuck requiring a timeout during a queue freeze, we saw the CPU Hotplug notification code get stuck inside blk_mq_freeze_queue_wait, as shown in the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable) [c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350 [c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990 [c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0 [c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110 [c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0 [c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100 [c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0 [c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400 [c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0 [c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50 [c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140 [c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0 [c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0 [c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0 [c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200 [c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0 [c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230 [c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110 [c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between dropping the initial refcount reference and the release of the last reference, which actually marks the freeze completion. This can be achieved with percpu_refcount_tryget, which does not require the counter to be alive. This way the timeout work can do it's job and terminate a stuck request even during a freeze, returning its reference and avoiding the deadlock.
Allowing the timeout to run is just a part of the fix, since for some devices, we might get stuck again inside the device driver's timeout handler, should it attempt to allocate a new request in that path - which is a quite common action for Abort commands, which need to be sent after a timeout. In NVMe, for instance, we call blk_mq_alloc_request from inside the timeout handler, which will fail during a freeze, since it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic solution for further device driver hangs, but we can't do that, since it would allow new requests to disturb the freeze process. I thought about creating a new function in the block layer to support unfreezable requests for these occasions, but after working on it for a while, I feel like this should be handled in a per-driver basis. I'm now experimenting with changes to the NVMe timeout path, but I'm open to suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi krisman@linux.vnet.ibm.com Cc: Brian King brking@linux.vnet.ibm.com Cc: Keith Busch keith.busch@intel.com Cc: linux-nvme@lists.infradead.org Cc: linux-block@vger.kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@fb.com Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 11a23bf73fd9..d13e70d40df9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -628,7 +628,20 @@ static void blk_mq_rq_timer(unsigned long priv) }; int i;
- if (blk_queue_enter(q, GFP_NOWAIT)) + /* A deadlock might occur if a request is stuck requiring a + * timeout at the same time a queue freeze is waiting + * completion, since the timeout code would not be able to + * acquire the queue reference here. + * + * That's why we don't use blk_queue_enter here; instead, we use + * percpu_ref_tryget directly, because we need to be able to + * obtain a reference even in the short window between the queue + * starting to freeze, by dropping the first reference in + * blk_mq_freeze_queue_start, and the moment the last request is + * consumed, marked by the instant q_usage_counter reaches + * zero. + */ + if (!percpu_ref_tryget(&q->q_usage_counter)) return;
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
commit f5bbbbe4d63577026f908a809f22f5fd5a90ea1f upstream.
The original commit was intended to prevent concurrent manipulation of nr_hw_queues and iteration over queues. The former doesn't happen in this older kernel version. However, the extra locking (which is buggy as it exists in this commit) may protect against other concurrent accesses such as queue removal.
The original commit message follows for completeness.
blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to account the inflight requests. It will access the queue_hw_ctx and nr_hw_queues w/o any protection. When updating nr_hw_queues and blk_mq_in_flight/rw occur concurrently, panic comes up.
Before update nr_hw_queues, the q will be frozen. So we could use q_usage_counter to avoid the race. percpu_ref_is_zero is used here so that we will not miss any in-flight request. The access to nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are under rcu critical section, __blk_mq_update_nr_hw_queues could use synchronize_rcu to ensure the zeroed q_usage_counter to be globally visible.
Signed-off-by: Jianchao Wang jianchao.w.wang@oracle.com Reviewed-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq-tag.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index a07ca3488d96..bf356de30134 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -481,6 +481,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, struct blk_mq_hw_ctx *hctx; int i;
+ /* + * Avoid potential races with things like queue removal. + */ + rcu_read_lock(); + if (percpu_ref_is_zero(&q->q_usage_counter)) { + rcu_read_unlock(); + return; + }
queue_for_each_hw_ctx(q, hctx, i) { struct blk_mq_tags *tags = hctx->tags; @@ -497,7 +505,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv, false); } - + rcu_read_unlock(); }
static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
commit 530ca2c9bd6949c72c9b5cfc330cb3dbccaa3f5b upstream.
This change is a back-ported fix to the back-port of f5bbbbe4d6357, a439abbd6e707232b1f399e6df1a85ace42e8f9f.
A recent commit runs tag iterator callbacks under the rcu read lock, but existing callbacks do not satisfy the non-blocking requirement. The commit intended to prevent an iterator from accessing a queue that's being modified. This patch fixes the original issue by taking a queue reference instead of reading it, which allows callbacks to make blocking calls.
Fixes: f5bbbbe4d6357 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter") Acked-by: Jianchao Wang jianchao.w.wang@oracle.com Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq-tag.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index bf356de30134..c1c654319287 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -484,11 +484,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, /* * Avoid potential races with things like queue removal. */ - rcu_read_lock(); - if (percpu_ref_is_zero(&q->q_usage_counter)) { - rcu_read_unlock(); + if (!percpu_ref_tryget(&q->q_usage_counter)) return; - }
queue_for_each_hw_ctx(q, hctx, i) { struct blk_mq_tags *tags = hctx->tags; @@ -505,7 +502,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv, false); } - rcu_read_unlock(); + blk_queue_exit(q); }
static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
Greg.
Going back further to 3.16 or 3.18 looks like a lot of work and I have low confidence of generating correct code.
There are changes like 3ef28e83ab15799742e55fd13243a5f678b04242 (from 4.3) which changed the locking from blk_mq_queue_enter to blk_queue_enter.
I'm going to stand down here.
Sorry about this. Giuliano.
On Tue, 7 Apr 2020 at 17:31, Giuliano Procida gprocida@google.com wrote:
Hi Greg.
On Fri, 3 Apr 2020 at 23:30, Giuliano Procida gprocida@google.com wrote:
Hi Greg.
I also have 4.14 and 4.9, I'll send them on for comparison.
I've done this.
I will try 4.4 but, as one call site doesn't exist and the other didn't have any locking to start with, I'd like to try to reproduce the issue first.
I have failed to build a bootable 4.4 kernel which is surprising / embarrassing, as my current toolchain (even after working around various known issues) compiles kernels that either panic or triple-fault (apparently, as there's no log output, just a reboot) on my amd64 hardware. Running an old live distribution with a 4.4 kernel, I wasn't able to reproduce the issue apparently resolved by these fixes after several hours of running.
I've also spent most of 2 days looking at unfamiliar code.
The code in 4.4 uses a timer instead of a workqueue for timeout callbacks. The callbacks have also have blk_queue_enter/exit protection in 4.9 but not 4.4. I'm guessing, but don't know, that the execution contexts are sufficiently similar between timers and workqueues that this protection should be back-ported to 4.4. This is relatively simple, it's bits of a couple of extra commits.
f5bbbbe4d635 adds to blk_mq_queue_tag_busy_iter an RCU-protected test to see if the blk_queue is held before doing any work. It also adds RCU synchronisation to code that manipulates the number of hardware queues. The follow-up 530ca2c9bd more sensibly just (possibly recursively) does try-to-enter/exit instead. 4.4 doesn't have code that manipulates the number of hardware queues. However, the blk_mq_queue_tag_busy_iter locking may be enough to prevent ioctl/procfs concurrency.
To this end, I've put together patches for 4.4. They are completely untested. Once I've verified they actually compile I'll send them on.
Giuliano.
I should have some spare time for this soon.
Giuilano.
On Fri, 3 Apr 2020 at 10:20, Greg KH greg@kroah.com wrote:
On Wed, Apr 01, 2020 at 05:47:02PM +0000, Giuliano Procida wrote:
This issue was found in 4.14 and is present in earlier kernels.
Please backport
f5bbbbe4d635 blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter 530ca2c9bd69 blk-mq: Allow blocking queue tag iter callbacks
onto the stable branches that don't have these. The second is a fix for the first. Thank you.
4.19.y and later - commits already present 4.14.y - f5bbbbe4d635 doesn't patch cleanly but it's still straightforward, just drop the comment and code mentioning switching to 'none' in the trailing context 4.9.y - ditto 4.4.y - there was a refactoring of the code in commit 0bf6cd5b9531bcc29c0a5e504b6ce2984c6fd8d8 making this non-trivial 3.16.y - ditto
I am happy to try to produce clean patches, but it may be a day or so.
I have done this for 4.14.y and 4.9.y, can you please provide a backport for 4.4.y that I can queue up?
thanks,
greg k-h
v2: Updated commit messages following feedback from gregkh.
Here are the patches for linux-4.4.y.
There are 2 further patches over those for linux-4.9.y and the differences after back-porting are non-trivial.
The code complies without warnings. However, I have no suitable hardware or virtual machine to test this on.
Regards, Guiliano.
Giuliano Procida (4): block: more locking around delayed work blk-mq: Allow timeouts to run while queue is freezing blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter blk-mq: Allow blocking queue tag iter callbacks
block/blk-mq-tag.c | 7 ++++++- block/blk-mq.c | 17 +++++++++++++++++ block/blk-timeout.c | 3 +++ 3 files changed, 26 insertions(+), 1 deletion(-)
On Wed, Apr 15, 2020 at 02:00:13PM +0100, Giuliano Procida wrote:
v2: Updated commit messages following feedback from gregkh.
Here are the patches for linux-4.4.y.
There are 2 further patches over those for linux-4.9.y and the differences after back-porting are non-trivial.
The code complies without warnings. However, I have no suitable hardware or virtual machine to test this on.
Sorry for the delay, now queued up.
greg k-h
commit 287922eb0b186e2a5bf54fdd04b734c25c90035c upstream.
block: defer timeouts to a workqueue
Timer context is not very useful for drivers to perform any meaningful abort action from. So instead of calling the driver from this useless context defer it to a workqueue as soon as possible.
Note that while a delayed_work item would seem the right thing here I didn't dare to use it due to the magic in blk_add_timer that pokes deep into timer internals. But maybe this encourages Tejun to add a sensible API for that to the workqueue API and we'll all be fine in the end :)
Contains a major update from Keith Bush:
"This patch removes synchronizing the timeout work so that the timer can start a freeze on its own queue. The timer enters the queue, so timer context can only start a freeze, but not wait for frozen."
NOTE: Back-ported to 4.4.y.
The only parts of the upstream commit that have been kept are various locking changes, none of which were mentioned in the original commit message which therefore describes this change not at all.
Timeout callbacks continue to be run via a timer. Both blk_mq_rq_timer and blk_rq_timed_out_timer will return without without doing any work if they cannot acquire the queue (without waiting).
Signed-off-by: Christoph Hellwig hch@lst.de Acked-by: Keith Busch keith.busch@intel.com Signed-off-by: Jens Axboe axboe@fb.com Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq.c | 4 ++++ block/blk-timeout.c | 3 +++ 2 files changed, 7 insertions(+)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 8649dbf06ce4..11a23bf73fd9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -628,6 +628,9 @@ static void blk_mq_rq_timer(unsigned long priv) }; int i;
+ if (blk_queue_enter(q, GFP_NOWAIT)) + return; + blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
if (data.next_set) { @@ -642,6 +645,7 @@ static void blk_mq_rq_timer(unsigned long priv) blk_mq_tag_idle(hctx); } } + blk_queue_exit(q); }
/* diff --git a/block/blk-timeout.c b/block/blk-timeout.c index aa40aa93381b..2bc03df554a6 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -134,6 +134,8 @@ void blk_rq_timed_out_timer(unsigned long data) struct request *rq, *tmp; int next_set = 0;
+ if (blk_queue_enter(q, GFP_NOWAIT)) + return; spin_lock_irqsave(q->queue_lock, flags);
list_for_each_entry_safe(rq, tmp, &q->timeout_list, timeout_list) @@ -143,6 +145,7 @@ void blk_rq_timed_out_timer(unsigned long data) mod_timer(&q->timeout, round_jiffies_up(next));
spin_unlock_irqrestore(q->queue_lock, flags); + blk_queue_exit(q); }
/**
commit 71f79fb3179e69b0c1448a2101a866d871c66e7f upstream.
In case a submitted request gets stuck for some reason, the block layer can prevent the request starvation by starting the scheduled timeout work. If this stuck request occurs at the same time another thread has started a queue freeze, the blk_mq_timeout_work will not be able to acquire the queue reference and will return silently, thus not issuing the timeout. But since the request is already holding a q_usage_counter reference and is unable to complete, it will never release its reference, preventing the queue from completing the freeze started by first thread. This puts the request_queue in a hung state, forever waiting for the freeze completion.
This was observed while running IO to a NVMe device at the same time we toggled the CPU hotplug code. Eventually, once a request got stuck requiring a timeout during a queue freeze, we saw the CPU Hotplug notification code get stuck inside blk_mq_freeze_queue_wait, as shown in the trace below.
[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable) [c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350 [c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990 [c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0 [c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110 [c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0 [c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100 [c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0 [c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400 [c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0 [c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50 [c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140 [c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0 [c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0 [c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0 [c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200 [c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0 [c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230 [c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110 [c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4
The fix is to allow the timeout work to execute in the window between dropping the initial refcount reference and the release of the last reference, which actually marks the freeze completion. This can be achieved with percpu_refcount_tryget, which does not require the counter to be alive. This way the timeout work can do it's job and terminate a stuck request even during a freeze, returning its reference and avoiding the deadlock.
Allowing the timeout to run is just a part of the fix, since for some devices, we might get stuck again inside the device driver's timeout handler, should it attempt to allocate a new request in that path - which is a quite common action for Abort commands, which need to be sent after a timeout. In NVMe, for instance, we call blk_mq_alloc_request from inside the timeout handler, which will fail during a freeze, since it also tries to acquire a queue reference.
I considered a similar change to blk_mq_alloc_request as a generic solution for further device driver hangs, but we can't do that, since it would allow new requests to disturb the freeze process. I thought about creating a new function in the block layer to support unfreezable requests for these occasions, but after working on it for a while, I feel like this should be handled in a per-driver basis. I'm now experimenting with changes to the NVMe timeout path, but I'm open to suggestions of ways to make this generic.
Signed-off-by: Gabriel Krisman Bertazi krisman@linux.vnet.ibm.com Cc: Brian King brking@linux.vnet.ibm.com Cc: Keith Busch keith.busch@intel.com Cc: linux-nvme@lists.infradead.org Cc: linux-block@vger.kernel.org Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@fb.com Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 11a23bf73fd9..d13e70d40df9 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -628,7 +628,20 @@ static void blk_mq_rq_timer(unsigned long priv) }; int i;
- if (blk_queue_enter(q, GFP_NOWAIT)) + /* A deadlock might occur if a request is stuck requiring a + * timeout at the same time a queue freeze is waiting + * completion, since the timeout code would not be able to + * acquire the queue reference here. + * + * That's why we don't use blk_queue_enter here; instead, we use + * percpu_ref_tryget directly, because we need to be able to + * obtain a reference even in the short window between the queue + * starting to freeze, by dropping the first reference in + * blk_mq_freeze_queue_start, and the moment the last request is + * consumed, marked by the instant q_usage_counter reaches + * zero. + */ + if (!percpu_ref_tryget(&q->q_usage_counter)) return;
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
commit f5bbbbe4d63577026f908a809f22f5fd5a90ea1f upstream.
For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to account the inflight requests. It will access the queue_hw_ctx and nr_hw_queues w/o any protection. When updating nr_hw_queues and blk_mq_in_flight/rw occur concurrently, panic comes up.
Before update nr_hw_queues, the q will be frozen. So we could use q_usage_counter to avoid the race. percpu_ref_is_zero is used here so that we will not miss any in-flight request. The access to nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are under rcu critical section, __blk_mq_update_nr_hw_queues could use synchronize_rcu to ensure the zeroed q_usage_counter to be globally visible.
NOTE: Back-ported to 4.4.y.
The upstream commit was intended to prevent concurrent manipulation of nr_hw_queues and iteration over queues. The former doesn't happen in this in 4.4.7 (as __blk_mq_update_nr_hw_queues doesn't exist). The extra locking is also buggy in this commit but fixed in a follow-up.
It may protect against other concurrent accesses such as queue removal by synchronising RCU locking around q_usage_counter.
Signed-off-by: Jianchao Wang jianchao.w.wang@oracle.com Reviewed-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq-tag.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index a07ca3488d96..bf356de30134 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -481,6 +481,14 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, struct blk_mq_hw_ctx *hctx; int i;
+ /* + * Avoid potential races with things like queue removal. + */ + rcu_read_lock(); + if (percpu_ref_is_zero(&q->q_usage_counter)) { + rcu_read_unlock(); + return; + }
queue_for_each_hw_ctx(q, hctx, i) { struct blk_mq_tags *tags = hctx->tags; @@ -497,7 +505,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv, false); } - + rcu_read_unlock(); }
static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
commit 530ca2c9bd6949c72c9b5cfc330cb3dbccaa3f5b upstream.
A recent commit runs tag iterator callbacks under the rcu read lock, but existing callbacks do not satisfy the non-blocking requirement. The commit intended to prevent an iterator from accessing a queue that's being modified. This patch fixes the original issue by taking a queue reference instead of reading it, which allows callbacks to make blocking calls.
Fixes: f5bbbbe4d6357 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter") Acked-by: Jianchao Wang jianchao.w.wang@oracle.com Signed-off-by: Keith Busch keith.busch@intel.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Giuliano Procida gprocida@google.com --- block/blk-mq-tag.c | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index bf356de30134..c1c654319287 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -484,11 +484,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, /* * Avoid potential races with things like queue removal. */ - rcu_read_lock(); - if (percpu_ref_is_zero(&q->q_usage_counter)) { - rcu_read_unlock(); + if (!percpu_ref_tryget(&q->q_usage_counter)) return; - }
queue_for_each_hw_ctx(q, hctx, i) { struct blk_mq_tags *tags = hctx->tags; @@ -505,7 +502,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn, bt_for_each(hctx, &tags->bitmap_tags, tags->nr_reserved_tags, fn, priv, false); } - rcu_read_unlock(); + blk_queue_exit(q); }
static unsigned int bt_unused_tags(struct blk_mq_bitmap_tags *bt)
linux-stable-mirror@lists.linaro.org