Hi Greg,
Please consider this patchset, which include block/scsi multiqueue performance enhancement and bugfix.
We've run multiple benchmark and different tests for over one week, looks good.
These patches are also included in Oracle UEK5.
They're almost just simple cherry-pick, only 2 patches need minor adjust.
They can apply cleanly on 4.14.57.
Jens Axboe (3): Revert "blk-mq: don't handle TAG_SHARED in restart" blk-mq: fix issue with shared tag queue re-running blk-mq: only run the hardware queue if IO is pending
Jianchao Wang (1): blk-mq: put the driver tag of nxt rq before first one is requeued
Ming Lei (19): blk-mq-sched: move actual dispatching into one helper blk-mq: introduce .get_budget and .put_budget in blk_mq_ops sbitmap: introduce __sbitmap_for_each_set() blk-mq-sched: improve dispatching from sw queue scsi: allow passing in null rq to scsi_prep_state_check() scsi: implement .get_budget and .put_budget for blk-mq SCSI: don't get target/host busy_count in scsi_mq_get_budget() blk-mq: don't handle TAG_SHARED in restart blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE blk-mq: don't handle failure in .get_budget blk-flush: don't run queue for requests bypassing flush block: pass 'run_queue' to blk_mq_request_bypass_insert blk-flush: use blk_mq_request_bypass_insert() blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ blk-mq: move blk_mq_put_driver_tag*() into blk-mq.h blk-mq: don't allocate driver tag upfront for flush rq blk-mq: put driver tag if dispatch budget can't be got blk-mq: quiesce queue during switching io sched and updating nr_requests scsi: core: run queue if SCSI device queue isn't ready and queue is idle
block/blk-core.c | 2 +- block/blk-flush.c | 37 +++++-- block/blk-mq-debugfs.c | 1 - block/blk-mq-sched.c | 203 ++++++++++++++++++++++------------- block/blk-mq.c | 278 +++++++++++++++++++++++++++--------------------- block/blk-mq.h | 58 +++++++++- block/elevator.c | 2 + drivers/scsi/scsi_lib.c | 53 ++++++--- include/linux/blk-mq.h | 20 +++- include/linux/sbitmap.h | 64 ++++++++--- 10 files changed, 475 insertions(+), 243 deletions(-)
From: Ming Lei ming.lei@redhat.com
commit caf8eb0d604a0eaeb8111eb4d36853a6d08eebe7 upstream
So that it becomes easy to support to dispatch from sw queue in the following patch.
No functional change.
Reviewed-by: Bart Van Assche bart.vanassche@wdc.com Reviewed-by: Omar Sandoval osandov@fb.com Suggested-by: Christoph Hellwig hch@lst.de # for simplifying dispatch logic Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 43 ++++++++++++++++++++++++------------------- 1 file changed, 24 insertions(+), 19 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index eca011fdfa0e..be29ba849408 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -89,12 +89,26 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) return false; }
+static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + struct elevator_queue *e = q->elevator; + LIST_HEAD(rq_list); + + do { + struct request *rq = e->type->ops.mq.dispatch_request(hctx); + + if (!rq) + break; + list_add(&rq->queuelist, &rq_list); + } while (blk_mq_dispatch_rq_list(q, &rq_list)); +} + void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct elevator_queue *e = q->elevator; const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request; - bool do_sched_dispatch = true; LIST_HEAD(rq_list);
/* RCU or SRCU read lock is needed before checking quiesced flag */ @@ -122,30 +136,21 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) * scheduler, we can no longer merge or sort them. So it's best to * leave them there for as long as we can. Mark the hw queue as * needing a restart in that case. + * + * We want to dispatch from the scheduler if there was nothing + * on the dispatch list or we were able to dispatch from the + * dispatch list. */ if (!list_empty(&rq_list)) { blk_mq_sched_mark_restart_hctx(hctx); - do_sched_dispatch = blk_mq_dispatch_rq_list(q, &rq_list); - } else if (!has_sched_dispatch) { + if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch) + blk_mq_do_dispatch_sched(hctx); + } else if (has_sched_dispatch) { + blk_mq_do_dispatch_sched(hctx); + } else { blk_mq_flush_busy_ctxs(hctx, &rq_list); blk_mq_dispatch_rq_list(q, &rq_list); } - - /* - * We want to dispatch from the scheduler if there was nothing - * on the dispatch list or we were able to dispatch from the - * dispatch list. - */ - if (do_sched_dispatch && has_sched_dispatch) { - do { - struct request *rq; - - rq = e->type->ops.mq.dispatch_request(hctx); - if (!rq) - break; - list_add(&rq->queuelist, &rq_list); - } while (blk_mq_dispatch_rq_list(q, &rq_list)); - } }
bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
On 7/23/18 7:24 AM, Jack Wang wrote:
From: Ming Lei ming.lei@redhat.com
commit caf8eb0d604a0eaeb8111eb4d36853a6d08eebe7 upstream
So that it becomes easy to support to dispatch from sw queue in the following patch.
No functional change.
What is this huge series of 23 patches?!
On Mon, Jul 23, 2018 at 08:31:31AM -0600, Jens Axboe wrote:
On 7/23/18 7:24 AM, Jack Wang wrote:
From: Ming Lei ming.lei@redhat.com
commit caf8eb0d604a0eaeb8111eb4d36853a6d08eebe7 upstream
So that it becomes easy to support to dispatch from sw queue in the following patch.
No functional change.
What is this huge series of 23 patches?!
I already complained that you were not cc:ed on the 00/23 email :)
greg k-h
From: Ming Lei ming.lei@redhat.com
commit de1482974080ec9ef414bf048b2646b246b63f6e upstream
For SCSI devices, there is often a per-request-queue depth, which needs to be respected before queuing one request.
Currently blk-mq always dequeues the request first, then calls .queue_rq() to dispatch the request to lld. One obvious issue with this approach is that I/O merging may not be successful, because when the per-request-queue depth can't be respected, .queue_rq() has to return BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch list. This means it never gets a chance to be merged with other IO.
This patch introduces .get_budget and .put_budget callback in blk_mq_ops, then we can try to get reserved budget first before dequeuing request. If the budget for queueing I/O can't be satisfied, we don't need to dequeue request at all. Hence the request can be left in the IO scheduler queue, for more merging opportunities.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 55 +++++++++++++++++++++++++++++++++++++++++--------- block/blk-mq-sched.h | 2 +- block/blk-mq.c | 43 ++++++++++++++++++++++++++++++++++----- block/blk-mq.h | 20 +++++++++++++++++- include/linux/blk-mq.h | 11 ++++++++++ 5 files changed, 114 insertions(+), 17 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index be29ba849408..8e525e66a0d9 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -89,31 +89,57 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) return false; }
-static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) +/* return true if hctx need to run again */ +static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct elevator_queue *e = q->elevator; LIST_HEAD(rq_list);
do { - struct request *rq = e->type->ops.mq.dispatch_request(hctx); + struct request *rq; + blk_status_t ret;
- if (!rq) + if (e->type->ops.mq.has_work && + !e->type->ops.mq.has_work(hctx)) break; + + ret = blk_mq_get_dispatch_budget(hctx); + if (ret == BLK_STS_RESOURCE) + return true; + + rq = e->type->ops.mq.dispatch_request(hctx); + if (!rq) { + blk_mq_put_dispatch_budget(hctx); + break; + } else if (ret != BLK_STS_OK) { + blk_mq_end_request(rq, ret); + continue; + } + + /* + * Now this rq owns the budget which has to be released + * if this rq won't be queued to driver via .queue_rq() + * in blk_mq_dispatch_rq_list(). + */ list_add(&rq->queuelist, &rq_list); - } while (blk_mq_dispatch_rq_list(q, &rq_list)); + } while (blk_mq_dispatch_rq_list(q, &rq_list, true)); + + return false; }
-void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) +/* return true if hw queue need to be run again */ +bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct elevator_queue *e = q->elevator; const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request; LIST_HEAD(rq_list); + bool run_queue = false;
/* RCU or SRCU read lock is needed before checking quiesced flag */ if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))) - return; + return false;
hctx->run++;
@@ -143,14 +169,23 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) */ if (!list_empty(&rq_list)) { blk_mq_sched_mark_restart_hctx(hctx); - if (blk_mq_dispatch_rq_list(q, &rq_list) && has_sched_dispatch) - blk_mq_do_dispatch_sched(hctx); + if (blk_mq_dispatch_rq_list(q, &rq_list, false) && + has_sched_dispatch) + run_queue = blk_mq_do_dispatch_sched(hctx); } else if (has_sched_dispatch) { - blk_mq_do_dispatch_sched(hctx); + run_queue = blk_mq_do_dispatch_sched(hctx); } else { blk_mq_flush_busy_ctxs(hctx, &rq_list); - blk_mq_dispatch_rq_list(q, &rq_list); + blk_mq_dispatch_rq_list(q, &rq_list, false); } + + if (run_queue && !blk_mq_sched_needs_restart(hctx) && + !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) { + blk_mq_sched_mark_restart_hctx(hctx); + return true; + } + + return false; }
bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index ba1d1418a96d..1ccfb8027cfc 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -23,7 +23,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx, struct list_head *list, bool run_queue_async);
-void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx); +bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e); void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e); diff --git a/block/blk-mq.c b/block/blk-mq.c index 49979c095f31..ad5ae7192a64 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1037,7 +1037,8 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) return true; }
-bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) +bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, + bool got_budget) { struct blk_mq_hw_ctx *hctx; struct request *rq; @@ -1046,6 +1047,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) if (list_empty(list)) return false;
+ WARN_ON(!list_is_singular(list) && got_budget); + /* * Now process all the entries, sending them to the driver. */ @@ -1063,16 +1066,30 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) * The initial allocation attempt failed, so we need to * rerun the hardware queue when a tag is freed. */ - if (!blk_mq_dispatch_wait_add(hctx)) + if (!blk_mq_dispatch_wait_add(hctx)) { + if (got_budget) + blk_mq_put_dispatch_budget(hctx); break; + }
/* * It's possible that a tag was freed in the window * between the allocation failure and adding the * hardware queue to the wait queue. */ - if (!blk_mq_get_driver_tag(rq, &hctx, false)) + if (!blk_mq_get_driver_tag(rq, &hctx, false)) { + if (got_budget) + blk_mq_put_dispatch_budget(hctx); + break; + } + } + + if (!got_budget) { + ret = blk_mq_get_dispatch_budget(hctx); + if (ret == BLK_STS_RESOURCE) break; + if (ret != BLK_STS_OK) + goto fail_rq; }
list_del_init(&rq->queuelist); @@ -1100,6 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) break; }
+ fail_rq: if (unlikely(ret != BLK_STS_OK)) { errors++; blk_mq_end_request(rq, BLK_STS_IOERR); @@ -1158,6 +1176,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) { int srcu_idx; + bool run_queue;
/* * We should be running this queue from one of the CPUs that @@ -1192,15 +1211,18 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { rcu_read_lock(); - blk_mq_sched_dispatch_requests(hctx); + run_queue = blk_mq_sched_dispatch_requests(hctx); rcu_read_unlock(); } else { might_sleep();
srcu_idx = srcu_read_lock(hctx->queue_rq_srcu); - blk_mq_sched_dispatch_requests(hctx); + run_queue = blk_mq_sched_dispatch_requests(hctx); srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx); } + + if (run_queue) + blk_mq_run_hw_queue(hctx, true); }
/* @@ -1593,6 +1615,13 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (!blk_mq_get_driver_tag(rq, NULL, false)) goto insert;
+ ret = blk_mq_get_dispatch_budget(hctx); + if (ret == BLK_STS_RESOURCE) { + blk_mq_put_driver_tag(rq); + goto insert; + } else if (ret != BLK_STS_OK) + goto fail_rq; + new_cookie = request_to_qc_t(hctx, rq);
/* @@ -1609,6 +1638,7 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, __blk_mq_requeue_request(rq); goto insert; default: + fail_rq: *cookie = BLK_QC_T_NONE; blk_mq_end_request(rq, ret); return; @@ -2615,6 +2645,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) if (!set->ops->queue_rq) return -EINVAL;
+ if (!set->ops->get_budget ^ !set->ops->put_budget) + return -EINVAL; + if (set->queue_depth > BLK_MQ_MAX_DEPTH) { pr_info("blk-mq: reduced tag depth to %u\n", BLK_MQ_MAX_DEPTH); diff --git a/block/blk-mq.h b/block/blk-mq.h index 877237e09083..0c8ae91a7b5b 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -31,7 +31,7 @@ void blk_mq_freeze_queue(struct request_queue *q); void blk_mq_free_queue(struct request_queue *q); int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr); void blk_mq_wake_waiters(struct request_queue *q); -bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *); +bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool); void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list); bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx); bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, @@ -140,4 +140,22 @@ void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part, void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part, unsigned int inflight[2]);
+static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + + if (q->mq_ops->put_budget) + q->mq_ops->put_budget(hctx); +} + +static inline blk_status_t blk_mq_get_dispatch_budget( + struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + + if (q->mq_ops->get_budget) + return q->mq_ops->get_budget(hctx); + return BLK_STS_OK; +} + #endif diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 994cbb0f7ffc..b7da1ded07fc 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -91,6 +91,8 @@ struct blk_mq_queue_data {
typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *); +typedef blk_status_t (get_budget_fn)(struct blk_mq_hw_ctx *); +typedef void (put_budget_fn)(struct blk_mq_hw_ctx *); typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool); typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int); typedef void (exit_hctx_fn)(struct blk_mq_hw_ctx *, unsigned int); @@ -113,6 +115,15 @@ struct blk_mq_ops { queue_rq_fn *queue_rq;
/* + * Reserve budget before queue request, once .queue_rq is + * run, it is driver's responsibility to release the + * reserved budget. Also we have to handle failure case + * of .get_budget for avoiding I/O deadlock. + */ + get_budget_fn *get_budget; + put_budget_fn *put_budget; + + /* * Called on request timeout */ timeout_fn *timeout;
From: Ming Lei ming.lei@redhat.com
commit 7930d0a00ff5dbcc80f793d1a7a6b8de4e591f1a upstream
For blk-mq, we need to be able to iterate software queues starting from any queue in a round robin fashion, so introduce this helper.
Reviewed-by: Omar Sandoval osandov@fb.com Cc: Omar Sandoval osandov@fb.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- include/linux/sbitmap.h | 64 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 47 insertions(+), 17 deletions(-)
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h index a1904aadbc45..0dcc60e820de 100644 --- a/include/linux/sbitmap.h +++ b/include/linux/sbitmap.h @@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb); */ bool sbitmap_any_bit_clear(const struct sbitmap *sb);
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift) +#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U)) + typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
/** - * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap. + * __sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap. + * @start: Where to start the iteration. * @sb: Bitmap to iterate over. * @fn: Callback. Should return true to continue or false to break early. * @data: Pointer to pass to callback. @@ -222,35 +226,61 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *); * This is inline even though it's non-trivial so that the function calls to the * callback will hopefully get optimized away. */ -static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn, - void *data) +static inline void __sbitmap_for_each_set(struct sbitmap *sb, + unsigned int start, + sb_for_each_fn fn, void *data) { - unsigned int i; + unsigned int index; + unsigned int nr; + unsigned int scanned = 0;
- for (i = 0; i < sb->map_nr; i++) { - struct sbitmap_word *word = &sb->map[i]; - unsigned int off, nr; + if (start >= sb->depth) + start = 0; + index = SB_NR_TO_INDEX(sb, start); + nr = SB_NR_TO_BIT(sb, start);
- if (!word->word) - continue; + while (scanned < sb->depth) { + struct sbitmap_word *word = &sb->map[index]; + unsigned int depth = min_t(unsigned int, word->depth - nr, + sb->depth - scanned);
- nr = 0; - off = i << sb->shift; + scanned += depth; + if (!word->word) + goto next; + + /* + * On the first iteration of the outer loop, we need to add the + * bit offset back to the size of the word for find_next_bit(). + * On all other iterations, nr is zero, so this is a noop. + */ + depth += nr; while (1) { - nr = find_next_bit(&word->word, word->depth, nr); - if (nr >= word->depth) + nr = find_next_bit(&word->word, depth, nr); + if (nr >= depth) break; - - if (!fn(sb, off + nr, data)) + if (!fn(sb, (index << sb->shift) + nr, data)) return;
nr++; } +next: + nr = 0; + if (++index >= sb->map_nr) + index = 0; } }
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift) -#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U)) +/** + * sbitmap_for_each_set() - Iterate over each set bit in a &struct sbitmap. + * @sb: Bitmap to iterate over. + * @fn: Callback. Should return true to continue or false to break early. + * @data: Pointer to pass to callback. + */ +static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn, + void *data) +{ + __sbitmap_for_each_set(sb, 0, fn, data); +}
static inline unsigned long *__sbitmap_word(struct sbitmap *sb, unsigned int bitnr)
From: Ming Lei ming.lei@redhat.com
commit b347689ffbca745ac457ee27400ce1affd571c6f upstream
SCSI devices use host-wide tagset, and the shared driver tag space is often quite big. However, there is also a queue depth for each lun( .cmd_per_lun), which is often small, for example, on both lpfc and qla2xxx, .cmd_per_lun is just 3.
So lots of requests may stay in sw queue, and we always flush all belonging to same hw queue and dispatch them all to driver. Unfortunately it is easy to cause queue busy because of the small .cmd_per_lun. Once these requests are flushed out, they have to stay in hctx->dispatch, and no bio merge can happen on these requests, and sequential IO performance is harmed.
This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request from a sw queue, so that we can dispatch them in scheduler's way. We can then avoid dequeueing too many requests from sw queue, since we don't flush ->dispatch completely.
This patch improves dispatching from sw queue by using the .get_budget and .put_budget callbacks.
Reviewed-by: Omar Sandoval osandov@fb.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++-- block/blk-mq.c | 39 ++++++++++++++++++++++++++ block/blk-mq.h | 2 ++ include/linux/blk-mq.h | 2 ++ 4 files changed, 114 insertions(+), 3 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 8e525e66a0d9..df8581bb0a37 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -128,6 +128,61 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) return false; }
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *ctx) +{ + unsigned idx = ctx->index_hw; + + if (++idx == hctx->nr_ctx) + idx = 0; + + return hctx->ctxs[idx]; +} + +/* return true if hctx need to run again */ +static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; + LIST_HEAD(rq_list); + struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from); + + do { + struct request *rq; + blk_status_t ret; + + if (!sbitmap_any_bit_set(&hctx->ctx_map)) + break; + + ret = blk_mq_get_dispatch_budget(hctx); + if (ret == BLK_STS_RESOURCE) + return true; + + rq = blk_mq_dequeue_from_ctx(hctx, ctx); + if (!rq) { + blk_mq_put_dispatch_budget(hctx); + break; + } else if (ret != BLK_STS_OK) { + blk_mq_end_request(rq, ret); + continue; + } + + /* + * Now this rq owns the budget which has to be released + * if this rq won't be queued to driver via .queue_rq() + * in blk_mq_dispatch_rq_list(). + */ + list_add(&rq->queuelist, &rq_list); + + /* round robin for fair dispatch */ + ctx = blk_mq_next_ctx(hctx, rq->mq_ctx); + + } while (blk_mq_dispatch_rq_list(q, &rq_list, true)); + + WRITE_ONCE(hctx->dispatch_from, ctx); + + return false; +} + /* return true if hw queue need to be run again */ bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) { @@ -169,11 +224,24 @@ bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) */ if (!list_empty(&rq_list)) { blk_mq_sched_mark_restart_hctx(hctx); - if (blk_mq_dispatch_rq_list(q, &rq_list, false) && - has_sched_dispatch) - run_queue = blk_mq_do_dispatch_sched(hctx); + if (blk_mq_dispatch_rq_list(q, &rq_list, false)) { + if (has_sched_dispatch) + run_queue = blk_mq_do_dispatch_sched(hctx); + else + run_queue = blk_mq_do_dispatch_ctx(hctx); + } } else if (has_sched_dispatch) { run_queue = blk_mq_do_dispatch_sched(hctx); + } else if (q->mq_ops->get_budget) { + /* + * If we need to get budget before queuing request, we + * dequeue request one by one from sw queue for avoiding + * to mess up I/O merge when dispatch runs out of resource. + * + * TODO: get more budgets, and dequeue more requests in + * one time. + */ + run_queue = blk_mq_do_dispatch_ctx(hctx); } else { blk_mq_flush_busy_ctxs(hctx, &rq_list); blk_mq_dispatch_rq_list(q, &rq_list, false); diff --git a/block/blk-mq.c b/block/blk-mq.c index ad5ae7192a64..3bc6de8b82c0 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -903,6 +903,45 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list) } EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
+struct dispatch_rq_data { + struct blk_mq_hw_ctx *hctx; + struct request *rq; +}; + +static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, + void *data) +{ + struct dispatch_rq_data *dispatch_data = data; + struct blk_mq_hw_ctx *hctx = dispatch_data->hctx; + struct blk_mq_ctx *ctx = hctx->ctxs[bitnr]; + + spin_lock(&ctx->lock); + if (unlikely(!list_empty(&ctx->rq_list))) { + dispatch_data->rq = list_entry_rq(ctx->rq_list.next); + list_del_init(&dispatch_data->rq->queuelist); + if (list_empty(&ctx->rq_list)) + sbitmap_clear_bit(sb, bitnr); + } + spin_unlock(&ctx->lock); + + return !dispatch_data->rq; +} + +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *start) +{ + unsigned off = start ? start->index_hw : 0; + struct dispatch_rq_data data = { + .hctx = hctx, + .rq = NULL, + }; + + __sbitmap_for_each_set(&hctx->ctx_map, off, + dispatch_rq_from_ctx, &data); + + return data.rq; +} + static inline unsigned int queued_to_index(unsigned int queued) { if (!queued) diff --git a/block/blk-mq.h b/block/blk-mq.h index 0c8ae91a7b5b..ef0f7c979f64 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -36,6 +36,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list); bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx); bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, bool wait); +struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx, + struct blk_mq_ctx *start);
/* * Internal helpers for allocating/freeing the request map diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index b7da1ded07fc..03439f10464f 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -31,6 +31,8 @@ struct blk_mq_hw_ctx {
struct sbitmap ctx_map;
+ struct blk_mq_ctx *dispatch_from; + struct blk_mq_ctx **ctxs; unsigned int nr_ctx;
From: Ming Lei ming.lei@redhat.com
commit aeec77629a4ac6f8c248f3a82e80d4170a881f22 upstream
In the following patch, we will implement scsi_get_budget() which need to call scsi_prep_state_check() when rq isn't dequeued yet.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- drivers/scsi/scsi_lib.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index bfd8f12d4e9a..0ecc74421b1f 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1331,7 +1331,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req) /* * If the devices is blocked we defer normal commands. */ - if (!(req->rq_flags & RQF_PREEMPT)) + if (req && !(req->rq_flags & RQF_PREEMPT)) ret = BLKPREP_DEFER; break; default: @@ -1340,7 +1340,7 @@ scsi_prep_state_check(struct scsi_device *sdev, struct request *req) * special commands. In particular any user initiated * command is not allowed. */ - if (!(req->rq_flags & RQF_PREEMPT)) + if (req && !(req->rq_flags & RQF_PREEMPT)) ret = BLKPREP_KILL; break; }
From: Ming Lei ming.lei@redhat.com
commit 0df21c86bdbfd17dec9ab898312af9bfb74d5d86 upstream
We need to tell blk-mq to reserve resources before queuing one request, so implement these two callbacks. Then blk-mq can avoid to dequeue request too early, and IO merging can be improved a lot.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk [jwang: fix conflict in scsi_lib.c] Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- drivers/scsi/scsi_lib.c | 75 ++++++++++++++++++++++++++++++++++--------------- 1 file changed, 52 insertions(+), 23 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 0ecc74421b1f..c3f9d6d7a724 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1970,25 +1970,32 @@ static void scsi_mq_done(struct scsi_cmnd *cmd) blk_mq_complete_request(cmd->request); }
-static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, - const struct blk_mq_queue_data *bd) +static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) { - struct request *req = bd->rq; - struct request_queue *q = req->q; + struct request_queue *q = hctx->queue; + struct scsi_device *sdev = q->queuedata; + struct Scsi_Host *shost = sdev->host; + + scsi_dec_host_busy(shost); + if (scsi_target(sdev)->can_queue > 0) + atomic_dec(&scsi_target(sdev)->target_busy); + atomic_dec(&sdev->device_busy); + put_device(&sdev->sdev_gendev); +} + +static blk_status_t scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) +{ + struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; struct Scsi_Host *shost = sdev->host; - struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); blk_status_t ret; - int reason;
- ret = prep_to_mq(scsi_prep_state_check(sdev, req)); - if (ret != BLK_STS_OK) - goto out; + ret = prep_to_mq(scsi_prep_state_check(sdev, NULL)); + if (ret == BLK_STS_RESOURCE || ret != BLK_STS_OK) + return ret;
- ret = BLK_STS_RESOURCE; if (!get_device(&sdev->sdev_gendev)) goto out; - if (!scsi_dev_queue_ready(q, sdev)) goto out_put_device; if (!scsi_target_queue_ready(shost, sdev)) @@ -1996,10 +2003,38 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, if (!scsi_host_queue_ready(q, shost, sdev)) goto out_dec_target_busy;
+ return BLK_STS_OK; + +out_dec_target_busy: + if (scsi_target(sdev)->can_queue > 0) + atomic_dec(&scsi_target(sdev)->target_busy); +out_dec_device_busy: + atomic_dec(&sdev->device_busy); +out_put_device: + put_device(&sdev->sdev_gendev); +out: + return BLK_STS_RESOURCE; +} + +static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, + const struct blk_mq_queue_data *bd) +{ + struct request *req = bd->rq; + struct request_queue *q = req->q; + struct scsi_device *sdev = q->queuedata; + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); + blk_status_t ret; + int reason; + + ret = prep_to_mq(scsi_prep_state_check(sdev, req)); + if (ret != BLK_STS_OK) + goto out_put_budget; + + ret = BLK_STS_RESOURCE; if (!(req->rq_flags & RQF_DONTPREP)) { ret = prep_to_mq(scsi_mq_prep_fn(req)); if (ret != BLK_STS_OK) - goto out_dec_host_busy; + goto out_put_budget; req->rq_flags |= RQF_DONTPREP; } else { blk_mq_start_request(req); @@ -2017,21 +2052,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, if (reason) { scsi_set_blocked(cmd, reason); ret = BLK_STS_RESOURCE; - goto out_dec_host_busy; + goto out_put_budget; }
return BLK_STS_OK;
-out_dec_host_busy: - scsi_dec_host_busy(shost); -out_dec_target_busy: - if (scsi_target(sdev)->can_queue > 0) - atomic_dec(&scsi_target(sdev)->target_busy); -out_dec_device_busy: - atomic_dec(&sdev->device_busy); -out_put_device: - put_device(&sdev->sdev_gendev); -out: +out_put_budget: + scsi_mq_put_budget(hctx); switch (ret) { case BLK_STS_OK: break; @@ -2237,6 +2264,8 @@ struct request_queue *scsi_old_alloc_queue(struct scsi_device *sdev) }
static const struct blk_mq_ops scsi_mq_ops = { + .get_budget = scsi_mq_get_budget, + .put_budget = scsi_mq_put_budget, .queue_rq = scsi_queue_rq, .complete = scsi_softirq_done, .timeout = scsi_timeout,
From: Ming Lei ming.lei@redhat.com
commit 826a70a08b1210bbfdbda812ab43eb986e25b5c2 upstream
It is very expensive to atomic_inc/atomic_dec the host wide counter of host->busy_count, and it should have been avoided via blk-mq's mechanism of getting driver tag, which uses the more efficient way of sbitmap queue.
Also we don't check atomic_read(&sdev->device_busy) in scsi_mq_get_budget() and don't run queue if the counter becomes zero, so IO hang may be caused if all requests are completed just before the current SCSI device is added to shost->starved_list.
Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq) Reported-by: Bart Van Assche bart.vanassche@wdc.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk [jwang: fix conflict in scsi_lib.c] Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- drivers/scsi/scsi_lib.c | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index c3f9d6d7a724..51265d2cfbb4 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1974,11 +1974,7 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; - struct Scsi_Host *shost = sdev->host;
- scsi_dec_host_busy(shost); - if (scsi_target(sdev)->can_queue > 0) - atomic_dec(&scsi_target(sdev)->target_busy); atomic_dec(&sdev->device_busy); put_device(&sdev->sdev_gendev); } @@ -1987,7 +1983,6 @@ static blk_status_t scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; - struct Scsi_Host *shost = sdev->host; blk_status_t ret;
ret = prep_to_mq(scsi_prep_state_check(sdev, NULL)); @@ -1998,18 +1993,9 @@ static blk_status_t scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) goto out; if (!scsi_dev_queue_ready(q, sdev)) goto out_put_device; - if (!scsi_target_queue_ready(shost, sdev)) - goto out_dec_device_busy; - if (!scsi_host_queue_ready(q, shost, sdev)) - goto out_dec_target_busy;
return BLK_STS_OK;
-out_dec_target_busy: - if (scsi_target(sdev)->can_queue > 0) - atomic_dec(&scsi_target(sdev)->target_busy); -out_dec_device_busy: - atomic_dec(&sdev->device_busy); out_put_device: put_device(&sdev->sdev_gendev); out: @@ -2022,6 +2008,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req = bd->rq; struct request_queue *q = req->q; struct scsi_device *sdev = q->queuedata; + struct Scsi_Host *shost = sdev->host; struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); blk_status_t ret; int reason; @@ -2031,10 +2018,15 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, goto out_put_budget;
ret = BLK_STS_RESOURCE; + if (!scsi_target_queue_ready(shost, sdev)) + goto out_put_budget; + if (!scsi_host_queue_ready(q, shost, sdev)) + goto out_dec_target_busy; + if (!(req->rq_flags & RQF_DONTPREP)) { ret = prep_to_mq(scsi_mq_prep_fn(req)); if (ret != BLK_STS_OK) - goto out_put_budget; + goto out_dec_host_busy; req->rq_flags |= RQF_DONTPREP; } else { blk_mq_start_request(req); @@ -2052,11 +2044,16 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, if (reason) { scsi_set_blocked(cmd, reason); ret = BLK_STS_RESOURCE; - goto out_put_budget; + goto out_dec_host_busy; }
return BLK_STS_OK;
+out_dec_host_busy: + scsi_dec_host_busy(shost); +out_dec_target_busy: + if (scsi_target(sdev)->can_queue > 0) + atomic_dec(&scsi_target(sdev)->target_busy); out_put_budget: scsi_mq_put_budget(hctx); switch (ret) {
From: Ming Lei ming.lei@redhat.com
commit 358a3a6bccb74da9d63a26b2dd5f09f1e9970e0b upstream
Now restart is used in the following cases, and TAG_SHARED is for SCSI only.
1) .get_budget() returns BLK_STS_RESOURCE - if resource in target/host level isn't satisfied, this SCSI device will be added in shost->starved_list, and the whole queue will be rerun (via SCSI's built-in RESTART) in scsi_end_request() after any request initiated from this host/targe is completed. Forget to mention, host level resource can't be an issue for blk-mq at all.
- the same is true if resource in the queue level isn't satisfied.
- if there isn't outstanding request on this queue, then SCSI's RESTART can't work(blk-mq's can't work too), and the queue will be run after SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's RESTART when this request is finished
2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE - if there isn't onprogressing request on this queue, the queue will be run after SCSI_QUEUE_DELAY
- otherwise, SCSI's RESTART covers the rerun.
3) blk_mq_get_driver_tag() failed - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver allocation.
In one word, SCSI's built-in RESTART is enough to cover the queue rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.
In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when running I/O on these 8 luns concurrently.
Aslo Roman Pen reported the current RESTART is very expensive especialy when there are lots of LUNs attached in one host, such as in his test, RESTART causes half of IOPS be cut.
Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2 Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 78 +++------------------------------------------------- 1 file changed, 4 insertions(+), 74 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index df8581bb0a37..daab27feb653 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -68,25 +68,17 @@ static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); }
-static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) +void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) { if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - return false; - - if (hctx->flags & BLK_MQ_F_TAG_SHARED) { - struct request_queue *q = hctx->queue; + return;
- if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - atomic_dec(&q->shared_hctx_restart); - } else - clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); + clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
if (blk_mq_hctx_has_pending(hctx)) { blk_mq_run_hw_queue(hctx, true); - return true; + return; } - - return false; }
/* return true if hctx need to run again */ @@ -385,68 +377,6 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, return true; }
-/** - * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list - * @pos: loop cursor. - * @skip: the list element that will not be examined. Iteration starts at - * @skip->next. - * @head: head of the list to examine. This list must have at least one - * element, namely @skip. - * @member: name of the list_head structure within typeof(*pos). - */ -#define list_for_each_entry_rcu_rr(pos, skip, head, member) \ - for ((pos) = (skip); \ - (pos = (pos)->member.next != (head) ? list_entry_rcu( \ - (pos)->member.next, typeof(*pos), member) : \ - list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \ - (pos) != (skip); ) - -/* - * Called after a driver tag has been freed to check whether a hctx needs to - * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware - * queues in a round-robin fashion if the tag set of @hctx is shared with other - * hardware queues. - */ -void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) -{ - struct blk_mq_tags *const tags = hctx->tags; - struct blk_mq_tag_set *const set = hctx->queue->tag_set; - struct request_queue *const queue = hctx->queue, *q; - struct blk_mq_hw_ctx *hctx2; - unsigned int i, j; - - if (set->flags & BLK_MQ_F_TAG_SHARED) { - /* - * If this is 0, then we know that no hardware queues - * have RESTART marked. We're done. - */ - if (!atomic_read(&queue->shared_hctx_restart)) - return; - - rcu_read_lock(); - list_for_each_entry_rcu_rr(q, queue, &set->tag_list, - tag_set_list) { - queue_for_each_hw_ctx(q, hctx2, i) - if (hctx2->tags == tags && - blk_mq_sched_restart_hctx(hctx2)) - goto done; - } - j = hctx->queue_num + 1; - for (i = 0; i < queue->nr_hw_queues; i++, j++) { - if (j == queue->nr_hw_queues) - j = 0; - hctx2 = queue->queue_hw_ctx[j]; - if (hctx2->tags == tags && - blk_mq_sched_restart_hctx(hctx2)) - break; - } -done: - rcu_read_unlock(); - } else { - blk_mq_sched_restart_hctx(hctx); - } -} - /* * Add flush/fua to the queue. If we fail getting a driver tag, then * punt to the requeue list. Requeue will re-invoke us from a context
From: Ming Lei ming.lei@redhat.com
commit 1f460b63d4b37f504d8d0affc2cd492eb005ea97 upstream
SCSI restarts its queue in scsi_end_request() automatically, so we don't need to handle this case in blk-mq.
Especailly any request won't be dequeued in this case, we needn't to worry about IO hang caused by restart vs. dispatch.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 45 ++++++++++++++++++++------------------------- block/blk-mq-sched.h | 2 +- block/blk-mq.c | 8 ++------ 3 files changed, 23 insertions(+), 32 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index daab27feb653..7775f6b12fa9 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -81,8 +81,12 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) } }
-/* return true if hctx need to run again */ -static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) +/* + * Only SCSI implements .get_budget and .put_budget, and SCSI restarts + * its queue by itself in its completion handler, so we don't need to + * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE. + */ +static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct elevator_queue *e = q->elevator; @@ -98,7 +102,7 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
ret = blk_mq_get_dispatch_budget(hctx); if (ret == BLK_STS_RESOURCE) - return true; + break;
rq = e->type->ops.mq.dispatch_request(hctx); if (!rq) { @@ -116,8 +120,6 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx) */ list_add(&rq->queuelist, &rq_list); } while (blk_mq_dispatch_rq_list(q, &rq_list, true)); - - return false; }
static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx, @@ -131,8 +133,12 @@ static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx, return hctx->ctxs[idx]; }
-/* return true if hctx need to run again */ -static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) +/* + * Only SCSI implements .get_budget and .put_budget, and SCSI restarts + * its queue by itself in its completion handler, so we don't need to + * restart queue if .get_budget() returns BLK_STS_NO_RESOURCE. + */ +static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; LIST_HEAD(rq_list); @@ -147,7 +153,7 @@ static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
ret = blk_mq_get_dispatch_budget(hctx); if (ret == BLK_STS_RESOURCE) - return true; + break;
rq = blk_mq_dequeue_from_ctx(hctx, ctx); if (!rq) { @@ -171,22 +177,19 @@ static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx) } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
WRITE_ONCE(hctx->dispatch_from, ctx); - - return false; }
/* return true if hw queue need to be run again */ -bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) +void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct elevator_queue *e = q->elevator; const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request; LIST_HEAD(rq_list); - bool run_queue = false;
/* RCU or SRCU read lock is needed before checking quiesced flag */ if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))) - return false; + return;
hctx->run++;
@@ -218,12 +221,12 @@ bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) blk_mq_sched_mark_restart_hctx(hctx); if (blk_mq_dispatch_rq_list(q, &rq_list, false)) { if (has_sched_dispatch) - run_queue = blk_mq_do_dispatch_sched(hctx); + blk_mq_do_dispatch_sched(hctx); else - run_queue = blk_mq_do_dispatch_ctx(hctx); + blk_mq_do_dispatch_ctx(hctx); } } else if (has_sched_dispatch) { - run_queue = blk_mq_do_dispatch_sched(hctx); + blk_mq_do_dispatch_sched(hctx); } else if (q->mq_ops->get_budget) { /* * If we need to get budget before queuing request, we @@ -233,19 +236,11 @@ bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) * TODO: get more budgets, and dequeue more requests in * one time. */ - run_queue = blk_mq_do_dispatch_ctx(hctx); + blk_mq_do_dispatch_ctx(hctx); } else { blk_mq_flush_busy_ctxs(hctx, &rq_list); blk_mq_dispatch_rq_list(q, &rq_list, false); } - - if (run_queue && !blk_mq_sched_needs_restart(hctx) && - !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) { - blk_mq_sched_mark_restart_hctx(hctx); - return true; - } - - return false; }
bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 1ccfb8027cfc..ba1d1418a96d 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -23,7 +23,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_ctx *ctx, struct list_head *list, bool run_queue_async);
-bool blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx); +void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx);
int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e); void blk_mq_exit_sched(struct request_queue *q, struct elevator_queue *e); diff --git a/block/blk-mq.c b/block/blk-mq.c index 3bc6de8b82c0..b9dac21f35f6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1215,7 +1215,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx) { int srcu_idx; - bool run_queue;
/* * We should be running this queue from one of the CPUs that @@ -1250,18 +1249,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { rcu_read_lock(); - run_queue = blk_mq_sched_dispatch_requests(hctx); + blk_mq_sched_dispatch_requests(hctx); rcu_read_unlock(); } else { might_sleep();
srcu_idx = srcu_read_lock(hctx->queue_rq_srcu); - run_queue = blk_mq_sched_dispatch_requests(hctx); + blk_mq_sched_dispatch_requests(hctx); srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx); } - - if (run_queue) - blk_mq_run_hw_queue(hctx, true); }
/*
From: Ming Lei ming.lei@redhat.com
commit 88022d7201e96b43f1754b0358fc6bcd8dbdcde1 upstream
It is enough to just check if we can get the budget via .get_budget(). And we don't need to deal with device state change in .get_budget().
For SCSI, one issue to be fixed is that we have to call scsi_mq_uninit_cmd() to free allocated ressources if SCSI device fails to handle the request. And it isn't enough to simply call blk_mq_end_request() to do that if this request is marked as RQF_DONTPREP.
Fixes: 0df21c86bdbf(scsi: implement .get_budget and .put_budget for blk-mq) Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 14 ++------------ block/blk-mq.c | 17 ++++------------- block/blk-mq.h | 5 ++--- drivers/scsi/scsi_lib.c | 11 +++-------- include/linux/blk-mq.h | 2 +- 5 files changed, 12 insertions(+), 37 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 7775f6b12fa9..13a27d4d1671 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -94,23 +94,18 @@ static void blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
do { struct request *rq; - blk_status_t ret;
if (e->type->ops.mq.has_work && !e->type->ops.mq.has_work(hctx)) break;
- ret = blk_mq_get_dispatch_budget(hctx); - if (ret == BLK_STS_RESOURCE) + if (!blk_mq_get_dispatch_budget(hctx)) break;
rq = e->type->ops.mq.dispatch_request(hctx); if (!rq) { blk_mq_put_dispatch_budget(hctx); break; - } else if (ret != BLK_STS_OK) { - blk_mq_end_request(rq, ret); - continue; }
/* @@ -146,22 +141,17 @@ static void blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
do { struct request *rq; - blk_status_t ret;
if (!sbitmap_any_bit_set(&hctx->ctx_map)) break;
- ret = blk_mq_get_dispatch_budget(hctx); - if (ret == BLK_STS_RESOURCE) + if (!blk_mq_get_dispatch_budget(hctx)) break;
rq = blk_mq_dequeue_from_ctx(hctx, ctx); if (!rq) { blk_mq_put_dispatch_budget(hctx); break; - } else if (ret != BLK_STS_OK) { - blk_mq_end_request(rq, ret); - continue; }
/* diff --git a/block/blk-mq.c b/block/blk-mq.c index b9dac21f35f6..94cf2c91f767 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1123,13 +1123,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } }
- if (!got_budget) { - ret = blk_mq_get_dispatch_budget(hctx); - if (ret == BLK_STS_RESOURCE) - break; - if (ret != BLK_STS_OK) - goto fail_rq; - } + if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) + break;
list_del_init(&rq->queuelist);
@@ -1156,7 +1151,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, break; }
- fail_rq: if (unlikely(ret != BLK_STS_OK)) { errors++; blk_mq_end_request(rq, BLK_STS_IOERR); @@ -1650,12 +1644,10 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (!blk_mq_get_driver_tag(rq, NULL, false)) goto insert;
- ret = blk_mq_get_dispatch_budget(hctx); - if (ret == BLK_STS_RESOURCE) { + if (!blk_mq_get_dispatch_budget(hctx)) { blk_mq_put_driver_tag(rq); goto insert; - } else if (ret != BLK_STS_OK) - goto fail_rq; + }
new_cookie = request_to_qc_t(hctx, rq);
@@ -1673,7 +1665,6 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, __blk_mq_requeue_request(rq); goto insert; default: - fail_rq: *cookie = BLK_QC_T_NONE; blk_mq_end_request(rq, ret); return; diff --git a/block/blk-mq.h b/block/blk-mq.h index ef0f7c979f64..93dde7054a28 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -150,14 +150,13 @@ static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx) q->mq_ops->put_budget(hctx); }
-static inline blk_status_t blk_mq_get_dispatch_budget( - struct blk_mq_hw_ctx *hctx) +static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue;
if (q->mq_ops->get_budget) return q->mq_ops->get_budget(hctx); - return BLK_STS_OK; + return true; }
#endif diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 51265d2cfbb4..9e242583ab83 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1979,27 +1979,22 @@ static void scsi_mq_put_budget(struct blk_mq_hw_ctx *hctx) put_device(&sdev->sdev_gendev); }
-static blk_status_t scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) +static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) { struct request_queue *q = hctx->queue; struct scsi_device *sdev = q->queuedata; - blk_status_t ret; - - ret = prep_to_mq(scsi_prep_state_check(sdev, NULL)); - if (ret == BLK_STS_RESOURCE || ret != BLK_STS_OK) - return ret;
if (!get_device(&sdev->sdev_gendev)) goto out; if (!scsi_dev_queue_ready(q, sdev)) goto out_put_device;
- return BLK_STS_OK; + return true;
out_put_device: put_device(&sdev->sdev_gendev); out: - return BLK_STS_RESOURCE; + return false; }
static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 03439f10464f..4564bd216431 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -93,7 +93,7 @@ struct blk_mq_queue_data {
typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *, const struct blk_mq_queue_data *); -typedef blk_status_t (get_budget_fn)(struct blk_mq_hw_ctx *); +typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *); typedef void (put_budget_fn)(struct blk_mq_hw_ctx *); typedef enum blk_eh_timer_return (timeout_fn)(struct request *, bool); typedef int (init_hctx_fn)(struct blk_mq_hw_ctx *, void *, unsigned int);
From: Jianchao Wang jianchao.w.wang@oracle.com
commit 6d6f167ce74158903e7fc20dfbecf89c71aa1c00 upstream
When freeing the driver tag of the next rq with an I/O scheduler configured, we get the first entry of the list. However, this can race with requeue of a request, and we end up getting the wrong request from the head of the list. Free the driver tag of next rq before the failed one is requeued in the failure branch of queue_rq callback.
Signed-off-by: Jianchao Wang jianchao.w.wang@oracle.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 94cf2c91f767..8bef73f9151d 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1080,7 +1080,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool got_budget) { struct blk_mq_hw_ctx *hctx; - struct request *rq; + struct request *rq, *nxt; int errors, queued;
if (list_empty(list)) @@ -1137,14 +1137,20 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, if (list_empty(list)) bd.last = true; else { - struct request *nxt; - nxt = list_first_entry(list, struct request, queuelist); bd.last = !blk_mq_get_driver_tag(nxt, NULL, false); }
ret = q->mq_ops->queue_rq(hctx, &bd); if (ret == BLK_STS_RESOURCE) { + /* + * If an I/O scheduler has been configured and we got a + * driver tag for the next request already, free it again. + */ + if (!list_empty(list)) { + nxt = list_first_entry(list, struct request, queuelist); + blk_mq_put_driver_tag(nxt); + } blk_mq_put_driver_tag_hctx(hctx, rq); list_add(&rq->queuelist, list); __blk_mq_requeue_request(rq); @@ -1167,13 +1173,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * that is where we will continue on next queue run. */ if (!list_empty(list)) { - /* - * If an I/O scheduler has been configured and we got a driver - * tag for the next request already, free it again. - */ - rq = list_first_entry(list, struct request, queuelist); - blk_mq_put_driver_tag(rq); - spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); spin_unlock(&hctx->lock);
From: Ming Lei ming.lei@redhat.com
commit 9c71c83c857e7a84a5be5a56ea88da7098f51db8 upstream
blk_insert_flush() should only insert request since run queue always follows it.
In case of bypassing flush, we don't need to run queue because every blk_insert_flush() follows one run queue.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-flush.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c index 4938bec8cfef..81bd1a843043 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq) if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { if (q->mq_ops) - blk_mq_sched_insert_request(rq, false, true, false, false); + blk_mq_sched_insert_request(rq, false, false, false, false); else list_add_tail(&rq->queuelist, &q->queue_head); return;
From: Ming Lei ming.lei@redhat.com
commit b0850297c749ea79a5717d597931366b3d7f4b09 upstream
Block flush need this function without running the queue, so add a parameter controlling whether we run it or not.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-core.c | 2 +- block/blk-mq.c | 5 +++-- block/blk-mq.h | 2 +- 3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 68bae6338ad4..297e1745f099 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2364,7 +2364,7 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - blk_mq_request_bypass_insert(rq); + blk_mq_request_bypass_insert(rq, true); return BLK_STS_OK; }
diff --git a/block/blk-mq.c b/block/blk-mq.c index 8bef73f9151d..1f515346f94e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1496,7 +1496,7 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, * Should only be used carefully, when the caller knows we want to * bypass a potential IO scheduler on the target device. */ -void blk_mq_request_bypass_insert(struct request *rq) +void blk_mq_request_bypass_insert(struct request *rq, bool run_queue) { struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu); @@ -1505,7 +1505,8 @@ void blk_mq_request_bypass_insert(struct request *rq) list_add_tail(&rq->queuelist, &hctx->dispatch); spin_unlock(&hctx->lock);
- blk_mq_run_hw_queue(hctx, false); + if (run_queue) + blk_mq_run_hw_queue(hctx, false); }
void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, diff --git a/block/blk-mq.h b/block/blk-mq.h index 93dde7054a28..1819db23b038 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -57,7 +57,7 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags, */ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, bool at_head); -void blk_mq_request_bypass_insert(struct request *rq); +void blk_mq_request_bypass_insert(struct request *rq, bool run_queue); void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx, struct list_head *list);
From: Ming Lei ming.lei@redhat.com
commit 598906f814280762157629ba8833bf5cb11def74 upstream
In the following patch, we will use RQF_FLUSH_SEQ to decide:
1) if the flag isn't set, the flush rq need to be inserted via blk_insert_flush()
2) otherwise, the flush rq need to be dispatched directly since it is in flush machinery now.
So we use blk_mq_request_bypass_insert() for requests of bypassing flush machinery, just like the legacy path did.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-flush.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c index 81bd1a843043..a9773d2075ac 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -463,7 +463,7 @@ void blk_insert_flush(struct request *rq) if ((policy & REQ_FSEQ_DATA) && !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) { if (q->mq_ops) - blk_mq_sched_insert_request(rq, false, false, false, false); + blk_mq_request_bypass_insert(rq, false); else list_add_tail(&rq->queuelist, &q->queue_head); return;
From: Ming Lei ming.lei@redhat.com
commit a6a252e6491443c1c18eab7e254daee63d4a7a04 upstream
In case of IO scheduler we always pre-allocate one driver tag before calling blk_insert_flush(), and flush request will be marked as RQF_FLUSH_SEQ once it is in flush machinery.
So if RQF_FLUSH_SEQ isn't set, we call blk_insert_flush() to handle the request, otherwise the flush request is dispatched to ->dispatch list directly.
This is a preparation patch for not preallocating a driver tag for flush requests, and for not treating flush requests as a special case. This is similar to what the legacy path does.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 29 ++++++++++++++++------------- 1 file changed, 16 insertions(+), 13 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 13a27d4d1671..e7094f44afaf 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -345,21 +345,23 @@ void blk_mq_sched_request_inserted(struct request *rq) EXPORT_SYMBOL_GPL(blk_mq_sched_request_inserted);
static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, + bool has_sched, struct request *rq) { - if (rq->tag == -1) { + /* dispatch flush rq directly */ + if (rq->rq_flags & RQF_FLUSH_SEQ) { + spin_lock(&hctx->lock); + list_add(&rq->queuelist, &hctx->dispatch); + spin_unlock(&hctx->lock); + return true; + } + + if (has_sched) { rq->rq_flags |= RQF_SORTED; - return false; + WARN_ON(rq->tag != -1); }
- /* - * If we already have a real request tag, send directly to - * the dispatch list. - */ - spin_lock(&hctx->lock); - list_add(&rq->queuelist, &hctx->dispatch); - spin_unlock(&hctx->lock); - return true; + return false; }
/* @@ -385,12 +387,13 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head, struct blk_mq_ctx *ctx = rq->mq_ctx; struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu);
- if (rq->tag == -1 && op_is_flush(rq->cmd_flags)) { + /* flush rq in flush machinery need to be dispatched directly */ + if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) { blk_mq_sched_insert_flush(hctx, rq, can_block); return; }
- if (e && blk_mq_sched_bypass_insert(hctx, rq)) + if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) goto run;
if (e && e->type->ops.mq.insert_requests) { @@ -428,7 +431,7 @@ void blk_mq_sched_insert_requests(struct request_queue *q, list_for_each_entry_safe(rq, next, list, queuelist) { if (WARN_ON_ONCE(rq->tag != -1)) { list_del_init(&rq->queuelist); - blk_mq_sched_bypass_insert(hctx, rq); + blk_mq_sched_bypass_insert(hctx, true, rq); } } }
From: Ming Lei ming.lei@redhat.com
commit 244c65a3ccaa06fd15cc940315606674d3108b2f upstream
We need this helper to put the driver tag for flush rq, since we will not share tag in the flush request sequence in the following patch in case that I/O scheduler is applied.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq.c | 32 -------------------------------- block/blk-mq.h | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 33 insertions(+), 32 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index 1f515346f94e..aff2ada523c3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -982,38 +982,6 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, return rq->tag != -1; }
-static void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, - struct request *rq) -{ - blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); - rq->tag = -1; - - if (rq->rq_flags & RQF_MQ_INFLIGHT) { - rq->rq_flags &= ~RQF_MQ_INFLIGHT; - atomic_dec(&hctx->nr_active); - } -} - -static void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx, - struct request *rq) -{ - if (rq->tag == -1 || rq->internal_tag == -1) - return; - - __blk_mq_put_driver_tag(hctx, rq); -} - -static void blk_mq_put_driver_tag(struct request *rq) -{ - struct blk_mq_hw_ctx *hctx; - - if (rq->tag == -1 || rq->internal_tag == -1) - return; - - hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu); - __blk_mq_put_driver_tag(hctx, rq); -} - /* * If we fail getting a driver tag because all the driver tags are already * assigned and on the dispatch list, BUT the first entry does not have a diff --git a/block/blk-mq.h b/block/blk-mq.h index 1819db23b038..a5c2b46951e4 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -3,6 +3,7 @@ #define INT_BLK_MQ_H
#include "blk-stat.h" +#include "blk-mq-tag.h"
struct blk_mq_tag_set;
@@ -159,4 +160,36 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx) return true; }
+static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx, + struct request *rq) +{ + blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag); + rq->tag = -1; + + if (rq->rq_flags & RQF_MQ_INFLIGHT) { + rq->rq_flags &= ~RQF_MQ_INFLIGHT; + atomic_dec(&hctx->nr_active); + } +} + +static inline void blk_mq_put_driver_tag_hctx(struct blk_mq_hw_ctx *hctx, + struct request *rq) +{ + if (rq->tag == -1 || rq->internal_tag == -1) + return; + + __blk_mq_put_driver_tag(hctx, rq); +} + +static inline void blk_mq_put_driver_tag(struct request *rq) +{ + struct blk_mq_hw_ctx *hctx; + + if (rq->tag == -1 || rq->internal_tag == -1) + return; + + hctx = blk_mq_map_queue(rq->q, rq->mq_ctx->cpu); + __blk_mq_put_driver_tag(hctx, rq); +} + #endif
From: Ming Lei ming.lei@redhat.com
commit 923218f6166a84688973acdc39094f3bee1e9ad4 upstream
The idea behind it is simple:
1) for none scheduler, driver tag has to be borrowed for flush rq, otherwise we may run out of tag, and that causes an IO hang. And get/put driver tag is actually noop for none, so reordering tags isn't necessary at all.
2) for a real I/O scheduler, we need not allocate a driver tag upfront for flush rq. It works just fine to follow the same approach as normal requests: allocate driver tag for each rq just before calling ->queue_rq().
One driver visible change is that the driver tag isn't shared in the flush request sequence. That won't be a problem, since we always do that in legacy path.
Then flush rq need not be treated specially wrt. get/put driver tag. This cleans up the code - for instance, reorder_tags_to_front() can be removed, and we needn't worry about request ordering in dispatch list for avoiding I/O deadlock.
Also we have to put the driver tag before requeueing.
Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-flush.c | 35 ++++++++++++++++++++++++++--------- block/blk-mq-sched.c | 42 +++++------------------------------------- block/blk-mq.c | 41 ++++++----------------------------------- 3 files changed, 37 insertions(+), 81 deletions(-)
diff --git a/block/blk-flush.c b/block/blk-flush.c index a9773d2075ac..f17170675917 100644 --- a/block/blk-flush.c +++ b/block/blk-flush.c @@ -231,8 +231,13 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error) /* release the tag's ownership to the req cloned from */ spin_lock_irqsave(&fq->mq_flush_lock, flags); hctx = blk_mq_map_queue(q, flush_rq->mq_ctx->cpu); - blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq); - flush_rq->tag = -1; + if (!q->elevator) { + blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq); + flush_rq->tag = -1; + } else { + blk_mq_put_driver_tag_hctx(hctx, flush_rq); + flush_rq->internal_tag = -1; + } }
running = &fq->flush_queue[fq->flush_running_idx]; @@ -318,19 +323,26 @@ static bool blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq) blk_rq_init(q, flush_rq);
/* - * Borrow tag from the first request since they can't - * be in flight at the same time. And acquire the tag's - * ownership for flush req. + * In case of none scheduler, borrow tag from the first request + * since they can't be in flight at the same time. And acquire + * the tag's ownership for flush req. + * + * In case of IO scheduler, flush rq need to borrow scheduler tag + * just for cheating put/get driver tag. */ if (q->mq_ops) { struct blk_mq_hw_ctx *hctx;
flush_rq->mq_ctx = first_rq->mq_ctx; - flush_rq->tag = first_rq->tag; - fq->orig_rq = first_rq;
- hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu); - blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq); + if (!q->elevator) { + fq->orig_rq = first_rq; + flush_rq->tag = first_rq->tag; + hctx = blk_mq_map_queue(q, first_rq->mq_ctx->cpu); + blk_mq_tag_set_rq(hctx, first_rq->tag, flush_rq); + } else { + flush_rq->internal_tag = first_rq->internal_tag; + } }
flush_rq->cmd_flags = REQ_OP_FLUSH | REQ_PREFLUSH; @@ -394,6 +406,11 @@ static void mq_flush_data_end_io(struct request *rq, blk_status_t error)
hctx = blk_mq_map_queue(q, ctx->cpu);
+ if (q->elevator) { + WARN_ON(rq->tag < 0); + blk_mq_put_driver_tag_hctx(hctx, rq); + } + /* * After populating an empty queue, kick it to avoid stall. Read * the comment in flush_end_io(). diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index e7094f44afaf..01a43fed6b8c 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -356,29 +356,12 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, return true; }
- if (has_sched) { + if (has_sched) rq->rq_flags |= RQF_SORTED; - WARN_ON(rq->tag != -1); - }
return false; }
-/* - * Add flush/fua to the queue. If we fail getting a driver tag, then - * punt to the requeue list. Requeue will re-invoke us from a context - * that's safe to block from. - */ -static void blk_mq_sched_insert_flush(struct blk_mq_hw_ctx *hctx, - struct request *rq, bool can_block) -{ - if (blk_mq_get_driver_tag(rq, &hctx, can_block)) { - blk_insert_flush(rq); - blk_mq_run_hw_queue(hctx, true); - } else - blk_mq_add_to_requeue_list(rq, false, true); -} - void blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue, bool async, bool can_block) { @@ -389,10 +372,12 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
/* flush rq in flush machinery need to be dispatched directly */ if (!(rq->rq_flags & RQF_FLUSH_SEQ) && op_is_flush(rq->cmd_flags)) { - blk_mq_sched_insert_flush(hctx, rq, can_block); - return; + blk_insert_flush(rq); + goto run; }
+ WARN_ON(e && (rq->tag != -1)); + if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) goto run;
@@ -419,23 +404,6 @@ void blk_mq_sched_insert_requests(struct request_queue *q, struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, ctx->cpu); struct elevator_queue *e = hctx->queue->elevator;
- if (e) { - struct request *rq, *next; - - /* - * We bypass requests that already have a driver tag assigned, - * which should only be flushes. Flushes are only ever inserted - * as single requests, so we shouldn't ever hit the - * WARN_ON_ONCE() below (but let's handle it just in case). - */ - list_for_each_entry_safe(rq, next, list, queuelist) { - if (WARN_ON_ONCE(rq->tag != -1)) { - list_del_init(&rq->queuelist); - blk_mq_sched_bypass_insert(hctx, true, rq); - } - } - } - if (e && e->type->ops.mq.insert_requests) e->type->ops.mq.insert_requests(hctx, list, false); else diff --git a/block/blk-mq.c b/block/blk-mq.c index aff2ada523c3..b490a1c2acc8 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -655,6 +655,8 @@ static void __blk_mq_requeue_request(struct request *rq) { struct request_queue *q = rq->q;
+ blk_mq_put_driver_tag(rq); + trace_block_rq_requeue(q, rq); wbt_requeue(q->rq_wb, &rq->issue_stat);
@@ -982,30 +984,6 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, return rq->tag != -1; }
-/* - * If we fail getting a driver tag because all the driver tags are already - * assigned and on the dispatch list, BUT the first entry does not have a - * tag, then we could deadlock. For that case, move entries with assigned - * driver tags to the front, leaving the set of tagged requests in the - * same order, and the untagged set in the same order. - */ -static bool reorder_tags_to_front(struct list_head *list) -{ - struct request *rq, *tmp, *first = NULL; - - list_for_each_entry_safe_reverse(rq, tmp, list, queuelist) { - if (rq == first) - break; - if (rq->tag != -1) { - list_move(&rq->queuelist, list); - if (!first) - first = rq; - } - } - - return first != NULL; -} - static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags, void *key) { @@ -1066,9 +1044,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
rq = list_first_entry(list, struct request, queuelist); if (!blk_mq_get_driver_tag(rq, &hctx, false)) { - if (!queued && reorder_tags_to_front(list)) - continue; - /* * The initial allocation attempt failed, so we need to * rerun the hardware queue when a tag is freed. @@ -1119,7 +1094,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, nxt = list_first_entry(list, struct request, queuelist); blk_mq_put_driver_tag(nxt); } - blk_mq_put_driver_tag_hctx(hctx, rq); list_add(&rq->queuelist, list); __blk_mq_requeue_request(rq); break; @@ -1706,13 +1680,10 @@ static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) if (unlikely(is_flush_fua)) { blk_mq_put_ctx(data.ctx); blk_mq_bio_to_request(rq, bio); - if (q->elevator) { - blk_mq_sched_insert_request(rq, false, true, true, - true); - } else { - blk_insert_flush(rq); - blk_mq_run_hw_queue(data.hctx, true); - } + + /* bypass scheduler for flush rq */ + blk_insert_flush(rq); + blk_mq_run_hw_queue(data.hctx, true); } else if (plug && q->nr_hw_queues == 1) { struct request *last = NULL;
From: Jens Axboe axboe@kernel.dk
commit 05b79413946d8b2b58999ea1ae844b6fc3c54f61 upstream
This reverts commit 358a3a6bccb74da9d63a26b2dd5f09f1e9970e0b.
We have cases that aren't covered 100% in the drivers, so for now we have to retain the shared tag restart loops.
Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 74 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 01a43fed6b8c..6f4bdb8209f7 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -68,17 +68,25 @@ static void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx) set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); }
-void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) +static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) { if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) - return; + return false;
- clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); + if (hctx->flags & BLK_MQ_F_TAG_SHARED) { + struct request_queue *q = hctx->queue; + + if (test_and_clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) + atomic_dec(&q->shared_hctx_restart); + } else + clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
if (blk_mq_hctx_has_pending(hctx)) { blk_mq_run_hw_queue(hctx, true); - return; + return true; } + + return false; }
/* @@ -362,6 +370,68 @@ static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, return false; }
+/** + * list_for_each_entry_rcu_rr - iterate in a round-robin fashion over rcu list + * @pos: loop cursor. + * @skip: the list element that will not be examined. Iteration starts at + * @skip->next. + * @head: head of the list to examine. This list must have at least one + * element, namely @skip. + * @member: name of the list_head structure within typeof(*pos). + */ +#define list_for_each_entry_rcu_rr(pos, skip, head, member) \ + for ((pos) = (skip); \ + (pos = (pos)->member.next != (head) ? list_entry_rcu( \ + (pos)->member.next, typeof(*pos), member) : \ + list_entry_rcu((pos)->member.next->next, typeof(*pos), member)), \ + (pos) != (skip); ) + +/* + * Called after a driver tag has been freed to check whether a hctx needs to + * be restarted. Restarts @hctx if its tag set is not shared. Restarts hardware + * queues in a round-robin fashion if the tag set of @hctx is shared with other + * hardware queues. + */ +void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx) +{ + struct blk_mq_tags *const tags = hctx->tags; + struct blk_mq_tag_set *const set = hctx->queue->tag_set; + struct request_queue *const queue = hctx->queue, *q; + struct blk_mq_hw_ctx *hctx2; + unsigned int i, j; + + if (set->flags & BLK_MQ_F_TAG_SHARED) { + /* + * If this is 0, then we know that no hardware queues + * have RESTART marked. We're done. + */ + if (!atomic_read(&queue->shared_hctx_restart)) + return; + + rcu_read_lock(); + list_for_each_entry_rcu_rr(q, queue, &set->tag_list, + tag_set_list) { + queue_for_each_hw_ctx(q, hctx2, i) + if (hctx2->tags == tags && + blk_mq_sched_restart_hctx(hctx2)) + goto done; + } + j = hctx->queue_num + 1; + for (i = 0; i < queue->nr_hw_queues; i++, j++) { + if (j == queue->nr_hw_queues) + j = 0; + hctx2 = queue->queue_hw_ctx[j]; + if (hctx2->tags == tags && + blk_mq_sched_restart_hctx(hctx2)) + break; + } +done: + rcu_read_unlock(); + } else { + blk_mq_sched_restart_hctx(hctx); + } +} + void blk_mq_sched_insert_request(struct request *rq, bool at_head, bool run_queue, bool async, bool can_block) {
From: Ming Lei ming.lei@redhat.com
commit 0c6af1ccd5fd9ac640aef01c8de0043837451a04 upstream
We have to put the driver tag if dispatch budget can't be got, otherwise it might cause IO deadlock, especially in case that size of tags is very small.
Fixes: de1482974080(blk-mq: introduce .get_budget and .put_budget in blk_mq_ops) Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index b490a1c2acc8..657cfde50cda 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1066,8 +1066,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, } }
- if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) + if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) { + blk_mq_put_driver_tag(rq); break; + }
list_del_init(&rq->queuelist);
From: Jens Axboe axboe@kernel.dk
commit eb619fdb2d4cb8b3d3419e9113921e87e7daf557 upstream
This patch attempts to make the case of hctx re-running on driver tag failure more robust. Without this patch, it's pretty easy to trigger a stall condition with shared tags. An example is using null_blk like this:
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 hw_queue_depth=1
which sets up 4 devices, sharing the same tag set with a depth of 1. Running a fio job ala:
[global] bs=4k rw=randread norandommap direct=1 ioengine=libaio iodepth=4
[nullb0] filename=/dev/nullb0 [nullb1] filename=/dev/nullb1 [nullb2] filename=/dev/nullb2 [nullb3] filename=/dev/nullb3
will inevitably end with one or more threads being stuck waiting for a scheduler tag. That IO is then stuck forever, until someone else triggers a run of the queue.
Ensure that we always re-run the hardware queue, if the driver tag we were waiting for got freed before we added our leftover request entries back on the dispatch list.
Reviewed-by: Bart Van Assche bart.vanassche@wdc.com Tested-by: Bart Van Assche bart.vanassche@wdc.com Reviewed-by: Ming Lei ming.lei@redhat.com Reviewed-by: Omar Sandoval osandov@fb.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-debugfs.c | 1 - block/blk-mq.c | 85 ++++++++++++++++++++++++++++---------------------- include/linux/blk-mq.h | 5 ++- 3 files changed, 50 insertions(+), 41 deletions(-)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index d95439154556..fe880e194cdb 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -180,7 +180,6 @@ static const char *const hctx_state_name[] = { HCTX_STATE_NAME(STOPPED), HCTX_STATE_NAME(TAG_ACTIVE), HCTX_STATE_NAME(SCHED_RESTART), - HCTX_STATE_NAME(TAG_WAITING), HCTX_STATE_NAME(START_ON_RUN), }; #undef HCTX_STATE_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index 657cfde50cda..7b45f20ae7c1 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -984,49 +984,64 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, return rq->tag != -1; }
-static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, int flags, - void *key) +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, + int flags, void *key) { struct blk_mq_hw_ctx *hctx;
hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
- list_del(&wait->entry); - clear_bit_unlock(BLK_MQ_S_TAG_WAITING, &hctx->state); + list_del_init(&wait->entry); blk_mq_run_hw_queue(hctx, true); return 1; }
-static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx, + struct request *rq) { + struct blk_mq_hw_ctx *this_hctx = *hctx; + wait_queue_entry_t *wait = &this_hctx->dispatch_wait; struct sbq_wait_state *ws;
+ if (!list_empty_careful(&wait->entry)) + return false; + + spin_lock(&this_hctx->lock); + if (!list_empty(&wait->entry)) { + spin_unlock(&this_hctx->lock); + return false; + } + + ws = bt_wait_ptr(&this_hctx->tags->bitmap_tags, this_hctx); + add_wait_queue(&ws->wait, wait); + /* - * The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait. - * The thread which wins the race to grab this bit adds the hardware - * queue to the wait queue. + * It's possible that a tag was freed in the window between the + * allocation failure and adding the hardware queue to the wait + * queue. */ - if (test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state) || - test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, &hctx->state)) + if (!blk_mq_get_driver_tag(rq, hctx, false)) { + spin_unlock(&this_hctx->lock); return false; - - init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); - ws = bt_wait_ptr(&hctx->tags->bitmap_tags, hctx); + }
/* - * As soon as this returns, it's no longer safe to fiddle with - * hctx->dispatch_wait, since a completion can wake up the wait queue - * and unlock the bit. + * We got a tag, remove ourselves from the wait queue to ensure + * someone else gets the wakeup. */ - add_wait_queue(&ws->wait, &hctx->dispatch_wait); + spin_lock_irq(&ws->wait.lock); + list_del_init(&wait->entry); + spin_unlock_irq(&ws->wait.lock); + spin_unlock(&this_hctx->lock); return true; }
bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, - bool got_budget) + bool got_budget) { struct blk_mq_hw_ctx *hctx; struct request *rq, *nxt; + bool no_tag = false; int errors, queued;
if (list_empty(list)) @@ -1046,22 +1061,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, if (!blk_mq_get_driver_tag(rq, &hctx, false)) { /* * The initial allocation attempt failed, so we need to - * rerun the hardware queue when a tag is freed. + * rerun the hardware queue when a tag is freed. The + * waitqueue takes care of that. If the queue is run + * before we add this entry back on the dispatch list, + * we'll re-run it below. */ - if (!blk_mq_dispatch_wait_add(hctx)) { - if (got_budget) - blk_mq_put_dispatch_budget(hctx); - break; - } - - /* - * It's possible that a tag was freed in the window - * between the allocation failure and adding the - * hardware queue to the wait queue. - */ - if (!blk_mq_get_driver_tag(rq, &hctx, false)) { + if (!blk_mq_dispatch_wait_add(&hctx, rq)) { if (got_budget) blk_mq_put_dispatch_budget(hctx); + no_tag = true; break; } } @@ -1126,10 +1134,10 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * it is no longer set that means that it was cleared by another * thread and hence that a queue rerun is needed. * - * If TAG_WAITING is set that means that an I/O scheduler has - * been configured and another thread is waiting for a driver - * tag. To guarantee fairness, do not rerun this hardware queue - * but let the other thread grab the driver tag. + * If 'no_tag' is set, that means that we failed getting + * a driver tag with an I/O scheduler attached. If our dispatch + * waitqueue is no longer active, ensure that we run the queue + * AFTER adding our entries back to the list. * * If no I/O scheduler has been configured it is possible that * the hardware queue got stopped and restarted before requests @@ -1141,8 +1149,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, * returning BLK_STS_RESOURCE. Two exceptions are scsi-mq * and dm-rq. */ - if (!blk_mq_sched_needs_restart(hctx) && - !test_bit(BLK_MQ_S_TAG_WAITING, &hctx->state)) + if (!blk_mq_sched_needs_restart(hctx) || + (no_tag && list_empty_careful(&hctx->dispatch_wait.entry))) blk_mq_run_hw_queue(hctx, true); }
@@ -2029,6 +2037,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
hctx->nr_ctx = 0;
+ init_waitqueue_func_entry(&hctx->dispatch_wait, blk_mq_dispatch_wake); + INIT_LIST_HEAD(&hctx->dispatch_wait.entry); + if (set->ops->init_hctx && set->ops->init_hctx(hctx, set->driver_data, hctx_idx)) goto free_bitmap; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 4564bd216431..cc51f940aec4 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -36,7 +36,7 @@ struct blk_mq_hw_ctx { struct blk_mq_ctx **ctxs; unsigned int nr_ctx;
- wait_queue_entry_t dispatch_wait; + wait_queue_entry_t dispatch_wait; atomic_t wait_index;
struct blk_mq_tags *tags; @@ -182,8 +182,7 @@ enum { BLK_MQ_S_STOPPED = 0, BLK_MQ_S_TAG_ACTIVE = 1, BLK_MQ_S_SCHED_RESTART = 2, - BLK_MQ_S_TAG_WAITING = 3, - BLK_MQ_S_START_ON_RUN = 4, + BLK_MQ_S_START_ON_RUN = 3,
BLK_MQ_MAX_DEPTH = 10240,
From: Jens Axboe axboe@kernel.dk
commit 79f720a751cad613620d0237e3b44f89f4a69181 upstream
Currently we are inconsistent in when we decide to run the queue. Using blk_mq_run_hw_queues() we check if the hctx has pending IO before running it, but we don't do that from the individual queue run function, blk_mq_run_hw_queue(). This results in a lot of extra and pointless queue runs, potentially, on flush requests and (much worse) on tag starvation situations. This is observable just looking at top output, with lots of kworkers active. For the !async runs, it just adds to the CPU overhead of blk-mq.
Move the has-pending check into the run function instead of having callers do it.
Reviewed-by: Christoph Hellwig hch@lst.de Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq-sched.c | 7 +------ block/blk-mq.c | 18 +++++++++++------- block/blk-mq.h | 2 -- include/linux/blk-mq.h | 2 +- 4 files changed, 13 insertions(+), 16 deletions(-)
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 6f4bdb8209f7..c117bd8fd1f6 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -81,12 +81,7 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx *hctx) } else clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
- if (blk_mq_hctx_has_pending(hctx)) { - blk_mq_run_hw_queue(hctx, true); - return true; - } - - return false; + return blk_mq_run_hw_queue(hctx, true); }
/* diff --git a/block/blk-mq.c b/block/blk-mq.c index 7b45f20ae7c1..b3f15d9c7c5a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -60,10 +60,10 @@ static int blk_mq_poll_stats_bkt(const struct request *rq) /* * Check if any of the ctx's have pending work in this hardware queue */ -bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx) +static bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx) { - return sbitmap_any_bit_set(&hctx->ctx_map) || - !list_empty_careful(&hctx->dispatch) || + return !list_empty_careful(&hctx->dispatch) || + sbitmap_any_bit_set(&hctx->ctx_map) || blk_mq_sched_has_work(hctx); }
@@ -1261,9 +1261,14 @@ void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs) } EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
-void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) +bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { - __blk_mq_delay_run_hw_queue(hctx, async, 0); + if (blk_mq_hctx_has_pending(hctx)) { + __blk_mq_delay_run_hw_queue(hctx, async, 0); + return true; + } + + return false; } EXPORT_SYMBOL(blk_mq_run_hw_queue);
@@ -1273,8 +1278,7 @@ void blk_mq_run_hw_queues(struct request_queue *q, bool async) int i;
queue_for_each_hw_ctx(q, hctx, i) { - if (!blk_mq_hctx_has_pending(hctx) || - blk_mq_hctx_stopped(hctx)) + if (blk_mq_hctx_stopped(hctx)) continue;
blk_mq_run_hw_queue(hctx, async); diff --git a/block/blk-mq.h b/block/blk-mq.h index a5c2b46951e4..1b849a163e9d 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -27,14 +27,12 @@ struct blk_mq_ctx { struct kobject kobj; } ____cacheline_aligned_in_smp;
-void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); void blk_mq_freeze_queue(struct request_queue *q); void blk_mq_free_queue(struct request_queue *q); int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr); void blk_mq_wake_waiters(struct request_queue *q); bool blk_mq_dispatch_rq_list(struct request_queue *, struct list_head *, bool); void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list); -bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx); bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx, bool wait); struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx, diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index cc51f940aec4..d0162ea97078 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -261,7 +261,7 @@ void blk_mq_start_stopped_hw_queues(struct request_queue *q, bool async); void blk_mq_quiesce_queue(struct request_queue *q); void blk_mq_unquiesce_queue(struct request_queue *q); void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); -void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); +bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async); void blk_mq_run_hw_queues(struct request_queue *q, bool async); void blk_mq_delay_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs); void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
From: Ming Lei ming.lei@redhat.com
commit 24f5a90f0d13a97b51aa79f468143fafea4246bb upstream
Dispatch may still be in-progress after queue is frozen, so we have to quiesce queue before switching IO scheduler and updating nr_requests.
Also when switching io schedulers, blk_mq_run_hw_queue() may still be called somewhere(such as from nvme_reset_work()), and io scheduler's per-hctx data may not be setup yet, so cause oops even inside blk_mq_hctx_has_pending(), such as it can be run just between:
ret = e->ops.mq.init_sched(q, e); AND ret = e->ops.mq.init_hctx(hctx, i)
inside blk_mq_init_sched().
This reverts commit 7a148c2fcff8330(block: don't call blk_mq_quiesce_queue() after queue is frozen) basically, and makes sure blk_mq_hctx_has_pending won't be called if queue is quiesced.
Reviewed-by: Christoph Hellwig hch@lst.de Fixes: 7a148c2fcff83309(block: don't call blk_mq_quiesce_queue() after queue is frozen) Reported-by: Yi Zhang yi.zhang@redhat.com Tested-by: Yi Zhang yi.zhang@redhat.com Signed-off-by: Ming Lei ming.lei@redhat.com Signed-off-by: Jens Axboe axboe@kernel.dk Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- block/blk-mq.c | 27 ++++++++++++++++++++++++++- block/elevator.c | 2 ++ 2 files changed, 28 insertions(+), 1 deletion(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index b3f15d9c7c5a..7d396b77e5e8 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1263,7 +1263,30 @@ EXPORT_SYMBOL(blk_mq_delay_run_hw_queue);
bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async) { - if (blk_mq_hctx_has_pending(hctx)) { + int srcu_idx; + bool need_run; + + /* + * When queue is quiesced, we may be switching io scheduler, or + * updating nr_hw_queues, or other things, and we can't run queue + * any more, even __blk_mq_hctx_has_pending() can't be called safely. + * + * And queue will be rerun in blk_mq_unquiesce_queue() if it is + * quiesced. + */ + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) { + rcu_read_lock(); + need_run = !blk_queue_quiesced(hctx->queue) && + blk_mq_hctx_has_pending(hctx); + rcu_read_unlock(); + } else { + srcu_idx = srcu_read_lock(hctx->queue_rq_srcu); + need_run = !blk_queue_quiesced(hctx->queue) && + blk_mq_hctx_has_pending(hctx); + srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx); + } + + if (need_run) { __blk_mq_delay_run_hw_queue(hctx, async, 0); return true; } @@ -2710,6 +2733,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) return -EINVAL;
blk_mq_freeze_queue(q); + blk_mq_quiesce_queue(q);
ret = 0; queue_for_each_hw_ctx(q, hctx, i) { @@ -2734,6 +2758,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) if (!ret) q->nr_requests = nr;
+ blk_mq_unquiesce_queue(q); blk_mq_unfreeze_queue(q);
return ret; diff --git a/block/elevator.c b/block/elevator.c index 153926a90901..229eb4b5604c 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -951,6 +951,7 @@ static int elevator_switch_mq(struct request_queue *q, int ret;
blk_mq_freeze_queue(q); + blk_mq_quiesce_queue(q);
if (q->elevator) { if (q->elevator->registered) @@ -977,6 +978,7 @@ static int elevator_switch_mq(struct request_queue *q, blk_add_trace_msg(q, "elv switch: none");
out: + blk_mq_unquiesce_queue(q); blk_mq_unfreeze_queue(q); return ret; }
From: Ming Lei ming.lei@redhat.com
commit 7e70aa789d4a0c89dbfbd2c8a974a4df717475ec upstream
Before commit 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq"), we run queue after 3ms if queue is idle and SCSI device queue isn't ready, which is done in handling BLK_STS_RESOURCE. After commit 0df21c86bdbf is introduced, queue won't be run any more under this situation.
IO hang is observed when timeout happened, and this patch fixes the IO hang issue by running queue after delay in scsi_dev_queue_ready, just like non-mq. This issue can be triggered by the following script[1].
There is another issue which can be covered by running idle queue: when .get_budget() is called on request coming from hctx->dispatch_list, if one request just completes during .get_budget(), we can't depend on SCSI's restart to make progress any more. This patch fixes the race too.
With this patch, we basically recover to previous behaviour (before commit 0df21c86bdbf) of handling idle queue when running out of resource.
[1] script for test/verify SCSI timeout rmmod scsi_debug modprobe scsi_debug max_queue=1
DEVICE=`ls -d /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/* | head -1 | xargs basename` DISK_DIR=`ls -d /sys/block/$DEVICE/device/scsi_disk/*`
echo "using scsi device $DEVICE" echo "-1" >/sys/bus/pseudo/drivers/scsi_debug/every_nth echo "temporary write through" >$DISK_DIR/cache_type echo "128" >/sys/bus/pseudo/drivers/scsi_debug/opts echo none > /sys/block/$DEVICE/queue/scheduler dd if=/dev/$DEVICE of=/dev/null bs=1M iflag=direct count=1 & sleep 5 echo "0" >/sys/bus/pseudo/drivers/scsi_debug/opts wait echo "SUCCESS"
Fixes: 0df21c86bdbf ("scsi: implement .get_budget and .put_budget for blk-mq") Signed-off-by: Ming Lei ming.lei@redhat.com Tested-by: Holger Hoffstätte holger@applied-asynchrony.com Reviewed-by: Bart Van Assche bart.vanassche@wdc.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Jack Wang jinpu.wang@profitbricks.com --- drivers/scsi/scsi_lib.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 9e242583ab83..175981058396 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1994,6 +1994,8 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx *hctx) out_put_device: put_device(&sdev->sdev_gendev); out: + if (atomic_read(&sdev->device_busy) == 0 && !scsi_device_blocked(sdev)) + blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY); return false; }
On Mon, Jul 23, 2018 at 03:24:22PM +0200, Jack Wang wrote:
Hi Greg, Please consider this patchset, which include block/scsi multiqueue performance enhancement and bugfix.
What exactly is the performance enhancement? How can you measure it? How did you measure it?
Which one(s) are the bugfixes in this series? Why not just submit those separately as they should be "obvious fixes", right?
And why didn't you cc: the subsystem maintainer and mailing list here? I need their ack before I could take these.
thanks,
greg k-h
Hi Greg,
Thanks for quick reply. Please see reply inline.
Greg KH gregkh@linuxfoundation.org 于2018年7月23日周一 下午3:34写道:
On Mon, Jul 23, 2018 at 03:24:22PM +0200, Jack Wang wrote:
Hi Greg,
Please consider this patchset, which include block/scsi multiqueue performance enhancement and bugfix.
What exactly is the performance enhancement? How can you measure it? How did you measure it?
I'm testing on SRP/IBNBD using fio, I've seen +10% with mix IO load, and 50% improvement on small IO (bs=512.) with the patchset.
Which one(s) are the bugfixes in this series? Why not just submit those separately as they should be "obvious fixes", right?
We had queue stall problem on shared hctx restart, my colleague Roman reported to upstream: https://lkml.org/lkml/2017/10/18/263 the patch wasn't taken by upstream, and a bigger patchset from Lei Ming was accepted instread with revert and further bugfix later on.
following patches meant for improving performance on scsi-mq, and also contains the attempt to fix the queue stall problem 358a3a6bccb7 ("blk-mq: don't handle TAG_SHARED in restart") blk-mq-sched: move actual dispatching into one helper blk-mq: introduce .get_budget and .put_budget in blk_mq_ops sbitmap: introduce __sbitmap_for_each_set() blk-mq-sched: improve dispatching from sw queue scsi: allow passing in null rq to scsi_prep_state_check() scsi: implement .get_budget and .put_budget for blk-mq SCSI: don't get target/host busy_count in scsi_mq_get_budget() blk-mq: don't handle TAG_SHARED in restart blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE blk-mq: don't handle failure in .get_budget
Following is independent fix: blk-mq: quiesce queue during switching io sched and updating nr_requests blk-mq: fix issue with shared tag queue re-running
These are clearup for flush request handling: blk-flush: don't run queue for requests bypassing flush block: pass 'run_queue' to blk_mq_request_bypass_insert blk-flush: use blk_mq_request_bypass_insert() blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ blk-mq: move blk_mq_put_driver_tag*() into blk-mq.h blk-mq: don't allocate driver tag upfront for flush rq
The left are following up fixes.
And why didn't you cc: the subsystem maintainer and mailing list here? I need their ack before I could take these.
Sorry, my mistake, now add Jens, linux-block, ilnux-scsi in reply.
Sorry for confusion.
thanks,
greg k-h
Regards, Jack Wang
On 7/23/18 9:00 AM, Jack Wang wrote:
Hi Greg,
Thanks for quick reply. Please see reply inline.
Greg KH gregkh@linuxfoundation.org 于2018年7月23日周一 下午3:34写道:
On Mon, Jul 23, 2018 at 03:24:22PM +0200, Jack Wang wrote:
Hi Greg,
Please consider this patchset, which include block/scsi multiqueue performance enhancement and bugfix.
What exactly is the performance enhancement? How can you measure it? How did you measure it?
I'm testing on SRP/IBNBD using fio, I've seen +10% with mix IO load, and 50% improvement on small IO (bs=512.) with the patchset.
Big nak on backporting this huge series, it's a lot of core changes. It's way beyond the scope of a stable fix backport.
Jens Axboe axboe@kernel.dk 于2018年7月23日周一 下午5:05写道:
On 7/23/18 9:00 AM, Jack Wang wrote:
Hi Greg,
Thanks for quick reply. Please see reply inline.
Greg KH gregkh@linuxfoundation.org 于2018年7月23日周一 下午3:34写道:
On Mon, Jul 23, 2018 at 03:24:22PM +0200, Jack Wang wrote:
Hi Greg,
Please consider this patchset, which include block/scsi multiqueue performance enhancement and bugfix.
What exactly is the performance enhancement? How can you measure it? How did you measure it?
I'm testing on SRP/IBNBD using fio, I've seen +10% with mix IO load, and 50% improvement on small IO (bs=512.) with the patchset.
Big nak on backporting this huge series, it's a lot of core changes. It's way beyond the scope of a stable fix backport.
-- Jens Axboe
OK, could you shed light on how could we fix the queue stall problem on 4.14? My colleague Roman reported to upstream: https://lkml.org/lkml/2017/10/18/263
It's still there on latest 4.14.
Thanks, Jack
On 7/23/18 9:28 AM, Jack Wang wrote:
Jens Axboe axboe@kernel.dk 于2018年7月23日周一 下午5:05写道:
On 7/23/18 9:00 AM, Jack Wang wrote:
Hi Greg,
Thanks for quick reply. Please see reply inline.
Greg KH gregkh@linuxfoundation.org 于2018年7月23日周一 下午3:34写道:
On Mon, Jul 23, 2018 at 03:24:22PM +0200, Jack Wang wrote:
Hi Greg,
Please consider this patchset, which include block/scsi multiqueue performance enhancement and bugfix.
What exactly is the performance enhancement? How can you measure it? How did you measure it?
I'm testing on SRP/IBNBD using fio, I've seen +10% with mix IO load, and 50% improvement on small IO (bs=512.) with the patchset.
Big nak on backporting this huge series, it's a lot of core changes. It's way beyond the scope of a stable fix backport.
-- Jens Axboe
OK, could you shed light on how could we fix the queue stall problem on 4.14? My colleague Roman reported to upstream: https://lkml.org/lkml/2017/10/18/263
It's still there on latest 4.14.
The proposed patch is a helluva lot simpler than doing a 23 patch selective backport.
Jens Axboe axboe@kernel.dk 于2018年7月23日周一 下午5:31写道:
On 7/23/18 9:28 AM, Jack Wang wrote:
Jens Axboe axboe@kernel.dk 于2018年7月23日周一 下午5:05写道:
On 7/23/18 9:00 AM, Jack Wang wrote:
Hi Greg,
Thanks for quick reply. Please see reply inline.
Greg KH gregkh@linuxfoundation.org 于2018年7月23日周一 下午3:34写道:
On Mon, Jul 23, 2018 at 03:24:22PM +0200, Jack Wang wrote:
Hi Greg,
Please consider this patchset, which include block/scsi multiqueue performance enhancement and bugfix.
What exactly is the performance enhancement? How can you measure it? How did you measure it?
I'm testing on SRP/IBNBD using fio, I've seen +10% with mix IO load, and 50% improvement on small IO (bs=512.) with the patchset.
Big nak on backporting this huge series, it's a lot of core changes. It's way beyond the scope of a stable fix backport.
-- Jens Axboe
OK, could you shed light on how could we fix the queue stall problem on 4.14? My colleague Roman reported to upstream: https://lkml.org/lkml/2017/10/18/263
It's still there on latest 4.14.
The proposed patch is a helluva lot simpler than doing a 23 patch selective backport.
-- Jens Axboe
Do you think it make sense to port Roman's patch to upstream 4.14?
So others would also be benifitial? On the other side, stable tree only takes patches accepted into Linus tree.
Thanks, Jack
linux-stable-mirror@lists.linaro.org