On 07/23/2018 11:50 PM, Bart Van Assche wrote:
The patch below fixes queue stalling when shared hctx marked for restart (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
The blk_mq_hw_ctx structure is also per request_queue Please refer to blk_mq_init_allocated_queue -> blk_mq_realloc_hw_ctxs
belongs to the particular queue, which in fact may not need to be restarted, thus we return from blk_mq_sched_restart() and leave shared hctx of another queue never restarted.
The fix is to make shared_hctx_restart counter belong not to the queue, but to tags, thereby counter will reflect real number of shared hctx needed to be restarted.
During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests were noticed in dd->fifo_list of mq-deadline scheduler.
Seeming possible sequence of events:
Request A of queue A is inserted into dd->fifo_list of the scheduler.
Request B of queue A bypasses scheduler and goes directly to hctx->dispatch.
Request C of queue B is inserted.
blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not empty (request B is in the list) hctx is only marked for for next restart and request A is left in a list (see comment "So it's best to leave them there for as long as we can. Mark the hw queue as needing a restart in that case." in blk-mq-sched.c)
Eventually request B is completed/freed and blk_mq_sched_restart() is called, but by chance hctx from queue B is chosen for restart and request C gets a chance to be dispatched.
Request C is just inserted into queue B. If there is no mark restart there, it will not be chosen. blk_mq_sched_restart_hctx
if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) return false;
If blk_mq_sched_restart choose queue B, one of its hctx must have SCHED_RESTART flag, and q->shared_hctx_restart must not be zero.
- Eventually request C is completed/freed and blk_mq_sched_restart() is called, but shared_hctx_restart for queue B is zero and we return without attempt to restart hctx from queue A, thus request A is stuck forever.
But stalling queue is not the only one problem with blk_mq_sched_restart(). My tests show that those loops thru all queues and hctxs can be very costly, even with shared_hctx_restart counter, which aims to fix performance issue.
Currently, SCHED_RESTART is always set when there is reqs on hctx->dispatch list in blk_mq_sched_dispatch_requests. And no driver tag case is the main reason for hctx->dispatch is not empty when there is io scheduler. Therefore, most of time, blk_mq_sched_restart will be invoked further for no driver tag case.
For non-share-tag case, it will wakeup the hctx. But for shared-tag case, it is unnecessary, because the sbitmap_queue wakeup hook will work and hctx_may_queue will avoid starvation of other ones.
Therefore, the costly loop through the queues and hctxs is unnecessary most of time.
Thanks Jianchao