This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") reversed how requests are stored in the blk_plug list, this had significant impact on bio merging with requests exist on the plug list. This impact has been reported in [1] and could easily be reproducible using 4k randwrite fio benchmark on an NVME based SSD without having any filesystem on the disk.
My benchmark is:
fio --time_based --name=benchmark --size=50G --rw=randwrite \ --runtime=60 --filename="/dev/nvme1n1" --ioengine=psync \ --randrepeat=0 --iodepth=1 --fsync=64 --invalidate=1 \ --verify=0 --verify_fatal=0 --blocksize=4k --numjobs=4 \ --group_reporting
On 1.9TiB SSD(180K Max IOPS) attached to i3.16xlarge AWS EC2 instance.
Kernel | fio (B.W MiB/sec) | I/O size (iostat) --------------+---------------------+-------------------- 6.15.1 | 362 | 2KiB 6.15.1+revert | 660 (+82%) | 4KiB --------------+---------------------+--------------------
I have run iostat while the fio benchmark was running and was able to see that the I/O size seen on the disk is shown as 2KB without this revert while it's 4KB with the revert. In the bad case the write bandwidth is capped at around 362MiB/sec which almost 2KiB * 180K IOPS so we are hitting the SSD Disk IOPS limit which is 180K. After the revert the I/O size has been doubled to 4KiB hence the bandwidth has been almost doubled as we no longer hit the Disk IOPS limit.
I have done some tracing using bpftrace & bcc and was able to conclude that the reason behind the I/O size discrepancy with the revert is that this fio benchmark is subimitting each 4k I/O as 2 contiguous 2KB bios.
In the good case each 2 bios are merged in a 4KB request that's then been submitted to the disk while in the bad case 2K bios are submitted to the disk without merging because blk_attempt_plug_merge() failed to merge them as seen below.
**Without the revert**
[12:12:28] r::blk_attempt_plug_merge():int:$retval COUNT EVENT 5618 $retval = 1 176578 $retval = 0
**With the revert**
[12:11:43] r::blk_attempt_plug_merge():int:$retval COUNT EVENT 146684 $retval = 0 146686 $retval = 1
In blk_attempt_plug_merge() we are iterating ithrought the plug list from head to tail looking for a request with which we can merge the most recently submitted bio.
With commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") the most recent request will be at the tail so blk_attempt_plug_merge() will fail because it tries to merge bio with the plug list head. In blk_attempt_plug_merge() we don't iterate across the whole plug list because as we exit the loop once we fail merging in blk_attempt_bio_merge().
In commit <bc490f81731> ("block: change plugging to use a singly linked list") the plug list has been changed to single linked list so there's no way to iterate the list from tail to head which is the only way to mitigate the impact on bio merging if we want to keep commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug").
Given that moving plug list to a single linked list was mainly for performance reason then let's revert commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") for now to mitigate the reported performance regression.
[1] https://lore.kernel.org/lkml/202412122112.ca47bcec-lkp@intel.com/
Cc: stable@vger.kernel.org # 6.12 Reported-by: kernel test robot oliver.sang@intel.com Reported-by: Hagar Hemdan hagarhem@amazon.com Reported-and-bisected-by: Shaoying Xu shaoyi@amazon.com Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com --- block/blk-mq.c | 4 ++-- drivers/block/virtio_blk.c | 2 +- drivers/nvme/host/pci.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index c2697db59109..28965cac19fb 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1394,7 +1394,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq) */ if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS)) plug->has_elevator = true; - rq_list_add_tail(&plug->mq_list, rq); + rq_list_add_head(&plug->mq_list, rq); plug->rq_count++; }
@@ -2846,7 +2846,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) rq_list_add_tail(&requeue_list, rq); continue; } - list_add_tail(&rq->queuelist, &list); + list_add(&rq->queuelist, &list); depth++; } while (!rq_list_empty(&plug->mq_list));
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 7cffea01d868..7992a171f905 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -513,7 +513,7 @@ static void virtio_queue_rqs(struct rq_list *rqlist) vq = this_vq;
if (virtblk_prep_rq_batch(req)) - rq_list_add_tail(&submit_list, req); + rq_list_add_head(&submit_list, req); /* reverse order */ else rq_list_add_tail(&requeue_list, req); } diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f1dd804151b1..5f7da42f9dac 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1026,7 +1026,7 @@ static void nvme_queue_rqs(struct rq_list *rqlist) nvmeq = req->mq_hctx->driver_data;
if (nvme_prep_rq_batch(nvmeq, req)) - rq_list_add_tail(&submit_list, req); + rq_list_add_head(&submit_list, req); /* reverse order */ else rq_list_add_tail(&requeue_list, req); }