This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") reversed how requests are stored in the blk_plug list, this had significant impact on bio merging with requests exist on the plug list. This impact has been reported in [1] and could easily be reproducible using 4k randwrite fio benchmark on an NVME based SSD without having any filesystem on the disk.
My benchmark is:
fio --time_based --name=benchmark --size=50G --rw=randwrite \ --runtime=60 --filename="/dev/nvme1n1" --ioengine=psync \ --randrepeat=0 --iodepth=1 --fsync=64 --invalidate=1 \ --verify=0 --verify_fatal=0 --blocksize=4k --numjobs=4 \ --group_reporting
On 1.9TiB SSD(180K Max IOPS) attached to i3.16xlarge AWS EC2 instance.
Kernel | fio (B.W MiB/sec) | I/O size (iostat) --------------+---------------------+-------------------- 6.15.1 | 362 | 2KiB 6.15.1+revert | 660 (+82%) | 4KiB --------------+---------------------+--------------------
I have run iostat while the fio benchmark was running and was able to see that the I/O size seen on the disk is shown as 2KB without this revert while it's 4KB with the revert. In the bad case the write bandwidth is capped at around 362MiB/sec which almost 2KiB * 180K IOPS so we are hitting the SSD Disk IOPS limit which is 180K. After the revert the I/O size has been doubled to 4KiB hence the bandwidth has been almost doubled as we no longer hit the Disk IOPS limit.
I have done some tracing using bpftrace & bcc and was able to conclude that the reason behind the I/O size discrepancy with the revert is that this fio benchmark is subimitting each 4k I/O as 2 contiguous 2KB bios.
In the good case each 2 bios are merged in a 4KB request that's then been submitted to the disk while in the bad case 2K bios are submitted to the disk without merging because blk_attempt_plug_merge() failed to merge them as seen below.
**Without the revert**
[12:12:28] r::blk_attempt_plug_merge():int:$retval COUNT EVENT 5618 $retval = 1 176578 $retval = 0
**With the revert**
[12:11:43] r::blk_attempt_plug_merge():int:$retval COUNT EVENT 146684 $retval = 0 146686 $retval = 1
In blk_attempt_plug_merge() we are iterating ithrought the plug list from head to tail looking for a request with which we can merge the most recently submitted bio.
With commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") the most recent request will be at the tail so blk_attempt_plug_merge() will fail because it tries to merge bio with the plug list head. In blk_attempt_plug_merge() we don't iterate across the whole plug list because as we exit the loop once we fail merging in blk_attempt_bio_merge().
In commit <bc490f81731> ("block: change plugging to use a singly linked list") the plug list has been changed to single linked list so there's no way to iterate the list from tail to head which is the only way to mitigate the impact on bio merging if we want to keep commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug").
Given that moving plug list to a single linked list was mainly for performance reason then let's revert commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") for now to mitigate the reported performance regression.
[1] https://lore.kernel.org/lkml/202412122112.ca47bcec-lkp@intel.com/
Cc: stable@vger.kernel.org # 6.12 Reported-by: kernel test robot oliver.sang@intel.com Reported-by: Hagar Hemdan hagarhem@amazon.com Reported-and-bisected-by: Shaoying Xu shaoyi@amazon.com Signed-off-by: Hazem Mohamed Abuelfotoh abuehaze@amazon.com --- block/blk-mq.c | 4 ++-- drivers/block/virtio_blk.c | 2 +- drivers/nvme/host/pci.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c index c2697db59109..28965cac19fb 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1394,7 +1394,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq) */ if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS)) plug->has_elevator = true; - rq_list_add_tail(&plug->mq_list, rq); + rq_list_add_head(&plug->mq_list, rq); plug->rq_count++; }
@@ -2846,7 +2846,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched) rq_list_add_tail(&requeue_list, rq); continue; } - list_add_tail(&rq->queuelist, &list); + list_add(&rq->queuelist, &list); depth++; } while (!rq_list_empty(&plug->mq_list));
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 7cffea01d868..7992a171f905 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -513,7 +513,7 @@ static void virtio_queue_rqs(struct rq_list *rqlist) vq = this_vq;
if (virtblk_prep_rq_batch(req)) - rq_list_add_tail(&submit_list, req); + rq_list_add_head(&submit_list, req); /* reverse order */ else rq_list_add_tail(&requeue_list, req); } diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f1dd804151b1..5f7da42f9dac 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1026,7 +1026,7 @@ static void nvme_queue_rqs(struct rq_list *rqlist) nvmeq = req->mq_hctx->driver_data;
if (nvme_prep_rq_batch(nvmeq, req)) - rq_list_add_tail(&submit_list, req); + rq_list_add_head(&submit_list, req); /* reverse order */ else rq_list_add_tail(&requeue_list, req); }
On 6/11/25 6:14 AM, Hazem Mohamed Abuelfotoh wrote:
This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") reversed how requests are stored in the blk_plug list, this had significant impact on bio merging with requests exist on the plug list. This impact has been reported in [1] and could easily be reproducible using 4k randwrite fio benchmark on an NVME based SSD without having any filesystem on the disk.
Rather than revert this commit, why not just attempt a tail merge? Something ala this, totally untested.
diff --git a/block/blk-merge.c b/block/blk-merge.c index 3af1d284add5..708ded67d52a 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -998,6 +998,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, if (!plug || rq_list_empty(&plug->mq_list)) return false;
+ rq = plug->mq_list.tail; + if (rq->q == q) + return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == BIO_MERGE_OK; + rq_list_for_each(&plug->mq_list, rq) { if (rq->q == q) { if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
On 11/06/2025 13:56, Jens Axboe wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On 6/11/25 6:14 AM, Hazem Mohamed Abuelfotoh wrote:
This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") reversed how requests are stored in the blk_plug list, this had significant impact on bio merging with requests exist on the plug list. This impact has been reported in [1] and could easily be reproducible using 4k randwrite fio benchmark on an NVME based SSD without having any filesystem on the disk.
Rather than revert this commit, why not just attempt a tail merge? Something ala this, totally untested.
diff --git a/block/blk-merge.c b/block/blk-merge.c index 3af1d284add5..708ded67d52a 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -998,6 +998,10 @@ bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio, if (!plug || rq_list_empty(&plug->mq_list)) return false;
rq = plug->mq_list.tail;
if (rq->q == q)
return blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == BIO_MERGE_OK;
rq_list_for_each(&plug->mq_list, rq) { if (rq->q == q) { if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) ==
-- Jens Axboe
I thought about that solution before submitting the revert and I believe it will help with the case we are discussing here but what about the case where we have raid disks for which we need to iterate the plug list? In this case we will iterate the plug list from head to tail while the most recent requests (where merging will likely happen) will be closer to that tail so there will be additional overhead to iterate through the whole plug list and I believe the revert will also be better for this specific case unless I am missing something here :)
On Wed, Jun 11, 2025 at 12:14:54PM +0000, Hazem Mohamed Abuelfotoh wrote:
This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") reversed how requests are stored in the blk_plug list, this had significant impact on bio merging with requests exist on the plug list. This impact has been reported in [1] and could easily be reproducible using 4k randwrite fio benchmark on an NVME based SSD without having any filesystem on the disk.
My benchmark is:
fio --time_based --name=benchmark --size=50G --rw=randwrite \
--runtime=60 --filename="/dev/nvme1n1" --ioengine=psync \ --randrepeat=0 --iodepth=1 --fsync=64 --invalidate=1 \ --verify=0 --verify_fatal=0 --blocksize=4k --numjobs=4 \ --group_reporting
On 1.9TiB SSD(180K Max IOPS) attached to i3.16xlarge AWS EC2 instance.
Kernel | fio (B.W MiB/sec) | I/O size (iostat) --------------+---------------------+-------------------- 6.15.1 | 362 | 2KiB 6.15.1+revert | 660 (+82%) | 4KiB --------------+---------------------+--------------------
I just run one quick test in my test VM, but can't reproduce it.
Also be curious, why does writeback produce so many 2KiB bios?
Thanks, Ming
On 11/06/2025 16:10, Ming Lei wrote:
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
On Wed, Jun 11, 2025 at 12:14:54PM +0000, Hazem Mohamed Abuelfotoh wrote:
This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") reversed how requests are stored in the blk_plug list, this had significant impact on bio merging with requests exist on the plug list. This impact has been reported in [1] and could easily be reproducible using 4k randwrite fio benchmark on an NVME based SSD without having any filesystem on the disk.
My benchmark is:
fio --time_based --name=benchmark --size=50G --rw=randwrite \ --runtime=60 --filename="/dev/nvme1n1" --ioengine=psync \ --randrepeat=0 --iodepth=1 --fsync=64 --invalidate=1 \ --verify=0 --verify_fatal=0 --blocksize=4k --numjobs=4 \ --group_reporting
On 1.9TiB SSD(180K Max IOPS) attached to i3.16xlarge AWS EC2 instance.
Kernel | fio (B.W MiB/sec) | I/O size (iostat) --------------+---------------------+-------------------- 6.15.1 | 362 | 2KiB 6.15.1+revert | 660 (+82%) | 4KiB --------------+---------------------+--------------------
I just run one quick test in my test VM, but can't reproduce it.
Possibly you aren't hitting the Disk IOPS limit because you are using more powerful SSD? In this case I am using i3.16xlarge EC2 instance running AL2023 or may be fio has different behavior across Distribution. In AL2023 we have fio-3.32
Also be curious, why does writeback produce so many 2KiB bios?
Good question unfortunately I don't have a good answer on why we have 2KB bios although I am specifying 4K as I/O size in fio, this is something we probably should explore further.
Hazem
On 6/11/25 9:10 AM, Ming Lei wrote:
On Wed, Jun 11, 2025 at 12:14:54PM +0000, Hazem Mohamed Abuelfotoh wrote:
This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") reversed how requests are stored in the blk_plug list, this had significant impact on bio merging with requests exist on the plug list. This impact has been reported in [1] and could easily be reproducible using 4k randwrite fio benchmark on an NVME based SSD without having any filesystem on the disk.
My benchmark is:
fio --time_based --name=benchmark --size=50G --rw=randwrite \
--runtime=60 --filename="/dev/nvme1n1" --ioengine=psync \ --randrepeat=0 --iodepth=1 --fsync=64 --invalidate=1 \ --verify=0 --verify_fatal=0 --blocksize=4k --numjobs=4 \ --group_reporting
On 1.9TiB SSD(180K Max IOPS) attached to i3.16xlarge AWS EC2 instance.
Kernel | fio (B.W MiB/sec) | I/O size (iostat) --------------+---------------------+-------------------- 6.15.1 | 362 | 2KiB 6.15.1+revert | 660 (+82%) | 4KiB --------------+---------------------+--------------------
I just run one quick test in my test VM, but can't reproduce it.
Also be curious, why does writeback produce so many 2KiB bios?
I was pondering that too, sounds like a misconfiguration of sorts. But even without that, in a quick synthetic test here locally, I do see a lot of missed merges that is solved with the alternative patch I sent out. I strongly suspect it'll fix this issue too.
On 6/11/25 5:14 AM, Hazem Mohamed Abuelfotoh wrote:
Given that moving plug list to a single linked list was mainly for performance reason then let's revert commit <e70c301faece> ("block: don't reorder requests in blk_add_rq_to_plug") for now to mitigate the reported performance regression.
Reverting that commit would break zoned storage support and hence is not an option.
Bart.
linux-stable-mirror@lists.linaro.org