Commit cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") introduced the flag BIO_QUEUE_ENTERED in order split bios bypass the blocking queue entering routine and use the live non-blocking version. It was a result of an extensive discussion in a linux-block thread[0], and the purpose of this change was to prevent a hung task waiting on a reference to drop.
Happens that md raid0 split bios all the time, and more important, it changes their underlying device to the raid member. After the change introduced by this flag's usage, we experience various crashes if a raid0 member is removed during a large write. This happens because the bio reaches the live queue entering function when the queue of the raid0 member is dying.
A simple reproducer of this behavior is presented below: a) Build kernel v5.1-rc7 with CONFIG_BLK_DEV_THROTTLING=y.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme0n1/device/device/remove (whereas nvme0n1 is the 2nd array member)
This will trigger the following warning/oops:
------------[ cut here ]------------ no blkg associated for bio on block-device: nvme0n1 WARNING: CPU: 9 PID: 184 at ./include/linux/blk-cgroup.h:785 generic_make_request_checks+0x4dd/0x690 [...] BUG: unable to handle kernel NULL pointer dereference at 0000000000000155 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:blk_throtl_bio+0x45/0x970 [...] Call Trace: generic_make_request_checks+0x1bf/0x690 generic_make_request+0x64/0x3f0 raid0_make_request+0x184/0x620 [raid0] ? raid0_make_request+0x184/0x620 [raid0] ? blk_queue_split+0x384/0x6d0 md_handle_request+0x126/0x1a0 md_make_request+0x7b/0x180 generic_make_request+0x19e/0x3f0 submit_bio+0x73/0x140 [...]
This patch changes raid0 driver to fallback to the "old" blocking queue entering procedure, by clearing the BIO_QUEUE_ENTERED from raid0 bios. This prevents the crashes and restores the regular behavior of raid0 arrays when a member is removed during a large write.
[0] https://marc.info/?l=linux-block&m=152638475806811
Cc: Jens Axboe axboe@kernel.dk Cc: Ming Lei ming.lei@redhat.com Cc: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Cc: stable@vger.kernel.org # v4.18 Fixes: cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com --- drivers/md/raid0.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index f3fb5bb8c82a..d5bdc79e0835 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -547,6 +547,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) trace_block_bio_remap(bdev_get_queue(rdev->bdev), discard_bio, disk_devt(mddev->gendisk), bio->bi_iter.bi_sector); + bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(discard_bio); } bio_endio(bio); @@ -602,6 +603,7 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, bio); mddev_check_write_zeroes(mddev, bio); + bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(bio); return true; }