Commit 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") introduced a NULL pointer dereference in generic_make_request(). The patch sets q to NULL and enter_succeeded to false; right after, there's an 'if (enter_succeeded)' which is not taken, and then the 'else' will dereference q in blk_queue_dying(q).
This patch just moves the 'q = NULL' to a point in which it won't trigger the oops, although the semantics of this NULLification remains untouched.
A simple test case/reproducer is as follows: a) Build kernel v5.2-rc1 with CONFIG_BLK_CGROUP=n.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme0n1/device/device/remove (whereas nvme0n1 is the 2nd array member)
This will trigger the following oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:generic_make_request+0x32b/0x400 Call Trace: submit_bio+0x73/0x140 ext4_io_submit+0x4d/0x60 ext4_writepages+0x626/0xe90 do_writepages+0x4b/0xe0 [...]
This patch has no functional changes and preserves the md/raid0 behavior when a member is removed before kernel v4.17.
Cc: stable@vger.kernel.org # v4.17 Reviewed-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Ming Lei ming.lei@redhat.com Tested-by: Eric Ren renzhengeek@gmail.com Fixes: 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com ---
Changes V1->V2: * Implemented Ming's suggestion (drop {} from if) - thanks Ming! * Rebased to v5.2-rc1 * Added Reviewed-by/Tested-by tags
Also, Ming mentioned a new patch series[0] that will refactor legacy IO path so probably the bug won't happen anymore. Even in this case, I consider this patch important specially aiming the stable releases, in which backporting small bugfixes is much simpler than more complex patch sets.
[0] https://lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.c...
block/blk-core.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 419d600e6637..e887915c7804 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1054,10 +1054,8 @@ blk_qc_t generic_make_request(struct bio *bio) flags = 0; if (bio->bi_opf & REQ_NOWAIT) flags = BLK_MQ_REQ_NOWAIT; - if (blk_queue_enter(q, flags) < 0) { + if (blk_queue_enter(q, flags) < 0) enter_succeeded = false; - q = NULL; - } }
if (enter_succeeded) { @@ -1088,6 +1086,7 @@ blk_qc_t generic_make_request(struct bio *bio) bio_wouldblock_error(bio); else bio_io_error(bio); + q = NULL; } bio = bio_list_pop(&bio_list_on_stack[0]); } while (bio);
Commit cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") introduced the flag BIO_QUEUE_ENTERED in order split bios bypass the blocking queue entering routine and use the live non-blocking version. It was a result of an extensive discussion in a linux-block thread[0], and the purpose of this change was to prevent a hung task waiting on a reference to drop.
Happens that md raid0 split bios all the time, and more important, it changes their underlying device to the raid member. After the change introduced by this flag's usage, we experience various crashes if a raid0 member is removed during a large write. This happens because the bio reaches the live queue entering function when the queue of the raid0 member is dying.
A simple reproducer of this behavior is presented below: a) Build kernel v5.2-rc1 with CONFIG_BLK_DEV_THROTTLING=y.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme0n1/device/device/remove (whereas nvme0n1 is the 2nd array member)
This will trigger the following warning/oops:
------------[ cut here ]------------ no blkg associated for bio on block-device: nvme0n1 WARNING: CPU: 9 PID: 184 at ./include/linux/blk-cgroup.h:785 generic_make_request_checks+0x4dd/0x690 [...] BUG: unable to handle kernel NULL pointer dereference at 0000000000000155 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:blk_throtl_bio+0x45/0x970 [...] Call Trace: generic_make_request_checks+0x1bf/0x690 generic_make_request+0x64/0x3f0 raid0_make_request+0x184/0x620 [raid0] ? raid0_make_request+0x184/0x620 [raid0] ? blk_queue_split+0x384/0x6d0 md_handle_request+0x126/0x1a0 md_make_request+0x7b/0x180 generic_make_request+0x19e/0x3f0 submit_bio+0x73/0x140 [...]
This patch changes raid0 driver to fallback to the "old" blocking queue entering procedure, by clearing the BIO_QUEUE_ENTERED from raid0 bios. This prevents the crashes and restores the regular behavior of raid0 arrays when a member is removed during a large write.
[0] https://marc.info/?l=linux-block&m=152638475806811
Cc: Jens Axboe axboe@kernel.dk Cc: Ming Lei ming.lei@redhat.com Cc: Song Liu liu.song.a23@gmail.com Cc: Tetsuo Handa penguin-kernel@I-love.SAKURA.ne.jp Cc: stable@vger.kernel.org # v4.18 Fixes: cd4a4ae4683d ("block: don't use blocking queue entered for recursive bio submits") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com ---
No changes from V1, only rebased to v5.2-rc1. Also, notice that if [1] gets merged before this patch, the BIO_QUEUE_ENTERED flag will change to BIO_SPLITTED, so the (easy) conflict will need to be worked.
[1] https://lore.kernel.org/linux-block/20190515030310.20393-4-ming.lei@redhat.c...
drivers/md/raid0.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index f3fb5bb8c82a..d5bdc79e0835 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -547,6 +547,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio) trace_block_bio_remap(bdev_get_queue(rdev->bdev), discard_bio, disk_devt(mddev->gendisk), bio->bi_iter.bi_sector); + bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(discard_bio); } bio_endio(bio); @@ -602,6 +603,7 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio) disk_devt(mddev->gendisk), bio_sector); mddev_check_writesame(mddev, bio); mddev_check_write_zeroes(mddev, bio); + bio_clear_flag(bio, BIO_QUEUE_ENTERED); generic_make_request(bio); return true; }
On Mon, May 20, 2019 at 07:09:11PM -0300, Guilherme G. Piccoli wrote:
No changes from V1, only rebased to v5.2-rc1. Also, notice that if [1] gets merged before this patch, the BIO_QUEUE_ENTERED flag will change to BIO_SPLITTED, so the (easy) conflict will need to be worked.
[1] https://lore.kernel.org/linux-block/20190515030310.20393-4-ming.lei@redhat.c...
Actually - that series should also fix you problem and avoid the need for both patches in this series. Can you please test it?
On Tue, May 21, 2019 at 9:56 AM Christoph Hellwig hch@infradead.org wrote:
[1] https://lore.kernel.org/linux-block/20190515030310.20393-4-ming.lei@redhat.c...
Actually - that series should also fix you problem and avoid the need for both patches in this series. Can you please test it?
Hi Christoph, thanks for looking into this. You're right, this series fixes both issues. The problem I see though is that it relies on legacy IO path removal - for v5.0 and beyond, all fine. But backporting that to v4.17-v4.20 stable series will be quite painful.
My fixes are mostly "oneliners". If we could get both approaches upstream, that'd be perfect! Cheers,
Guilherme
On Tue, May 21, 2019 at 11:10:05AM -0300, Guilherme Piccoli wrote:
Hi Christoph, thanks for looking into this. You're right, this series fixes both issues. The problem I see though is that it relies on legacy IO path removal - for v5.0 and beyond, all fine. But backporting that to v4.17-v4.20 stable series will be quite painful.
My fixes are mostly "oneliners". If we could get both approaches upstream, that'd be perfect!
But they basically just fix code that otherwise gets removed. And the way this patches uses the ENTERED flag from the md code looks slightly sketchy to me. Maybe we want them as stable only patches.
On Tue, May 21, 2019 at 2:23 PM Christoph Hellwig hch@infradead.org wrote:
On Tue, May 21, 2019 at 11:10:05AM -0300, Guilherme Piccoli wrote:
Hi Christoph, thanks for looking into this. You're right, this series fixes both issues. The problem I see though is that it relies on legacy IO path removal - for v5.0 and beyond, all fine. But backporting that to v4.17-v4.20 stable series will be quite painful.
My fixes are mostly "oneliners". If we could get both approaches upstream, that'd be perfect!
But they basically just fix code that otherwise gets removed. And the way this patches uses the ENTERED flag from the md code looks slightly sketchy to me. Maybe we want them as stable only patches.
OK, it makes sense to me, if that is a possibility. The fist one is clearly a small and non-intrusive fix. The 2nd indeed is more invasive heh
Please let me know how to proceed to have that added at least in stable trees; this would help a lot the distro side of the world hehe
Cheers,
Guilherme
On Mon, May 20, 2019 at 3:10 PM Guilherme G. Piccoli gpiccoli@canonical.com wrote:
Commit 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") introduced a NULL pointer dereference in generic_make_request(). The patch sets q to NULL and enter_succeeded to false; right after, there's an 'if (enter_succeeded)' which is not taken, and then the 'else' will dereference q in blk_queue_dying(q).
This patch just moves the 'q = NULL' to a point in which it won't trigger the oops, although the semantics of this NULLification remains untouched.
A simple test case/reproducer is as follows: a) Build kernel v5.2-rc1 with CONFIG_BLK_CGROUP=n.
b) Create a raid0 md array with 2 NVMe devices as members, and mount it with an ext4 filesystem.
c) Run the following oneliner (supposing the raid0 is mounted in /mnt): (dd of=/mnt/tmp if=/dev/zero bs=1M count=999 &); sleep 0.3; echo 1 > /sys/block/nvme0n1/device/device/remove (whereas nvme0n1 is the 2nd array member)
This will trigger the following oops:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000078 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI RIP: 0010:generic_make_request+0x32b/0x400 Call Trace: submit_bio+0x73/0x140 ext4_io_submit+0x4d/0x60 ext4_writepages+0x626/0xe90 do_writepages+0x4b/0xe0 [...]
This patch has no functional changes and preserves the md/raid0 behavior when a member is removed before kernel v4.17.
Cc: stable@vger.kernel.org # v4.17 Reviewed-by: Bart Van Assche bvanassche@acm.org Reviewed-by: Ming Lei ming.lei@redhat.com Tested-by: Eric Ren renzhengeek@gmail.com Fixes: 37f9579f4c31 ("blk-mq: Avoid that submitting a bio concurrently with device removal triggers a crash") Signed-off-by: Guilherme G. Piccoli gpiccoli@canonical.com
Applied both patches! Thanks for the fix!
Changes V1->V2:
- Implemented Ming's suggestion (drop {} from if) - thanks Ming!
- Rebased to v5.2-rc1
- Added Reviewed-by/Tested-by tags
Also, Ming mentioned a new patch series[0] that will refactor legacy IO path so probably the bug won't happen anymore. Even in this case, I consider this patch important specially aiming the stable releases, in which backporting small bugfixes is much simpler than more complex patch sets.
[0] https://lore.kernel.org/linux-block/20190515030310.20393-1-ming.lei@redhat.c...
block/blk-core.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c index 419d600e6637..e887915c7804 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1054,10 +1054,8 @@ blk_qc_t generic_make_request(struct bio *bio) flags = 0; if (bio->bi_opf & REQ_NOWAIT) flags = BLK_MQ_REQ_NOWAIT;
if (blk_queue_enter(q, flags) < 0) {
if (blk_queue_enter(q, flags) < 0) enter_succeeded = false;
q = NULL;
} } if (enter_succeeded) {
@@ -1088,6 +1086,7 @@ blk_qc_t generic_make_request(struct bio *bio) bio_wouldblock_error(bio); else bio_io_error(bio);
q = NULL; } bio = bio_list_pop(&bio_list_on_stack[0]); } while (bio);
-- 2.21.0
On Tue, May 21, 2019 at 2:59 AM Song Liu liu.song.a23@gmail.com wrote:
Applied both patches! Thanks for the fix!
Thanks Song!
On 21/05/2019 02:59, Song Liu wrote:
Applied both patches! Thanks for the fix!
Hi Song, sorry for the annoyance, but the situation of both patches is a bit confused for me heheh
You mention you've applied both patches - I couldn't find your tree. Also, Christoph noticed Ming's series fixes both issues and suggested to drop both my patches in favor of Ming's clean-up, or at least make them -stable only.
So, what is the current status of the patches? Can we have them on -stable trees at least? If so, how should I proceed?
Thanks in advance for the clarification! Cheers,
Guilherme
On Thu, May 23, 2019 at 7:36 AM Guilherme G. Piccoli gpiccoli@canonical.com wrote:
On 21/05/2019 02:59, Song Liu wrote:
Applied both patches! Thanks for the fix!
Hi Song, sorry for the annoyance, but the situation of both patches is a bit confused for me heheh
You mention you've applied both patches - I couldn't find your tree. Also, Christoph noticed Ming's series fixes both issues and suggested to drop both my patches in favor of Ming's clean-up, or at least make them -stable only.
So, what is the current status of the patches? Can we have them on -stable trees at least? If so, how should I proceed?
Thanks in advance for the clarification! Cheers,
Guilherme
Sorry for the confusion and delay. I will send patches to stable@.
Song
On Thu, May 23, 2019 at 2:06 PM Song Liu liu.song.a23@gmail.com wrote:
Sorry for the confusion and delay. I will send patches to stable@.
Song
Hi Song, no problem at all! Thanks a lot for the clarification. Cheers,
Guilherme
linux-stable-mirror@lists.linaro.org