> This is for pre-6.4 kernels, as scrub code goes through a huge rework.
>
> [BUG]
> Even before the scrub rework, if we have some corrupted metadata failed
> to be repaired during replace, we still continue replace and let it
> finish just as there is nothing wrong:
>
> BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
> BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
> BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
> BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
> BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
> BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
> BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
> BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
>
> This can lead to unexpected problems for the result fs.
>
> [CAUSE]
> Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
>
> But unlike scrub, dev-replace doesn't really bother to check the scrub
> progress, which records all the errors found during replace.
>
> And even if we checks the progress, we can not really determine which
> errors are minor, which are critical just by the plain numbers.
> (remember we don't treat metadata/data checksum error differently).
>
> This behavior is there from the very beginning.
>
> [FIX]
> Instead of continue the replace, just error out if we hit an unrepaired
> metadata sector.
>
> Now the dev-replace would be rejected with -EIO, to inform the user.
> Although it also means, the fs has some metadata error which can not be
> repaired, the user would be super upset anyway.
If one sector is bad in metadata how much secondary data is damaged?
As someone who recently had to remove, and currently replace a disk.
it is upsetting, if it stopped if we stopped because 0,01% of data is
unrepairable, if we can save the other 99,99%. Can we have it
continue, print an error message to standard out, and a way for the
user to delete or copy it (with som option like -force-delete or
--force-copy) with the error to the new disk?
"Metadata at block 5578752 is damaged and unrepaired. Skipping. Read
`man btrfs-replace` for more info. "
At the end if possible, list files affected by the damaged metadata blocks.
In man answer:
How can the user know what files are connected to the metadata?
How can a user decide what to do with the damaged metadata?
At minimum, can there be some useful info to the info to the error output? like
"Replace has stopped, due to reading unrepaired metadata block, was
working on block 5578752, se `dmesg` for more details. (\s Sorry but
you are currently f..k)"
>
> The new dmesg would look like this:
>
> BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
> BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
> BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
> BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
> BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
> BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
> BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
> BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
>
> CC: stable(a)vger.kernel.org
> Signed-off-by: Qu Wenruo <wqu(a)suse.com>
> ---
> I'm not sure how should we merge this patch.
>
> The misc-next is already merging the new scrub code, but the problem is
> there for all old kernels thus we need such fixes.
>
> Maybe we can merge this fix before the scrub rework, then the rework,
> and finally the better fix using reworked interface?
> ---
> fs/btrfs/scrub.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ef4046a2572c..71f64b9bcd9f 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -195,6 +195,7 @@ struct scrub_ctx {
> struct mutex wr_lock;
> struct btrfs_device *wr_tgtdev;
> bool flush_all_writes;
> + bool has_meta_failed;
>
> /*
> * statistics
> @@ -1380,6 +1381,8 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
> btrfs_err_rl_in_rcu(fs_info,
> "unable to fixup (regular) error at logical %llu on dev %s",
> logical, btrfs_dev_name(dev));
> + if (is_metadata)
> + sctx->has_meta_failed = true;
> }
>
> out:
> @@ -3838,6 +3841,12 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>
> blk_finish_plug(&plug);
>
> + /*
> + * If we have metadata unable to be repaired, we should error
> + * out the dev-replace.
> + */
> + if (sctx->is_dev_replace && sctx->has_meta_failed && ret >= 0)
> + ret = -EIO;
> if (sctx->is_dev_replace && ret >= 0) {
> int ret2;
>
--
Torstein Eide
Torsteine(a)gmail.com
I sent an email to you yesterday but since I did not get a response,
I thought probably you did not receive it so I decided to send it
again and hopefully I will get a response this time around.
I am a secret admirer and would like to explore the opportunity to
learn more about each other. I know it is strange to contact you
this way and I hope you can forgive me. I am a shy person and
this is the only way I know I could get your attention. I just want
to know what you think and my intention is not to offend you.
I hope we can be friends if that is what you want, although I wish
to be more than just a friend. I know you have a few questions to
ask and I hope I can satisfy some of your curiosity with a few
answers.
I believe in the saying that 'to the world you are just one person,
but to someone special you are the world'. All I want is love,
romantic care and attention from a special companion which I am
hoping would be you.
I hope this message will be the beginning of a long term
communication between us, simply send a reply to this message, it
will make me happy.
Hugs and kisses,
Marion.
The patch titled
Subject: relayfs: fix out-of-bounds access in relay_file_read
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
relayfs-fix-out-of-bounds-access-in-relay_file_read.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Zhang Zhengming <zhang.zhengming(a)h3c.com>
Subject: relayfs: fix out-of-bounds access in relay_file_read
Date: Wed, 19 Apr 2023 12:02:03 +0800
There is a crash in relay_file_read, as the var from
point to the end of last subbuf.
The oops looks something like:
pc : __arch_copy_to_user+0x180/0x310
lr : relay_file_read+0x20c/0x2c8
Call trace:
__arch_copy_to_user+0x180/0x310
full_proxy_read+0x68/0x98
vfs_read+0xb0/0x1d0
ksys_read+0x6c/0xf0
__arm64_sys_read+0x20/0x28
el0_svc_common.constprop.3+0x84/0x108
do_el0_svc+0x74/0x90
el0_svc+0x1c/0x28
el0_sync_handler+0x88/0xb0
el0_sync+0x148/0x180
We get the condition by analyzing the vmcore:
1). The last produced byte and last consumed byte
both at the end of the last subbuf
2). A softirq calls function(e.g __blk_add_trace)
to write relay buffer occurs when an program is calling
relay_file_read_avail().
relay_file_read
relay_file_read_avail
relay_file_read_consume(buf, 0, 0);
//interrupted by softirq who will write subbuf
....
return 1;
//read_start point to the end of the last subbuf
read_start = relay_file_read_start_pos
//avail is equal to subsize
avail = relay_file_read_subbuf_avail
//from points to an invalid memory address
from = buf->start + read_start
//system is crashed
copy_to_user(buffer, from, avail)
Link: https://lkml.kernel.org/r/20230419040203.37676-1-zhang.zhengming@h3c.com
Fixes: 341a7213e5c1 ("kernel/relay.c: fix read_pos error when multiple readers")
Signed-off-by: Zhang Zhengming <zhang.zhengming(a)h3c.com>
Reviewed-by: Zhao Lei <zhao_lei1(a)hoperun.com>
Reviewed-by: Zhou Kete <zhou.kete(a)h3c.com>
Cc: Pengcheng Yang <yangpc(a)wangsu.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
kernel/relay.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- a/kernel/relay.c~relayfs-fix-out-of-bounds-access-in-relay_file_read
+++ a/kernel/relay.c
@@ -989,7 +989,8 @@ static size_t relay_file_read_start_pos(
size_t subbuf_size = buf->chan->subbuf_size;
size_t n_subbufs = buf->chan->n_subbufs;
size_t consumed = buf->subbufs_consumed % n_subbufs;
- size_t read_pos = consumed * subbuf_size + buf->bytes_consumed;
+ size_t read_pos = (consumed * subbuf_size + buf->bytes_consumed)
+ % (n_subbufs * subbuf_size);
read_subbuf = read_pos / subbuf_size;
padding = buf->padding[read_subbuf];
_
Patches currently in -mm which might be from zhang.zhengming(a)h3c.com are
relayfs-fix-out-of-bounds-access-in-relay_file_read.patch
The patch titled
Subject: maple_tree: make maple state reusable after mas_empty_area_rev()
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
maple_tree-make-maple-state-reusable-after-mas_empty_area_rev.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: "Liam R. Howlett" <Liam.Howlett(a)oracle.com>
Subject: maple_tree: make maple state reusable after mas_empty_area_rev()
Date: Fri, 14 Apr 2023 10:57:26 -0400
Stop using maple state min/max for the range by passing through pointers
for those values. This will allow the maple state to be reused without
resetting.
Also add some logic to fail out early on searching with invalid
arguments.
Link: https://lkml.kernel.org/r/20230414145728.4067069-1-Liam.Howlett@oracle.com
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett(a)oracle.com>
Reported-by: Rick Edgecombe <rick.p.edgecombe(a)intel.com>
Cc: Peng Zhang <zhangpeng.00(a)bytedance.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
lib/maple_tree.c | 27 +++++++++++++--------------
1 file changed, 13 insertions(+), 14 deletions(-)
--- a/lib/maple_tree.c~maple_tree-make-maple-state-reusable-after-mas_empty_area_rev
+++ a/lib/maple_tree.c
@@ -4965,7 +4965,8 @@ not_found:
* Return: True if found in a leaf, false otherwise.
*
*/
-static bool mas_rev_awalk(struct ma_state *mas, unsigned long size)
+static bool mas_rev_awalk(struct ma_state *mas, unsigned long size,
+ unsigned long *gap_min, unsigned long *gap_max)
{
enum maple_type type = mte_node_type(mas->node);
struct maple_node *node = mas_mn(mas);
@@ -5030,8 +5031,8 @@ static bool mas_rev_awalk(struct ma_stat
if (unlikely(ma_is_leaf(type))) {
mas->offset = offset;
- mas->min = min;
- mas->max = min + gap - 1;
+ *gap_min = min;
+ *gap_max = min + gap - 1;
return true;
}
@@ -5309,6 +5310,9 @@ int mas_empty_area(struct ma_state *mas,
unsigned long *pivots;
enum maple_type mt;
+ if (min >= max)
+ return -EINVAL;
+
if (mas_is_start(mas))
mas_start(mas);
else if (mas->offset >= 2)
@@ -5363,6 +5367,9 @@ int mas_empty_area_rev(struct ma_state *
{
struct maple_enode *last = mas->node;
+ if (min >= max)
+ return -EINVAL;
+
if (mas_is_start(mas)) {
mas_start(mas);
mas->offset = mas_data_end(mas);
@@ -5382,7 +5389,7 @@ int mas_empty_area_rev(struct ma_state *
mas->index = min;
mas->last = max;
- while (!mas_rev_awalk(mas, size)) {
+ while (!mas_rev_awalk(mas, size, &min, &max)) {
if (last == mas->node) {
if (!mas_rewind_node(mas))
return -EBUSY;
@@ -5397,17 +5404,9 @@ int mas_empty_area_rev(struct ma_state *
if (unlikely(mas->offset == MAPLE_NODE_SLOTS))
return -EBUSY;
- /*
- * mas_rev_awalk() has set mas->min and mas->max to the gap values. If
- * the maximum is outside the window we are searching, then use the last
- * location in the search.
- * mas->max and mas->min is the range of the gap.
- * mas->index and mas->last are currently set to the search range.
- */
-
/* Trim the upper limit to the max. */
- if (mas->max <= mas->last)
- mas->last = mas->max;
+ if (max <= mas->last)
+ mas->last = max;
mas->index = mas->last - size + 1;
return 0;
_
Patches currently in -mm which might be from Liam.Howlett(a)oracle.com are
maple_tree-make-maple-state-reusable-after-mas_empty_area_rev.patch