Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

31 Jan 2024

On Tue, Jan 30, 2024 at 6:41 PM Yu Kuai yukuai1@huaweicloud.com wrote:
...
Hi, Blazej!
在 2024/01/31 0:26, Blazej Kucman 写道:
...
Hi,
On Fri, 26 Jan 2024 08:46:10 -0700
Dan Moulding dan@danm.net wrote:
...
That's a good suggestion, so I switched it to use XFS. It can still
reproduce the hang. Sounds like this is probably a different problem
than the known ext4 one.
Our daily tests directed at mdadm/md also detected a problem with
identical symptoms as described in the thread.
Issue detected with IMSM metadata but it also reproduces with native
metadata.
NVMe disks under VMD controller were used.
Scenario:

Create raid10:

mdadm --create /dev/md/r10d4s128-15_A --level=10 --chunk=128
--raid-devices=4 /dev/nvme6n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme0n1
--size=7864320 --run
2. Create FS
mkfs.ext4 /dev/md/r10d4s128-15_A
3. Set faulty one raid member:
mdadm --set-faulty /dev/md/r10d4s128-15_A /dev/nvme3n1
4. Stop raid devies:
mdadm -Ss
Expected result:
The raid stops without kernel hangs and errors.
Actual result:
command "mdadm -Ss" hangs,
hung_task occurs in OS.
Can you test the following patch?
Thanks!
Kuai

diff --git a/drivers/md/md.c b/drivers/md/md.c
index e3a56a958b47..a8db84c200fe 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -578,8 +578,12 @@ static void submit_flushes(struct work_struct *ws)
                         rcu_read_lock();
                 }
         rcu_read_unlock();

  if (atomic_dec_and_test(&mddev->flush_pending))




  if (atomic_dec_and_test(&mddev->flush_pending)) {


          /* The pair is percpu_ref_get() from md_flush_request() */


          percpu_ref_put(&mddev->active_io);


           queue_work(md_wq, &mddev->flush_work);


  }

}
static void md_submit_flush_data(struct work_struct *ws)


This fixes the issue in my tests. Please submit the official patch.
Also, we should add a test in mdadm/tests to cover this case.
Thanks,
Song

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected