Hi,
while trying to backup a Dell R7525 system running Debian bookworm/testing using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot.
First I thought this was related to LVM so I created
https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html (continued at https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229)
Long story short:
I was even able to reproduce with fsfreeze, see last strace lines
[...] 14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) =
3
14471 1659449870.984658 newfstatat(3, "", {st_mode=S_IFDIR|0700,
st_size=4096, ...}, AT_EMPTY_PATH) = 0
14471 1659449870.984678 ioctl(3, FIFREEZE
so I started to bisect kernel and found the following bad commit:
md: add support for REQ_NOWAIT
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support for checking whether a given bdev supports handling of REQ_NOWAIT or not. Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target") added support for REQ_NOWAIT for dm. This uses a similar approach to incorporate REQ_NOWAIT for md based bios.
This patch was tested using t/io_uring tool within FIO. A nvme drive was partitioned into 2 partitions and a simple raid 0 configuration /dev/md0 was created.
md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks
Before patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38396 38396 pts/2 00:00:00 io_uring 38396 38397 pts/2 00:00:15 io_uring 38396 38398 pts/2 00:00:13 iou-wrk-38397
We can see iou-wrk-38397 io worker thread created which gets created when io_uring sees that the underlying device (/dev/md0 in this case) doesn't support nowait.
After patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38341 38341 pts/2 00:10:22 io_uring 38341 38342 pts/2 00:10:37 io_uring
After running this patch, we don't see any io worker thread being created which indicated that io_uring saw that the underlying device does support nowait. This is the exact behaviour noticed on a dm device which also supports nowait.
For all the other raid personalities except raid0, we would need to train pieces which involves make_request fn in order for them to correctly handle REQ_NOWAIT.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i d=f51d46d0e7cb5b8494aa534d276a9d8915a2443d
After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.
At this point I still wonder why I experienced the same problem even after I removed one nvme device from the mdraid array and tested it separately. So maybe there is another nowait/REQ_NOWAIT problem somewhere. During bisect I only tested against the mdraid array.
#regzbot introduced: f51d46d0e7cb5b8494aa534d276a9d8915a2443d #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229
Hi,
any news on this? Is there anything else you need from me or I can help with?
Thanks.
Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone.
[CCing Jens, as the top-level maintainer who in this case also reviewed the patch that causes this regression.]
Vishal, Song, what up here? Could you please look into this and at least comment on the issue, as it's a regression that was reported more than 10 days ago already. Ideally at this point it would be good if the regression was fixed already, as explained by "Prioritize work on fixing regressions" here: https://docs.kernel.org/process/handling-regressions.html#prioritize-work-on...
Ciao, Thorsten
On 11.08.22 14:34, Thomas Deutschmann wrote:
Hi,
any news on this? Is there anything else you need from me or I can help with?
Thanks.
-- Regards, Thomas -----Original Message----- From: Thomas Deutschmann whissi@whissi.de Sent: Wednesday, August 3, 2022 4:35 PM To: vverma@digitalocean.com; song@kernel.org Cc: stable@vger.kernel.org; regressions@lists.linux.dev Subject: [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs Hi, while trying to backup a Dell R7525 system running Debian bookworm/testing using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot. First I thought this was related to LVM so I created https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html (continued at https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229) Long story short: I was even able to reproduce with fsfreeze, see last strace lines
[...] 14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) =3 14471 1659449870.984658 newfstatat(3, "",
{st_mode=S_IFDIR|0700,st_size=4096, ...}, AT_EMPTY_PATH) = 0
14471 1659449870.984678 ioctl(3, FIFREEZE
so I started to bisect kernel and found the following bad commit:
md: add support for REQ_NOWAIT
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support for checking whether a given bdev supports handling of REQ_NOWAIT or not. Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target") added support for REQ_NOWAIT for dm. This uses a similar approach to incorporate REQ_NOWAIT for md based bios.
This patch was tested using t/io_uring tool within FIO. A nvme drive was partitioned into 2 partitions and a simple raid 0 configuration /dev/md0 was created.
md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks
Before patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38396 38396 pts/2 00:00:00 io_uring 38396 38397 pts/2 00:00:15 io_uring 38396 38398 pts/2 00:00:13 iou-wrk-38397
We can see iou-wrk-38397 io worker thread created which gets created when io_uring sees that the underlying device (/dev/md0 in this case) doesn't support nowait.
After patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38341 38341 pts/2 00:10:22 io_uring 38341 38342 pts/2 00:10:37 io_uring
After running this patch, we don't see any io worker thread being created which indicated that io_uring saw that the underlying device does support nowait. This is the exact behaviour noticed on a dm device which also supports nowait.
For all the other raid personalities except raid0, we would need to train pieces which involves make_request fn in order for them to correctly handle REQ_NOWAIT.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i d=f51d46d0e7cb5b8494aa534d276a9d8915a2443d
After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.
At this point I still wonder why I experienced the same problem even after I removed one nvme device from the mdraid array and tested it separately. So maybe there is another nowait/REQ_NOWAIT problem somewhere. During bisect I only tested against the mdraid array.
#regzbot introduced: f51d46d0e7cb5b8494aa534d276a9d8915a2443d #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229
-- Regards, Thomas
Just saw this. I’m trying to understand whether this happens only on md array or individual nvme drives (without any raid) too? The commit you pointed added REQ_NOWAIT for md based arrays, but if it is happening on individual nvme drives then that could point to something with REQ_NOWAIT I think.
On Aug 15, 2022, at 3:58 AM, Thorsten Leemhuis regressions@leemhuis.info wrote:
Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone.
[CCing Jens, as the top-level maintainer who in this case also reviewed the patch that causes this regression.]
Vishal, Song, what up here? Could you please look into this and at least comment on the issue, as it's a regression that was reported more than 10 days ago already. Ideally at this point it would be good if the regression was fixed already, as explained by "Prioritize work on fixing regressions" here: https://docs.kernel.org/process/handling-regressions.html#prioritize-work-on...
Ciao, Thorsten
On 11.08.22 14:34, Thomas Deutschmann wrote:
Hi,
any news on this? Is there anything else you need from me or I can help with?
Thanks.
-- Regards, Thomas -----Original Message----- From: Thomas Deutschmann whissi@whissi.de Sent: Wednesday, August 3, 2022 4:35 PM To: vverma@digitalocean.com; song@kernel.org Cc: stable@vger.kernel.org; regressions@lists.linux.dev Subject: [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs Hi, while trying to backup a Dell R7525 system running Debian bookworm/testing using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot. First I thought this was related to LVM so I created https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html (continued at https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229) Long story short: I was even able to reproduce with fsfreeze, see last strace lines
[...] 14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) =3 14471 1659449870.984658 newfstatat(3, "",
{st_mode=S_IFDIR|0700,st_size=4096, ...}, AT_EMPTY_PATH) = 0
14471 1659449870.984678 ioctl(3, FIFREEZE
so I started to bisect kernel and found the following bad commit:
md: add support for REQ_NOWAIT
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support for checking whether a given bdev supports handling of REQ_NOWAIT or not. Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target") added support for REQ_NOWAIT for dm. This uses a similar approach to incorporate REQ_NOWAIT for md based bios.
This patch was tested using t/io_uring tool within FIO. A nvme drive was partitioned into 2 partitions and a simple raid 0 configuration /dev/md0 was created.
md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks
Before patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38396 38396 pts/2 00:00:00 io_uring 38396 38397 pts/2 00:00:15 io_uring 38396 38398 pts/2 00:00:13 iou-wrk-38397
We can see iou-wrk-38397 io worker thread created which gets created when io_uring sees that the underlying device (/dev/md0 in this case) doesn't support nowait.
After patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38341 38341 pts/2 00:10:22 io_uring 38341 38342 pts/2 00:10:37 io_uring
After running this patch, we don't see any io worker thread being created which indicated that io_uring saw that the underlying device does support nowait. This is the exact behaviour noticed on a dm device which also supports nowait.
For all the other raid personalities except raid0, we would need to train pieces which involves make_request fn in order for them to correctly handle REQ_NOWAIT.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i d=f51d46d0e7cb5b8494aa534d276a9d8915a2443d
After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.
At this point I still wonder why I experienced the same problem even after I removed one nvme device from the mdraid array and tested it separately. So maybe there is another nowait/REQ_NOWAIT problem somewhere. During bisect I only tested against the mdraid array.
#regzbot introduced: f51d46d0e7cb5b8494aa534d276a9d8915a2443d #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229
-- Regards, Thomas
On Mon, Aug 15, 2022 at 8:46 AM Vishal Verma vverma@digitalocean.com wrote:
Just saw this. I’m trying to understand whether this happens only on md array or individual nvme drives (without any raid) too? The commit you pointed added REQ_NOWAIT for md based arrays, but if it is happening on individual nvme drives then that could point to something with REQ_NOWAIT I think.
Agreed with this analysis.
On Aug 15, 2022, at 3:58 AM, Thorsten Leemhuis regressions@leemhuis.info wrote:
Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone.
[CCing Jens, as the top-level maintainer who in this case also reviewed the patch that causes this regression.]
Vishal, Song, what up here? Could you please look into this and at least comment on the issue, as it's a regression that was reported more than 10 days ago already. Ideally at this point it would be good if the regression was fixed already, as explained by "Prioritize work on fixing regressions" here: https://docs.kernel.org/process/handling-regressions.html#prioritize-work-on...
I am sorry for the delay.
[...]
Hi,
any news on this? Is there anything else you need from me or I can help with?
Thanks.
-- Regards, Thomas -----Original Message----- From: Thomas Deutschmann whissi@whissi.de Sent: Wednesday, August 3, 2022 4:35 PM To: vverma@digitalocean.com; song@kernel.org Cc: stable@vger.kernel.org; regressions@lists.linux.dev Subject: [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs Hi, while trying to backup a Dell R7525 system running Debian bookworm/testing using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot. First I thought this was related to LVM so I created https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html (continued at https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229) Long story short: I was even able to reproduce with fsfreeze, see last strace lines
[...] 14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) =3 14471 1659449870.984658 newfstatat(3, "",
{st_mode=S_IFDIR|0700,st_size=4096, ...}, AT_EMPTY_PATH) = 0
14471 1659449870.984678 ioctl(3, FIFREEZE
so I started to bisect kernel and found the following bad commit:
I am not able to reproduce this on 5.19+ kernel. I have:
[root@eth50-1 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 253:0 0 32G 0 disk ├─vda1 253:1 0 2G 0 part /boot └─vda2 253:2 0 30G 0 part / nvme0n1 259:0 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme2n1 259:1 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme3n1 259:2 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme1n1 259:3 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt [root@eth50-1 ~]# for x in {1..100} ; do fsfreeze --unfreeze /root/mnt ; fsfreeze --freeze /root/mnt ; done
Did I miss something?
Thanks, Song
[...]
Hi,
On 2022-08-17 08:19, Song Liu wrote:
On Mon, Aug 15, 2022 at 8:46 AM Vishal Verma vverma@digitalocean.com wrote:
Just saw this. I’m trying to understand whether this happens only on md array or individual nvme drives (without any raid) too? The commit you pointed added REQ_NOWAIT for md based arrays, but if it is happening on individual nvme drives then that could point to something with REQ_NOWAIT I think.
Agreed with this analysis.
I bisected again, this time I tested against the single nvme device.
I did it 2 times, and always ended up with
git bisect start # good: [8bb7eca972ad531c9b149c0a51ab43a417385813] Linux 5.15 git bisect good 8bb7eca972ad531c9b149c0a51ab43a417385813 # bad: [df0cc57e057f18e44dac8e6c18aba47ab53202f9] Linux 5.16 git bisect bad df0cc57e057f18e44dac8e6c18aba47ab53202f9 # good: [2219b0ceefe835b92a8a74a73fe964aa052742a2] Merge tag
'soc-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect good 2219b0ceefe835b92a8a74a73fe964aa052742a2 # good: [206825f50f908771934e1fba2bfc2e1f1138b36a] Merge tag
'mtd/for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
git bisect good 206825f50f908771934e1fba2bfc2e1f1138b36a # bad: [4e1fddc98d2585ddd4792b5e44433dcee7ece001] tcp_cubic: fix
spurious Hystart ACK train detections for not-cwnd-limited flows
git bisect bad 4e1fddc98d2585ddd4792b5e44433dcee7ece001 # good: [dbf49896187fd58c577fa1574a338e4f3672b4b2] Merge branch
'akpm' (patches from Andrew)
git bisect good dbf49896187fd58c577fa1574a338e4f3672b4b2 # good: [0ecca62beb12eeb13965ed602905c8bf53ac93d0] Merge tag
'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client
git bisect good 0ecca62beb12eeb13965ed602905c8bf53ac93d0 # bad: [7d5775d49e4a488bc8a07e5abb2b71a4c28aadbb] Merge tag
'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
git bisect bad 7d5775d49e4a488bc8a07e5abb2b71a4c28aadbb # good: [35c8fad4a703fdfa009ed274f80bb64b49314cde] Merge tag
'perf-tools-for-v5.16-2021-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
git bisect good 35c8fad4a703fdfa009ed274f80bb64b49314cde # good: [6ea45c57dc176dde529ab5d7c4b3f20e52a2bd82] Merge tag
'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
git bisect good 6ea45c57dc176dde529ab5d7c4b3f20e52a2bd82 # bad: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1 git bisect bad fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf # good: [475c3f599582a34e189f047ed3fb7e90a295ea5b] sh: fix READ/WRITE
redefinition warnings
git bisect good 475c3f599582a34e189f047ed3fb7e90a295ea5b # good: [c3b68c27f58a07130382f3fa6320c3652ad76f15] Merge tag
'for-5.16/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
git bisect good c3b68c27f58a07130382f3fa6320c3652ad76f15 # good: [4a6b35b3b3f28df81fea931dc77c4c229cbdb5b2] xfs: sync
xfs_btree_split macros with userspace libxfs
git bisect good 4a6b35b3b3f28df81fea931dc77c4c229cbdb5b2 # good: [dee2b702bcf067d7b6b62c18bdd060ff0810a800] kconfig: Add
support for -Wimplicit-fallthrough
git bisect good dee2b702bcf067d7b6b62c18bdd060ff0810a800 # first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux
5.16-rc1
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
...but this doesn't make any sense, right?
However, I cannot reproduce with the commit before, i.e. dee2b702bcf0 didn't freeze during my 10 test runs. But with fa55b7dcdc (or any later commit), system will freeze on _every_ test run?!
I checked out 1bd297988b75 which never failed before, changed Makefile to PATCHLEVEL=16 and EXTRAVERSION=-rc1 and guess what: It's now failing, too.
So this sounds like some code changes behavior when KV is >=5.16-rc1. Is that possible?
Anyway, I started to test v5.10 (with PATCHLEVEL=16 and EXTRAVERSION=-rc1 set) which worked so I started another bisect session where I named all KV to 5.16-rc1.
I'll post my finding when this session is completed.
I am not able to reproduce this on 5.19+ kernel. I have:
[root@eth50-1 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 253:0 0 32G 0 disk ├─vda1 253:1 0 2G 0 part /boot └─vda2 253:2 0 30G 0 part / nvme0n1 259:0 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme2n1 259:1 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme3n1 259:2 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme1n1 259:3 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt [root@eth50-1 ~]# for x in {1..100} ; do fsfreeze --unfreeze /root/mnt ; fsfreeze --freeze /root/mnt ; done
Did I miss something?
Well, your reproducer doesn't work. Like written in my initial mail, executing `fsfreeze --freeze...` directly after boot doesn't even fail for me. The device/array must have seen some I/O to trigger this.
To be more precise:
During my current bisect session (where I set KV to 5.16-rc1 for all kernels), I noticed that my 'reproducer' failed:
To trigger the problem, it is not enough to create random I/O by copying some files for example.
I am using mysqld (MariaDB 10.6.8) and restore ~20GB of SQL dumps -- somehow this is triggering the problem in a reliable way. The mysqld is using O_DIRECT (https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_method) -- maybe Direct I/O is the trigger.
This process usually takes ~620s on my test system where I am experiencing the problem. After import I called `fsfreeze --freeze ...` against the mount point used by mysqld. When this command did not return (=fsfreeze was hanging), I marked revision as bad.
Since setting KV in all kernels to "5.16-rc1" I noticed that the import process sometimes "freezed" -- mysqld was still running and responsive (that's not the case when fsfreeze hangs for example) and `SHOW PROCESSLIST` showed the running imports with still increasing time counter. However, no data are read and written anymore. Although fsfreeze command works when this happens. Anyway, I marked revisions showing this behavior as bad, too.
I'll post my results when I finished this bisect session.
On 2022-08-17 08:53, Thomas Deutschmann wrote:
I'll post my results when I finished this bisect session.
I bisected kernel with KV set to "5.16-rc1":
git bisect start # good: [2c85ebc57b3e1817b6ce1a6b703928e113a90442] Linux 5.10 git bisect good 2c85ebc57b3e1817b6ce1a6b703928e113a90442 # bad: [8bb7eca972ad531c9b149c0a51ab43a417385813] Linux 5.15 git bisect bad 8bb7eca972ad531c9b149c0a51ab43a417385813 # bad: [6bdf2fbc48f104a84606f6165aa8a20d9a7d9074] Merge tag 'nvme-5.13-2021-05-13' of git://git.infradead.org/nvme into block-5.13 git bisect bad 6bdf2fbc48f104a84606f6165aa8a20d9a7d9074 # good: [02f9fc286e039d0bef7284fb1200ee755b525bde] Merge tag 'pm-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect good 02f9fc286e039d0bef7284fb1200ee755b525bde # bad: [f351f4b63dac127079bbd77da64b2a61c09d522d] usb: xhci-mtk: fix oops when unbind driver git bisect bad f351f4b63dac127079bbd77da64b2a61c09d522d # good: [28b9aaac4cc5a11485b6f70656e4e9ead590cf5b] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux git bisect good 28b9aaac4cc5a11485b6f70656e4e9ead590cf5b # good: [cf64c2a905e0dabcc473ca70baf275fb3a61fac4] Merge branch 'work.sparc32' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect good cf64c2a905e0dabcc473ca70baf275fb3a61fac4 # bad: [ea6be461cbedefaa881711a43f2842aabbd12fd4] Merge tag 'acpi-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm git bisect bad ea6be461cbedefaa881711a43f2842aabbd12fd4 # good: [1c9077cdecd027714736e70704da432ee2b946bb] Merge tag 'nfs-for-5.12-1' of git://git.linux-nfs.org/projects/anna/linux-nfs git bisect good 1c9077cdecd027714736e70704da432ee2b946bb # good: [efba6d3a7c4bb59f0750609fae0f9644d82304b6] Merge tag 'for-5.12/io_uring-2021-02-25' of git://git.kernel.dk/linux-block git bisect good efba6d3a7c4bb59f0750609fae0f9644d82304b6 # bad: [0b311e34d5033fdcca4c9b5f2d9165b3604704d3] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi git bisect bad 0b311e34d5033fdcca4c9b5f2d9165b3604704d3 # good: [5ceabb6078b80a8544ba86d6ee523ad755ae6d5e] Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs git bisect good 5ceabb6078b80a8544ba86d6ee523ad755ae6d5e # bad: [3ab6608e66b16159c3a3c2d7015b9c11cd3396c1] Merge tag 'block-5.12-2021-02-27' of git://git.kernel.dk/linux-block git bisect bad 3ab6608e66b16159c3a3c2d7015b9c11cd3396c1 # bad: [e941894eae31b52f0fd9bdb3ce20620afa152f45] io-wq: make buffered file write hashed work map per-ctx git bisect bad e941894eae31b52f0fd9bdb3ce20620afa152f45 # good: [4379bf8bd70b5de6bba7d53015b0c36c57a634ee] io_uring: remove io_identity git bisect good 4379bf8bd70b5de6bba7d53015b0c36c57a634ee # good: [1c0aa1fae1acb77c5f9917adb0e4cb4500b9f3a6] io_uring: flag new native workers with IORING_FEAT_NATIVE_WORKERS git bisect good 1c0aa1fae1acb77c5f9917adb0e4cb4500b9f3a6 # good: [0100e6bbdbb79404e56939313662b42737026574] arch: ensure parisc/powerpc handle PF_IO_WORKER in copy_thread() git bisect good 0100e6bbdbb79404e56939313662b42737026574 # good: [8b3e78b5955abb98863832453f5c74eca8f53c3a] io-wq: fix races around manager/worker creation and task exit git bisect good 8b3e78b5955abb98863832453f5c74eca8f53c3a # good: [eb2de9418d56b5e6ebf27bad51dbce3e22ee109b] io-wq: fix race around io_worker grabbing git bisect good eb2de9418d56b5e6ebf27bad51dbce3e22ee109b # first bad commit: [e941894eae31b52f0fd9bdb3ce20620afa152f45] io-wq: make buffered file write hashed work map per-ctx
From e941894eae31b52f0fd9bdb3ce20620afa152f45 From: Jens Axboe Date: Fri, 19 Feb 2021 12:33:30 -0700 Subject: io-wq: make buffered file write hashed work map per-ctx
Before the io-wq thread change, we maintained a hash work map and lock per-node per-ring. That wasn't ideal, as we really wanted it to be per ring. But now that we have per-task workers, the hash map ends up being just per-task. That'll work just fine for the normal case of having one task use a ring, but if you share the ring between tasks, then it's considerably worse than it was before.
Make the hash map per ctx instead, which provides full per-ctx buffered write serialization on hashed writes.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
But I think this result is misleading.
Like mentioned, the problem I experienced during this bisect session was different (not the FIFREEZE ioctl hang). This sounds more like the already fixed regressions caused by the commit above, i.e.
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
- https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=...
I will do another round with 2b7196a219bf (good) <-> 5.18 (bad).
On 2022-08-17 20:29, Thomas Deutschmann wrote:
I will do another round with 2b7196a219bf (good) <-> 5.18 (bad).
...and this one also ended up in
first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1
Now I built vanilla 5.18.18 and fsfreeze will hang after FIFREEZE ioctl system call after running my reproducer which generated I/O load.
=> So looks like bug is still present, right?
When I now just edit Makefile and set KV <5.16-rc1, i.e.
diff --git a/Makefile b/Makefile index 23162e2bdf14..0f344944d828 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 5 -PATCHLEVEL = 18 -SUBLEVEL = 18 +PATCHLEVEL = 15 +SUBLEVEL = 0 EXTRAVERSION = NAME = Superb Owl
then I can no longer reproduce the problem.
Of course,
diff --git a/Makefile b/Makefile index 23162e2bdf14..0f344944d828 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 5 -PATCHLEVEL = 18 -SUBLEVEL = 18 +PATCHLEVEL = 15 +SUBLEVEL = 99 EXTRAVERSION = NAME = Superb Owl
will freeze again.
For me it looks like kernel is taking a different code path depending on KV but I don't know how to proceed. Any idea how to continue debugging this?
On Thu, Aug 18, 2022 at 7:46 PM Thomas Deutschmann whissi@whissi.de wrote:
On 2022-08-17 20:29, Thomas Deutschmann wrote:
I will do another round with 2b7196a219bf (good) <-> 5.18 (bad).
...and this one also ended up in
first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1
Now I built vanilla 5.18.18 and fsfreeze will hang after FIFREEZE ioctl system call after running my reproducer which generated I/O load.
=> So looks like bug is still present, right?
When I now just edit Makefile and set KV <5.16-rc1, i.e.
diff --git a/Makefile b/Makefile index 23162e2bdf14..0f344944d828 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 5 -PATCHLEVEL = 18 -SUBLEVEL = 18 +PATCHLEVEL = 15 +SUBLEVEL = 0 EXTRAVERSION = NAME = Superb Owl
then I can no longer reproduce the problem.
Of course,
diff --git a/Makefile b/Makefile index 23162e2bdf14..0f344944d828 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 VERSION = 5 -PATCHLEVEL = 18 -SUBLEVEL = 18 +PATCHLEVEL = 15 +SUBLEVEL = 99 EXTRAVERSION = NAME = Superb Owl
will freeze again.
For me it looks like kernel is taking a different code path depending on KV but I don't know how to proceed. Any idea how to continue debugging this?
Hmm.. does the user space use different logic based on the kernel version?
I still cannot reproduce the issue. Have you tried to reproduce the issue without mysqld? Something with fio will be great.
Thanks, Song
On 2022-08-20 03:04, Song Liu wrote:
Hmm.. does the user space use different logic based on the kernel version?
I still cannot reproduce the issue. Have you tried to reproduce the issue without mysqld? Something with fio will be great.
No, I spent last day trying various fio options but I was unable to reproduce the problem yet.
I managed to reduce the required mysql I/O -- I can now reproduce after importing ~150MB SQL dump instead of 20GB.
It's also interesting: Just hard killing mysqld which will cause recovery on next start is already enough to trigger the problem.
I filed ticket with MariaDB to get some input from them, maybe they have an idea for another reproducer: https://jira.mariadb.org/browse/MDEV-29349
Hi,
I can now reproduce using fio:
I looked around in MariaDB issue tracker and found https://jira.mariadb.org/browse/MDEV-26674 which lead me to https://github.com/MariaDB/server/commit/de7db5517de11a58d57d2a41d0bc6f38b6f... -- it's a conditional based on $KV and I hit that kernel regression during one of my bisect attempts (see https://lore.kernel.org/all/701f3fc0-2f0c-a32c-0d41-b489a9a59b99@whissi.de/).
Setting innodb_use_native_aio=OFF will prevent the problem.
This helped me to find https://github.com/axboe/fio/issues/1195 so I now have a working reproducer for fio.
$ cat reproducer.fio [global] direct=1 thread=1 norandommap=1 group_reporting=1 time_based=1 ioengine=io_uring
rw=randwrite bs=4096 runtime=20 numjobs=1 fixedbufs=1 hipri=1 registerfiles=1 sqthread_poll=1
[filename0] directory=/srv/machines/fio size=200M iodepth=1 cpus_allowed=20
...now call fio like "fio reproducer.fio". After one successful fio run, fsfreeze will already hang for me.
On Mon, Aug 22, 2022 at 9:30 AM Thomas Deutschmann whissi@whissi.de wrote:
Hi,
I can now reproduce using fio:
I looked around in MariaDB issue tracker and found https://jira.mariadb.org/browse/MDEV-26674 which lead me to https://github.com/MariaDB/server/commit/de7db5517de11a58d57d2a41d0bc6f38b6f... -- it's a conditional based on $KV and I hit that kernel regression during one of my bisect attempts (see https://lore.kernel.org/all/701f3fc0-2f0c-a32c-0d41-b489a9a59b99@whissi.de/).
Setting innodb_use_native_aio=OFF will prevent the problem.
This helped me to find https://github.com/axboe/fio/issues/1195 so I now have a working reproducer for fio.
$ cat reproducer.fio [global] direct=1 thread=1 norandommap=1 group_reporting=1 time_based=1 ioengine=io_uring
rw=randwrite bs=4096 runtime=20 numjobs=1 fixedbufs=1 hipri=1 registerfiles=1 sqthread_poll=1
[filename0] directory=/srv/machines/fio size=200M iodepth=1 cpus_allowed=20
...now call fio like "fio reproducer.fio". After one successful fio run, fsfreeze will already hang for me.
Hmm.. I still cannot repro the hang in my test. I have:
[root@eth50-1 ~]# mount | grep mnt /dev/md0 on /root/mnt type ext4 (rw,relatime,stripe=384) [root@eth50-1 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 253:0 0 32G 0 disk ├─vda1 253:1 0 2G 0 part /boot └─vda2 253:2 0 30G 0 part / nvme0n1 259:0 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme2n1 259:1 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme3n1 259:2 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme1n1 259:3 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt
[root@eth50-1 ~]# history 381 fio iou/repro.fio 382 fsfreeze --freeze /root/mnt 383 fsfreeze --unfreeze /root/mnt 384 fio iou/repro.fio 385 fsfreeze --freeze /root/mnt 386 fsfreeze --unfreeze /root/mnt ^^^^^^^^^^^^^^ all works fine.
Did I miss something?
Thanks, Song
On 2022-08-22 23:52, Song Liu wrote:
Hmm.. I still cannot repro the hang in my test. I have:
[root@eth50-1 ~]# mount | grep mnt /dev/md0 on /root/mnt type ext4 (rw,relatime,stripe=384) [root@eth50-1 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 253:0 0 32G 0 disk ├─vda1 253:1 0 2G 0 part /boot └─vda2 253:2 0 30G 0 part / nvme0n1 259:0 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme2n1 259:1 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme3n1 259:2 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme1n1 259:3 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt
[root@eth50-1 ~]# history 381 fio iou/repro.fio 382 fsfreeze --freeze /root/mnt 383 fsfreeze --unfreeze /root/mnt 384 fio iou/repro.fio 385 fsfreeze --freeze /root/mnt 386 fsfreeze --unfreeze /root/mnt ^^^^^^^^^^^^^^ all works fine.
Did I miss something?
No :(
I am currently not testing against the mdraid but this shouldn't matter.
However, it looks like you don't test on bare metal, do you?
I tried to test on VMware Workstation 16 myself but VMware's nvme implementation is currently broken (https://github.com/vmware/open-vm-tools/issues/579).
On Mon, Aug 22, 2022 at 3:44 PM Thomas Deutschmann whissi@whissi.de wrote:
On 2022-08-22 23:52, Song Liu wrote:
Hmm.. I still cannot repro the hang in my test. I have:
[root@eth50-1 ~]# mount | grep mnt /dev/md0 on /root/mnt type ext4 (rw,relatime,stripe=384) [root@eth50-1 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 253:0 0 32G 0 disk ├─vda1 253:1 0 2G 0 part /boot └─vda2 253:2 0 30G 0 part / nvme0n1 259:0 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme2n1 259:1 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme3n1 259:2 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme1n1 259:3 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt
[root@eth50-1 ~]# history 381 fio iou/repro.fio 382 fsfreeze --freeze /root/mnt 383 fsfreeze --unfreeze /root/mnt 384 fio iou/repro.fio 385 fsfreeze --freeze /root/mnt 386 fsfreeze --unfreeze /root/mnt ^^^^^^^^^^^^^^ all works fine.
Did I miss something?
No :(
I am currently not testing against the mdraid but this shouldn't matter.
However, it looks like you don't test on bare metal, do you?
I tried to test on VMware Workstation 16 myself but VMware's nvme implementation is currently broken (https://github.com/vmware/open-vm-tools/issues/579).
I am testing with QEMU emulator version 6.2.0. I can also test with bare metal.
Thanks, Song
On Mon, Aug 22, 2022 at 3:59 PM Song Liu song@kernel.org wrote:
On Mon, Aug 22, 2022 at 3:44 PM Thomas Deutschmann whissi@whissi.de wrote:
On 2022-08-22 23:52, Song Liu wrote:
Hmm.. I still cannot repro the hang in my test. I have:
[root@eth50-1 ~]# mount | grep mnt /dev/md0 on /root/mnt type ext4 (rw,relatime,stripe=384) [root@eth50-1 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 253:0 0 32G 0 disk ├─vda1 253:1 0 2G 0 part /boot └─vda2 253:2 0 30G 0 part / nvme0n1 259:0 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme2n1 259:1 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme3n1 259:2 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt nvme1n1 259:3 0 4G 0 disk └─md0 9:0 0 12G 0 raid5 /root/mnt
[root@eth50-1 ~]# history 381 fio iou/repro.fio 382 fsfreeze --freeze /root/mnt 383 fsfreeze --unfreeze /root/mnt 384 fio iou/repro.fio 385 fsfreeze --freeze /root/mnt 386 fsfreeze --unfreeze /root/mnt ^^^^^^^^^^^^^^ all works fine.
Did I miss something?
No :(
I am currently not testing against the mdraid but this shouldn't matter.
However, it looks like you don't test on bare metal, do you?
I tried to test on VMware Workstation 16 myself but VMware's nvme implementation is currently broken (https://github.com/vmware/open-vm-tools/issues/579).
I am testing with QEMU emulator version 6.2.0. I can also test with bare metal.
OK, now I got a repro with bare metal: nvme+xfs.
This is a 5.19 based kernel, the stack is
[ 867.091579] INFO: task fsfreeze:49972 blocked for more than 122 seconds. [ 867.104969] Tainted: G S 5.19.0-0_fbk0_rc1_gc225658be66e #1 [ 867.119750] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 867.135381] task:fsfreeze state:D stack: 0 pid:49972 ppid: 22571 flags:0x00004000 [ 867.135388] Call Trace: [ 867.135390] <TASK> [ 867.135394] __schedule+0x3d7/0x700 [ 867.135404] schedule+0x39/0x90 [ 867.135409] percpu_down_write+0x234/0x270 [ 867.135414] freeze_super+0x8a/0x160 [ 867.135422] do_vfs_ioctl+0x8b5/0x920 [ 867.135430] __x64_sys_ioctl+0x52/0xb0 [ 867.135435] do_syscall_64+0x3d/0x90 [ 867.135441] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 867.135447] RIP: 0033:0x7f034f23fcdb [ 867.135453] RSP: 002b:00007ffe2bdfebf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 867.135457] RAX: ffffffffffffffda RBX: 0000000000000066 RCX: 00007f034f23fcdb [ 867.135460] RDX: 0000000000000000 RSI: 00000000c0045877 RDI: 0000000000000003 [ 867.135463] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000000000000 [ 867.135466] R10: 0000000000001000 R11: 0000000000000246 R12: 00007ffe2bdff334 [ 867.135469] R13: 00005650ff68dc40 R14: ffffffff00000000 R15: 00005650ff68c0f5 [ 867.135474] </TASK>
I am not very familiar with this code, so I will need more time to look into it.
Thomas, have you tried to bisect with the fio repro?
Thanks, Song
On 2022-08-23 03:37, Song Liu wrote:
Thomas, have you tried to bisect with the fio repro?
Yes, just finished:
d32d3d0b47f7e34560ae3c55ddfcf68694813501 is the first bad commit commit d32d3d0b47f7e34560ae3c55ddfcf68694813501 Author: Christoph Hellwig Date: Mon Jun 14 13:17:34 2021 +0200
nvme-multipath: set QUEUE_FLAG_NOWAIT The nvme multipathing code just dispatches bios to one of the blk-mq based paths and never blocks on its own, so set QUEUE_FLAG_NOWAIT to support REQ_NOWAIT bios.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
So another NOWAIT issue -- similar to the bad commit which is causing the mdraid issue I already found (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...).
Reverting the commit, i.e. deleting
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue);
fixes the problem for me. Well, sort of. Looks like this will disable io_uring. fio reproducer fails with
$ fio reproducer.fio filename0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1 fio-3.30 Starting 1 thread fio: io_u error on file /srv/machines/fio/filename0.0.0: Operation not supported: write offset=12648448, buflen=4096 fio: pid=1585, err=95/file:io_u.c:1846, func=io_u error, error=Operation not supported
My MariaDB reproducer also doesn't trigger the problem anymore, but probably for the same reason -- it cannot use io_uring anymore.
On Mon, Aug 22, 2022 at 8:15 PM Thomas Deutschmann whissi@whissi.de wrote:
On 2022-08-23 03:37, Song Liu wrote:
Thomas, have you tried to bisect with the fio repro?
Yes, just finished:
d32d3d0b47f7e34560ae3c55ddfcf68694813501 is the first bad commit commit d32d3d0b47f7e34560ae3c55ddfcf68694813501 Author: Christoph Hellwig Date: Mon Jun 14 13:17:34 2021 +0200
nvme-multipath: set QUEUE_FLAG_NOWAIT The nvme multipathing code just dispatches bios to one of the blk-mq based paths and never blocks on its own, so set QUEUE_FLAG_NOWAIT to support REQ_NOWAIT bios.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
So another NOWAIT issue -- similar to the bad commit which is causing the mdraid issue I already found (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...).
Reverting the commit, i.e. deleting
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue);
fixes the problem for me. Well, sort of. Looks like this will disable io_uring. fio reproducer fails with
My system doesn't have multipath enabled. I guess bisect will point to something else here.
I am afraid we won't get more information from bisect.
Thanks, Song
$ fio reproducer.fio filename0: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=1 fio-3.30 Starting 1 thread fio: io_u error on file /srv/machines/fio/filename0.0.0: Operation not supported: write offset=12648448, buflen=4096 fio: pid=1585, err=95/file:io_u.c:1846, func=io_u error, error=Operation not supported
My MariaDB reproducer also doesn't trigger the problem anymore, but probably for the same reason -- it cannot use io_uring anymore.
On Tue, Aug 23, 2022 at 10:13 AM Song Liu song@kernel.org wrote:
On Mon, Aug 22, 2022 at 8:15 PM Thomas Deutschmann whissi@whissi.de wrote:
On 2022-08-23 03:37, Song Liu wrote:
Thomas, have you tried to bisect with the fio repro?
Yes, just finished:
d32d3d0b47f7e34560ae3c55ddfcf68694813501 is the first bad commit commit d32d3d0b47f7e34560ae3c55ddfcf68694813501 Author: Christoph Hellwig Date: Mon Jun 14 13:17:34 2021 +0200
nvme-multipath: set QUEUE_FLAG_NOWAIT The nvme multipathing code just dispatches bios to one of the blk-mq based paths and never blocks on its own, so set QUEUE_FLAG_NOWAIT to support REQ_NOWAIT bios.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
So another NOWAIT issue -- similar to the bad commit which is causing the mdraid issue I already found (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...).
Reverting the commit, i.e. deleting
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue);
fixes the problem for me. Well, sort of. Looks like this will disable io_uring. fio reproducer fails with
My system doesn't have multipath enabled. I guess bisect will point to something else here.
I am afraid we won't get more information from bisect.
OK, I am able to pinpoint the issue, and Jens found the proper fix for it (see below, also available in [1]). It survived 100 runs of the repro fio job.
Thomas, please give it a try.
Thanks, Song
diff --git c/fs/io_uring.c w/fs/io_uring.c index 3f8a79a4affa..72a39f5ec5a5 100644 --- c/fs/io_uring.c +++ w/fs/io_uring.c @@ -4551,7 +4551,12 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) copy_iov: iov_iter_restore(&s->iter, &s->iter_state); ret = io_setup_async_rw(req, iovec, s, false); - return ret ?: -EAGAIN; + if (!ret) { + if (kiocb->ki_flags & IOCB_WRITE) + kiocb_end_write(req); + return -EAGAIN; + } + return 0; } out_free: /* it's reportedly faster than delegating the null check to kfree() */
[1] https://lore.kernel.org/stable/a603cfc5-9ba5-20c3-3fec-2c4eec4350f7@kernel.d...
On 8/25/22 10:47 AM, Song Liu wrote:
On Tue, Aug 23, 2022 at 10:13 AM Song Liu song@kernel.org wrote:
On Mon, Aug 22, 2022 at 8:15 PM Thomas Deutschmann whissi@whissi.de wrote:
On 2022-08-23 03:37, Song Liu wrote:
Thomas, have you tried to bisect with the fio repro?
Yes, just finished:
d32d3d0b47f7e34560ae3c55ddfcf68694813501 is the first bad commit commit d32d3d0b47f7e34560ae3c55ddfcf68694813501 Author: Christoph Hellwig Date: Mon Jun 14 13:17:34 2021 +0200
nvme-multipath: set QUEUE_FLAG_NOWAIT The nvme multipathing code just dispatches bios to one of the blk-mq based paths and never blocks on its own, so set QUEUE_FLAG_NOWAIT to support REQ_NOWAIT bios.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
So another NOWAIT issue -- similar to the bad commit which is causing the mdraid issue I already found (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...).
Reverting the commit, i.e. deleting
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue);
fixes the problem for me. Well, sort of. Looks like this will disable io_uring. fio reproducer fails with
My system doesn't have multipath enabled. I guess bisect will point to something else here.
I am afraid we won't get more information from bisect.
OK, I am able to pinpoint the issue, and Jens found the proper fix for it (see below, also available in [1]). It survived 100 runs of the repro fio job.
Thomas, please give it a try.
Thanks, Song
diff --git c/fs/io_uring.c w/fs/io_uring.c index 3f8a79a4affa..72a39f5ec5a5 100644 --- c/fs/io_uring.c +++ w/fs/io_uring.c @@ -4551,7 +4551,12 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) copy_iov: iov_iter_restore(&s->iter, &s->iter_state); ret = io_setup_async_rw(req, iovec, s, false);
return ret ?: -EAGAIN;
if (!ret) {
if (kiocb->ki_flags & IOCB_WRITE)
kiocb_end_write(req);
return -EAGAIN;
}
return 0;
This should be 'return ret;' for that last line. I had to double check the ones I did, but they did get it right. But I did a double take when I saw this one :-)
It'll work fine for testing as we won't hit errors here unless we run out of memory, so...
On Thu, Aug 25, 2022 at 12:12 PM Jens Axboe axboe@kernel.dk wrote:
On 8/25/22 10:47 AM, Song Liu wrote:
On Tue, Aug 23, 2022 at 10:13 AM Song Liu song@kernel.org wrote:
On Mon, Aug 22, 2022 at 8:15 PM Thomas Deutschmann whissi@whissi.de wrote:
On 2022-08-23 03:37, Song Liu wrote:
Thomas, have you tried to bisect with the fio repro?
Yes, just finished:
d32d3d0b47f7e34560ae3c55ddfcf68694813501 is the first bad commit commit d32d3d0b47f7e34560ae3c55ddfcf68694813501 Author: Christoph Hellwig Date: Mon Jun 14 13:17:34 2021 +0200
nvme-multipath: set QUEUE_FLAG_NOWAIT The nvme multipathing code just dispatches bios to one of the blk-mq based paths and never blocks on its own, so set QUEUE_FLAG_NOWAIT to support REQ_NOWAIT bios.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
So another NOWAIT issue -- similar to the bad commit which is causing the mdraid issue I already found (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...).
Reverting the commit, i.e. deleting
blk_queue_flag_set(QUEUE_FLAG_NOWAIT, head->disk->queue);
fixes the problem for me. Well, sort of. Looks like this will disable io_uring. fio reproducer fails with
My system doesn't have multipath enabled. I guess bisect will point to something else here.
I am afraid we won't get more information from bisect.
OK, I am able to pinpoint the issue, and Jens found the proper fix for it (see below, also available in [1]). It survived 100 runs of the repro fio job.
Thomas, please give it a try.
Thanks, Song
diff --git c/fs/io_uring.c w/fs/io_uring.c index 3f8a79a4affa..72a39f5ec5a5 100644 --- c/fs/io_uring.c +++ w/fs/io_uring.c @@ -4551,7 +4551,12 @@ static int io_write(struct io_kiocb *req, unsigned int issue_flags) copy_iov: iov_iter_restore(&s->iter, &s->iter_state); ret = io_setup_async_rw(req, iovec, s, false);
return ret ?: -EAGAIN;
if (!ret) {
if (kiocb->ki_flags & IOCB_WRITE)
kiocb_end_write(req);
return -EAGAIN;
}
return 0;
This should be 'return ret;' for that last line. I had to double check the ones I did, but they did get it right. But I did a double take when I saw this one :-)
Ah, right... "ret ?: -EAGAIN" is a lot of information..
Song
It'll work fine for testing as we won't hit errors here unless we run out of memory, so...
-- Jens Axboe
Hello,
patch looks good to me -- cannot reproduce the problem anymore:
I tested 10 hours against single NVME drive and 10 hours against mdraid array so the patch addresses both problems.
Thank you very much!
TWIMC: this mail is primarily send for documentation purposes and for regzbot, my Linux kernel regression tracking bot. These mails usually contain '#forregzbot' in the subject, to make them easy to spot and filter.
#regzbot fixed-by: e053aaf4da56cbf0afb33a0fda4a62188e2c0637
On 15.08.22 12:58, Thorsten Leemhuis wrote:
Hi, this is your Linux kernel regression tracker. Top-posting for once, to make this easily accessible to everyone.
[CCing Jens, as the top-level maintainer who in this case also reviewed the patch that causes this regression.]
Vishal, Song, what up here? Could you please look into this and at least comment on the issue, as it's a regression that was reported more than 10 days ago already. Ideally at this point it would be good if the regression was fixed already, as explained by "Prioritize work on fixing regressions" here: https://docs.kernel.org/process/handling-regressions.html#prioritize-work-on...
Ciao, Thorsten
On 11.08.22 14:34, Thomas Deutschmann wrote:
Hi,
any news on this? Is there anything else you need from me or I can help with?
Thanks.
-- Regards, Thomas -----Original Message----- From: Thomas Deutschmann whissi@whissi.de Sent: Wednesday, August 3, 2022 4:35 PM To: vverma@digitalocean.com; song@kernel.org Cc: stable@vger.kernel.org; regressions@lists.linux.dev Subject: [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs Hi, while trying to backup a Dell R7525 system running Debian bookworm/testing using LVM snapshots I noticed that the system will 'freeze' sometimes (not all the times) when creating the snapshot. First I thought this was related to LVM so I created https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html (continued at https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229) Long story short: I was even able to reproduce with fsfreeze, see last strace lines
[...] 14471 1659449870.984635 openat(AT_FDCWD, "/var/lib/machines", O_RDONLY) =3 14471 1659449870.984658 newfstatat(3, "",
{st_mode=S_IFDIR|0700,st_size=4096, ...}, AT_EMPTY_PATH) = 0
14471 1659449870.984678 ioctl(3, FIFREEZE
so I started to bisect kernel and found the following bad commit:
md: add support for REQ_NOWAIT
commit 021a24460dc2 ("block: add QUEUE_FLAG_NOWAIT") added support for checking whether a given bdev supports handling of REQ_NOWAIT or not. Since then commit 6abc49468eea ("dm: add support for REQ_NOWAIT and enable it for linear target") added support for REQ_NOWAIT for dm. This uses a similar approach to incorporate REQ_NOWAIT for md based bios.
This patch was tested using t/io_uring tool within FIO. A nvme drive was partitioned into 2 partitions and a simple raid 0 configuration /dev/md0 was created.
md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0] 937423872 blocks super 1.2 512k chunks
Before patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38396 38396 pts/2 00:00:00 io_uring 38396 38397 pts/2 00:00:15 io_uring 38396 38398 pts/2 00:00:13 iou-wrk-38397
We can see iou-wrk-38397 io worker thread created which gets created when io_uring sees that the underlying device (/dev/md0 in this case) doesn't support nowait.
After patch:
$ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
Running top while the above runs:
$ ps -eL | grep $(pidof io_uring)
38341 38341 pts/2 00:10:22 io_uring 38341 38342 pts/2 00:10:37 io_uring
After running this patch, we don't see any io worker thread being created which indicated that io_uring saw that the underlying device does support nowait. This is the exact behaviour noticed on a dm device which also supports nowait.
For all the other raid personalities except raid0, we would need to train pieces which involves make_request fn in order for them to correctly handle REQ_NOWAIT.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i d=f51d46d0e7cb5b8494aa534d276a9d8915a2443d
After reverting this commit (and follow up commit 0f9650bd838efe5c52f7e5f40c3204ad59f1964d) v5.18.15 and v5.19 worked for me again.
At this point I still wonder why I experienced the same problem even after I removed one nvme device from the mdraid array and tested it separately. So maybe there is another nowait/REQ_NOWAIT problem somewhere. During bisect I only tested against the mdraid array.
#regzbot introduced: f51d46d0e7cb5b8494aa534d276a9d8915a2443d #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-July/026228.html #regzbot link: https://listman.redhat.com/archives/linux-lvm/2022-August/thread.html#26229
-- Regards, Thomas
linux-stable-mirror@lists.linaro.org