commit 968f19c5b1b7d5595423b0ac0020cc18dfed8cb5 upstream.
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode 'none').
[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even if the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).
This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.
Bu the final write into the image file is handled by btrfs, which needs
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).
So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
btrfs requires no modification, this leads to the false csum mismatch.
This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.
For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.
[FIX]
I have considered the following ideas to solve the problem:
- Make direct IO to always skip data checksum
This not only requires a new incompatible flag, as it breaks the
current per-inode NODATASUM flag.
But also requires extra handling for no csum found cases.
And this also reduces our checksum protection.
- Let hardware handle all the checksum
AKA, just nodatasum mount option.
That requires trust for hardware (which is not that trustful in a lot
of cases), and it's not generic at all.
- Always fallback to buffered write if the inode requires checksum
This was suggested by Christoph, and is the solution utilized by this
patch.
The cost is obvious, the extra buffer copying into page cache, thus it
reduces the performance.
But at least it's still user configurable, if the end user still wants
the zero-copy performance, just set NODATASUM flag for the inode
(which is a common practice for VM images on btrfs).
Since we cannot trust user space programs to keep the buffer
consistent during direct IO, we have no choice but always falling back
to buffered IO. At least by this, we avoid the more deadly false data
checksum mismatch error.
CC: stable(a)vger.kernel.org # 6.12+
Suggested-by: Christoph Hellwig <hch(a)infradead.org>
Reviewed-by: Filipe Manana <fdmanana(a)suse.com>
Signed-off-by: Qu Wenruo <wqu(a)suse.com>
Reviewed-by: David Sterba <dsterba(a)suse.com>
Signed-off-by: David Sterba <dsterba(a)suse.com>
---
fs/btrfs/direct-io.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 8567af46e16f..eacbb74bf6bc 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -855,6 +855,22 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
goto buffered;
}
+ /*
+ * We can't control the folios being passed in, applications can write
+ * to them while a direct IO write is in progress. This means the
+ * content might change after we calculated the data checksum.
+ * Therefore we can end up storing a checksum that doesn't match the
+ * persisted data.
+ *
+ * To be extra safe and avoid false data checksum mismatch, if the
+ * inode requires data checksum, just fallback to buffered IO.
+ * For buffered IO we have full control of page cache and can ensure
+ * no one is modifying the content during writeback.
+ */
+ if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
+ btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
+ goto buffered;
+ }
/*
* The iov_iter can be mapped to the same file range we are writing to.
--
2.49.0
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.1.y
git checkout FETCH_HEAD
git cherry-pick -x 687b2bae0efff9b25e071737d6af5004e6e35af5
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025051212-antirust-outshoot-07f7@gregkh' --subject-prefix 'PATCH 6.1.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 687b2bae0efff9b25e071737d6af5004e6e35af5 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Wed, 7 May 2025 07:34:24 -0600
Subject: [PATCH] io_uring: ensure deferred completions are flushed for
multishot
Multishot normally uses io_req_post_cqe() to post completions, but when
stopping it, it may finish up with a deferred completion. This is fine,
except if another multishot event triggers before the deferred completions
get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
as new multishot completions get posted before the deferred ones are
flushed. This can cause confusion on the application side, if strict
ordering is required for the use case.
When multishot posting via io_req_post_cqe(), flush any pending deferred
completions first, if any.
Cc: stable(a)vger.kernel.org # 6.1+
Reported-by: Norman Maurer <norman_maurer(a)apple.com>
Reported-by: Christian Mazakas <christian.mazakas(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 769814d71153..541e65a1eebf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
struct io_ring_ctx *ctx = req->ctx;
bool posted;
+ /*
+ * If multishot has already posted deferred completions, ensure that
+ * those are flushed first before posting this one. If not, CQEs
+ * could get reordered.
+ */
+ if (!wq_list_empty(&ctx->submit_state.compl_reqs))
+ __io_submit_flush_completions(ctx);
+
lockdep_assert(!io_wq_current_is_worker());
lockdep_assert_held(&ctx->uring_lock);
Hi stable(a)vger.kernel.org,
Could you please apply:
1. Commit a7208610761ae ("Bluetooth: btmtk: Remove resetting mt7921
before downloading the fw") to v6.12.x (it's already in
v6.14).
2. Commit 33634e2ab7c6 ("Bluetooth: btmtk: Remove the resetting step
before downloading the fw") to v6.12.x and v6.14.x.
These fixes address an issue with some audio interfaces failing to
initialise during boot on kernels 6.11+. As noted in my original
analysis below, the MediaTek Bluetooth controller reset increases the
device setup time from ~200ms to ~20s and can interfere with other USB
devices on the bus.
Thanks,
Geoffrey.
On Fri, Mar 28, 2025 at 08:44:49AM +1030, Geoffrey D. Bennett wrote:
> Hi all,
>
> Sorry, I see that an identical patch has already been applied to
> bluetooth-next
> https://lore.kernel.org/linux-bluetooth/20250315022730.11071-1-hao.qin@medi…
>
> While I'm glad the issue is being addressed, my original patch
> https://lore.kernel.org/linux-bluetooth/Z8ybV04CVUfVAykH@m.b4.vu/
> contained useful context and tags that didn't make it into the final
> commit.
>
> For getting this fix into current kernel releases 6.12/6.13/6.14, I
> think the patch needs the "Cc: stable(a)vger.kernel.org" tag that was in
> my original submission but missing from Hao's. Since this is causing
> significant issues for users on kernels 6.11+ (audio interfaces
> failing to work), it's important this gets backported.
>
> Hao, is this something you can do? I think the instructions at
> https://www.kernel.org/doc/html/v6.14/process/stable-kernel-rules.html#opti…
> need to be followed, but I've not done this before.
>
> Thanks,
> Geoffrey.
>
> On Fri, Mar 28, 2025 at 07:22:06AM +1030, Geoffrey D. Bennett wrote:
> > This reverts commit ccfc8948d7e4d93cab341a99774b24586717d89a.
> >
> > The MediaTek Bluetooth controller reset that was added increases the
> > Bluetooth device setup time from ~200ms to ~20s and interferes with
> > other devices on the bus.
> >
> > Three users (with Focusrite Scarlett 2nd Gen 6i6 and 3rd Gen Solo and
> > 4i4 audio interfaces) reported that since 6.11 (which added this
> > commit) their audio interface fails to initialise if connected during
> > boot. Two of the users confirmed they have an MT7922.
> >
> > Errors like this are observed in dmesg for the audio interface:
> >
> > usb 3-4: parse_audio_format_rates_v2v3(): unable to find clock source (clock -110)
> > usb 3-4: uac_clock_source_is_valid(): cannot get clock validity for id 41
> > usb 3-4: clock source 41 is not valid, cannot use
> >
> > The problem only occurs when both devices and kernel modules are
> > present and loaded during system boot, so it can be worked around by
> > connecting the audio interface after booting.
> >
> > Fixes: ccfc8948d7e4 ("Bluetooth: btusb: mediatek: reset the controller before downloading the fw")
> > Closes: https://github.com/geoffreybennett/linux-fcp/issues/24
> > Bisected-by: Benedikt Ziemons <ben(a)rs485.network>
> > Tested-by: Benedikt Ziemons <ben(a)rs485.network>
> > Cc: stable(a)vger.kernel.org
> > Signed-off-by: Geoffrey D. Bennett <g(a)b4.vu>
> > ---
> > Changelog:
> >
> > v1 -> v2:
> >
> > - Updated commit message with additional information.
> > - No change to this patch's diff.
> > - Dropped alternate patch that only reverted for 0x7922.
> > - Chris, Sean, Hao agreed to reverting the change:
> > https://lore.kernel.org/linux-bluetooth/2025031352-octopus-quadrant-f7ca@gr…
> >
> > drivers/bluetooth/btmtk.c | 10 ----------
> > 1 file changed, 10 deletions(-)
> >
> > diff --git a/drivers/bluetooth/btmtk.c b/drivers/bluetooth/btmtk.c
> > index 68846c5bd4f7..4390fd571dbd 100644
> > --- a/drivers/bluetooth/btmtk.c
> > +++ b/drivers/bluetooth/btmtk.c
> > @@ -1330,13 +1330,6 @@ int btmtk_usb_setup(struct hci_dev *hdev)
> > break;
> > case 0x7922:
> > case 0x7925:
> > - /* Reset the device to ensure it's in the initial state before
> > - * downloading the firmware to ensure.
> > - */
> > -
> > - if (!test_bit(BTMTK_FIRMWARE_LOADED, &btmtk_data->flags))
> > - btmtk_usb_subsys_reset(hdev, dev_id);
> > - fallthrough;
> > case 0x7961:
> > btmtk_fw_get_filename(fw_bin_name, sizeof(fw_bin_name), dev_id,
> > fw_version, fw_flavor);
> > @@ -1345,12 +1338,9 @@ int btmtk_usb_setup(struct hci_dev *hdev)
> > btmtk_usb_hci_wmt_sync);
> > if (err < 0) {
> > bt_dev_err(hdev, "Failed to set up firmware (%d)", err);
> > - clear_bit(BTMTK_FIRMWARE_LOADED, &btmtk_data->flags);
> > return err;
> > }
> >
> > - set_bit(BTMTK_FIRMWARE_LOADED, &btmtk_data->flags);
> > -
> > /* It's Device EndPoint Reset Option Register */
> > err = btmtk_usb_uhw_reg_write(hdev, MTK_EP_RST_OPT,
> > MTK_EP_RST_IN_OUT_OPT);
> > --
> > 2.45.0
> >
> >
> >
The patch below does not apply to the 6.6-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y
git checkout FETCH_HEAD
git cherry-pick -x 687b2bae0efff9b25e071737d6af5004e6e35af5
# <resolve conflicts, build, test, etc.>
git commit -s
git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025051212-safeness-barista-b429@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 687b2bae0efff9b25e071737d6af5004e6e35af5 Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Wed, 7 May 2025 07:34:24 -0600
Subject: [PATCH] io_uring: ensure deferred completions are flushed for
multishot
Multishot normally uses io_req_post_cqe() to post completions, but when
stopping it, it may finish up with a deferred completion. This is fine,
except if another multishot event triggers before the deferred completions
get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
as new multishot completions get posted before the deferred ones are
flushed. This can cause confusion on the application side, if strict
ordering is required for the use case.
When multishot posting via io_req_post_cqe(), flush any pending deferred
completions first, if any.
Cc: stable(a)vger.kernel.org # 6.1+
Reported-by: Norman Maurer <norman_maurer(a)apple.com>
Reported-by: Christian Mazakas <christian.mazakas(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 769814d71153..541e65a1eebf 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -848,6 +848,14 @@ bool io_req_post_cqe(struct io_kiocb *req, s32 res, u32 cflags)
struct io_ring_ctx *ctx = req->ctx;
bool posted;
+ /*
+ * If multishot has already posted deferred completions, ensure that
+ * those are flushed first before posting this one. If not, CQEs
+ * could get reordered.
+ */
+ if (!wq_list_empty(&ctx->submit_state.compl_reqs))
+ __io_submit_flush_completions(ctx);
+
lockdep_assert(!io_wq_current_is_worker());
lockdep_assert_held(&ctx->uring_lock);