Re: [PATCH] btrfs: fix data overwriting bug during buffered write when block size < page size

19 Feb 2025

On Tue, Feb 18, 2025 at 11:10 PM Qu Wenruo wqu@suse.com wrote:
...
[BUG]
When running generic/417 with a btrfs whose block size < page size
generic/417 -> generic/418
...
(subpage cases), it always fails.
And the following minimal reproducer is more than enough to trigger it
reliably:
workload()
{
        mkfs.btrfs -s 4k -f $dev > /dev/null
        dmesg -C
        mount $dev $mnt
        $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
        ret=$?
        umount $mnt
        stop_trace
        if [ $ret -ne 0 ]; then
                fail
        fi
}
for (( i = 0; i < 1024; i++)); do
        echo "=== $i/$runtime ==="
        workload
done
[CAUSE]
With extra trace printk added to the following functions:

btrfs_buffered_write()

Which folio is touched
The file offset (start) where the buffered write is at
How many bytes are copied
The content of the write (the first 2 bytes)


submit_one_sector()

Which folio is touched
The position inside the folio


pagecache_isize_extended()

The parameters of the function itself
The parameters of the folio_zero_range()



Which are enough to show the problem:
22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
  22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
  22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
  22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
  22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
  22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
  22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
The tool dio-invalidate-cache will start 3 threads, each doing a buffered
write with 0x01 at 4096 * i (i is 0, 1 ,2), do a fsync, then do a direct read,
and compare the read buffer with the write buffer.
Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
the page cache.
But at submit_one_sector(), at file offset 4096, the content is zeroed
out, mostly by pagecache_isize_extended().
The race happens like this:
 Thread A is writing into range [4K, 8K).
 Thread B is writing into range [8K, 12k).
           Thread A              |         Thread B

-------------------------------------+------------------------------------
btrfs_buffered_write()               | btrfs_buffered_write()
|- old_isize = 4K;                   | |- old_isize = 4096;
|- btrfs_inode_lock()                | |
|- write into folio range [4K, 8K)   | |
|- pagecache_isize_extended()        | |
|  extend isize from 4096 to 8192    | |
|  no folio_zero_range() called      | |
|- btrfs_inode_lock()                | |
                                     | |- btrfs_inode_lock()
                                     | |- write into folio range [8K, 12K)
                                     | |- pagecache_isize_extended()
                                     | |  calling folio_zero_range(4K, 8K)
                                     | |  This is caused by the old_isize is
                                     | |  grabbed too early, without any
                                     | |  inode lock.
                                     | |- btrfs_inode_unlock()
The @old_isize is grabbed without inode lock, causing race between two
buffered write threads and making pagecache_isize_extended() to zero
range which is still containing cached data.
And this is only affecting subpage btrfs, because for regular blocksize
== page size case, the function pagecache_isize_extended() will do
nothing if the block size >= page size.
[FIX]
Grab the old isize with inode lock hold.
hold -> held
Or easier to understand:
Grab the old i_size while holding the inode lock.
...
This means each buffered write thread will have a stable view of the
old inode size, thus avoid the above race.
Cc: stable@vger.kernel.org
Signed-off-by: Qu Wenruo wqu@suse.com
Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache
extension to btrfs_buffered_write")
That is the commit that introduced the race, before it we were
grabbing i_size and doing the pagecache_isize_extended() call while
holding the inode lock.
...

fs/btrfs/file.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index fd90855fe717..896dc03689d6 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1090,7 +1090,7 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
        u64 lockend;
        size_t num_written = 0;
        ssize_t ret;

  loff_t old_isize = i_size_read(inode);




  loff_t old_isize;
  unsigned int ilock_flags = 0;
  const bool nowait = (iocb->ki_flags & IOCB_NOWAIT);
  unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0);



@@ -1103,6 +1103,13 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i)
        if (ret < 0)
                return ret;

  /*


   * We can only trust the isize with inode lock hold, or it can race with



Same here, hold -> held
With those things addressed:
Reviewed-by: Filipe Manana fdmanana@suse.com
Thanks.
...

   * other buffered writes and cause incorrect call of


   * pagecache_isize_extended() to overwrite existing data.


   */


  old_isize = i_size_read(inode);


  ret = generic_write_checks(iocb, i);
  if (ret <= 0)
          goto out;



--
2.48.1

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] btrfs: fix data overwriting bug during buffered write when block size < page size