On Tue, Feb 18, 2025 at 11:10 PM Qu Wenruo wqu@suse.com wrote:
[BUG] When running generic/417 with a btrfs whose block size < page size
generic/417 -> generic/418
(subpage cases), it always fails.
And the following minimal reproducer is more than enough to trigger it reliably:
workload() { mkfs.btrfs -s 4k -f $dev > /dev/null dmesg -C mount $dev $mnt $fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest ret=$? umount $mnt stop_trace if [ $ret -ne 0 ]; then fail fi }
for (( i = 0; i < 1024; i++)); do echo "=== $i/$runtime ===" workload done
[CAUSE] With extra trace printk added to the following functions:
btrfs_buffered_write()
- Which folio is touched
- The file offset (start) where the buffered write is at
- How many bytes are copied
- The content of the write (the first 2 bytes)
submit_one_sector()
- Which folio is touched
- The position inside the folio
pagecache_isize_extended()
- The parameters of the function itself
- The parameters of the folio_zero_range()
Which are enough to show the problem:
22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101 22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101 22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101 22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101 22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192 22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000 22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
The tool dio-invalidate-cache will start 3 threads, each doing a buffered write with 0x01 at 4096 * i (i is 0, 1 ,2), do a fsync, then do a direct read, and compare the read buffer with the write buffer.
Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into the page cache.
But at submit_one_sector(), at file offset 4096, the content is zeroed out, mostly by pagecache_isize_extended().
The race happens like this: Thread A is writing into range [4K, 8K). Thread B is writing into range [8K, 12k).
Thread A | Thread B
-------------------------------------+------------------------------------ btrfs_buffered_write() | btrfs_buffered_write() |- old_isize = 4K; | |- old_isize = 4096; |- btrfs_inode_lock() | | |- write into folio range [4K, 8K) | | |- pagecache_isize_extended() | | | extend isize from 4096 to 8192 | | | no folio_zero_range() called | | |- btrfs_inode_lock() | | | |- btrfs_inode_lock() | |- write into folio range [8K, 12K) | |- pagecache_isize_extended() | | calling folio_zero_range(4K, 8K) | | This is caused by the old_isize is | | grabbed too early, without any | | inode lock. | |- btrfs_inode_unlock()
The @old_isize is grabbed without inode lock, causing race between two buffered write threads and making pagecache_isize_extended() to zero range which is still containing cached data.
And this is only affecting subpage btrfs, because for regular blocksize == page size case, the function pagecache_isize_extended() will do nothing if the block size >= page size.
[FIX] Grab the old isize with inode lock hold.
hold -> held
Or easier to understand:
Grab the old i_size while holding the inode lock.
This means each buffered write thread will have a stable view of the old inode size, thus avoid the above race.
Cc: stable@vger.kernel.org Signed-off-by: Qu Wenruo wqu@suse.com
Fixes: 5e8b9ef30392 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
That is the commit that introduced the race, before it we were grabbing i_size and doing the pagecache_isize_extended() call while holding the inode lock.
fs/btrfs/file.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index fd90855fe717..896dc03689d6 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1090,7 +1090,7 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) u64 lockend; size_t num_written = 0; ssize_t ret;
loff_t old_isize = i_size_read(inode);
loff_t old_isize; unsigned int ilock_flags = 0; const bool nowait = (iocb->ki_flags & IOCB_NOWAIT); unsigned int bdp_flags = (nowait ? BDP_ASYNC : 0);
@@ -1103,6 +1103,13 @@ ssize_t btrfs_buffered_write(struct kiocb *iocb, struct iov_iter *i) if (ret < 0) return ret;
/*
* We can only trust the isize with inode lock hold, or it can race with
Same here, hold -> held
With those things addressed:
Reviewed-by: Filipe Manana fdmanana@suse.com
Thanks.
* other buffered writes and cause incorrect call of
* pagecache_isize_extended() to overwrite existing data.
*/
old_isize = i_size_read(inode);
ret = generic_write_checks(iocb, i); if (ret <= 0) goto out;
-- 2.48.1