On Tue, Feb 16, 2021 at 04:15:46PM +0100, David Sterba wrote:
On Tue, Feb 16, 2021 at 03:50:36PM +0100, Greg KH wrote:
On Tue, Feb 16, 2021 at 02:40:31PM +0000, fdmanana@kernel.org wrote:
From: Filipe Manana fdmanana@suse.com
Whenever we attempt to do a non-aligned direct IO write with O_DSYNC, we end up triggering an assertion and crashing. Example reproducer:
$ cat test.sh #!/bin/bash
DEV=/dev/sdj MNT=/mnt/sdj
mkfs.btrfs -f $DEV > /dev/null mount $DEV $MNT
# Do a direct IO write with O_DSYNC into a non-aligned range... xfs_io -f -d -s -c "pwrite -S 0xab -b 64K 1111 64K" $MNT/foobar
umount $MNT
When running the reproducer an assertion fails and produces the following trace:
[ 2418.403134] assertion failed: !current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA, in fs/btrfs/space-info.c:1467 [ 2418.403745] ------------[ cut here ]------------ [ 2418.404306] kernel BUG at fs/btrfs/ctree.h:3286! [ 2418.404862] invalid opcode: 0000 [#2] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 2418.405451] CPU: 1 PID: 64705 Comm: xfs_io Tainted: G D 5.10.15-btrfs-next-87 #1 [ 2418.406026] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 2418.407228] RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs] [ 2418.407835] Code: e6 48 c7 (...) [ 2418.409078] RSP: 0018:ffffb06080d13c98 EFLAGS: 00010246 [ 2418.409696] RAX: 000000000000006c RBX: ffff994c1debbf08 RCX: 0000000000000000 [ 2418.410302] RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff [ 2418.410904] RBP: ffff994c21770000 R08: 0000000000000000 R09: 0000000000000000 [ 2418.411504] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000010000 [ 2418.412111] R13: ffff994c22198400 R14: ffff994c21770000 R15: 0000000000000000 [ 2418.412713] FS: 00007f54fd7aff00(0000) GS:ffff994d35200000(0000) knlGS:0000000000000000 [ 2418.413326] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2418.413933] CR2: 000056549596d000 CR3: 000000010b928003 CR4: 0000000000370ee0 [ 2418.414528] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2418.415109] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2418.415669] Call Trace: [ 2418.416254] btrfs_reserve_data_bytes.cold+0x22/0x22 [btrfs] [ 2418.416812] btrfs_check_data_free_space+0x4c/0xa0 [btrfs] [ 2418.417380] btrfs_buffered_write+0x1b0/0x7f0 [btrfs] [ 2418.418315] btrfs_file_write_iter+0x2a9/0x770 [btrfs] [ 2418.418920] new_sync_write+0x11f/0x1c0 [ 2418.419430] vfs_write+0x2bb/0x3b0 [ 2418.419972] __x64_sys_pwrite64+0x90/0xc0 [ 2418.420486] do_syscall_64+0x33/0x80 [ 2418.420979] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 2418.421486] RIP: 0033:0x7f54fda0b986 [ 2418.421981] Code: 48 c7 c0 (...) [ 2418.423019] RSP: 002b:00007ffc40569c38 EFLAGS: 00000246 ORIG_RAX: 0000000000000012 [ 2418.423547] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f54fda0b986 [ 2418.424075] RDX: 0000000000010000 RSI: 000056549595e000 RDI: 0000000000000003 [ 2418.424596] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000400 [ 2418.425119] R10: 0000000000000400 R11: 0000000000000246 R12: 00000000ffffffff [ 2418.425644] R13: 0000000000000400 R14: 0000000000010000 R15: 0000000000000000 [ 2418.426148] Modules linked in: btrfs blake2b_generic (...) [ 2418.429540] ---[ end trace ef2aeb44dc0afa34 ]---
At btrfs_file_write_iter() we set current->journal_info to BTRFS_DIO_SYNC_STUB;
We then call __btrfs_direct_write(), which calls btrfs_direct_IO();
We can't do the direct IO write because it starts at a non-aligned offset (1111). So at btrfs_direct_IO() we return -EINVAL (coming from check_direct_IO() which does the alignment check), but we leave current->journal_info set to BTRFS_DIO_SYNC_STUB - we only clear it at btrfs_dio_iomap_begin(), because we assume we always get there;
Then at __btrfs_direct_write() we see that the attempt to do the direct IO write was not successful, 0 bytes written, so we fallback to a buffered write by calling btrfs_buffered_write();
There we call btrfs_check_data_free_space() which in turn calls btrfs_alloc_data_chunk_ondemand() and that calls btrfs_reserve_data_bytes() with flush == BTRFS_RESERVE_FLUSH_DATA;
Then at btrfs_reserve_data_bytes() we have current->journal_info set to BTRFS_DIO_SYNC_STUB, therefore not NULL, and flush has the value BTRFS_RESERVE_FLUSH_DATA, triggering the second assertion:
int btrfs_reserve_data_bytes(struct btrfs_fs_info *fs_info, u64 bytes, enum btrfs_reserve_flush_enum flush) { struct btrfs_space_info *data_sinfo = fs_info->data_sinfo; int ret;
ASSERT(flush == BTRFS_RESERVE_FLUSH_DATA || flush == BTRFS_RESERVE_FLUSH_FREE_SPACE_INODE); ASSERT(!current->journal_info || flush != BTRFS_RESERVE_FLUSH_DATA);
(...)
So fix that by setting the journal to NULL whenever check_direct_IO() returns a failure.
This bug only affects 5.10 kernels, and the regression was introduced in 5.10-rc1 by commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround"). The bug does not exist in 5.11 kernels due to commit ecfdc08b8cc65d ("btrfs: remove dio iomap DSYNC workaround"), which depends on a large patchset that went into the merge window for 5.11. So this is a fix only for 5.10.x stable kernels, as there are people hitting this bug.
Fixes: 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround") CC: stable@vger.kernel.org # 5.10 (and only 5.10) CC: David Sterba dsterba@suse.cz Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1181605 Signed-off-by: Filipe Manana fdmanana@suse.com
fs/btrfs/inode.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)
As this is a one-off patch, I need the btrfs maintainers to ack this and really justify why we can't take the larger patch or patch series here instead, as that is almost always the correct thing to do instead.
Acked-by: David Sterba dsterba@suse.com
The full backport would be patches
ecfdc08b8cc6 btrfs: remove dio iomap DSYNC workaround a42fa643169d btrfs: call iomap_dio_complete() without inode_lock 502756b38093 btrfs: remove btrfs_inode::dio_sem e9adabb9712e btrfs: use shared lock for direct writes within EOF c35237063340 btrfs: push inode locking and unlocking into buffered/direct write a14b78ad06ab btrfs: introduce btrfs_inode_lock()/unlock() b8d8e1fd570a btrfs: introduce btrfs_write_check()
and maybe more.
$ git diff b8d8e1fd570a^..ecfdc08b8cc6 | diffstat btrfs_inode.h | 10 - ctree.h | 8 + file.c | 338 +++++++++++++++++++++++++++------------------------------- inode.c | 96 +++++++--------- transaction.h | 1 5 files changed, 213 insertions(+), 240 deletions(-)
That seems too much for a backport, the fix Filipe implemented is simpler and IMO qualifies as the exceptional stable-only patch.
Why is that too much? For 7 patches that's a small overall diffstat. And you match identically what is upstream in Linus's tree. That means over time, backporting fixing is much easier, and understanding the code for everyone is simpler.
It's almost always better to track what is in Linus's tree than to do one-off patches as 95% of the time we do one-off patches they are buggy and cause problems as no one else is running them.
So how about sending the above backported series instead please.
thanks,
greg k-h