[BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers().
It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free.
[CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state.
But there are some metadata modification before that error, and they are still in the btree inode page cache.
Since there will be no real transaction commitment, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty.
And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata.
And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free.
[FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state immediately mark the bbio as failure, instead of really submitting them into the device.
This means for test case like generic/388, at iput() those dirty folios will just be discarded without triggering IO.
CC: stable@vger.kernel.org Signed-off-by: Qu Wenruo wqu@suse.com --- fs/btrfs/disk-io.c | 12 ++++++++++++ fs/btrfs/extent_io.c | 11 +++++++++++ 2 files changed, 23 insertions(+)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 0aa7e5d1b05f..8b0fc2df85f1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4402,11 +4402,23 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
btrfs_put_block_group_cache(fs_info);
+ /* + * If the fs is already in trans aborted case, also trigger writeback of all + * dirty metadata folios. + * Those folios will not reach disk but dicarded directly. + * This is to make sure no dirty folios before iput(), or iput() will + * trigger writeback again, and may even cause extra works queued + * into workqueue. + */ + if (unlikely(BTRFS_FS_ERROR(fs_info))) + filemap_write_and_wait(fs_info->btree_inode->i_mapping); + /* * we must make sure there is not any read request to * submit after we stopping all workers. */ invalidate_inode_pages2(fs_info->btree_inode->i_mapping); + btrfs_stop_all_workers(fs_info);
/* We shouldn't have any transaction open at this point */ diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 870584dde575..a8a53409bb3f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2246,6 +2246,17 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb, wbc_account_cgroup_owner(wbc, folio, range_len); folio_unlock(folio); } + /* + * If the fs is already in error status, do not submit any writeback + * but immediately finish it. + * This is to avoid iput() triggering dirty folio writeback for + * transaction aborted fses, which can cause extra works into + * already stopped workqueues. + */ + if (unlikely(BTRFS_FS_ERROR(fs_info))) { + btrfs_bio_end_io(bbio, errno_to_blk_status(-EROFS)); + return; + } btrfs_submit_bbio(bbio, 0); }