On Thu, 2024-08-22 at 19:41 +0800, Julian Sun wrote:
Hi, Josef. Thanks for your review and comments. Yeah, you are correct, sorry for the mistake.
I'm thinking if is there any misunderstanding in the following scenario
umount thread: btrfs-cleaner thread: btrfs_run_delayed_iputs() ->run_delayed_iput_locked() ->iput(inode) ->spin_lock(inode) // inode->i_count dec to 0 ->iput_final() ->__inode_add_lru() // mapping_shrinkable() returned false // so iput_final() returned with i_count=0 // note here: the inode still btrfs_kill_super() // in the sb list and I_FREEING ->generic_shutdown_super() // was not set ->evict_inodes() ->spin_unlock(inode) ->spin_lock(inode) // i_count is 0 and // continue running... // I_FREEING was not set ->__btrfs_run_defrag_inode() // so the inode was added to ->btrfs_iget() // dispost list // note the inode is still // in the sb list // and I_WILL_FREE|I_FREEING // was not set ->spin_unlock() ->find_inode() ->spin_lock(inode) ->__iget(); ->spin_unlock(); ->dispose_list() ->evict() ->iput() // note: i_lock was not ->spin_lock(inode) // hold here. ->iput_final() ->spin_unlock() ->evict()
The understanding here is not correct, I will try to analyze it more carefully.
Now, we have two threads simultaneously evicting the same inode, which led to a bug.
The key points here are:
- btrfs-cleaner: iput_final() called __inode_add_lru() and then
return because some reason with i_count=0, and I_FREEING was not set, and most important, i_lock was unlocked, which gives evict_inodes() a perfect chance to put the inode into dispose list. 2. umount thread: The inode (which i_count is 0 and I_FREEING was not set) was added to dispose list. 3. btrfs-cleaner: Before umount thread set I_FREEING flag, btrfs_iget() finded the inode and inc i_count 4. umount thread called evict() and btrfs-cleaner called iput(), which leads to bug.
The above behavior can be confirmed by the debugging log based on commit d30d0e49da71:
diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c index 407ccec3e57e..cc50dd1e8beb 100644 --- a/fs/btrfs/defrag.c +++ b/fs/btrfs/defrag.c @@ -278,6 +278,8 @@ static int __btrfs_run_defrag_inode(struct btrfs_fs_info *fs_info, ret = btrfs_defrag_file(inode, NULL, &range, defrag->transid, BTRFS_DEFRAG_BATCH); sb_end_write(fs_info->sb); + if (inode->i_state & (1<<19)) + pr_info("bug! double evict! crash will happen\n"); iput(inode); if (ret < 0) diff --git a/fs/inode.c b/fs/inode.c index 3a41f83a4ba5..641a828bc2b7 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -461,8 +461,14 @@ static void __inode_add_lru(struct inode *inode, bool rotate) return; if (!(inode->i_sb->s_flags & SB_ACTIVE)) return; - if (!mapping_shrinkable(&inode->i_data)) + if (!mapping_shrinkable(&inode->i_data)) { + if (inode->i_ino == 261) { + pr_info("set inode 0x%px dbg_flag casue mapping_shrinkable\n", inode); + dump_stack(); + WRITE_ONCE(inode->i_state, inode-
i_state|(1<<19));
+ } return; + } if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode-
i_lru))
this_cpu_inc(nr_unused); @@ -623,7 +629,10 @@ void clear_inode(struct inode *inode) xa_unlock_irq(&inode->i_data.i_pages); BUG_ON(!list_empty(&inode->i_data.i_private_list)); BUG_ON(!(inode->i_state & I_FREEING)); - BUG_ON(inode->i_state & I_CLEAR); + if (inode->i_state & I_CLEAR) { + pr_info("inode 0x%px state 0x%lx triggered bug\n", inode, inode->i_state); + BUG(); + } BUG_ON(!list_empty(&inode->i_wb_list)); /* don't need i_lock here, no concurrent mods to i_state */ inode->i_state = I_FREEING | I_CLEAR; @@ -1754,7 +1763,10 @@ void iput(struct inode *inode) { if (!inode) return; - BUG_ON(inode->i_state & I_CLEAR); + if (inode->i_state & I_CLEAR) { + pr_info("inode 0x%px state 0x%lx triggered a bug\n", inode, inode->i_state); + BUG(); + } retry: if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) { if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) { @@ -1764,6 +1776,8 @@ void iput(struct inode *inode) mark_inode_dirty_sync(inode); goto retry; } + if (inode->i_ino == 261 && (inode->i_state &(1<<19))) + pr_info("state is 0x%lx, bug may be triggered\n", inode->i_state); iput_final(inode); } }
The (1<<19) bit was used to debugging.
Here is the log when bug was triggered:
[ 33.028024][ T2451] set inode 0xffff88801fb87738 dbg_flag casue mapping_shrinkable [ 33.028478][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not tainted 6.10.0-rc2-00225-gf0deb01d52b3-dirty #132 [ 33.029072][ T2451] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 33.029675][ T2451] Call Trace: [ 33.029873][ T2451] <TASK> [ 33.030046][ T2451] dump_stack_lvl+0x16c/0x1f0 [ 33.030330][ T2451] __inode_add_lru.part.0+0x1ea/0x280 [ 33.030647][ T2451] iput+0x7c7/0x920 [ 33.030876][ T2451] run_delayed_iput_locked+0x136/0x1e0 [ 33.031199][ T2451] btrfs_run_delayed_iputs+0x8e/0x120 [ 33.031516][ T2451] cleaner_kthread+0x2d3/0x480 [ 33.031801][ T2451] ? __pfx_cleaner_kthread+0x10/0x10 [ 33.032113][ T2451] kthread+0x2c1/0x3a0 [ 33.032354][ T2451] ? _raw_spin_unlock_irq+0x23/0x50 [ 33.032659][ T2451] ? __pfx_kthread+0x10/0x10 [ 33.032929][ T2451] ret_from_fork+0x45/0x80 [ 33.033198][ T2451] ? __pfx_kthread+0x10/0x10 [ 33.033469][ T2451] ret_from_fork_asm+0x1a/0x30 [ 33.033756][ T2451] </TASK> [ 33.033978][ T2451] bug! double evict! crash will happen [ 33.034168][ T2332] BTRFS info (device loop0): last unmount of filesystem c9fe44da-de57-406a-8241-57ec7d4412cf [ 33.034295][ T2451] inode 0xffff88801fb87738 state 0x60 triggered a bug [ 33.034332][ T2451] ------------[ cut here ]------------ [ 33.035578][ T2451] kernel BUG at fs/inode.c:1768! [ 33.035873][ T2451] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 33.036286][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not tainted 6.10.0-rc2-00225-gf0deb01d52b3-dirty #132 [ 33.036870][ T2451] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 [ 33.037464][ T2451] RIP: 0010:iput+0x896/0x920 // logs ... [ 33.042882][ T2451] PKRU: 55555554 [ 33.043092][ T2451] Call Trace: [ 33.043288][ T2451] <TASK> [ 33.044190][ T2451] ? iput+0x896/0x920 [ 33.047321][ T2451] ? btrfs_run_defrag_inodes+0xb49/0xe60 [ 33.047653][ T2451] btrfs_run_defrag_inodes+0xa40/0xe60 [ 33.047975][ T2451] ? __pfx_btrfs_run_defrag_inodes+0x10/0x10 [ 33.048327][ T2451] ? __pfx___mutex_unlock_slowpath+0x10/0x10 [ 33.048678][ T2451] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 33.048996][ T2451] ? _raw_spin_unlock+0x28/0x50 [ 33.049285][ T2451] ? btrfs_clean_one_deleted_snapshot+0x2b2/0x420 [ 33.049660][ T2451] cleaner_kthread+0x2ee/0x480 [ 33.049941][ T2451] ? __pfx_cleaner_kthread+0x10/0x10 [ 33.050252][ T2451] kthread+0x2c1/0x3a0 [ 33.050492][ T2451] ? _raw_spin_unlock_irq+0x23/0x50 [ 33.050796][ T2451] ? __pfx_kthread+0x10/0x10 [ 33.051067][ T2451] ret_from_fork+0x45/0x80 [ 33.051333][ T2451] ? __pfx_kthread+0x10/0x10 [ 33.051603][ T2451] ret_from_fork_asm+0x1a/0x30 [ 33.051887][ T2451] </TASK> [ 33.052066][ T2451] Modules linked in: [ 33.052316][ T2451] ---[ end trace 0000000000000000 ]--- [ 33.052634][ T2451] RIP: 0010:iput+0x896/0x920
Please note inode 0xffff88801fb87738, which was set debug flag by __inode_add_lru() in the run_delayed_iput_locked() code path. This indicates that the inode was not actually evicted. Then, please note the double evict log, which shows that this inode which was set debug flag and was not be evicted is still on the sb list and was successfully acquired by btrfs_iget(). Next, take note of the unmount log, indicating that there was a concurrent unmount thread which has already called evict_inodes() at that time. Finally, observe where the BUG was triggered: the state of inode 0xffff88801fb87738 had been set to 0x60 by the unmount thread calling inode_clear(), and at the same time, btrfs-cleaner called iput(), ultimately leading to the BUG at the beginning check of iput(). Although the point of trigger differs from the one reported by syzbot, there is no doubt that they are the same bug.
You could apply the above diff and try to reproduce the bug using the reproducer[1].
If there is anything I'm missing or more info was needed, please let me know.
Thanks.