Re: [PATCH] ceph: fix deadlock bugs by making iput() calls asynchronous

17 Sep 2025

On Wed, Sep 17, 2025 at 02:44:04PM +0200, Max Kellermann wrote:
...
The iput() function is a dangerous one - if the reference counter goes
to zero, the function may block for a long time due to:

inode_wait_for_writeback() waits until writeback on this inode
completes

the filesystem-specific "evict_inode" callback can do similar
things; e.g. all netfs-based filesystems will call
netfs_wait_for_outstanding_io() which is similar to
inode_wait_for_writeback()


Therefore, callers must carefully evaluate the context they're in and
check whether invoking iput() is a good idea at all.
Most of the time, this is not a problem because the dcache holds
references to all inodes, and the dcache is usually the one to release
the last reference.  But this assumption is fragile.  For example,
under (memcg) memory pressure, the dcache shrinker is more likely to
release inode references, moving the inode eviction to contexts where
that was extremely unlikely to occur.
Our production servers "found" at least two deadlock bugs in the Ceph
filesystem that were caused by this iput() behavior:

Writeback may lead to iput() calls in Ceph (e.g. from
ceph_put_wrbuffer_cap_refs()) which deadlocks in
inode_wait_for_writeback().  Waiting for writeback completion from
within writeback will obviously never be able to make any progress.
This leads to blocked kworkers like this:
INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds.
       Not tainted 6.16.7-i1-es #773
 task:kworker/u777:6  state:D stack:0 pid:1270802 tgid:1270802 ppid:2
       task_flags:0x4208060 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-ceph-3)
 Call Trace:
  <TASK>
  __schedule+0x4ea/0x17d0
  schedule+0x1c/0xc0
  inode_wait_for_writeback+0x71/0xb0
  evict+0xcf/0x200
  ceph_put_wrbuffer_cap_refs+0xdd/0x220
  ceph_invalidate_folio+0x97/0xc0
  ceph_writepages_start+0x127b/0x14d0
  do_writepages+0xba/0x150
  __writeback_single_inode+0x34/0x290
  writeback_sb_inodes+0x203/0x470
  __writeback_inodes_wb+0x4c/0xe0
  wb_writeback+0x189/0x2b0
  wb_workfn+0x30b/0x3d0
  process_one_work+0x143/0x2b0
  worker_thread+0x30a/0x450

In the Ceph messenger thread (net/ceph/messenger*.c), any iput()
call may invoke ceph_evict_inode() which will deadlock in
netfs_wait_for_outstanding_io(); since this blocks the messenger
thread, completions from the Ceph servers will not ever be received
and handled.


It looks like these deadlock bugs have been in the Ceph filesystem
code since forever (therefore no "Fixes" tag in this patch).  There
may be various ways to solve this:

make iput() asynchronous and defer the actual eviction like fput()
(may add overhead)

make iput() only asynchronous if I_SYNC is set (doesn't solve random
things happening inside the "evict_inode" callback)

add iput_deferred() to make this asynchronous behavior/overhead
optional and explicit

refactor Ceph to avoid iput() calls from within writeback and
messenger (if that is even possible)

add a Ceph-specific workaround


- wait for Josef to finish his inode refcount rework patchset that
  gets rid of this whole "writeback doesn't hold an inode reference"
  problem that is the root cause of this the deadlock.
All that adding a whacky async iput work around does right now is
make it harder for Josef to land the patchset that makes this
problem go away entirely....
...

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index f67025465de0..385d5261632d 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -2191,6 +2191,48 @@ void ceph_queue_inode_work(struct inode *inode, int work_bit)
   }
 }
 
+/**


Queue an asynchronous iput() call in a worker thread.  Use this



instead of iput() in contexts where evicting the inode is unsafe.



For example, inode eviction may cause deadlocks in



inode_wait_for_writeback() (when called from within writeback) or



in netfs_wait_for_outstanding_io() (when called from within the



Ceph messenger).







@n: how many references to put


*/

+void ceph_iput_n_async(struct inode *inode, int n)
+{

if (unlikely(!inode))
return;



if (likely(atomic_sub_return(n, &inode->i_count) > 0))
/* somebody else is holding another reference -


 * nothing left to do for us


 */


return;



doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));

/* the reference counter is now 0, i.e. nobody else is holding
* a reference to this inode; restore it to 1 and donate it to


* ceph_inode_work() which will call iput() at the end


*/


atomic_set(&inode->i_count, 1);

If you must do this, please have a look at how
btrfs_add_delayed_iput() handles detecting the last inode reference
and punting it to an async queue without needing to drop the last
reference at all.
-Dave.
-- 
Dave Chinner
david@fromorbit.com


    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH] ceph: fix deadlock bugs by making iput() calls asynchronous