The iput() function is a dangerous one - if the reference counter goes to zero, the function may block for a long time due to:
- inode_wait_for_writeback() waits until writeback on this inode completes
- the filesystem-specific "evict_inode" callback can do similar things; e.g. all netfs-based filesystems will call netfs_wait_for_outstanding_io() which is similar to inode_wait_for_writeback()
Therefore, callers must carefully evaluate the context they're in and check whether invoking iput() is a good idea at all.
Most of the time, this is not a problem because the dcache holds references to all inodes, and the dcache is usually the one to release the last reference. But this assumption is fragile. For example, under (memcg) memory pressure, the dcache shrinker is more likely to release inode references, moving the inode eviction to contexts where that was extremely unlikely to occur.
Our production servers "found" at least two deadlock bugs in the Ceph filesystem that were caused by this iput() behavior:
1. Writeback may lead to iput() calls in Ceph (e.g. from ceph_put_wrbuffer_cap_refs()) which deadlocks in inode_wait_for_writeback(). Waiting for writeback completion from within writeback will obviously never be able to make any progress. This leads to blocked kworkers like this:
INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds. Not tainted 6.16.7-i1-es #773 task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: writeback wb_workfn (flush-ceph-3) Call Trace: <TASK> __schedule+0x4ea/0x17d0 schedule+0x1c/0xc0 inode_wait_for_writeback+0x71/0xb0 evict+0xcf/0x200 ceph_put_wrbuffer_cap_refs+0xdd/0x220 ceph_invalidate_folio+0x97/0xc0 ceph_writepages_start+0x127b/0x14d0 do_writepages+0xba/0x150 __writeback_single_inode+0x34/0x290 writeback_sb_inodes+0x203/0x470 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x189/0x2b0 wb_workfn+0x30b/0x3d0 process_one_work+0x143/0x2b0 worker_thread+0x30a/0x450
2. In the Ceph messenger thread (net/ceph/messenger*.c), any iput() call may invoke ceph_evict_inode() which will deadlock in netfs_wait_for_outstanding_io(); since this blocks the messenger thread, completions from the Ceph servers will not ever be received and handled.
It looks like these deadlock bugs have been in the Ceph filesystem code since forever (therefore no "Fixes" tag in this patch). There may be various ways to solve this:
- make iput() asynchronous and defer the actual eviction like fput() (may add overhead)
- make iput() only asynchronous if I_SYNC is set (doesn't solve random things happening inside the "evict_inode" callback)
- add iput_deferred() to make this asynchronous behavior/overhead optional and explicit
- refactor Ceph to avoid iput() calls from within writeback and messenger (if that is even possible)
- add a Ceph-specific workaround
After advice from Mateusz Guzik, I decided to do the latter. The implementation is simple because it piggybacks on the existing work_struct for ceph_queue_inode_work() - ceph_inode_work() calls iput() at the end which means we can donate the last reference to it.
Since Ceph has a few iput() callers in a loop, it seemed simple enough to pass this counter and use atomic_sub() instead of atomic_dec().
This patch adds ceph_iput_n_async() and converts lots of iput() calls to it - at least those that may come through writeback and the messenger.
Signed-off-by: Max Kellermann max.kellermann@ionos.com Cc: Mateusz Guzik mjguzik@gmail.com Cc: stable@vger.kernel.org --- fs/ceph/addr.c | 2 +- fs/ceph/caps.c | 21 ++++++++++----------- fs/ceph/dir.c | 2 +- fs/ceph/inode.c | 42 ++++++++++++++++++++++++++++++++++++++++++ fs/ceph/mds_client.c | 32 ++++++++++++++++---------------- fs/ceph/quota.c | 4 ++-- fs/ceph/snap.c | 10 +++++----- fs/ceph/super.h | 7 +++++++ 8 files changed, 84 insertions(+), 36 deletions(-)
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 322ed268f14a..fc497c91530e 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -265,7 +265,7 @@ static void finish_netfs_read(struct ceph_osd_request *req) subreq->error = err; trace_netfs_sreq(subreq, netfs_sreq_trace_io_progress); netfs_read_subreq_terminated(subreq); - iput(req->r_inode); + ceph_iput_async(req->r_inode); ceph_dec_osd_stopping_blocker(fsc->mdsc); }
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index b1a8ff612c41..bd88b5287a2b 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1771,7 +1771,7 @@ void ceph_flush_snaps(struct ceph_inode_info *ci, spin_unlock(&mdsc->snap_flush_lock);
if (need_put) - iput(inode); + ceph_iput_async(inode); }
/* @@ -3318,8 +3318,8 @@ static void __ceph_put_cap_refs(struct ceph_inode_info *ci, int had, } if (wake) wake_up_all(&ci->i_cap_wq); - while (put-- > 0) - iput(inode); + if (put > 0) + ceph_iput_n_async(inode, put); }
void ceph_put_cap_refs(struct ceph_inode_info *ci, int had) @@ -3418,9 +3418,8 @@ void ceph_put_wrbuffer_cap_refs(struct ceph_inode_info *ci, int nr, } if (complete_capsnap) wake_up_all(&ci->i_cap_wq); - while (put-- > 0) { - iput(inode); - } + if (put > 0) + ceph_iput_n_async(inode, put); }
/* @@ -3917,7 +3916,7 @@ static void handle_cap_flush_ack(struct inode *inode, u64 flush_tid, if (wake_mdsc) wake_up_all(&mdsc->cap_flushing_wq); if (drop) - iput(inode); + ceph_iput_async(inode); }
void __ceph_remove_capsnap(struct inode *inode, struct ceph_cap_snap *capsnap, @@ -4008,7 +4007,7 @@ static void handle_cap_flushsnap_ack(struct inode *inode, u64 flush_tid, wake_up_all(&ci->i_cap_wq); if (wake_mdsc) wake_up_all(&mdsc->cap_flushing_wq); - iput(inode); + ceph_iput_async(inode); } }
@@ -4557,7 +4556,7 @@ void ceph_handle_caps(struct ceph_mds_session *session, done: mutex_unlock(&session->s_mutex); done_unlocked: - iput(inode); + ceph_iput_async(inode); out: ceph_dec_mds_stopping_blocker(mdsc);
@@ -4636,7 +4635,7 @@ unsigned long ceph_check_delayed_caps(struct ceph_mds_client *mdsc) doutc(cl, "on %p %llx.%llx\n", inode, ceph_vinop(inode)); ceph_check_caps(ci, 0); - iput(inode); + ceph_iput_async(inode); spin_lock(&mdsc->cap_delay_lock); }
@@ -4675,7 +4674,7 @@ static void flush_dirty_session_caps(struct ceph_mds_session *s) spin_unlock(&mdsc->cap_dirty_lock); ceph_wait_on_async_create(inode); ceph_check_caps(ci, CHECK_CAPS_FLUSH); - iput(inode); + ceph_iput_async(inode); spin_lock(&mdsc->cap_dirty_lock); } spin_unlock(&mdsc->cap_dirty_lock); diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 32973c62c1a2..ec73ed52a227 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1290,7 +1290,7 @@ static void ceph_async_unlink_cb(struct ceph_mds_client *mdsc, ceph_mdsc_free_path_info(&path_info); } out: - iput(req->r_old_inode); + ceph_iput_async(req->r_old_inode); ceph_mdsc_release_dir_caps(req); }
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index f67025465de0..385d5261632d 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -2191,6 +2191,48 @@ void ceph_queue_inode_work(struct inode *inode, int work_bit) } }
+/** + * Queue an asynchronous iput() call in a worker thread. Use this + * instead of iput() in contexts where evicting the inode is unsafe. + * For example, inode eviction may cause deadlocks in + * inode_wait_for_writeback() (when called from within writeback) or + * in netfs_wait_for_outstanding_io() (when called from within the + * Ceph messenger). + * + * @n: how many references to put + */ +void ceph_iput_n_async(struct inode *inode, int n) +{ + if (unlikely(!inode)) + return; + + if (likely(atomic_sub_return(n, &inode->i_count) > 0)) + /* somebody else is holding another reference - + * nothing left to do for us + */ + return; + + doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode)); + + /* the reference counter is now 0, i.e. nobody else is holding + * a reference to this inode; restore it to 1 and donate it to + * ceph_inode_work() which will call iput() at the end + */ + atomic_set(&inode->i_count, 1); + + /* simply queue a ceph_inode_work() without setting + * i_work_mask bit; other than putting the reference, there is + * nothing to do + */ + WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq, + &ceph_inode(inode)->i_work)); + + /* note: queue_work() cannot fail; it i_work were already + * queued, then it would be holding another reference, but no + * such reference exists + */ +} + static void ceph_do_invalidate_pages(struct inode *inode) { struct ceph_client *cl = ceph_inode_to_client(inode); diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 3bc72b47fe4d..d7fce1ad8073 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -1097,14 +1097,14 @@ void ceph_mdsc_release_request(struct kref *kref) ceph_msg_put(req->r_reply); if (req->r_inode) { ceph_put_cap_refs(ceph_inode(req->r_inode), CEPH_CAP_PIN); - iput(req->r_inode); + ceph_iput_async(req->r_inode); } if (req->r_parent) { ceph_put_cap_refs(ceph_inode(req->r_parent), CEPH_CAP_PIN); - iput(req->r_parent); + ceph_iput_async(req->r_parent); } - iput(req->r_target_inode); - iput(req->r_new_inode); + ceph_iput_async(req->r_target_inode); + ceph_iput_async(req->r_new_inode); if (req->r_dentry) dput(req->r_dentry); if (req->r_old_dentry) @@ -1118,7 +1118,7 @@ void ceph_mdsc_release_request(struct kref *kref) */ ceph_put_cap_refs(ceph_inode(req->r_old_dentry_dir), CEPH_CAP_PIN); - iput(req->r_old_dentry_dir); + ceph_iput_async(req->r_old_dentry_dir); } kfree(req->r_path1); kfree(req->r_path2); @@ -1240,7 +1240,7 @@ static void __unregister_request(struct ceph_mds_client *mdsc, }
if (req->r_unsafe_dir) { - iput(req->r_unsafe_dir); + ceph_iput_async(req->r_unsafe_dir); req->r_unsafe_dir = NULL; }
@@ -1413,7 +1413,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc, cap = rb_entry(rb_first(&ci->i_caps), struct ceph_cap, ci_node); if (!cap) { spin_unlock(&ci->i_ceph_lock); - iput(inode); + ceph_iput_async(inode); goto random; } mds = cap->session->s_mds; @@ -1422,7 +1422,7 @@ static int __choose_mds(struct ceph_mds_client *mdsc, cap == ci->i_auth_cap ? "auth " : "", cap); spin_unlock(&ci->i_ceph_lock); out: - iput(inode); + ceph_iput_async(inode); return mds;
random: @@ -1841,7 +1841,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session, spin_unlock(&session->s_cap_lock);
if (last_inode) { - iput(last_inode); + ceph_iput_async(last_inode); last_inode = NULL; } if (old_cap) { @@ -1874,7 +1874,7 @@ int ceph_iterate_session_caps(struct ceph_mds_session *session, session->s_cap_iterator = NULL; spin_unlock(&session->s_cap_lock);
- iput(last_inode); + ceph_iput_async(last_inode); if (old_cap) ceph_put_cap(session->s_mdsc, old_cap);
@@ -1903,8 +1903,8 @@ static int remove_session_caps_cb(struct inode *inode, int mds, void *arg) wake_up_all(&ci->i_cap_wq); if (invalidate) ceph_queue_invalidate(inode); - while (iputs--) - iput(inode); + if (iputs > 0) + ceph_iput_n_async(inode, iputs); return 0; }
@@ -1944,7 +1944,7 @@ static void remove_session_caps(struct ceph_mds_session *session) spin_unlock(&session->s_cap_lock);
inode = ceph_find_inode(sb, vino); - iput(inode); + ceph_iput_async(inode);
spin_lock(&session->s_cap_lock); } @@ -2512,7 +2512,7 @@ static void ceph_cap_unlink_work(struct work_struct *work) doutc(cl, "on %p %llx.%llx\n", inode, ceph_vinop(inode)); ceph_check_caps(ci, CHECK_CAPS_FLUSH); - iput(inode); + ceph_iput_async(inode); spin_lock(&mdsc->cap_delay_lock); } } @@ -3933,7 +3933,7 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg) !req->r_reply_info.has_create_ino) { /* This should never happen on an async create */ WARN_ON_ONCE(req->r_deleg_ino); - iput(in); + ceph_iput_async(in); in = NULL; }
@@ -5313,7 +5313,7 @@ static void handle_lease(struct ceph_mds_client *mdsc,
out: mutex_unlock(&session->s_mutex); - iput(inode); + ceph_iput_async(inode);
ceph_dec_mds_stopping_blocker(mdsc); return; diff --git a/fs/ceph/quota.c b/fs/ceph/quota.c index d90eda19bcc4..bba00f8926e6 100644 --- a/fs/ceph/quota.c +++ b/fs/ceph/quota.c @@ -76,7 +76,7 @@ void ceph_handle_quota(struct ceph_mds_client *mdsc, le64_to_cpu(h->max_files)); spin_unlock(&ci->i_ceph_lock);
- iput(inode); + ceph_iput_async(inode); out: ceph_dec_mds_stopping_blocker(mdsc); } @@ -190,7 +190,7 @@ void ceph_cleanup_quotarealms_inodes(struct ceph_mds_client *mdsc) node = rb_first(&mdsc->quotarealms_inodes); qri = rb_entry(node, struct ceph_quotarealm_inode, node); rb_erase(node, &mdsc->quotarealms_inodes); - iput(qri->inode); + ceph_iput_async(qri->inode); kfree(qri); } mutex_unlock(&mdsc->quotarealms_inodes_mutex); diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index c65f2b202b2b..19f097e79b3c 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -735,7 +735,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc, if (!inode) continue; spin_unlock(&realm->inodes_with_caps_lock); - iput(lastinode); + ceph_iput_async(lastinode); lastinode = inode;
/* @@ -762,7 +762,7 @@ static void queue_realm_cap_snaps(struct ceph_mds_client *mdsc, spin_lock(&realm->inodes_with_caps_lock); } spin_unlock(&realm->inodes_with_caps_lock); - iput(lastinode); + ceph_iput_async(lastinode);
if (capsnap) kmem_cache_free(ceph_cap_snap_cachep, capsnap); @@ -955,7 +955,7 @@ static void flush_snaps(struct ceph_mds_client *mdsc) ihold(inode); spin_unlock(&mdsc->snap_flush_lock); ceph_flush_snaps(ci, &session); - iput(inode); + ceph_iput_async(inode); spin_lock(&mdsc->snap_flush_lock); } spin_unlock(&mdsc->snap_flush_lock); @@ -1116,12 +1116,12 @@ void ceph_handle_snap(struct ceph_mds_client *mdsc, ceph_get_snap_realm(mdsc, realm); ceph_change_snap_realm(inode, realm); spin_unlock(&ci->i_ceph_lock); - iput(inode); + ceph_iput_async(inode); continue;
skip_inode: spin_unlock(&ci->i_ceph_lock); - iput(inode); + ceph_iput_async(inode); }
/* we may have taken some of the old realm's children. */ diff --git a/fs/ceph/super.h b/fs/ceph/super.h index cf176aab0f82..15c09b6c94aa 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -1085,6 +1085,13 @@ static inline void ceph_queue_flush_snaps(struct inode *inode) ceph_queue_inode_work(inode, CEPH_I_WORK_FLUSH_SNAPS); }
+void ceph_iput_n_async(struct inode *inode, int n); + +static inline void ceph_iput_async(struct inode *inode) +{ + ceph_iput_n_async(inode, 1); +} + extern int ceph_try_to_choose_auth_mds(struct inode *inode, int mask); extern int __ceph_do_getattr(struct inode *inode, struct page *locked_page, int mask, bool force);
On Wed, Sep 17, 2025 at 2:44 PM Max Kellermann max.kellermann@ionos.com wrote:
I don't know about ceph internals, so no comment on that front.
+/**
- Queue an asynchronous iput() call in a worker thread. Use this
- instead of iput() in contexts where evicting the inode is unsafe.
- For example, inode eviction may cause deadlocks in
- inode_wait_for_writeback() (when called from within writeback) or
- in netfs_wait_for_outstanding_io() (when called from within the
- Ceph messenger).
- @n: how many references to put
- */
+void ceph_iput_n_async(struct inode *inode, int n) +{
if (unlikely(!inode))
return;
if (likely(atomic_sub_return(n, &inode->i_count) > 0))
/* somebody else is holding another reference -
* nothing left to do for us
*/
return;
doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
/* the reference counter is now 0, i.e. nobody else is holding
* a reference to this inode; restore it to 1 and donate it to
* ceph_inode_work() which will call iput() at the end
*/
atomic_set(&inode->i_count, 1);
That loop over iput() indeed asks for a variant which grabs an explicit count to subtract.
However, you cannot legally transition to 0 without ->i_lock held. By API contract the ->drop_inode routine needs to be called when you get here and other CPUs are prevented from refing the inode.
While it is true nobody *refs* the inode, it is still hanging out on the superblock list where it can get picked up by forced unmount and on the inode hash where it can get picked up by lookup. With a refcount of 0, ->i_lock not held and no flags added, from their POV this is a legally cached inode they can do whatever they want with.
So that force setting of refcount to 1 might be a use-after-free if this raced against another iput or it might be losing a reference picked up by someone else.
If you got the idea to bring back one frem from iput() in the stock kernel:
if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) { if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) { atomic_inc(&inode->i_count);
Note this guy still makes sure to take the lock first. As a side note this weird deref to 0 + ref back to 1 business is going away [1].
I don't know what's the handy Linux way to sub an arbitrary amount as long as the target is not x, I guess worst case one can just write a cmpxchg loop by hand.
Given that this is a reliability fix I would forego optimizations of the sort.
Does the patch convert literally all iput calls within ceph into the async variant? I would be worried that mandatory deferral of literally all final iputs may be a regression from perf standpoint.
I see you are mentioning another deadlock, perhaps being in danger of deadlocking is something you could track with a flag within ceph (just like it happens for writeback)? Then the local iput variant could check on both. I have no idea if this is feasible at all for the netfs thing.
No matter what though, it looks like the way forward concerning ->i_count is to make sure it does not drop to 0 within the new primitive.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6....
/* simply queue a ceph_inode_work() without setting
* i_work_mask bit; other than putting the reference, there is
* nothing to do
*/
WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
&ceph_inode(inode)->i_work));
/* note: queue_work() cannot fail; it i_work were already
* queued, then it would be holding another reference, but no
* such reference exists
*/
+}
On Wed, Sep 17, 2025 at 3:13 PM Mateusz Guzik mjguzik@gmail.com wrote:
On Wed, Sep 17, 2025 at 2:44 PM Max Kellermann max.kellermann@ionos.com wrote:
I don't know about ceph internals, so no comment on that front.
+/**
- Queue an asynchronous iput() call in a worker thread. Use this
- instead of iput() in contexts where evicting the inode is unsafe.
- For example, inode eviction may cause deadlocks in
- inode_wait_for_writeback() (when called from within writeback) or
- in netfs_wait_for_outstanding_io() (when called from within the
- Ceph messenger).
- @n: how many references to put
- */
+void ceph_iput_n_async(struct inode *inode, int n) +{
if (unlikely(!inode))
return;
if (likely(atomic_sub_return(n, &inode->i_count) > 0))
/* somebody else is holding another reference -
* nothing left to do for us
*/
return;
doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
/* the reference counter is now 0, i.e. nobody else is holding
* a reference to this inode; restore it to 1 and donate it to
* ceph_inode_work() which will call iput() at the end
*/
atomic_set(&inode->i_count, 1);
That loop over iput() indeed asks for a variant which grabs an explicit count to subtract.
However, you cannot legally transition to 0 without ->i_lock held. By API contract the ->drop_inode routine needs to be called when you get here and other CPUs are prevented from refing the inode.
While it is true nobody *refs* the inode, it is still hanging out on the superblock list where it can get picked up by forced unmount and on the inode hash where it can get picked up by lookup. With a refcount of 0, ->i_lock not held and no flags added, from their POV this is a legally cached inode they can do whatever they want with.
So that force setting of refcount to 1 might be a use-after-free if this raced against another iput or it might be losing a reference picked up by someone else.
If you got the idea to bring back one frem from iput() in the stock kernel:
if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) { if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) { atomic_inc(&inode->i_count);
Note this guy still makes sure to take the lock first. As a side note this weird deref to 0 + ref back to 1 business is going away [1].
I don't know what's the handy Linux way to sub an arbitrary amount as long as the target is not x, I guess worst case one can just write a cmpxchg loop by hand.
Given that this is a reliability fix I would forego optimizations of the sort.
Does the patch convert literally all iput calls within ceph into the async variant? I would be worried that mandatory deferral of literally all final iputs may be a regression from perf standpoint.
I see you are mentioning another deadlock, perhaps being in danger of deadlocking is something you could track with a flag within ceph (just like it happens for writeback)? Then the local iput variant could check on both. I have no idea if this is feasible at all for the netfs thing.
No matter what though, it looks like the way forward concerning ->i_count is to make sure it does not drop to 0 within the new primitive.
That is to say the routine async routine should start with: if (atomic_add_unless(&inode->i_count, -1, 1)) return; /* defer to iput here */
this is copy pasted, no credit needed for anyone
As you can see there is some work going on concerning these routines, I would wager that loop over iput in writeback will go away in mainline after the dust settles ;)
[1] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6....
/* simply queue a ceph_inode_work() without setting
* i_work_mask bit; other than putting the reference, there is
* nothing to do
*/
WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
&ceph_inode(inode)->i_work));
/* note: queue_work() cannot fail; it i_work were already
* queued, then it would be holding another reference, but no
* such reference exists
*/
+}
On Wed, Sep 17, 2025 at 3:14 PM Mateusz Guzik mjguzik@gmail.com wrote:
Given that this is a reliability fix I would forego optimizations of the sort.
Thanks, good catch - I guess I was trying to be too clever. I'll remove the "n" parameter and just do atomic_add_unless(), like btrfs does. That makes sure 0 is never hit by my own code.
On Wed, Sep 17, 2025 at 3:14 PM Mateusz Guzik mjguzik@gmail.com wrote:
Does the patch convert literally all iput calls within ceph into the async variant? I would be worried that mandatory deferral of literally all final iputs may be a regression from perf standpoint.
(Forgot to reply to this part) No, I changed just the ones that are called from Writeback+Messenger.
I don't think this affects performance at all. It almost never happens that the last reference gets dropped by somebody other than dcache (which only happens under memory pressure). It was very difficult to reproduce this bug: - "echo 2 >drop_caches" in a loop - a kernel patch that adds msleep() to several functions - another kernel patch that allows me to disconnect the Ceph server via ioctl The latter was to free inode references that are held by Ceph caps. For this deadlock to occur, all references other than writeback/messenger must be gone already. (It did happen on our production servers, crashing all of them a few days ago causing a major service outage, but apparently in all these years we're the first ones to observe this deadlock bug.)
(I don't know the iput() ordering on umount/shutdown - that might be worth a closer look.)
On Wed, Sep 17, 2025 at 3:39 PM Max Kellermann max.kellermann@ionos.com wrote:
On Wed, Sep 17, 2025 at 3:14 PM Mateusz Guzik mjguzik@gmail.com wrote:
Does the patch convert literally all iput calls within ceph into the async variant? I would be worried that mandatory deferral of literally all final iputs may be a regression from perf standpoint.
ok, in that case i have no further commentary
I don't think this affects performance at all. It almost never happens that the last reference gets dropped by somebody other than dcache (which only happens under memory pressure).
Well only changing the problematic consumers as opposed *everyone* should be the end of it.
(Forgot to reply to this part) No, I changed just the ones that are called from Writeback+Messenger.
I don't think this affects performance at all. It almost never happens that the last reference gets dropped by somebody other than dcache (which only happens under memory pressure). It was very difficult to reproduce this bug:
- "echo 2 >drop_caches" in a loop
- a kernel patch that adds msleep() to several functions
- another kernel patch that allows me to disconnect the Ceph server via ioctl
The latter was to free inode references that are held by Ceph caps. For this deadlock to occur, all references other than writeback/messenger must be gone already. (It did happen on our production servers, crashing all of them a few days ago causing a major service outage, but apparently in all these years we're the first ones to observe this deadlock bug.)
This makes sense to me.
The VFS layer is hopefully going to get significantly better assert coverage, so I expect this kind of trouble will be reported on without having to actually run into it. Presumably including yet-to-be-discovered deadlocks. ;)
On Wed, Sep 17, 2025 at 02:44:04PM +0200, Max Kellermann wrote:
The iput() function is a dangerous one - if the reference counter goes to zero, the function may block for a long time due to:
inode_wait_for_writeback() waits until writeback on this inode completes
the filesystem-specific "evict_inode" callback can do similar things; e.g. all netfs-based filesystems will call netfs_wait_for_outstanding_io() which is similar to inode_wait_for_writeback()
Therefore, callers must carefully evaluate the context they're in and check whether invoking iput() is a good idea at all.
Most of the time, this is not a problem because the dcache holds references to all inodes, and the dcache is usually the one to release the last reference. But this assumption is fragile. For example, under (memcg) memory pressure, the dcache shrinker is more likely to release inode references, moving the inode eviction to contexts where that was extremely unlikely to occur.
Our production servers "found" at least two deadlock bugs in the Ceph filesystem that were caused by this iput() behavior:
Writeback may lead to iput() calls in Ceph (e.g. from ceph_put_wrbuffer_cap_refs()) which deadlocks in inode_wait_for_writeback(). Waiting for writeback completion from within writeback will obviously never be able to make any progress. This leads to blocked kworkers like this:
INFO: task kworker/u777:6:1270802 blocked for more than 122 seconds. Not tainted 6.16.7-i1-es #773 task:kworker/u777:6 state:D stack:0 pid:1270802 tgid:1270802 ppid:2 task_flags:0x4208060 flags:0x00004000 Workqueue: writeback wb_workfn (flush-ceph-3) Call Trace:
<TASK> __schedule+0x4ea/0x17d0 schedule+0x1c/0xc0 inode_wait_for_writeback+0x71/0xb0 evict+0xcf/0x200 ceph_put_wrbuffer_cap_refs+0xdd/0x220 ceph_invalidate_folio+0x97/0xc0 ceph_writepages_start+0x127b/0x14d0 do_writepages+0xba/0x150 __writeback_single_inode+0x34/0x290 writeback_sb_inodes+0x203/0x470 __writeback_inodes_wb+0x4c/0xe0 wb_writeback+0x189/0x2b0 wb_workfn+0x30b/0x3d0 process_one_work+0x143/0x2b0 worker_thread+0x30a/0x450
In the Ceph messenger thread (net/ceph/messenger*.c), any iput() call may invoke ceph_evict_inode() which will deadlock in netfs_wait_for_outstanding_io(); since this blocks the messenger thread, completions from the Ceph servers will not ever be received and handled.
It looks like these deadlock bugs have been in the Ceph filesystem code since forever (therefore no "Fixes" tag in this patch). There may be various ways to solve this:
make iput() asynchronous and defer the actual eviction like fput() (may add overhead)
make iput() only asynchronous if I_SYNC is set (doesn't solve random things happening inside the "evict_inode" callback)
add iput_deferred() to make this asynchronous behavior/overhead optional and explicit
refactor Ceph to avoid iput() calls from within writeback and messenger (if that is even possible)
add a Ceph-specific workaround
- wait for Josef to finish his inode refcount rework patchset that gets rid of this whole "writeback doesn't hold an inode reference" problem that is the root cause of this the deadlock.
All that adding a whacky async iput work around does right now is make it harder for Josef to land the patchset that makes this problem go away entirely....
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index f67025465de0..385d5261632d 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -2191,6 +2191,48 @@ void ceph_queue_inode_work(struct inode *inode, int work_bit) } } +/**
- Queue an asynchronous iput() call in a worker thread. Use this
- instead of iput() in contexts where evicting the inode is unsafe.
- For example, inode eviction may cause deadlocks in
- inode_wait_for_writeback() (when called from within writeback) or
- in netfs_wait_for_outstanding_io() (when called from within the
- Ceph messenger).
- @n: how many references to put
- */
+void ceph_iput_n_async(struct inode *inode, int n) +{
- if (unlikely(!inode))
return;
- if (likely(atomic_sub_return(n, &inode->i_count) > 0))
/* somebody else is holding another reference -
* nothing left to do for us
*/
return;
- doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
- /* the reference counter is now 0, i.e. nobody else is holding
* a reference to this inode; restore it to 1 and donate it to
* ceph_inode_work() which will call iput() at the end
*/
- atomic_set(&inode->i_count, 1);
If you must do this, please have a look at how btrfs_add_delayed_iput() handles detecting the last inode reference and punting it to an async queue without needing to drop the last reference at all.
-Dave.
On Thu, Sep 18, 2025 at 12:51 AM Dave Chinner david@fromorbit.com wrote:
- wait for Josef to finish his inode refcount rework patchset that gets rid of this whole "writeback doesn't hold an inode reference" problem that is the root cause of this the deadlock.
All that adding a whacky async iput work around does right now is make it harder for Josef to land the patchset that makes this problem go away entirely....
Per Max this is a problem present on older kernels as well, something of this sort is needed to cover it regardless of what happens in mainline.
As for mainline, I don't believe Josef's patchset addresses the problem.
The newly added refcount now taken by writeback et al only gates the inode getting freed, it does not gate almost any of iput/evict processing. As in with the patchset writeback does not hold a real reference.
So ceph can still iput from writeback and find itself waiting in inode_wait_for_writeback, unless the filesystem can be converted to use the weaker refcounts and iobj_put instead (but that's not something I would be betting on).
On Thu, Sep 18, 2025 at 1:08 AM Mateusz Guzik mjguzik@gmail.com wrote:
On Thu, Sep 18, 2025 at 12:51 AM Dave Chinner david@fromorbit.com wrote:
- wait for Josef to finish his inode refcount rework patchset that gets rid of this whole "writeback doesn't hold an inode reference" problem that is the root cause of this the deadlock.
All that adding a whacky async iput work around does right now is make it harder for Josef to land the patchset that makes this problem go away entirely....
Per Max this is a problem present on older kernels as well, something of this sort is needed to cover it regardless of what happens in mainline.
As for mainline, I don't believe Josef's patchset addresses the problem.
The newly added refcount now taken by writeback et al only gates the inode getting freed, it does not gate almost any of iput/evict processing. As in with the patchset writeback does not hold a real reference.
So ceph can still iput from writeback and find itself waiting in inode_wait_for_writeback, unless the filesystem can be converted to use the weaker refcounts and iobj_put instead (but that's not something I would be betting on).
To further elaborate, an extra count which only gates the struct being freed has limited usefulness. Notably it does not help filesystems which need the inode to be valid for use the entire time as evict() is only stalled *after* ->evict_inode(), which might have destroyed the vital parts.
Or to put it differently, the patchset tries to fit btrfs's needs which don't necessarily line up with other filesystems. For example it may be ceph needs the full reference in writeback, then the new ref is of no use here. But for the sake of argument let's say ceph will get away with the ligher ref instead. Even then this is on the clock for a different filesystem to show up which can't do it and needs an async iput and then its developers are looking at "whacky work arounds".
The actual generic async iput is the actual async iput, not an arbitrary chunk of it after the inode is partway through processing. But then any form of extra refcounting is of no significance.
To that end a non-whacky mechanism to defer iput would be most welcome, presumably provided by the vfs layer itself. Per remarks by Al elsewhere, care needs to be taken to make sure all inodes are sorted out before the super block gets destroyed.
This suggests expanding the super_block to track all of the deferred iputs and drain them early in sb destruction. The current struct inode on LP64 has 2 * 4 byte holes and llist linkage is only 8 bytes, so this can be added without growing the struct above stock kernel.
I would argue it would be good if the work could be deffered to task_work if possible (fput-style). Waiting for these should be easy enough, but arguably the thread which is supposed to get to them can be stalled elsewhere indefinitely, so perhaps this bit is a no-go.
On Thu, Sep 18, 2025 at 02:04:52AM +0200, Mateusz Guzik wrote:
On Thu, Sep 18, 2025 at 1:08 AM Mateusz Guzik mjguzik@gmail.com wrote:
On Thu, Sep 18, 2025 at 12:51 AM Dave Chinner david@fromorbit.com wrote:
- wait for Josef to finish his inode refcount rework patchset that gets rid of this whole "writeback doesn't hold an inode reference" problem that is the root cause of this the deadlock.
All that adding a whacky async iput work around does right now is make it harder for Josef to land the patchset that makes this problem go away entirely....
Per Max this is a problem present on older kernels as well, something of this sort is needed to cover it regardless of what happens in mainline.
As for mainline, I don't believe Josef's patchset addresses the problem.
The newly added refcount now taken by writeback et al only gates the inode getting freed, it does not gate almost any of iput/evict processing. As in with the patchset writeback does not hold a real reference.
So ceph can still iput from writeback and find itself waiting in inode_wait_for_writeback, unless the filesystem can be converted to use the weaker refcounts and iobj_put instead (but that's not something I would be betting on).
To further elaborate, an extra count which only gates the struct being freed has limited usefulness. Notably it does not help filesystems which need the inode to be valid for use the entire time as evict() is only stalled *after* ->evict_inode(), which might have destroyed the vital parts.
Not sure I follow you - ->evict_inode() comes after waiting for writeback and other VFS operations to complete. There's nothing in the VFS eviction code that actually blocks after ->evict_inode() is called.
Or to put it differently, the patchset tries to fit btrfs's needs which don't necessarily line up with other filesystems.
Yes, that much is obvious, I think it will be difficult to use it to replace any of XFS's custom asynchronous inode cleanup code at this point in time...
For example it may be ceph needs the full reference in writeback,
IMO, we should always hold a full reference in writeback, because doing so addresses the underlying eviction vs writeback race condition that exists. i.e. we currently handle the lack of reference counts in writeback by blocking on I_SYNC in eviction to prevent a UAF.
If we have an active reference for writeback in the first place then eviction doesn't need to block on writeback because, by definition, we cannot be performing writeback and eviction at the same time....
then the new ref is of no use here. But for the sake of argument let's say ceph will get away with the ligher ref instead. Even then this is on the clock for a different filesystem to show up which can't do it and needs an async iput and then its developers are looking at "whacky work arounds".
Right, that's because we haven't addressed the underlying problem.
That is, we need to hold the right references at the VFS level such that filesystems can't drop the last reference to the inode whilst high level VFS inode operations (such as writeback) are in progress on that inode.
Done properly, eviction can then be done asynchronously for all inodes because we've guaranteed there are no active or pending active users of the inode....
.... and at that point, all the custom async inode garbage collection stuff that XFS does goes away because it is native VFS functionality :)
The actual generic async iput is the actual async iput, not an arbitrary chunk of it after the inode is partway through processing. But then any form of extra refcounting is of no significance.
To that end a non-whacky mechanism to defer iput would be most welcome, presumably provided by the vfs layer itself. Per remarks by Al elsewhere, care needs to be taken to make sure all inodes are sorted out before the super block gets destroyed.
Yes, but first we have to get the reference counting right, such that inode eviction only occurs after we've guaranteed there are no other active users of the inode. Page cache residency and dirty inodes are still in active use, we should account for them that way.
This suggests expanding the super_block to track all of the deferred iputs and drain them early in sb destruction. The current struct inode on LP64 has 2 * 4 byte holes and llist linkage is only 8 bytes, so this can be added without growing the struct above stock kernel.
Right, we already do this lockless llist based async garbage collection under ->destroy_inode with XFS.
I would argue it would be good if the work could be deffered to task_work if possible (fput-style). Waiting for these should be easy enough, but arguably the thread which is supposed to get to them can be stalled elsewhere indefinitely, so perhaps this bit is a no-go.
If the reference counting is right, nothing expect a new lookup on the inode should stall on an inode queued for eviction...
-Dave.
On Thu, Sep 18, 2025 at 01:08:29AM +0200, Mateusz Guzik wrote:
On Thu, Sep 18, 2025 at 12:51 AM Dave Chinner david@fromorbit.com wrote:
- wait for Josef to finish his inode refcount rework patchset that gets rid of this whole "writeback doesn't hold an inode reference" problem that is the root cause of this the deadlock.
All that adding a whacky async iput work around does right now is make it harder for Josef to land the patchset that makes this problem go away entirely....
Per Max this is a problem present on older kernels as well, something of this sort is needed to cover it regardless of what happens in mainline.
As for mainline, I don't believe Josef's patchset addresses the problem.
The newly added refcount now taken by writeback et al only gates the inode getting freed, it does not gate almost any of iput/evict processing. As in with the patchset writeback does not hold a real reference.
Hmmmm. That patchset holds a real active reference when it is on the LRU list.
I thought the new pinned inode list also did the same, but you are right in that it only holds an object reference. i.e. whilst the inode is pinned, iput_final won't move such inodes to the LRU and so they don't get a new active reference whilst they are waiting for writeback/page cache reclaim.
That in itself is probably OK, but it means that writeback really needs to take an active reference to the inode itself while it is flushing (i.e. whilst it has I_SYNC is set). This prevents the fs writeback code from dropping the last active reference and trying to evict the inode whilst writeback is active on the inode...
Indeed, comments in the patchset imply writeback takes an active reference and so on completion will put inodes back on the correct list, but that does not appear to be the behaviour that has been implemented:
"When we call inode_add_lru(), if the inode falls into one of these categories, we will add it to the cached inode list and hold an i_obj_count reference. If the inode does not fall into one of these categories it will be moved to the normal LRU, which is already holds an i_obj_count reference.
The dirty case we will delete it from the LRU if it is on one, and then the iput after the writeout will make sure it's placed onto the correct list at that point."
It's the "iput after the writeout" that implies writeback should be holding an active reference, as is done with the LRU a couple of patches later in the series.
So ceph can still iput from writeback and find itself waiting in inode_wait_for_writeback, unless the filesystem can be converted to use the weaker refcounts and iobj_put instead (but that's not something I would be betting on).
See above; I think the ref counting needs to be the other way around: writeback needs to take an active ref to prevent eviction from a nested iput() call from the filesystem...
On Thu, Sep 18, 2025 at 12:51 AM Dave Chinner david@fromorbit.com wrote:
- wait for Josef to finish his inode refcount rework patchset that gets rid of this whole "writeback doesn't hold an inode reference" problem that is the root cause of this the deadlock.
No, it is necessary to have a minimal fix that is eligible for stable backports.
Of course, my patch is a kludge; this problem is much larger and a general, global solution should be preferred. But my patch is minimal, easy to understand, doesn't add overhead and piggybacks on an existing Ceph feature (per-inode work) that is considered mature and stable. It can be backported easily to 6.12 and 6.6 (with minor conflicts due to renamed netfs functions in adjacent lines).
linux-stable-mirror@lists.linaro.org