On Wed, Sep 17, 2025 at 2:44 PM Max Kellermann max.kellermann@ionos.com wrote:
I don't know about ceph internals, so no comment on that front.
+/**
- Queue an asynchronous iput() call in a worker thread. Use this
- instead of iput() in contexts where evicting the inode is unsafe.
- For example, inode eviction may cause deadlocks in
- inode_wait_for_writeback() (when called from within writeback) or
- in netfs_wait_for_outstanding_io() (when called from within the
- Ceph messenger).
- @n: how many references to put
- */
+void ceph_iput_n_async(struct inode *inode, int n) +{
if (unlikely(!inode))
return;
if (likely(atomic_sub_return(n, &inode->i_count) > 0))
/* somebody else is holding another reference -
* nothing left to do for us
*/
return;
doutc(ceph_inode_to_fs_client(inode)->client, "%p %llx.%llx\n", inode, ceph_vinop(inode));
/* the reference counter is now 0, i.e. nobody else is holding
* a reference to this inode; restore it to 1 and donate it to
* ceph_inode_work() which will call iput() at the end
*/
atomic_set(&inode->i_count, 1);
That loop over iput() indeed asks for a variant which grabs an explicit count to subtract.
However, you cannot legally transition to 0 without ->i_lock held. By API contract the ->drop_inode routine needs to be called when you get here and other CPUs are prevented from refing the inode.
While it is true nobody *refs* the inode, it is still hanging out on the superblock list where it can get picked up by forced unmount and on the inode hash where it can get picked up by lookup. With a refcount of 0, ->i_lock not held and no flags added, from their POV this is a legally cached inode they can do whatever they want with.
So that force setting of refcount to 1 might be a use-after-free if this raced against another iput or it might be losing a reference picked up by someone else.
If you got the idea to bring back one frem from iput() in the stock kernel:
if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) { if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) { atomic_inc(&inode->i_count);
Note this guy still makes sure to take the lock first. As a side note this weird deref to 0 + ref back to 1 business is going away [1].
I don't know what's the handy Linux way to sub an arbitrary amount as long as the target is not x, I guess worst case one can just write a cmpxchg loop by hand.
Given that this is a reliability fix I would forego optimizations of the sort.
Does the patch convert literally all iput calls within ceph into the async variant? I would be worried that mandatory deferral of literally all final iputs may be a regression from perf standpoint.
I see you are mentioning another deadlock, perhaps being in danger of deadlocking is something you could track with a flag within ceph (just like it happens for writeback)? Then the local iput variant could check on both. I have no idea if this is feasible at all for the netfs thing.
No matter what though, it looks like the way forward concerning ->i_count is to make sure it does not drop to 0 within the new primitive.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs-6....
/* simply queue a ceph_inode_work() without setting
* i_work_mask bit; other than putting the reference, there is
* nothing to do
*/
WARN_ON_ONCE(!queue_work(ceph_inode_to_fs_client(inode)->inode_wq,
&ceph_inode(inode)->i_work));
/* note: queue_work() cannot fail; it i_work were already
* queued, then it would be holding another reference, but no
* such reference exists
*/
+}