On Thu, Sep 18, 2025 at 01:08:29AM +0200, Mateusz Guzik wrote:
On Thu, Sep 18, 2025 at 12:51 AM Dave Chinner david@fromorbit.com wrote:
- wait for Josef to finish his inode refcount rework patchset that gets rid of this whole "writeback doesn't hold an inode reference" problem that is the root cause of this the deadlock.
All that adding a whacky async iput work around does right now is make it harder for Josef to land the patchset that makes this problem go away entirely....
Per Max this is a problem present on older kernels as well, something of this sort is needed to cover it regardless of what happens in mainline.
As for mainline, I don't believe Josef's patchset addresses the problem.
The newly added refcount now taken by writeback et al only gates the inode getting freed, it does not gate almost any of iput/evict processing. As in with the patchset writeback does not hold a real reference.
Hmmmm. That patchset holds a real active reference when it is on the LRU list.
I thought the new pinned inode list also did the same, but you are right in that it only holds an object reference. i.e. whilst the inode is pinned, iput_final won't move such inodes to the LRU and so they don't get a new active reference whilst they are waiting for writeback/page cache reclaim.
That in itself is probably OK, but it means that writeback really needs to take an active reference to the inode itself while it is flushing (i.e. whilst it has I_SYNC is set). This prevents the fs writeback code from dropping the last active reference and trying to evict the inode whilst writeback is active on the inode...
Indeed, comments in the patchset imply writeback takes an active reference and so on completion will put inodes back on the correct list, but that does not appear to be the behaviour that has been implemented:
"When we call inode_add_lru(), if the inode falls into one of these categories, we will add it to the cached inode list and hold an i_obj_count reference. If the inode does not fall into one of these categories it will be moved to the normal LRU, which is already holds an i_obj_count reference.
The dirty case we will delete it from the LRU if it is on one, and then the iput after the writeout will make sure it's placed onto the correct list at that point."
It's the "iput after the writeout" that implies writeback should be holding an active reference, as is done with the LRU a couple of patches later in the series.
So ceph can still iput from writeback and find itself waiting in inode_wait_for_writeback, unless the filesystem can be converted to use the weaker refcounts and iobj_put instead (but that's not something I would be betting on).
See above; I think the ref counting needs to be the other way around: writeback needs to take an active ref to prevent eviction from a nested iput() call from the filesystem...