On Fri, Jul 11, 2025 at 5:10 PM David Howells dhowells@redhat.com wrote:
The netfs copy-to-cache that is used by Ceph with local caching sets up a new request to write data just read to the cache. The request is started and then left to look after itself whilst the app continues. The request gets notified by the backing fs upon completion of the async DIO write, but then tries to wake up the app because NETFS_RREQ_OFFLOAD_COLLECTION isn't set - but the app isn't waiting there, and so the request just hangs.
Fix this by setting NETFS_RREQ_OFFLOAD_COLLECTION which causes the notification from the backing filesystem to put the collection onto a work queue instead.
Thanks David, you can add me as Tested-by if you want.
I can't test the other patch for the next two weeks (vacation). When I'm back, I'll install both fixes on some heavily loaded production machines - our clusters always shake out the worst in every piece of code they run!