On Fri, Apr 02, 2021 at 01:08:41PM -0700, Sagi Grimberg wrote:
The below patches caused a regression in a multipath setup: Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
These patches on their own are correct because they fixed a controller reset regression.
When we reset/teardown a controller, we must freeze and quiesce the namespaces request queues to make sure that we safely stop inflight I/O submissions. Freeze is mandatory because if our hctx map changed between reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and if it still has pending submissions (that are still quiesced) it will hang. This is what the above patches fixed.
However, by freezing the namespaces request queues, and only unfreezing them when we successfully reconnect, inflight submissions that are running concurrently can now block grabbing the nshead srcu until either we successfully reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected).
This caused a deadlock [1] when a different controller (different path on the same subsystem) became live (i.e. optimized/non-optimized). This is because nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O in order to make sure that current_path is visible to future (re)submisions. However the srcu lock is taken by a blocked submission on a frozen request queue, and we have a deadlock.
In recent kernels (v5.9+) direct_make_request was replaced by submit_bio_noacct which does not have this issue because it bio_list will be active when nvme-mpath calls submit_bio_noacct on the bottom device (because it was populated when submit_bio was triggered on it.
Hence, we need to fix all the kernels that were before submit_bio_noacct was introduced.
Why can we not just add submit_bio_noacct to the 5.4 kernel to correct this? What commit id is that?
thanks,
greg k-h