On Fri, May 11, 2018 at 2:56 AM, Alex G. mr.nuke.me@gmail.com wrote:
On 05/10/2018 11:01 AM, Keith Busch wrote:
AER handling expects a successful return from slot_reset means the driver made the device functional again. The nvme driver had been using an asynchronous reset to recover the device, so the device may still be initializing after control is returned to the AER handler. This creates problems for subsequent event handling, causing the initializion to fail.
This patch fixes that by syncing the controller reset before returning to the AER driver, and reporting the true state of the reset.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199657 Reported-by: Alex Gagniuc mr.nuke.me@gmail.com
Tested-by: Alex Gagniuc mr.nuke.me@gmail.com
Sponsored-by: DellEMC You know I had to add that plug somewhere :p
Cc: Sinan Kaya okaya@codeaurora.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: stable@vger.kernel.org Signed-off-by: Keith Busch keith.busch@intel.com
drivers/nvme/host/pci.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b542dce45927..2e221796257a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2681,8 +2681,15 @@ static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
dev_info(dev->ctrl.device, "restart after slot reset\n"); pci_restore_state(pdev);
nvme_reset_ctrl(&dev->ctrl);
return PCI_ERS_RESULT_RECOVERED;
nvme_reset_ctrl_sync(&dev->ctrl);
This does wonders when nvme_reset_ctrl_sync() returns in a timely manner. I was also able to get the nvme drive in a state where nvme_reset_ctrl_sync() does not return. Then we end up with the device lock in report_slot_reset, which, as you may imagine, is not a great thing.
I think this step is a move in the better direction, but we still have problems.
If IOs from nvme_reset_work() times out, nvme_reset_ctrl_sync() may never return, but not sure if that is your case.
You may find where it hangs via 'ps -ax | grep D' and cat /proc/$PID/stack.