The patch below does not apply to the 6.5-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y git checkout FETCH_HEAD git cherry-pick -x 8b4d9469d0b0e553208ee6f62f2807111fde18b9 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2023100443-freewill-shanty-0f05@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8b4d9469d0b0e553208ee6f62f2807111fde18b9 Mon Sep 17 00:00:00 2001 From: Damien Le Moal dlemoal@kernel.org Date: Tue, 5 Sep 2023 09:06:23 +0900 Subject: [PATCH] ata: libata-scsi: Fix delayed scsi_rescan_device() execution
Commit 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") modified ata_scsi_dev_rescan() to check the scsi device "is_suspended" power field to ensure that the scsi device associated with an ATA device is fully resumed when scsi_rescan_device() is executed. However, this fix is problematic as: 1) It relies on a PM internal field that should not be used without PM device locking protection. 2) The check for is_suspended and the call to scsi_rescan_device() are not atomic and a suspend PM event may be triggered between them, casuing scsi_rescan_device() to be called on a suspended device and in that function blocking while holding the scsi device lock. This would deadlock a following resume operation. These problems can trigger PM deadlocks on resume, especially with resume operations triggered quickly after or during suspend operations. E.g., a simple bash script like:
for (( i=0; i<10; i++ )); do echo "+2 > /sys/class/rtc/rtc0/wakealarm echo mem > /sys/power/state done
that triggers a resume 2 seconds after starting suspending a system can quickly lead to a PM deadlock preventing the system from correctly resuming.
Fix this by replacing the check on is_suspended with a check on the return value given by scsi_rescan_device() as that function will fail if called against a suspended device. Also make sure rescan tasks already scheduled are first cancelled before suspending an ata port.
Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal dlemoal@kernel.org Reviewed-by: Hannes Reinecke hare@suse.de Reviewed-by: Niklas Cassel niklas.cassel@wdc.com Tested-by: Geert Uytterhoeven geert+renesas@glider.be Reviewed-by: Martin K. Petersen martin.petersen@oracle.com
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index a0bc01606b30..092372334e92 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5168,11 +5168,27 @@ static const unsigned int ata_port_suspend_ehi = ATA_EHI_QUIET
static void ata_port_suspend(struct ata_port *ap, pm_message_t mesg) { + /* + * We are about to suspend the port, so we do not care about + * scsi_rescan_device() calls scheduled by previous resume operations. + * The next resume will schedule the rescan again. So cancel any rescan + * that is not done yet. + */ + cancel_delayed_work_sync(&ap->scsi_rescan_task); + ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, false); }
static void ata_port_suspend_async(struct ata_port *ap, pm_message_t mesg) { + /* + * We are about to suspend the port, so we do not care about + * scsi_rescan_device() calls scheduled by previous resume operations. + * The next resume will schedule the rescan again. So cancel any rescan + * that is not done yet. + */ + cancel_delayed_work_sync(&ap->scsi_rescan_task); + ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, true); }
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index a0e58d22d222..6850cac803c1 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -4756,7 +4756,7 @@ void ata_scsi_dev_rescan(struct work_struct *work) struct ata_link *link; struct ata_device *dev; unsigned long flags; - bool delay_rescan = false; + int ret = 0;
mutex_lock(&ap->scsi_scan_mutex); spin_lock_irqsave(ap->lock, flags); @@ -4765,37 +4765,34 @@ void ata_scsi_dev_rescan(struct work_struct *work) ata_for_each_dev(dev, link, ENABLED) { struct scsi_device *sdev = dev->sdev;
+ /* + * If the port was suspended before this was scheduled, + * bail out. + */ + if (ap->pflags & ATA_PFLAG_SUSPENDED) + goto unlock; + if (!sdev) continue; if (scsi_device_get(sdev)) continue;
- /* - * If the rescan work was scheduled because of a resume - * event, the port is already fully resumed, but the - * SCSI device may not yet be fully resumed. In such - * case, executing scsi_rescan_device() may cause a - * deadlock with the PM code on device_lock(). Prevent - * this by giving up and retrying rescan after a short - * delay. - */ - delay_rescan = sdev->sdev_gendev.power.is_suspended; - if (delay_rescan) { - scsi_device_put(sdev); - break; - } - spin_unlock_irqrestore(ap->lock, flags); - scsi_rescan_device(sdev); + ret = scsi_rescan_device(sdev); scsi_device_put(sdev); spin_lock_irqsave(ap->lock, flags); + + if (ret) + goto unlock; } }
+unlock: spin_unlock_irqrestore(ap->lock, flags); mutex_unlock(&ap->scsi_scan_mutex);
- if (delay_rescan) + /* Reschedule with a delay if scsi_rescan_device() returned an error */ + if (ret) schedule_delayed_work(&ap->scsi_rescan_task, msecs_to_jiffies(5)); }
On 10/4/23 23:32, gregkh@linuxfoundation.org wrote:
The patch below does not apply to the 6.5-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to stable@vger.kernel.org.
To reproduce the conflict and resubmit, you may use the following commands:
git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.5.y git checkout FETCH_HEAD git cherry-pick -x 8b4d9469d0b0e553208ee6f62f2807111fde18b9 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to 'stable@vger.kernel.org' --in-reply-to '2023100443-freewill-shanty-0f05@gregkh' --subject-prefix 'PATCH 6.5.y' HEAD^..
Possible dependencies:
commit ff48b37802e5c134e2dfc4d091f10b2eb5065a72
scsi: Do not attempt to rescan suspended devices
is needed to backport this one.
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 8b4d9469d0b0e553208ee6f62f2807111fde18b9 Mon Sep 17 00:00:00 2001 From: Damien Le Moal dlemoal@kernel.org Date: Tue, 5 Sep 2023 09:06:23 +0900 Subject: [PATCH] ata: libata-scsi: Fix delayed scsi_rescan_device() execution
Commit 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") modified ata_scsi_dev_rescan() to check the scsi device "is_suspended" power field to ensure that the scsi device associated with an ATA device is fully resumed when scsi_rescan_device() is executed. However, this fix is problematic as:
- It relies on a PM internal field that should not be used without PM device locking protection.
- The check for is_suspended and the call to scsi_rescan_device() are not atomic and a suspend PM event may be triggered between them, casuing scsi_rescan_device() to be called on a suspended device and in that function blocking while holding the scsi device lock. This would deadlock a following resume operation.
These problems can trigger PM deadlocks on resume, especially with resume operations triggered quickly after or during suspend operations. E.g., a simple bash script like:
for (( i=0; i<10; i++ )); do echo "+2 > /sys/class/rtc/rtc0/wakealarm echo mem > /sys/power/state done
that triggers a resume 2 seconds after starting suspending a system can quickly lead to a PM deadlock preventing the system from correctly resuming.
Fix this by replacing the check on is_suspended with a check on the return value given by scsi_rescan_device() as that function will fail if called against a suspended device. Also make sure rescan tasks already scheduled are first cancelled before suspending an ata port.
Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal dlemoal@kernel.org Reviewed-by: Hannes Reinecke hare@suse.de Reviewed-by: Niklas Cassel niklas.cassel@wdc.com Tested-by: Geert Uytterhoeven geert+renesas@glider.be Reviewed-by: Martin K. Petersen martin.petersen@oracle.com
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index a0bc01606b30..092372334e92 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5168,11 +5168,27 @@ static const unsigned int ata_port_suspend_ehi = ATA_EHI_QUIET static void ata_port_suspend(struct ata_port *ap, pm_message_t mesg) {
- /*
* We are about to suspend the port, so we do not care about
* scsi_rescan_device() calls scheduled by previous resume operations.
* The next resume will schedule the rescan again. So cancel any rescan
* that is not done yet.
*/
- cancel_delayed_work_sync(&ap->scsi_rescan_task);
- ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, false);
} static void ata_port_suspend_async(struct ata_port *ap, pm_message_t mesg) {
- /*
* We are about to suspend the port, so we do not care about
* scsi_rescan_device() calls scheduled by previous resume operations.
* The next resume will schedule the rescan again. So cancel any rescan
* that is not done yet.
*/
- cancel_delayed_work_sync(&ap->scsi_rescan_task);
- ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, true);
} diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index a0e58d22d222..6850cac803c1 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -4756,7 +4756,7 @@ void ata_scsi_dev_rescan(struct work_struct *work) struct ata_link *link; struct ata_device *dev; unsigned long flags;
- bool delay_rescan = false;
- int ret = 0;
mutex_lock(&ap->scsi_scan_mutex); spin_lock_irqsave(ap->lock, flags); @@ -4765,37 +4765,34 @@ void ata_scsi_dev_rescan(struct work_struct *work) ata_for_each_dev(dev, link, ENABLED) { struct scsi_device *sdev = dev->sdev;
/*
* If the port was suspended before this was scheduled,
* bail out.
*/
if (ap->pflags & ATA_PFLAG_SUSPENDED)
goto unlock;
if (!sdev) continue; if (scsi_device_get(sdev)) continue;
/*
* If the rescan work was scheduled because of a resume
* event, the port is already fully resumed, but the
* SCSI device may not yet be fully resumed. In such
* case, executing scsi_rescan_device() may cause a
* deadlock with the PM code on device_lock(). Prevent
* this by giving up and retrying rescan after a short
* delay.
*/
delay_rescan = sdev->sdev_gendev.power.is_suspended;
if (delay_rescan) {
scsi_device_put(sdev);
break;
}
spin_unlock_irqrestore(ap->lock, flags);
scsi_rescan_device(sdev);
ret = scsi_rescan_device(sdev); scsi_device_put(sdev); spin_lock_irqsave(ap->lock, flags);
if (ret)
} }goto unlock;
+unlock: spin_unlock_irqrestore(ap->lock, flags); mutex_unlock(&ap->scsi_scan_mutex);
- if (delay_rescan)
- /* Reschedule with a delay if scsi_rescan_device() returned an error */
- if (ret) schedule_delayed_work(&ap->scsi_rescan_task, msecs_to_jiffies(5));
}
Commit 8b4d9469d0b0e553208ee6f62f2807111fde18b9 upstream.
Commit 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") modified ata_scsi_dev_rescan() to check the scsi device "is_suspended" power field to ensure that the scsi device associated with an ATA device is fully resumed when scsi_rescan_device() is executed. However, this fix is problematic as: 1) It relies on a PM internal field that should not be used without PM device locking protection. 2) The check for is_suspended and the call to scsi_rescan_device() are not atomic and a suspend PM event may be triggered between them, casuing scsi_rescan_device() to be called on a suspended device and in that function blocking while holding the scsi device lock. This would deadlock a following resume operation. These problems can trigger PM deadlocks on resume, especially with resume operations triggered quickly after or during suspend operations. E.g., a simple bash script like:
for (( i=0; i<10; i++ )); do echo "+2 > /sys/class/rtc/rtc0/wakealarm echo mem > /sys/power/state done
that triggers a resume 2 seconds after starting suspending a system can quickly lead to a PM deadlock preventing the system from correctly resuming.
Fix this by replacing the check on is_suspended with a check on the return value given by scsi_rescan_device() as that function will fail if called against a suspended device. Also make sure rescan tasks already scheduled are first cancelled before suspending an ata port.
Fixes: 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan after device resume") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal dlemoal@kernel.org Reviewed-by: Hannes Reinecke hare@suse.de Reviewed-by: Niklas Cassel niklas.cassel@wdc.com Tested-by: Geert Uytterhoeven geert+renesas@glider.be Reviewed-by: Martin K. Petersen martin.petersen@oracle.com (cherry picked from commit 8b4d9469d0b0e553208ee6f62f2807111fde18b9) --- drivers/ata/libata-core.c | 16 ++++++++++++++++ drivers/ata/libata-scsi.c | 33 +++++++++++++++------------------ 2 files changed, 31 insertions(+), 18 deletions(-)
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 79d02eb4e479..397a06bcc0cc 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -5245,11 +5245,27 @@ static const unsigned int ata_port_suspend_ehi = ATA_EHI_QUIET
static void ata_port_suspend(struct ata_port *ap, pm_message_t mesg) { + /* + * We are about to suspend the port, so we do not care about + * scsi_rescan_device() calls scheduled by previous resume operations. + * The next resume will schedule the rescan again. So cancel any rescan + * that is not done yet. + */ + cancel_delayed_work_sync(&ap->scsi_rescan_task); + ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, false); }
static void ata_port_suspend_async(struct ata_port *ap, pm_message_t mesg) { + /* + * We are about to suspend the port, so we do not care about + * scsi_rescan_device() calls scheduled by previous resume operations. + * The next resume will schedule the rescan again. So cancel any rescan + * that is not done yet. + */ + cancel_delayed_work_sync(&ap->scsi_rescan_task); + ata_port_request_pm(ap, mesg, 0, ata_port_suspend_ehi, true); }
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index c6ece32de8e3..90001d265336 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -4861,7 +4861,7 @@ void ata_scsi_dev_rescan(struct work_struct *work) struct ata_link *link; struct ata_device *dev; unsigned long flags; - bool delay_rescan = false; + int ret = 0;
mutex_lock(&ap->scsi_scan_mutex); spin_lock_irqsave(ap->lock, flags); @@ -4870,37 +4870,34 @@ void ata_scsi_dev_rescan(struct work_struct *work) ata_for_each_dev(dev, link, ENABLED) { struct scsi_device *sdev = dev->sdev;
+ /* + * If the port was suspended before this was scheduled, + * bail out. + */ + if (ap->pflags & ATA_PFLAG_SUSPENDED) + goto unlock; + if (!sdev) continue; if (scsi_device_get(sdev)) continue;
- /* - * If the rescan work was scheduled because of a resume - * event, the port is already fully resumed, but the - * SCSI device may not yet be fully resumed. In such - * case, executing scsi_rescan_device() may cause a - * deadlock with the PM code on device_lock(). Prevent - * this by giving up and retrying rescan after a short - * delay. - */ - delay_rescan = sdev->sdev_gendev.power.is_suspended; - if (delay_rescan) { - scsi_device_put(sdev); - break; - } - spin_unlock_irqrestore(ap->lock, flags); - scsi_rescan_device(&(sdev->sdev_gendev)); + ret = scsi_rescan_device(&(sdev->sdev_gendev)); scsi_device_put(sdev); spin_lock_irqsave(ap->lock, flags); + + if (ret) + goto unlock; } }
+unlock: spin_unlock_irqrestore(ap->lock, flags); mutex_unlock(&ap->scsi_scan_mutex);
- if (delay_rescan) + /* Reschedule with a delay if scsi_rescan_device() returned an error */ + if (ret) schedule_delayed_work(&ap->scsi_rescan_task, msecs_to_jiffies(5)); }
linux-stable-mirror@lists.linaro.org