From: Damien Le Moal dlemoal@kernel.org
[ Upstream commit 15592a11d5a5c8411ac8494ec49736b658f6fbff ]
With the ATA error model, an NCQ command failure always triggers an abort (termination) of all NCQ commands queued on the device. In such case, the SAT or the host must handle the failed command according to the command sense data and immediately retry all other NCQ commands that were aborted due to the failed NCQ command.
For SAS HBAs controlled by the mpt3sas driver, NCQ command aborts are not handled by the HBA SAT and sent back to the host, with an ioc log information equal to 0x31080000 (IOC_LOGINFO_PREFIX_PL with the PL code PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR). The function _scsih_io_done() always forces a retry of commands terminated with the status MPI2_IOCSTATUS_SCSI_IOC_TERMINATED using the SCSI result DID_SOFT_ERROR, regardless of the log_info for the command. This correctly forces the retry of collateral NCQ abort commands, but with the retry counter for the command being incremented. If a command to an ATA device is subject to too many retries due to other NCQ commands failing (e.g. read commands trying to access unreadable sectors), the collateral NCQ abort commands may be terminated with an error as they run out of retries. This violates the SAT specification and causes hard-to-debug command errors.
Solve this issue by modifying the handling of the MPI2_IOCSTATUS_SCSI_IOC_TERMINATED status to check if a command is for an ATA device and if the command loginfo indicates an NCQ collateral abort. If that is the case, force the command retry using the SCSI result DID_IMM_RETRY to avoid incrementing the command retry count.
Signed-off-by: Damien Le Moal dlemoal@kernel.org Link: https://lore.kernel.org/r/20250606052747.742998-3-dlemoal@kernel.org Tested-by: Yafang Shao laoar.shao@gmail.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
Based on my analysis of the commit and examination of the kernel repository, here is my assessment:
**Backport Status: YES**
This commit should be backported to stable kernel trees for the following reasons:
## Bug Fix Analysis
1. **Fixes a Significant Bug**: The commit fixes a violation of the SAT (SCSI ATA Translation) specification that causes incorrect error handling for NCQ (Native Command Queuing) commands on SATA devices. When one NCQ command fails, all other NCQ commands get aborted as collateral damage, which is expected ATA behavior. However, the driver was incorrectly incrementing retry counters for these collateral aborts, eventually causing them to fail with errors.
2. **Real User Impact**: The commit message explicitly states this causes "hard-to-debug command errors" and includes a "Tested-by" tag from another developer (Yafang Shao), indicating this was a real problem affecting users in production.
3. **Small and Contained Fix**: The change is minimal - it only adds: - A new constant definition `IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR` (0x31080000) - A small conditional block (lines 5817-5829 in the new code) that checks for this specific log_info value and uses `DID_IMM_RETRY` instead of `DID_SOFT_ERROR`
4. **Low Risk of Regression**: - The fix is highly targeted - it only affects the specific case where `log_info == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR` - It uses an existing SCSI mechanism (`DID_IMM_RETRY`) that's already well-established in the kernel (used in 30+ places across SCSI drivers) - The change preserves all existing behavior for other error conditions
5. **Fixes Specification Compliance**: The bug violates the SAT specification, which is important for interoperability and correct SATA device operation through SAS HBAs.
## Code Analysis
The key change is in the `MPI2_IOCSTATUS_SCSI_IOC_TERMINATED` case handling:
```c +if (log_info == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) { + /* + * This is a ATA NCQ command aborted due to another NCQ + * command failure. We must retry this command + * immediately but without incrementing its retry + * counter. + */ + WARN_ON_ONCE(xfer_cnt != 0); + scmd->result = DID_IMM_RETRY << 16; + break; +} ```
This specifically handles the NCQ collateral abort case (identified by the log_info value 0x31080000) by using `DID_IMM_RETRY` which retries the command without incrementing the retry counter, instead of the default `DID_SOFT_ERROR` which would increment the counter.
## Stable Tree Criteria Met
- ✓ Fixes a real bug that affects users - ✓ No new features added - ✓ Minimal architectural changes - ✓ Low risk of regression - ✓ Confined to a specific subsystem (mpt3sas driver) - ✓ Clear problem and solution
The commit meets all the criteria for stable tree backporting as it's a targeted bug fix that resolves a specification compliance issue with minimal risk.
drivers/scsi/mpt3sas/mpt3sas_scsih.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+)
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c index 508861e88d9f..d7d8244dfedc 100644 --- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c +++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c @@ -195,6 +195,14 @@ struct sense_info { #define MPT3SAS_PORT_ENABLE_COMPLETE (0xFFFD) #define MPT3SAS_ABRT_TASK_SET (0xFFFE) #define MPT3SAS_REMOVE_UNRESPONDING_DEVICES (0xFFFF) + +/* + * SAS Log info code for a NCQ collateral abort after an NCQ error: + * IOC_LOGINFO_PREFIX_PL | PL_LOGINFO_CODE_SATA_NCQ_FAIL_ALL_CMDS_AFTR_ERR + * See: drivers/message/fusion/lsi/mpi_log_sas.h + */ +#define IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR 0x31080000 + /** * struct fw_event_work - firmware event struct * @list: link list framework @@ -5814,6 +5822,17 @@ _scsih_io_done(struct MPT3SAS_ADAPTER *ioc, u16 smid, u8 msix_index, u32 reply) scmd->result = DID_TRANSPORT_DISRUPTED << 16; goto out; } + if (log_info == IOC_LOGINFO_SATA_NCQ_FAIL_AFTER_ERR) { + /* + * This is a ATA NCQ command aborted due to another NCQ + * command failure. We must retry this command + * immediately but without incrementing its retry + * counter. + */ + WARN_ON_ONCE(xfer_cnt != 0); + scmd->result = DID_IMM_RETRY << 16; + break; + } if (log_info == 0x31110630) { if (scmd->retries > 2) { scmd->result = DID_NO_CONNECT << 16;