From: Suganath Prabu S suganath-prabu.subramani@broadcom.com
[ Upstream commit 4588e65cfd66fc8bbd9969ea730db39b60a36a30 ]
Avoid scanning SAS/SATA devices in channel 1 when SAS transport is enabled, as the SAS/SATA devices are exposed through channel 0.
Signed-off-by: Suganath Prabu S suganath-prabu.subramani@broadcom.com Signed-off-by: Ranjan Kumar ranjan.kumar@broadcom.com Link: https://lore.kernel.org/stable/20251120071955.463475-1-suganath-prabu.subram... Link: https://patch.msgid.link/20251120071955.463475-1-suganath-prabu.subramani@br... Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
### DEPENDENCY VERIFICATION
The `non_stl` field and `sas_transport_enabled` were both introduced in commit c4723e68a0d81 ("scsi: mpi3mr: Enable STL on HBAs where multipath is disabled") from August 2022 (Linux 6.0 cycle).
This means the patch is only applicable to stable kernels 6.1.y and later (not 5.15.y).
### SUMMARY
**Bug being fixed**: Duplicate SAS/SATA device entries appearing in both channel 0 (SAS transport) and channel 1 (generic SCSI) when SAS transport is enabled, causing user confusion and management issues.
**Stable kernel criteria assessment**: - ✅ **Obviously correct**: Simple conditional checks with clear logic - ✅ **Fixes a real bug**: Duplicate device enumeration is a real user- visible issue - ✅ **Small and contained**: Only 4 lines of actual code change - ✅ **No new features**: Just corrects existing device enumeration logic - ✅ **Tested**: Merged through maintainer tree with proper sign-offs - ✅ **Intentional stable submission**: Link to stable mailing list present
**Risk vs Benefit**: - **Risk**: Very low - simple conditional check, worst case is device visibility issue - **Benefit**: Fixes confusing duplicate device entries for MPI3MR users with SAS transport
**Concerns**: 1. Requires commit c4723e68a0d81 to be present (6.1.y and later only) 2. Version bump in header should be stripped for stable backport
The explicit submission to the stable mailing list, the small surgical nature of the fix, and the clear bug it addresses make this a valid stable backport candidate for kernels 6.1.y and newer.
**YES**
drivers/scsi/mpi3mr/mpi3mr.h | 4 ++-- drivers/scsi/mpi3mr/mpi3mr_os.c | 4 +++- 2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/scsi/mpi3mr/mpi3mr.h b/drivers/scsi/mpi3mr/mpi3mr.h index 6742684e2990a..31d68c151b207 100644 --- a/drivers/scsi/mpi3mr/mpi3mr.h +++ b/drivers/scsi/mpi3mr/mpi3mr.h @@ -56,8 +56,8 @@ extern struct list_head mrioc_list; extern int prot_mask; extern atomic64_t event_counter;
-#define MPI3MR_DRIVER_VERSION "8.15.0.5.50" -#define MPI3MR_DRIVER_RELDATE "12-August-2025" +#define MPI3MR_DRIVER_VERSION "8.15.0.5.51" +#define MPI3MR_DRIVER_RELDATE "18-November-2025"
#define MPI3MR_DRIVER_NAME "mpi3mr" #define MPI3MR_DRIVER_LICENSE "GPL" diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c index b88633e1efe27..d4ca878d08869 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_os.c +++ b/drivers/scsi/mpi3mr/mpi3mr_os.c @@ -1184,6 +1184,8 @@ static void mpi3mr_update_tgtdev(struct mpi3mr_ioc *mrioc, if (is_added == true) tgtdev->io_throttle_enabled = (flags & MPI3_DEVICE0_FLAGS_IO_THROTTLING_REQUIRED) ? 1 : 0; + if (!mrioc->sas_transport_enabled) + tgtdev->non_stl = 1;
switch (flags & MPI3_DEVICE0_FLAGS_MAX_WRITE_SAME_MASK) { case MPI3_DEVICE0_FLAGS_MAX_WRITE_SAME_256_LB: @@ -4844,7 +4846,7 @@ static int mpi3mr_target_alloc(struct scsi_target *starget) spin_lock_irqsave(&mrioc->tgtdev_lock, flags); if (starget->channel == mrioc->scsi_device_channel) { tgt_dev = __mpi3mr_get_tgtdev_by_perst_id(mrioc, starget->id); - if (tgt_dev && !tgt_dev->is_hidden) { + if (tgt_dev && !tgt_dev->is_hidden && tgt_dev->non_stl) { scsi_tgt_priv_data->starget = starget; scsi_tgt_priv_data->dev_handle = tgt_dev->dev_handle; scsi_tgt_priv_data->perst_id = tgt_dev->perst_id;
From: Xingui Yang yangxingui@huawei.com
[ Upstream commit 278712d20bc8ec29d1ad6ef9bdae9000ef2c220c ]
This reverts commit ab2068a6fb84751836a84c26ca72b3beb349619d.
When probing the exp-attached sata device, libsas/libata will issue a hard reset in sas_probe_sata() -> ata_sas_async_probe(), then a broadcast event will be received after the disk probe fails, and this commit causes the probe will be re-executed on the disk, and a faulty disk may get into an indefinite loop of probe.
Therefore, revert this commit, although it can fix some temporary issues with disk probe failure.
Signed-off-by: Xingui Yang yangxingui@huawei.com Reviewed-by: Jason Yan yanaijie@huawei.com Reviewed-by: John Garry john.g.garry@oracle.com Link: https://patch.msgid.link/20251202065627.140361-1-yangxingui@huawei.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
## Analysis Summary
### 1. Bug Description This revert fixes an **infinite probe loop** regression introduced by commit `ab2068a6fb84`. The original commit cleared the expander phy attached address when device probe failed, intending to prevent "flutter" detection. However, this creates an infinite loop:
1. Expander-attached SATA device probe fails 2. Hard reset issued in `sas_probe_sata()` → `ata_sas_async_probe()` 3. Broadcast event received after probe failure 4. Because address was cleared, device appears "new" 5. Probe re-executed → goto step 1
**Result**: Faulty disks get stuck in indefinite probe loops.
### 2. Stable Kernel Rules Assessment | Criterion | Assessment | |-----------|------------| | Fixes real bug | ✅ YES - infinite loop (severe) | | Obviously correct | ✅ YES - pure revert, restores known-working behavior | | Small and contained | ✅ YES - removes ~14 lines from one function in one file | | No new features | ✅ YES - removes code, adds nothing | | Tested | ✅ YES - Reviewed-by from Jason Yan and John Garry |
### 3. Affected Stable Trees - Original problematic commit first appeared in **v6.10-rc7** - Present in: v6.10.x, v6.11.x, v6.12.x, v6.13.x, v6.14.x, v6.15.x, v6.16.x stable trees - **NOT** in v6.1.x or v6.6.x LTS trees (they don't need this fix)
### 4. Risk Assessment **Very Low Risk**: - Pure revert restoring pre-v6.10 behavior that worked for years - No complex logic changes - simply removes the problematic code - Isolated to `sas_fail_probe()` function in `sas_internal.h` - Same author (Xingui Yang) who wrote the original is now reverting it - Multiple reviewers approved (John Garry from Oracle is a SCSI maintainer)
### 5. User Impact - **High impact** for users with SAS expanders and faulty/failing SATA drives - Enterprise storage systems commonly use SAS expanders - Infinite loops during device probe can render systems unusable - The trade-off (accepting "flutter" detection vs. infinite loops) clearly favors the revert
### 6. Dependencies None - this is a self-contained revert that requires no other commits.
## Conclusion
This is a textbook case for stable backporting: - Fixes a **severe regression** (infinite loop) - **Pure revert** with minimal risk - **Well-reviewed** by subsystem experts - Clearly affects real users with SAS/SATA storage configurations - Original author acknowledges the bug is worse than the original issue it tried to fix
The commit should be backported to all stable trees v6.10.x and newer that contain the original problematic commit.
**YES**
drivers/scsi/libsas/sas_internal.h | 14 -------------- 1 file changed, 14 deletions(-)
diff --git a/drivers/scsi/libsas/sas_internal.h b/drivers/scsi/libsas/sas_internal.h index 6706f2be8d274..da5408c701cdd 100644 --- a/drivers/scsi/libsas/sas_internal.h +++ b/drivers/scsi/libsas/sas_internal.h @@ -145,20 +145,6 @@ static inline void sas_fail_probe(struct domain_device *dev, const char *func, i func, dev->parent ? "exp-attached" : "direct-attached", SAS_ADDR(dev->sas_addr), err); - - /* - * If the device probe failed, the expander phy attached address - * needs to be reset so that the phy will not be treated as flutter - * in the next revalidation - */ - if (dev->parent && !dev_is_expander(dev->dev_type)) { - struct sas_phy *phy = dev->phy; - struct domain_device *parent = dev->parent; - struct ex_phy *ex_phy = &parent->ex_dev.ex_phy[phy->number]; - - memset(ex_phy->attached_sas_addr, 0, SAS_ADDR_SIZE); - } - sas_unregister_dev(dev->port, dev); }
From: Brian Kao powenkao@google.com
[ Upstream commit b4bb6daf4ac4d4560044ecdd81e93aa2f6acbb06 ]
When a W-LUN resume fails, its parent devices in the SCSI hierarchy, including the scsi_target, may be runtime suspended. Subsequently, the error handler in ufshcd_recover_pm_error() fails to set the W-LUN device back to active because the parent target is not active. This results in the following errors:
google-ufshcd 3c2d0000.ufs: ufshcd_err_handler started; HBA state eh_fatal; ... ufs_device_wlun 0:0:0:49488: START_STOP failed for power mode: 1, result 40000 ufs_device_wlun 0:0:0:49488: ufshcd_wl_runtime_resume failed: -5 ... ufs_device_wlun 0:0:0:49488: runtime PM trying to activate child device 0:0:0:49488 but parent (target0:0:0) is not active
Address this by:
1. Ensuring the W-LUN's parent scsi_target is runtime resumed before attempting to set the W-LUN to active within ufshcd_recover_pm_error().
2. Explicitly checking for power.runtime_error on the HBA and W-LUN devices before calling pm_runtime_set_active() to clear the error state.
3. Adding pm_runtime_get_sync(hba->dev) in ufshcd_err_handling_prepare() to ensure the HBA itself is active during error recovery, even if a child device resume failed.
These changes ensure the device power states are managed correctly during error recovery.
Signed-off-by: Brian Kao powenkao@google.com Tested-by: Brian Kao powenkao@google.com Reviewed-by: Bart Van Assche bvanassche@acm.org Link: https://patch.msgid.link/20251112063214.1195761-1-powenkao@google.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
## Analysis Summary
### Problem Description This commit fixes a bug where UFS error recovery fails after a W-LUN (Well Known Logical Unit Number) resume error. When a W-LUN resume fails:
1. The parent devices in the SCSI hierarchy (including `scsi_target`) remain runtime suspended 2. The error handler's `ufshcd_recover_pm_error()` attempts to set the W-LUN device active 3. The PM core rejects this because the parent target is not active, producing the error: ``` runtime PM trying to activate child device 0:0:0:49488 but parent (target0:0:0) is not active ```
### Technical Analysis of the Fix
**Original (buggy) code:** ```c ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev); if (ret) ret = pm_runtime_set_active(hba->dev); ```
This incorrectly tries to activate a child device while its parent is suspended - violating the PM hierarchy requirement.
**The fix does three things:** 1. **Adds `pm_runtime_get_sync(hba->dev)` in `ufshcd_err_handling_prepare()`** - Ensures HBA is active during error recovery even if a child device resume failed 2. **Fixes PM hierarchy in `ufshcd_recover_pm_error()`** - Explicitly resumes the parent `scsi_target` before trying to activate the child W-LUN 3. **Checks `power.runtime_error`** - Only clears error state on devices that actually have an error, rather than relying on return values
### Stable Tree Criteria Assessment
| Criterion | Assessment | |-----------|------------| | Fixes real bug | ✅ YES - Error handling completely fails without this | | Obviously correct | ✅ YES - Follows standard PM parent-before-child rules | | Small and contained | ✅ YES - ~35 lines in one file, error handling path only | | No new features | ✅ YES - Pure bugfix | | Tested | ✅ YES - Has Tested-by and Reviewed-by: Bart Van Assche | | Exists in stable | ✅ YES - `ufshcd_recover_pm_error()` introduced in v5.10 | | Dependencies | ✅ NONE - Uses standard, long-existing PM APIs |
### User Impact - **Affected systems**: Android devices, embedded systems with UFS storage - **Severity**: Error handler failure can lead to I/O failures and potential data loss - **Frequency**: Occurs when W-LUN resume fails and error handler tries to recover
### Risk Assessment - **Low regression risk**: Changes are in error handling code path only - **Well-understood fix**: Standard PM hierarchy handling pattern - **Good review**: Bart Van Assche (experienced SCSI/block maintainer) reviewed it
### Conclusion
This is a clear-cut bug fix for UFS error handling. The bug causes the error handler to fail completely when a W-LUN resume error occurs, which can leave the storage subsystem in an unrecoverable state. The fix correctly implements PM hierarchy requirements (parent must be active before child), is well-tested, has expert review, and uses standard APIs that exist in all stable trees since v5.10. The changes are contained to the error handling path with minimal regression risk.
**YES**
drivers/ufs/core/ufshcd.c | 36 ++++++++++++++++++++++++++++-------- 1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index d6a060a724618..ce52c3bafbe8f 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -6498,6 +6498,11 @@ static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
static void ufshcd_err_handling_prepare(struct ufs_hba *hba) { + /* + * A WLUN resume failure could potentially lead to the HBA being + * runtime suspended, so take an extra reference on hba->dev. + */ + pm_runtime_get_sync(hba->dev); ufshcd_rpm_get_sync(hba); if (pm_runtime_status_suspended(&hba->ufs_device_wlun->sdev_gendev) || hba->is_sys_suspended) { @@ -6537,6 +6542,7 @@ static void ufshcd_err_handling_unprepare(struct ufs_hba *hba) if (ufshcd_is_clkscaling_supported(hba)) ufshcd_clk_scaling_suspend(hba, false); ufshcd_rpm_put(hba); + pm_runtime_put(hba->dev); }
static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba) @@ -6551,28 +6557,42 @@ static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba) #ifdef CONFIG_PM static void ufshcd_recover_pm_error(struct ufs_hba *hba) { + struct scsi_target *starget = hba->ufs_device_wlun->sdev_target; struct Scsi_Host *shost = hba->host; struct scsi_device *sdev; struct request_queue *q; - int ret; + bool resume_sdev_queues = false;
hba->is_sys_suspended = false; + /* - * Set RPM status of wlun device to RPM_ACTIVE, - * this also clears its runtime error. + * Ensure the parent's error status is cleared before proceeding + * to the child, as the parent must be active to activate the child. */ - ret = pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev); + if (hba->dev->power.runtime_error) { + /* hba->dev has no functional parent thus simplily set RPM_ACTIVE */ + pm_runtime_set_active(hba->dev); + resume_sdev_queues = true; + } + + if (hba->ufs_device_wlun->sdev_gendev.power.runtime_error) { + /* + * starget, parent of wlun, might be suspended if wlun resume failed. + * Make sure parent is resumed before set child (wlun) active. + */ + pm_runtime_get_sync(&starget->dev); + pm_runtime_set_active(&hba->ufs_device_wlun->sdev_gendev); + pm_runtime_put_sync(&starget->dev); + resume_sdev_queues = true; + }
- /* hba device might have a runtime error otherwise */ - if (ret) - ret = pm_runtime_set_active(hba->dev); /* * If wlun device had runtime error, we also need to resume those * consumer scsi devices in case any of them has failed to be * resumed due to supplier runtime resume failure. This is to unblock * blk_queue_enter in case there are bios waiting inside it. */ - if (!ret) { + if (resume_sdev_queues) { shost_for_each_device(sdev, shost) { q = sdev->request_queue; if (q->dev && (q->rpm_status == RPM_SUSPENDED ||
From: Wen Xiong wenxiong@linux.ibm.com
[ Upstream commit 6ac3484fb13b2fc7f31cfc7f56093e7d0ce646a5 ]
A dynamic remove/add storage adapter test hits EEH on PowerPC:
EEH: [c00000000004f75c] __eeh_send_failure_event+0x7c/0x160 EEH: [c000000000048444] eeh_dev_check_failure.part.0+0x254/0x650 EEH: [c008000001650678] eeh_readl+0x60/0x90 [ipr] EEH: [c00800000166746c] ipr_cancel_op+0x2b8/0x524 [ipr] EEH: [c008000001656524] ipr_eh_abort+0x6c/0x130 [ipr] EEH: [c000000000ab0d20] scmd_eh_abort_handler+0x140/0x440 EEH: [c00000000017e558] process_one_work+0x298/0x590 EEH: [c00000000017eef8] worker_thread+0xa8/0x620 EEH: [c00000000018be34] kthread+0x124/0x130 EEH: [c00000000000cd64] ret_from_kernel_thread+0x5c/0x64
A PCIe bus trace reveals that a vector of MSI-X is cleared to 0 by irqbalance daemon. If we disable irqbalance daemon, we won't see the issue.
With debug enabled in ipr driver:
[ 44.103071] ipr: Entering __ipr_remove [ 44.103083] ipr: Entering ipr_initiate_ioa_bringdown [ 44.103091] ipr: Entering ipr_reset_shutdown_ioa [ 44.103099] ipr: Leaving ipr_reset_shutdown_ioa [ 44.103105] ipr: Leaving ipr_initiate_ioa_bringdown [ 44.149918] ipr: Entering ipr_reset_ucode_download [ 44.149935] ipr: Entering ipr_reset_alert [ 44.150032] ipr: Entering ipr_reset_start_timer [ 44.150038] ipr: Leaving ipr_reset_alert [ 44.244343] scsi 1:2:3:0: alua: Detached [ 44.254300] ipr: Entering ipr_reset_start_bist [ 44.254320] ipr: Entering ipr_reset_start_timer [ 44.254325] ipr: Leaving ipr_reset_start_bist [ 44.364329] scsi 1:2:4:0: alua: Detached [ 45.134341] scsi 1:2:5:0: alua: Detached [ 45.860949] ipr: Entering ipr_reset_shutdown_ioa [ 45.860962] ipr: Leaving ipr_reset_shutdown_ioa [ 45.860966] ipr: Entering ipr_reset_alert [ 45.861028] ipr: Entering ipr_reset_start_timer [ 45.861035] ipr: Leaving ipr_reset_alert [ 45.964302] ipr: Entering ipr_reset_start_bist [ 45.964309] ipr: Entering ipr_reset_start_timer [ 45.964313] ipr: Leaving ipr_reset_start_bist [ 46.264301] ipr: Entering ipr_reset_bist_done [ 46.264309] ipr: Leaving ipr_reset_bist_done
During adapter reset, ipr device driver blocks config space access but can't block MMIO access for MSI-X entries. There is very small window: irqbalance daemon kicks in during adapter reset before ipr driver calls pci_restore_state(pdev) to restore MSI-X table.
irqbalance daemon reads back all 0 for that MSI-X vector in __pci_read_msi_msg().
irqbalance daemon:
msi_domain_set_affinity() ->irq_chip_set_affinity_patent() ->xive_irq_set_affinity() ->irq_chip_compose_msi_msg() ->pseries_msi_compose_msg() ->__pci_read_msi_msg(): read all 0 since didn't call pci_restore_state ->irq_chip_write_msi_msg() -> pci_write_msg_msi(): write 0 to the msix vector entry
When ipr driver calls pci_restore_state(pdev) in ipr_reset_restore_cfg_space(), the MSI-X vector entry has been cleared by irqbalance daemon in pci_write_msg_msix().
pci_restore_state() ->__pci_restore_msix_state()
Below is the MSI-X table for ipr adapter after irqbalance daemon kicked in during adapter reset:
Dump MSIx table: index=0 address_lo=c800 address_hi=10000000 msg_data=0 Dump MSIx table: index=1 address_lo=c810 address_hi=10000000 msg_data=0 Dump MSIx table: index=2 address_lo=c820 address_hi=10000000 msg_data=0 Dump MSIx table: index=3 address_lo=c830 address_hi=10000000 msg_data=0 Dump MSIx table: index=4 address_lo=c840 address_hi=10000000 msg_data=0 Dump MSIx table: index=5 address_lo=c850 address_hi=10000000 msg_data=0 Dump MSIx table: index=6 address_lo=c860 address_hi=10000000 msg_data=0 Dump MSIx table: index=7 address_lo=c870 address_hi=10000000 msg_data=0 Dump MSIx table: index=8 address_lo=0 address_hi=0 msg_data=0 ---------> Hit EEH since msix vector of index=8 are 0 Dump MSIx table: index=9 address_lo=c890 address_hi=10000000 msg_data=0 Dump MSIx table: index=10 address_lo=c8a0 address_hi=10000000 msg_data=0 Dump MSIx table: index=11 address_lo=c8b0 address_hi=10000000 msg_data=0 Dump MSIx table: index=12 address_lo=c8c0 address_hi=10000000 msg_data=0 Dump MSIx table: index=13 address_lo=c8d0 address_hi=10000000 msg_data=0 Dump MSIx table: index=14 address_lo=c8e0 address_hi=10000000 msg_data=0 Dump MSIx table: index=15 address_lo=c8f0 address_hi=10000000 msg_data=0
[ 46.264312] ipr: Entering ipr_reset_restore_cfg_space [ 46.267439] ipr: Entering ipr_fail_all_ops [ 46.267447] ipr: Leaving ipr_fail_all_ops [ 46.267451] ipr: Leaving ipr_reset_restore_cfg_space [ 46.267454] ipr: Entering ipr_ioa_bringdown_done [ 46.267458] ipr: Leaving ipr_ioa_bringdown_done [ 46.267467] ipr: Entering ipr_worker_thread [ 46.267470] ipr: Leaving ipr_worker_thread
IRQ balancing is not required during adapter reset.
Enable "IRQ_NO_BALANCING" flag before starting adapter reset and disable it after calling pci_restore_state(). The irqbalance daemon is disabled for this short period of time (~2s).
Co-developed-by: Kyle Mahlkuch Kyle.Mahlkuch@ibm.com Signed-off-by: Kyle Mahlkuch Kyle.Mahlkuch@ibm.com Signed-off-by: Wen Xiong wenxiong@linux.ibm.com Link: https://patch.msgid.link/20251028142427.3969819-2-wenxiong@linux.ibm.com Signed-off-by: Martin K. Petersen martin.petersen@oracle.com Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
The APIs are well-established with 124 callers for `irq_set_status_flags` - this approach is standard practice in the kernel.
---
## Summary
**Why this should be backported:**
1. **Fixes a real, serious bug**: EEH events on PowerPC are significant hardware errors that can cause system instability and require recovery 2. **Clear root cause and fix**: Race condition between irqbalance daemon and adapter reset is clearly identified and surgically fixed 3. **Small, self-contained change**: ~30 lines added to a single driver file 4. **Uses standard kernel APIs**: `irq_set_status_flags()` and `irq_clear_status_flags()` are used by 124+ other callers - well- tested mechanism 5. **Low regression risk**: Changes only affect the ipr driver during adapter reset, no impact on other subsystems 6. **Critical for affected users**: Systems running IBM Power with ipr storage adapters and irqbalance enabled will hit this during hot-plug operations 7. **No dependencies**: All required APIs have been in the kernel for many years
**Risk Assessment**: LOW - The fix is narrowly scoped to the reset path - Uses battle-tested IRQ infrastructure - Only affects ipr driver users during specific operations
**Benefit**: HIGH for affected users - prevents EEH crashes during adapter reset operations
This commit meets all stable kernel criteria: it fixes a real bug, is obviously correct, is small and contained, and introduces no new features. The mechanism used (temporarily disabling IRQ balancing during critical operations) is a standard pattern used throughout the kernel.
**YES**
drivers/scsi/ipr.c | 28 +++++++++++++++++++++++++++- 1 file changed, 27 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c index 44214884deaf5..d62bb7d0e4164 100644 --- a/drivers/scsi/ipr.c +++ b/drivers/scsi/ipr.c @@ -61,8 +61,8 @@ #include <linux/hdreg.h> #include <linux/reboot.h> #include <linux/stringify.h> +#include <linux/irq.h> #include <asm/io.h> -#include <asm/irq.h> #include <asm/processor.h> #include <scsi/scsi.h> #include <scsi/scsi_host.h> @@ -7843,6 +7843,30 @@ static int ipr_dump_mailbox_wait(struct ipr_cmnd *ipr_cmd) return IPR_RC_JOB_RETURN; }
+/** + * ipr_set_affinity_nobalance + * @ioa_cfg: ipr_ioa_cfg struct for an ipr device + * @flag: bool + * true: ensable "IRQ_NO_BALANCING" bit for msix interrupt + * false: disable "IRQ_NO_BALANCING" bit for msix interrupt + * Description: This function will be called to disable/enable + * "IRQ_NO_BALANCING" to avoid irqbalance daemon + * kicking in during adapter reset. + **/ +static void ipr_set_affinity_nobalance(struct ipr_ioa_cfg *ioa_cfg, bool flag) +{ + int irq, i; + + for (i = 0; i < ioa_cfg->nvectors; i++) { + irq = pci_irq_vector(ioa_cfg->pdev, i); + + if (flag) + irq_set_status_flags(irq, IRQ_NO_BALANCING); + else + irq_clear_status_flags(irq, IRQ_NO_BALANCING); + } +} + /** * ipr_reset_restore_cfg_space - Restore PCI config space. * @ipr_cmd: ipr command struct @@ -7867,6 +7891,7 @@ static int ipr_reset_restore_cfg_space(struct ipr_cmnd *ipr_cmd) return IPR_RC_JOB_CONTINUE; }
+ ipr_set_affinity_nobalance(ioa_cfg, false); ipr_fail_all_ops(ioa_cfg);
if (ioa_cfg->sis64) { @@ -7946,6 +7971,7 @@ static int ipr_reset_start_bist(struct ipr_cmnd *ipr_cmd) rc = pci_write_config_byte(ioa_cfg->pdev, PCI_BIST, PCI_BIST_START);
if (rc == PCIBIOS_SUCCESSFUL) { + ipr_set_affinity_nobalance(ioa_cfg, true); ipr_cmd->job_step = ipr_reset_bist_done; ipr_reset_start_timer(ipr_cmd, IPR_WAIT_FOR_BIST_TIMEOUT); rc = IPR_RC_JOB_RETURN;
linux-stable-mirror@lists.linaro.org