From: Marc Zyngier maz@kernel.org
[ Upstream commit 3cc8f625e4c6a0e9f936da6b94166e62e387fe1d ]
Since changing the affinity of an MSI really is about changing the target address and that it isn't possible to mask an individual MSI, it is completely possible for an interrupt to race with itself, usually resulting in a lost interrupt.
Paper over the design blunder by informing the core code of this sad state of affairs.
Signed-off-by: Marc Zyngier maz@kernel.org Signed-off-by: Lorenzo Pieralisi lpieralisi@kernel.org Signed-off-by: Bjorn Helgaas bhelgaas@google.com Link: https://lore.kernel.org/r/20250708173404.1278635-11-maz@kernel.org Signed-off-by: Sasha Levin sashal@kernel.org ---
LLM Generated explanations, may be completely bogus:
**Backport Status: YES**
This commit should be backported to stable kernel trees. Here's my detailed analysis:
## Key Evidence for Backporting:
1. **Fixes a Real Bug - Lost Interrupts**: The commit addresses a race condition where MSI interrupts can be lost when CPU affinity changes occur. The commit message explicitly states this results in "a lost interrupt" which is a serious issue that can cause system malfunction or hangs.
2. **Minimal and Contained Change**: The fix consists of a single line addition: ```c irqd_set_resend_when_in_progress(irq_get_irq_data(virq)); ``` This is an extremely small, targeted fix that only affects the xgene- msi driver's interrupt allocation path.
3. **Part of a Broader Fix Pattern**: This commit follows the same pattern established for other interrupt controllers: - Commit 9c15eeb5362c4 introduced this mechanism for fasteoi handlers in 2023 - Commit 2c9e7f857400 extended it to handle_simple_irq() flow - This commit applies the same proven workaround to the xgene-msi driver
4. **Addresses Hardware Design Limitation**: The commit message describes this as papering over a "design blunder" - the hardware inability to mask individual MSIs combined with affinity changes creates an unavoidable race condition. This is not a new feature but a critical workaround for existing hardware defects.
5. **History of Race Conditions in This Driver**: The xgene-msi driver has had race condition issues before (commit a93c00e5f975 fixed a different race in 2021), indicating this subsystem needs these types of fixes for stability.
6. **No Architecture Changes**: The fix uses existing kernel infrastructure (irqd_set_resend_when_in_progress) without introducing new APIs or changing kernel architecture.
7. **Low Risk of Regression**: The change only affects the specific xgene-msi driver and only adds a resend mechanism when interrupts are already in progress - it doesn't change the normal interrupt handling path.
## Stable Tree Criteria Met: - ✅ Fixes a real bug (lost interrupts) - ✅ Small change (1 line) - ✅ Not a new feature - ✅ Isolated to specific driver - ✅ Uses established kernel mechanisms - ✅ Addresses hardware limitation that affects deployed systems
The fix prevents interrupt loss during CPU affinity changes on X-Gene MSI controllers, which is exactly the type of bug fix that stable kernels should receive to maintain system reliability.
drivers/pci/controller/pci-xgene-msi.c | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/pci/controller/pci-xgene-msi.c b/drivers/pci/controller/pci-xgene-msi.c index b05ec8b0bb93..50647fa14e69 100644 --- a/drivers/pci/controller/pci-xgene-msi.c +++ b/drivers/pci/controller/pci-xgene-msi.c @@ -200,6 +200,7 @@ static int xgene_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, irq_domain_set_info(domain, virq, msi_irq, &xgene_msi_bottom_irq_chip, domain->host_data, handle_simple_irq, NULL, NULL); + irqd_set_resend_when_in_progress(irq_get_irq_data(virq));
return 0; }