On 10/04/2025 4:36 pm, Robin Murphy wrote:
On 09/04/2025 4:56 pm, Naresh Kamboju wrote:
On Wed, 2 Apr 2025 at 21:04, Robin Murphy robin.murphy@arm.com wrote:
On 31/03/2025 5:03 am, Naresh Kamboju wrote:
Regressions on arm64 Juno-r2 devices detect SSD tests failed on the Linux next and Linux mainline.
First seen on the v6.14-7245-g5c2a430e8599 Good: v6.14 Bad: v6.14-7422-gacb4f33713b9
Sorry, I can't seem to reproduce this on my end, both today's mainline and acb4f33713b9 with my config, and even acb4f33713b9 with the linked LKFT config, all work OK on my Juno r2 (using a SATA SSD and PCIe networking). The only thing which stands out in your log is that PCI seems to give up probing and assigning resources beyond the switch downstream ports (so SATA and ethernet are never discovered), whereas on mine it does[2]. However that all happens before the first IOMMU instance probes (which conveniently is the PCIe one), so it's hard to imagine how that could have an effect anyway...
The only obvious difference is that I'm using EDK2 rather than U-Boot, so that's done all the PCIe configuration once already, but it doesn't seem like that's significant - looking back at a random older log[1], the on-board endpoints were still being picked up right after reconfiguring the switch, well before the IOMMU comes into the picture.
Since it is a still issue on mainline and next,
Bisected and reverted patch ^ causing kernel warnings at boot time but finding the SSD drive,
[bcb81ac6ae3c2ef95b44e7b54c3c9522364a245c] iommu: Get DT/ACPI parsing into the proper probe path
pcieport 0000:00:00.0: late IOMMU probe at driver bind, something fishy here! WARNING: at drivers/iommu/iommu.c:559 __iommu_probe_device
I see boot warnings [1] I am happy to test debug patches if you have any.
Seeing the warning after reverting the commit which introduced the warning mostly just means the conflict resolution in the revert wasn't right (there were some subsequent fixups...)
Anyway, I have now managed to get my Juno booting with the same antique version of U-Boot and finally reproduce the issue. It seems to be somehow connected to bus->dma_configure() being called in the device_add() notifier (even though the rest of the IOMMU setup doesn't run at that point since the driver hasn't registered yet), but how and why that prevents the buses behind the switch downstream ports being probed, and why *that* only happens when the switch isn't already configured, remains a mystery so far. I'm still digging...
OK, I found it, but I'm still not sure what exactly to make of it - it's the pci_request_acs() in of_iommu_configure(), now being called early enough to actually have an effect. Booting with EDK2 already using PCI prior to Linux, here's what I get for `sudo lspci -vv | grep ACSctl` with 6.15-rc1:
ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans- ACSCtl: SrcValid+ TransBlk- ReqRedir+ CmpltRedir+ UpstreamFwd+ EgressCtrl- DirectTrans-
whereas with the 6.14 behaviour they are all '-'. I don't have a working root filesystem with the U-Boot setup, but if I boot it with "pci=config_acs=000000@pci:0:0" then the kernel does assign the bridge windows and discover the ethernet/SATA endpoints again. I can spend some time getting NFS working next week, but if you're able to get lspci output off a machine in the "broken" state easily that would be handy to compare.
So at this point it would seem to be something about how Linux configures ACS when doing it from scratch. What I don't really know is where to go from there. I do know Juno's possibly a bit odd in that the switch supports ACS, but both the root port and endpoints either side of it don't. Could this be tickling some subtle bug in the PCI layer, and what is EDK2 doing that makes it not happen?
Thanks, Robin.