On Sat, Jul 05, 2025 at 08:30:46PM +0530, Bandhan Pramanik wrote:
Hello,
The dmesg log (the older one) is present here:
[1]:
https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83...
The newer dmesg log includes the first line and is not overwritten by the ring buffer (used pci=noaer in this case): https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83... (The newer one doesn't have the error recorded).
You should check out the older dmesg, the quoted line was taken from there verbatim, including any additional details.
Bandhan
On Sat, Jul 5, 2025 at 7:20 PM Bjorn Helgaas helgaas@kernel.org wrote:
On Sat, Jul 05, 2025 at 01:00:23AM +0530, Bandhan Pramanik wrote:
Hi everyone,
Here after a week. I did my research.
I talked to some folks on IRC and the glaring issue was basically this:
[ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0
From [1]:
[ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0 [ 1146.810069] ath10k_pci 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID) [ 1146.813130] ath10k_pci 0000:01:00.0: AER: can't recover (no error_detected callback) [ 1146.948066] pcieport 0000:00:1c.0: AER: Root Port link has been reset (0) [ 1146.948112] pcieport 0000:00:1c.0: AER: device recovery failed [ 1146.949480] ath10k_pci 0000:01:00.0: failed to wake target for read32 at 0x0003a028: -110
I think Linux is not doing a very good job of extracting error information. I think is_error_source() read PCI_ERR_UNCOR_STATUS from 01:00.0 and saw an error logged, but aer_get_device_error_info() declined to read PCI_ERR_UNCOR_STATUS again because we thought the link was unusable, so aer_print_error() didn't have any info to print, hence the "Inaccessible" message.
Are you able to rebuild a kernel with the patch below? This is based on v6.16-rc1 and likely wouldn't apply cleanly to your v6.14 kernel. But if you are able to build v6.16-rc1 with this patch, or adapt it to v6.14, I'd be interested in the output.
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 70ac66188367..99acb1e1946e 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -990,6 +990,8 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info) if ((PCI_BUS_NUM(e_info->id) != 0) && !(dev->bus->bus_flags & PCI_BUS_FLAGS_NO_AERSID)) { /* Device ID match? */ + pci_info(dev, "%s: bus_flags %#x e_info->id %#04x\n", + __func__, dev->bus->bus_flags, e_info->id); if (e_info->id == pci_dev_id(dev)) return true;
@@ -1025,6 +1027,10 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info) pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &status); pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask); } + pci_info(dev, "%s: %s STATUS %#010x MASK %#010x\n", + __func__, + e_info->severity == AER_CORRECTABLE ? "COR" : "UNCOR", + status, mask); if (status & ~mask) return true;
@@ -1368,6 +1374,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i) aer = dev->aer_cap; type = pci_pcie_type(dev);
+ pci_info(dev, "%s: type %#x cap %#04x\n", __func__, type, aer); + /* Must reset in this function */ info->status = 0; info->tlp_header_valid = 0; @@ -1383,16 +1391,14 @@ int aer_get_device_error_info(struct aer_err_info *info, int i) &info->mask); if (!(info->status & ~info->mask)) return 0; - } else if (type == PCI_EXP_TYPE_ROOT_PORT || - type == PCI_EXP_TYPE_RC_EC || - type == PCI_EXP_TYPE_DOWNSTREAM || - info->severity == AER_NONFATAL) { - + } else { /* Link is still healthy for IO reads */ pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &info->status); pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &info->mask); + pci_info(dev, "%s: UNCOR STATUS %#010x MASK %#010x\n", + __func__, info->status, info->mask); if (!(info->status & ~info->mask)) return 0;
@@ -1471,6 +1477,8 @@ static void aer_isr_one_error(struct pci_dev *root, { u32 status = e_src->status;
+ pci_info(root, "%s: ROOT_STATUS %#010x ROOT_ERR_SRC %#010x\n", + __func__, e_src->status, e_src->id); pci_rootport_aer_stats_incr(root, e_src);
/*