Hello,
The following is the original thread, where a bug was reported to the linux-wireless and ath10k mailing lists. The specific bug has been detailed clearly here.
https://lore.kernel.org/linux-wireless/690B1DB2-C9DC-4FAD-8063-4CED659B1701@...
There is also a Bugzilla report by me, which was opened later: https://bugzilla.kernel.org/show_bug.cgi?id=220264
As stated, it is highly encouraged to check out all the logs, especially the line of IRQ #16 in /proc/interrupts.
Here is where all the logs are: https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180 (these logs are taken from an Arch liveboot)
On my daily driver, I found these on my IRQ #16:
16: 173210 0 0 0 IR-IO-APIC 16-fasteoi i2c_designware.0, idma64.0, i801_smbus
The fixes stated on the Reddit post for this Wi-Fi card didn't quite work. (But git-cloning the firmware files did give me some more time to have stable internet)
This time, I had to go for the GRUB kernel parameters.
Right now, I'm using "irqpoll" to curb the errors caused. "intel_iommu=off" did not work, and the Wi-Fi was constantly crashing even then. Did not try out "pci=noaer" this time.
If it's of any concern, there is a very weird error in Chromium-based browsers which has only happened after I started using irqpoll. When I Google something, the background of the individual result boxes shows as pure black, while the surrounding space is the usual greyish-blackish, like we see in Dark Mode. Here is a picture of the exact thing I'm experiencing: https://files.catbox.moe/mjew6g.png
If you notice anything in my logs/bug reports, please let me know. (Because it seems like Wi-Fi errors are just a red herring, there are some ACPI or PCIe-related errors in the computers of this model - just a naive speculation, though.)
Thanking you, Bandhan Pramanik
[+cc Jeff, ath10k maintainer]
On Thu, Jun 26, 2025 at 12:47:49AM +0530, Bandhan Pramanik wrote:
Hello,
The following is the original thread, where a bug was reported to the linux-wireless and ath10k mailing lists. The specific bug has been detailed clearly here.
https://lore.kernel.org/linux-wireless/690B1DB2-C9DC-4FAD-8063-4CED659B1701@...
There is also a Bugzilla report by me, which was opened later: https://bugzilla.kernel.org/show_bug.cgi?id=220264
As stated, it is highly encouraged to check out all the logs, especially the line of IRQ #16 in /proc/interrupts.
Here is where all the logs are: https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180 (these logs are taken from an Arch liveboot)
On my daily driver, I found these on my IRQ #16:
16: 173210 0 0 0 IR-IO-APIC 16-fasteoi i2c_designware.0, idma64.0, i801_smbus
The fixes stated on the Reddit post for this Wi-Fi card didn't quite work. (But git-cloning the firmware files did give me some more time to have stable internet)
This time, I had to go for the GRUB kernel parameters.
Right now, I'm using "irqpoll" to curb the errors caused. "intel_iommu=off" did not work, and the Wi-Fi was constantly crashing even then. Did not try out "pci=noaer" this time.
If it's of any concern, there is a very weird error in Chromium-based browsers which has only happened after I started using irqpoll. When I Google something, the background of the individual result boxes shows as pure black, while the surrounding space is the usual greyish-blackish, like we see in Dark Mode. Here is a picture of the exact thing I'm experiencing: https://files.catbox.moe/mjew6g.png
If you notice anything in my logs/bug reports, please let me know. (Because it seems like Wi-Fi errors are just a red herring, there are some ACPI or PCIe-related errors in the computers of this model - just a naive speculation, though.)
Your dmesg log is incomplete, and we would need to see the entire thing. It should start with something like this:
Linux version 6.8.0-60-generic (buildd@lcy02-amd64-054) (x86_64-linux-gnu-gcc-13 (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0, GNU ld (GNU Binutils for Ubuntu) 2.42) #63-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 15 19:04:15 UTC 2025 (Ubuntu 6.8.0-60.63-generic 6.8.12)
Your lspci output doesn't include the necessary PCI details; collect it with "sudo lspci -vv".
We should pick the most serious problem and focus on that instead of trying to solve everything at once.
It sounds like the ath10k issue might be the biggest problem? If "options ath10k_core skip_otp=y" is a workaround for this problem, it looks like some ath10k firmware thing, probably unrelated to the PCI core.
Bjorn
Please ignore the last email (I haven't replied to everyone). Also, here's the actual updated dmesg (the previous one was the old one): https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw...
On Thu, Jun 26, 2025 at 4:16 AM Bandhan Pramanik bandhanpramanik06.foss@gmail.com wrote:
Hello Bjorn,
First of all, thanks a LOT for replying.
I have included the files in my previous GitHub Gist. Sharing the raw files for easier analysis.
lspci -vv: https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw... dmesg: https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw...
On a different note, I had to use pci=noaer, so that the ring buffer wouldn't get cleared that fast.
Regarding the ath10k thing, none of the fixes worked this time. Only irqpoll worked. I don't know if it's because of a disparity b/w GNOME and KDE (because my daily driver is Fedora 42), but I'm 300% sure that it's not just the Wi-Fi that's the issue here. It's most probably a lot of issues here, and the harder issues to fix are usually the ones closer to the hardware.
Anyway, if you get something, please let me know.
Bandhan
Hello everyone,
I think I found it. I used irqpoll and I didn't experience any hiccups with my mouse performance. But the Wi-Fi was still malfunctioning.
To linux-pci and linux-acpi:
It's an ath10k problem, sure, but there's something definitely problematic happening if, in the normal state, these Wi-Fi bugs hamper the touchpad movement.
To ath10k and linux-wireless:
I tried out "options ath10k_core rawmode = 0" along with "skip_otp=y' and the Wi-Fi seems to work perfectly as of now. It might be the fix, it might not be either. But I think there's something more important to ask: Are there any good resources/documentation on referring to what the different key-value pairs mean? Like, what's the exact documentation through which people arrive at "rawmode=0" or "skip_otp=y"?
Bandhan
On 26 June 2025 4:20:13 am IST, Bandhan Pramanik bandhanpramanik06.foss@gmail.com wrote:
Please ignore the last email (I haven't replied to everyone). Also, here's the actual updated dmesg (the previous one was the old one): https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw...
On Thu, Jun 26, 2025 at 4:16 AM Bandhan Pramanik bandhanpramanik06.foss@gmail.com wrote:
Hello Bjorn,
First of all, thanks a LOT for replying.
I have included the files in my previous GitHub Gist. Sharing the raw files for easier analysis.
lspci -vv: https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw... dmesg: https://gist.github.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d832a16180/raw...
On a different note, I had to use pci=noaer, so that the ring buffer wouldn't get cleared that fast.
Regarding the ath10k thing, none of the fixes worked this time. Only irqpoll worked. I don't know if it's because of a disparity b/w GNOME and KDE (because my daily driver is Fedora 42), but I'm 300% sure that it's not just the Wi-Fi that's the issue here. It's most probably a lot of issues here, and the harder issues to fix are usually the ones closer to the hardware.
Anyway, if you get something, please let me know.
Bandhan
Just a small update: it's not the fix. Back to square 1.
On 26 June 2025 11:23:14 pm IST, Bandhan Pramanik bandhanpramanik06.foss@gmail.com wrote:
Hello everyone,
I think I found it. I used irqpoll and I didn't experience any hiccups with my mouse performance. But the Wi-Fi was still malfunctioning.
To linux-pci and linux-acpi:
It's an ath10k problem, sure, but there's something definitely problematic happening if, in the normal state, these Wi-Fi bugs hamper the touchpad movement.
To ath10k and linux-wireless:
I tried out "options ath10k_core rawmode = 0" along with "skip_otp=y' and the Wi-Fi seems to work perfectly as of now. It might be the fix, it might not be either. But I think there's something more important to ask: Are there any good resources/documentation on referring to what the different key-value pairs mean? Like, what's the exact documentation through which people arrive at "rawmode=0" or "skip_otp=y"?
Bandhan
Hi everyone,
Here after a week. I did my research.
I talked to some folks on IRC and the glaring issue was basically this:
[ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0
This basically means that the root port (that 1c thing written with colons) of PCIe is the main problem here.
One particular note: this issue can be reproduced on the models of this same laptop. Therefore, this happens in most if not all of the laptops of the same model.
For starters, the root port basically manages the communication between the CPU and the device. Now, this root port itself is reporting fatal errors.
This is not a Wi-Fi error, but something deeper.
Any tips on what to do?
Bandhan
On Sat, Jul 05, 2025 at 01:00:23AM +0530, Bandhan Pramanik wrote:
Hi everyone,
Here after a week. I did my research.
I talked to some folks on IRC and the glaring issue was basically this:
[ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0
Where is the complete dmesg log from which this is extracted?
This basically means that the root port (that 1c thing written with colons) of PCIe is the main problem here.
One particular note: this issue can be reproduced on the models of this same laptop. Therefore, this happens in most if not all of the laptops of the same model.
For starters, the root port basically manages the communication between the CPU and the device. Now, this root port itself is reporting fatal errors.
This is not a Wi-Fi error, but something deeper.
Devices that support AER have extra log registers to capture details about an error. A device that detects an error sends a PCIe Error Message upstream to a Root Port. The Root Port generates an interrupt, which is handled by the aer driver. In this case, the 01:00.0 device detected an error and sent an ERR_FATAL message upstream, and the 00:1c.0 Root Port received it and generated an interrupt. The ERR_FATAL message doesn't contain any details about the error itself, so the aer driver looks for the AER registers in the 01:00.0 device and logs those details to the dmesg log. Normally there would be a few lines after the one you quoted that would include those details.
Bjorn
Hello,
The dmesg log (the older one) is present here: https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83...
The newer dmesg log includes the first line and is not overwritten by the ring buffer (used pci=noaer in this case): https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83... (The newer one doesn't have the error recorded).
You should check out the older dmesg, the quoted line was taken from there verbatim, including any additional details.
Bandhan
On Sat, Jul 5, 2025 at 7:20 PM Bjorn Helgaas helgaas@kernel.org wrote:
On Sat, Jul 05, 2025 at 01:00:23AM +0530, Bandhan Pramanik wrote:
Hi everyone,
Here after a week. I did my research.
I talked to some folks on IRC and the glaring issue was basically this:
[ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0
Where is the complete dmesg log from which this is extracted?
This basically means that the root port (that 1c thing written with colons) of PCIe is the main problem here.
One particular note: this issue can be reproduced on the models of this same laptop. Therefore, this happens in most if not all of the laptops of the same model.
For starters, the root port basically manages the communication between the CPU and the device. Now, this root port itself is reporting fatal errors.
This is not a Wi-Fi error, but something deeper.
Devices that support AER have extra log registers to capture details about an error. A device that detects an error sends a PCIe Error Message upstream to a Root Port. The Root Port generates an interrupt, which is handled by the aer driver. In this case, the 01:00.0 device detected an error and sent an ERR_FATAL message upstream, and the 00:1c.0 Root Port received it and generated an interrupt. The ERR_FATAL message doesn't contain any details about the error itself, so the aer driver looks for the AER registers in the 01:00.0 device and logs those details to the dmesg log. Normally there would be a few lines after the one you quoted that would include those details.
Bjorn
On Sat, Jul 05, 2025 at 08:30:46PM +0530, Bandhan Pramanik wrote:
Hello,
The dmesg log (the older one) is present here:
[1]:
https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83...
The newer dmesg log includes the first line and is not overwritten by the ring buffer (used pci=noaer in this case): https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83... (The newer one doesn't have the error recorded).
You should check out the older dmesg, the quoted line was taken from there verbatim, including any additional details.
Bandhan
On Sat, Jul 5, 2025 at 7:20 PM Bjorn Helgaas helgaas@kernel.org wrote:
On Sat, Jul 05, 2025 at 01:00:23AM +0530, Bandhan Pramanik wrote:
Hi everyone,
Here after a week. I did my research.
I talked to some folks on IRC and the glaring issue was basically this:
[ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0
From [1]:
[ 1146.810055] pcieport 0000:00:1c.0: AER: Uncorrectable (Fatal) error message received from 0000:01:00.0 [ 1146.810069] ath10k_pci 0000:01:00.0: AER: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID) [ 1146.813130] ath10k_pci 0000:01:00.0: AER: can't recover (no error_detected callback) [ 1146.948066] pcieport 0000:00:1c.0: AER: Root Port link has been reset (0) [ 1146.948112] pcieport 0000:00:1c.0: AER: device recovery failed [ 1146.949480] ath10k_pci 0000:01:00.0: failed to wake target for read32 at 0x0003a028: -110
I think Linux is not doing a very good job of extracting error information. I think is_error_source() read PCI_ERR_UNCOR_STATUS from 01:00.0 and saw an error logged, but aer_get_device_error_info() declined to read PCI_ERR_UNCOR_STATUS again because we thought the link was unusable, so aer_print_error() didn't have any info to print, hence the "Inaccessible" message.
Are you able to rebuild a kernel with the patch below? This is based on v6.16-rc1 and likely wouldn't apply cleanly to your v6.14 kernel. But if you are able to build v6.16-rc1 with this patch, or adapt it to v6.14, I'd be interested in the output.
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 70ac66188367..99acb1e1946e 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -990,6 +990,8 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info) if ((PCI_BUS_NUM(e_info->id) != 0) && !(dev->bus->bus_flags & PCI_BUS_FLAGS_NO_AERSID)) { /* Device ID match? */ + pci_info(dev, "%s: bus_flags %#x e_info->id %#04x\n", + __func__, dev->bus->bus_flags, e_info->id); if (e_info->id == pci_dev_id(dev)) return true;
@@ -1025,6 +1027,10 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info) pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &status); pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask); } + pci_info(dev, "%s: %s STATUS %#010x MASK %#010x\n", + __func__, + e_info->severity == AER_CORRECTABLE ? "COR" : "UNCOR", + status, mask); if (status & ~mask) return true;
@@ -1368,6 +1374,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i) aer = dev->aer_cap; type = pci_pcie_type(dev);
+ pci_info(dev, "%s: type %#x cap %#04x\n", __func__, type, aer); + /* Must reset in this function */ info->status = 0; info->tlp_header_valid = 0; @@ -1383,16 +1391,14 @@ int aer_get_device_error_info(struct aer_err_info *info, int i) &info->mask); if (!(info->status & ~info->mask)) return 0; - } else if (type == PCI_EXP_TYPE_ROOT_PORT || - type == PCI_EXP_TYPE_RC_EC || - type == PCI_EXP_TYPE_DOWNSTREAM || - info->severity == AER_NONFATAL) { - + } else { /* Link is still healthy for IO reads */ pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &info->status); pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &info->mask); + pci_info(dev, "%s: UNCOR STATUS %#010x MASK %#010x\n", + __func__, info->status, info->mask); if (!(info->status & ~info->mask)) return 0;
@@ -1471,6 +1477,8 @@ static void aer_isr_one_error(struct pci_dev *root, { u32 status = e_src->status;
+ pci_info(root, "%s: ROOT_STATUS %#010x ROOT_ERR_SRC %#010x\n", + __func__, e_src->status, e_src->id); pci_rootport_aer_stats_incr(root, e_src);
/*
Hi Bjorn,
I have downloaded 6.16-rc4, and I have a bootable pendrive having the arch iso, but I really don't know how to rebuild the kernel on a bootable drive.
Any tips on how to do that?
Bandhan
On Mon, Jul 07, 2025 at 04:31:22AM GMT, Bandhan Pramanik wrote:
Hi Bjorn,
I have downloaded 6.16-rc4, and I have a bootable pendrive having the arch iso, but I really don't know how to rebuild the kernel on a bootable drive.
You don't need to reinstall Arch for installing a custom kernel. Refer the Arch linux wiki on how to install a custom kernel from source:
https://wiki.archlinux.org/title/Kernel/Arch_build_system https://wiki.archlinux.org/title/Kernel/Traditional_compilation
- Mani
Hello,
I was actually a bit distracted by the things caused by the Automatic Partitioning of Fedora. I'll inform that in Fedora Bugzilla... anyway.
I realised that making the modules will take 8-9 hours, I didn't even have much of a success (because all the modules didn't properly load, particularly the firmware-N.bin files couldn't be found).
But I'll try to recompile the kernel, I'll just have to give it overnight time.
Bandhan
Ok, we did it. Could reproduce the errors properly.
Here are the journalctl logs:
Kernel level: https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83... User level: https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83...
Just so you know, I have used v6.16-rc4.
Bandhan.
On Wed, Jul 9, 2025 at 11:00 PM Bandhan Pramanik bandhanpramanik06.foss@gmail.com wrote:
Hello,
I was actually a bit distracted by the things caused by the Automatic Partitioning of Fedora. I'll inform that in Fedora Bugzilla... anyway.
I realised that making the modules will take 8-9 hours, I didn't even have much of a success (because all the modules didn't properly load, particularly the firmware-N.bin files couldn't be found).
But I'll try to recompile the kernel, I'll just have to give it overnight time.
Bandhan
On Fri, Jul 11, 2025 at 12:36:12AM +0530, Bandhan Pramanik wrote:
Ok, we did it. Could reproduce the errors properly.
Here are the journalctl logs:
Kernel level: https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83... User level: https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83...
Thanks. These logs look like the kernel doesn't include the patch I sent at https://lore.kernel.org/r/20250705195846.GA2011829@bhelgaas
Can you please try with that patch?
Just so you know, I have used v6.16-rc4.
Bandhan.
On Wed, Jul 9, 2025 at 11:00 PM Bandhan Pramanik bandhanpramanik06.foss@gmail.com wrote:
Hello,
I was actually a bit distracted by the things caused by the Automatic Partitioning of Fedora. I'll inform that in Fedora Bugzilla... anyway.
I realised that making the modules will take 8-9 hours, I didn't even have much of a success (because all the modules didn't properly load, particularly the firmware-N.bin files couldn't be found).
But I'll try to recompile the kernel, I'll just have to give it overnight time.
Bandhan
Hello,
I really couldn't find on the internet how to compile a single file now that I have compiled the whole kernel.
Any ways to do that?
On Fri, Jul 11, 2025 at 09:34:43PM +0530, Bandhan Pramanik wrote:
Hello,
I really couldn't find on the internet how to compile a single file now that I have compiled the whole kernel.
Any ways to do that?
If you apply the patch (cd to the linux/ directory, then "patch -p1 < email-file"), then run whatever "make" command you used before, it should rebuild that file and relink the whole kernel.
Bjorn
Compiled the usual way: the bzImage compiled within 4-5 minutes (compared to 1 hour previously), and the modules compiled within 1 hour (compared to 8 hours previously). Also, the congestion strangely didn't happen. It was instead silently followed by "No Internet".
Didn't add the kernel-level journalctl because I'm sure that the normal journalctl includes the kernel-level stuff too: https://gist.githubusercontent.com/BandhanPramanik/ddb0cb23eca03ca2ea43a1d83...
Please let me know what you think of the logs.
Bandhan
On Fri, Jul 11, 2025 at 10:06 PM Bjorn Helgaas helgaas@kernel.org wrote:
On Fri, Jul 11, 2025 at 09:34:43PM +0530, Bandhan Pramanik wrote:
Hello,
I really couldn't find on the internet how to compile a single file now that I have compiled the whole kernel.
Any ways to do that?
If you apply the patch (cd to the linux/ directory, then "patch -p1 < email-file"), then run whatever "make" command you used before, it should rebuild that file and relink the whole kernel.
Bjorn
I saw problems with Atheros on my Dell Inspiron, too.
These instructions helped me to reset the device without reboot:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730331/comments/40
I used modified script based on the one above (run as root):
set -e rmmod ath10k_pci 2> /dev/null || : rmmod ath10k_core 2> /dev/null || : rmmod ath 2> /dev/null || : { echo 1 > /sys/bus/pci/devices/0000:03:00.0/remove; } 2> /dev/null || : sleep 2 echo 1 > /sys/bus/pci/rescan
Try both scripts, one of them should work.
If still doesn't work, try to run original script, then do hibernate, if still doesn't work, run script again.
I finally was able to solve my problem by replacing Wi-Fi adapter. :) Here is my new Wi-Fi adapter:
[ 7.136347] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 3160, REV=0x164
-- Askar Safin
Hello Askar,
I appreciate your response. However, we're mainly trying to find out exactly 'why' this problem occurs. You might say that it's some kind of "Root Cause Analysis," so that this error goes away from the Inspiron laptops once and for all. If you keep on reading the messages in the other thread, you'll realise just how deep this error goes.
But still, thanks a lot for the response.
Bandhan
On Sun, Jul 13, 2025 at 12:48 AM Askar Safin safinaskar@zohomail.com wrote:
I saw problems with Atheros on my Dell Inspiron, too.
These instructions helped me to reset the device without reboot:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1730331/comments/40
I used modified script based on the one above (run as root):
set -e rmmod ath10k_pci 2> /dev/null || : rmmod ath10k_core 2> /dev/null || : rmmod ath 2> /dev/null || : { echo 1 > /sys/bus/pci/devices/0000:03:00.0/remove; } 2> /dev/null || : sleep 2 echo 1 > /sys/bus/pci/rescan
Try both scripts, one of them should work.
If still doesn't work, try to run original script, then do hibernate, if still doesn't work, run script again.
I finally was able to solve my problem by replacing Wi-Fi adapter. :) Here is my new Wi-Fi adapter:
[ 7.136347] iwlwifi 0000:03:00.0: Detected Intel(R) Dual Band Wireless AC 3160, REV=0x164
-- Askar Safin
linux-stable-mirror@lists.linaro.org