Hi there, hello,
Sometimes when I suspend (by closing the lid, less often - by pressing Fn+F1 (sleep key combo)) or poweroff my laptop (both by pressing powerit button and running "loginctl poweroff"), it goes in such a state when it doesn't respond to opening/closing the lid, power button nor Ctrl+Alt+Del, but, unlike in sleep mode, the fan is rotating and the "awake status" LED is on. I checked /var/log/kern.log, but it didn't report suspend at that moment at all: went straight from [UFW BLOCK] to "Microcode updated" on force reboot (marked with an arrow):
Apr 13 10:40:32 bong kernel: asus_wmi: Unknown key code 0xcf Apr 13 10:44:05 bong kernel: [UFW BLOCK] IN=wlan0 OUT= MAC=/*confidential*/ Apr 13 10:47:45 bong kernel: [UFW BLOCK] IN=wlan0 OUT= MAC=/*confidential*/ Apr 13 10:47:46 bong kernel: ICMPv6: NA: /*router*/ advertised our address /*ipv6*/ on wlan0! Apr 13 10:47:48 bong last message buffered 2 times -> Apr 13 10:49:11 bong kernel: [UFW BLOCK] IN=wlan0 OUT= MAC=/*confidential*/ Apr 13 10:52:34 bong kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12 Apr 13 10:52:34 bong kernel: Linux version 6.1.23-bong+ (acid@bong) (gcc (Gentoo Hardened 12.2.1_p20230121-r1 p10) 12.2.1 20230121, GNU ld (Gentoo 2.39 p5) 2.39.0) #1 SMP PREEMPT_DYNAMIC Tue Apr 11 15:21:57 EEST 2023 Apr 13 10:52:34 bong kernel: Command line: root=/dev/genston/root ro loglevel=4 rd.lvm.vg=genston rd.luks.uuid=97d10669-2da1-452d-a372-887e420b2ad4 rd.luks.allow-discards pci=nomsi initrd=\x5cinitramfs-6.1.23-bong+.img Apr 13 10:52:34 bong kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Apr 13 10:52:34 bong kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Apr 13 10:52:34 bong kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Normally it starts like this (taken from dmesg to sync with elogind messages)
[ 7835.869228] elogind-daemon[2033]: Lid closed. [ 7835.872875] elogind-daemon[2033]: Suspending... [ 7835.873955] elogind-daemon[2033]: Suspending system... [ 7835.873970] PM: suspend entry (deep) [ 7835.902814] Filesystems sync: 0.028 seconds [ 7835.920362] Freezing user space processes [ 7835.923030] Freezing user space processes completed (elapsed 0.002 seconds) [ 7835.923046] OOM killer disabled. [ 7835.923049] Freezing remaining freezable tasks [ 7835.924445] Freezing remaining freezable tasks completed (elapsed 0.001 seconds) [ 7835.924624] printk: Suspending console(s) (use no_console_suspend to debug)
The issue appeared when I was using pf-kernel with genpatches and updated from 6.1-pf2 to 6.1-pf3 (corresponding to vanilla versions 6.1.3 -> 6.1.6). I used that fork until 6.2-pf2, but since then (early March) moved to vanilla sources and started following the 6.1.y branch when it was declared LTS. And the issue was present on all of them.
The hang was last detected 3 days ago on 6.1.22 and today on 6.1.23.
I'd like to bisect it, but it could take ages for a couple of reasons:
1) I don't know exact patterns it follows. One of the scenarios I've noticed was this one (sorry if too ridiculous): - put the laptop on the nearby couch and simultaneously close the lid; the loose charger jack might disconnect; - lay the mouse upside down (so it doesn't wake up when I reconnect the charger), but it's not a 100% guarantee of the bug and, as I said earlier, the laptop also misbehaves on shutdown.
2) The issue happens rarely, once in a few days (sometimes up to a week; I haven't measured it precisely back then).
Hardware: https://tilde.cafe/u/acidbong/kernel/lspci (`lspci -vvnn`) Config (latest vanilla): https://git.sr.ht/~acid-bong/kernel/tree/806e6639da610952798e1b5d8c0d700062f... Built with KCFLAGS="-march=native" Isolated cmdline: root=/dev/genston/root ro loglevel=4 rd.lvm.vg=genston rd.luks.uuid=97d10669-2da1-452d-a372-887e420b2ad4 rd.luks.allow-discards pci=nomsi initrd=\initramfs-6.1.23-bong+.img
# regzbot introduced v6.1.3..v6.1.6
--- Regards, ~acidbong
On 4/14/23 02:35, Acid Bong wrote:
The issue appeared when I was using pf-kernel with genpatches and updated from 6.1-pf2 to 6.1-pf3 (corresponding to vanilla versions 6.1.3 -> 6.1.6). I used that fork until 6.2-pf2, but since then (early March) moved to vanilla sources and started following the 6.1.y branch when it was declared LTS. And the issue was present on all of them.
The hang was last detected 3 days ago on 6.1.22 and today on 6.1.23.
Have you tried testing latest mainline to see if commits which are backported to 6.1.y cause your regression?
# regzbot introduced v6.1.3..v6.1.6
Anyway, I'm adding this to regzbot:
#regzbot ^introduced v6.1.3..v6.1.6 #regzbot title Asus X541UAK hangs on suspend and poweroff #regzbot ignore-activity
Thanks.
On 14.04.23 09:51, Bagas Sanjaya wrote:
On 4/14/23 02:35, Acid Bong wrote:
The issue appeared when I was using pf-kernel with genpatches and updated from 6.1-pf2 to 6.1-pf3 (corresponding to vanilla versions 6.1.3 -> 6.1.6). I used that fork until 6.2-pf2, but since then (early March) moved to vanilla sources and started following the 6.1.y branch when it was declared LTS. And the issue was present on all of them.
The hang was last detected 3 days ago on 6.1.22 and today on 6.1.23.
Have you tried testing latest mainline to see if commits which are backported to 6.1.y cause your regression?
Well, if it something that started between v6.1.3 and v6.1.6 it must be a backported commit from mainline that causes the regression.
But yeah, testing mainline would be wise to differentiate between "this is something that is caused by a change in mainline" and "this is something stable specific and might be caused by a bad or incomplete backport".
It's not totally clear to me, but it seems 6.2 is affected as well? Well, then it's a mainline issue. Testing latest mainline nevertheless would be good to know if this maybe was fixed already.
But first something else: acidbong, why do you pass "pci=nomsi" to your kernel? Maybe that makes your machine run in a unusual configuration that directly or indirectly leads to your problem (which only worked by chance earlier).
# regzbot introduced v6.1.3..v6.1.6
Anyway, I'm adding this to regzbot:
Well, the quoted string above already did that. But whatever, a...
#regzbot ^introduced v6.1.3..v6.1.6
...should do no harm and this...
#regzbot title Asus X541UAK hangs on suspend and poweroff
... has improved the title (which was derived from the subject beforehand) somewhat. :-D
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
Thorsten why do you pass pci=nomsi
It's a workaround for another issue i've been facing for about 2 or 3 years, since when I first tried out Linux (started with loading Kubuntu and Mint live images). Without that workaround Kubuntu didn't boot for me - on kernel 5.8 it only reached the graphic installer part, but hung after language selection menu, on 5.4 and 5.11 - didn't even reach the graphic session. With Mint it was more severe - the screen was flooded with PCIe errors, like so:
Apr 10 18:47:08 bong last message buffered 3 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 5 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 13 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 5 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 6 times
`pci=nomsi` saved me also during Debian installation - without it the live ISO just crashed mid-installation.
But it wasn't a complete cure for Debian- and Ubuntu-based distros, and they still crashed even with this parameter (I don't know how exactly, at least they didn't flood with PCIe bus errors).
Since I moved to Manjaro, Void, Arch and now to Gentoo (which bases its config on the Fedora one), PCIe errors were my only trouble, which was easily mitigated with `pci=nomsi`. Recently I discovered that without it one of the kernel modules (irq/124-aerdrv) had high CPU load, so double useful.
https://forums.linuxmint.com/viewtopic.php?p=2237628 Just googled and there's a guy with a very similar model as me (UVK instead of UAK) and same issues, but `noaer` and `nomsi` work identically for me (I found `nomsi` in a different thread).
Since I'm building my own kernel for the last 3 months, I've disabled the MSI in kernel config - and with that, a big part of IOMMU part as well: https://git.sr.ht/~acid-bong/kernel/commit/cac5c09dec0bea919ca071a9b738108b0... but I did it _after_ I first experienced the issue I described in the thread head, hoping that it'll save me from these hangs as well. It didn't.
I'm keeping it in the bootloader config for cases when I boot with a prebuilt Gentoo kernel, and add every time I'm booting with Arch or Void live USB for rescue purposes. It's not a constant issue tho, happens every other time.
---
Bagas Have you tried testing latest mainline?
Just built and will boot in a moment. But we'll gotta wait for a couple of days, since the hanging is unexpected.
For the readers: here's a copy of the letter as it should've looked (it looks normally in the regressions archive, but wasn't parsed correctly in stable and linux-acpi lores): https://tilde.cafe/u/acidbong/kernel/pci-nomsi.txt
On 4/14/23 16:07, Acid Bong wrote:
Thorsten why do you pass pci=nomsi
It's a workaround for another issue i've been facing for about 2 or 3 years, since when I first tried out Linux (started with loading Kubuntu and Mint live images). Without that workaround Kubuntu didn't boot for me - on kernel 5.8 it only reached the graphic installer part, but hung after language selection menu, on 5.4 and 5.11 - didn't even reach the graphic session. With Mint it was more severe - the screen was flooded with PCIe errors, like so:
Apr 10 18:47:08 bong last message buffered 3 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask
Hardware issue? Or is this another kernel issue? If it is the latter, file separate report (see Documentation/admin-guide/reporting-issues.rst for how to report kernel issues).
On 14.04.23 11:07, Acid Bong wrote:
Thorsten why do you pass pci=nomsi
It's a workaround for another issue i've been facing for about 2 or 3 years, since when I first tried out Linux (started with loading Kubuntu and Mint live images). Without that workaround Kubuntu didn't boot for me - on kernel 5.8 it only reached the graphic installer part, but hung after language selection menu, on 5.4 and 5.11 - didn't even reach the graphic session. With Mint it was more severe - the screen was flooded with PCIe errors, like so:
Apr 10 18:47:08 bong last message buffered 3 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 5 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 13 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 5 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask000001/00002000 Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) Apr 10 18:47:08 bong last message buffered 6 times
`pci=nomsi` saved me also during Debian installation - without it the live ISO just crashed mid-installation.
But it wasn't a complete cure for Debian- and Ubuntu-based distros, and they still crashed even with this parameter (I don't know how exactly, at least they didn't flood with PCIe bus errors).
Since I moved to Manjaro, Void, Arch and now to Gentoo (which bases its config on the Fedora one), PCIe errors were my only trouble, which was easily mitigated with `pci=nomsi`. Recently I discovered that without it one of the kernel modules (irq/124-aerdrv) had high CPU load, so double useful.
https://forums.linuxmint.com/viewtopic.php?p=2237628 Just googled and there's a guy with a very similar model as me (UVK instead of UAK) and same issues, but `noaer` and `nomsi` work identically for me (I found `nomsi` in a different thread).
Since I'm building my own kernel for the last 3 months, I've disabled the MSI in kernel config - and with that, a big part of IOMMU part as well: https://git.sr.ht/~acid-bong/kernel/commit/cac5c09dec0bea919ca071a9b738108b0... but I did it _after_ I first experienced the issue I described in the thread head, hoping that it'll save me from these hangs as well. It didn't.
I'm keeping it in the bootloader config for cases when I boot with a prebuilt Gentoo kernel, and add every time I'm booting with Arch or Void live USB for rescue purposes. It's not a constant issue tho, happens every other time.
Bagas Have you tried testing latest mainline?
Just built and will boot in a moment. But we'll gotta wait for a couple of days, since the hanging is unexpected.
This is not my area of expertise, but the pre-existing hardware config trouble the kernel apparently has makes this a problematic case, as what causes those problems might directly or indirectly cause the regression you see by chance -- and might be something that only happens on your machine.
Maybe we are lucky and some developer of the affected kernel code areas will see your report and have an idea what might cause the regressions. But I'd say chances are slim. So unless we are lucky, we'll likely won't can any closer to a solution without a bisection.
But I wouldn't take that path; instead I in your place would report and sort out the hardware config trouble, as the problem might vanish by solving that.
But as I said, this is not my area of expertise, so maybe it's a bad advice.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
[+cc linux-pci; thread at https://lore.kernel.org/r/CRVU11I7JJWF.367PSO4YAQQEI@bong]
On Fri, Apr 14, 2023 at 12:07:42PM +0300, Acid Bong wrote:
Thorsten why do you pass pci=nomsi
It's a workaround for another issue i've been facing for about 2 or 3 years, since when I first tried out Linux (started with loading Kubuntu and Mint live images). Without that workaround Kubuntu didn't boot for me - on kernel 5.8 it only reached the graphic installer part, but hung after language selection menu, on 5.4 and 5.11 - didn't even reach the graphic session. With Mint it was more severe - the screen was flooded with PCIe errors, like so:
Apr 10 18:47:08 bong last message buffered 3 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask Apr 10 18:47:08 bong last message buffered 5 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask Apr 10 18:47:08 bong last message buffered 13 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask Apr 10 18:47:08 bong last message buffered 5 times Apr 10 18:47:08 bong kernel: pcieport 0000:00:1c.5: device [8086:9d15] error status/mask Apr 10 18:47:08 bong last message buffered 6 times
`pci=nomsi` saved me also during Debian installation - without it the live ISO just crashed mid-installation.
Likely "pci=nomsi" or "pci=noaer" are not related to the suspend/poweroff issue, but I'd really like to fix the AER problem anyway.
Can you collect the complete dmesg log and output of "sudo lspci -vv" and post them somewhere (https://bugzilla.kernel.org is a good place)?
Ideally the dmesg would be from the most recent kernel you have.
Bjorn
Can you collect the complete dmesg log and output of "sudo lspci -vv" and post them somewhere (https://bugzilla.kernel.org is a good place)?
`lspci -vvnn` output is linked in the head of the thread. Append .txt to make it readable in the browser (I only understood it after the upload).
Ideally the dmesg would be from the most recent kernel you have.
Speaking of that, a couple of questions:
1) Should I post them with or without pci=nomsi/noaer? The problem with disabling it is that it floods the logs so fast, that they reach 700M in 5-7 minutes, and, when rotation is enabled (my parameters are default, up to 10 copies 10M each), all pre-flood data is lost instantly.
Also I'm currently bisecting the kernel with MSI disabled in the config. But I'm keeping the parameter in the bootloader for cases when I'm using Gentoo's prebuilt kernel.
2) Can I delete messages by ufw? They contain MACs of my router, laptop and cellphone and I don't really wanna share them
3) I'm not savvy in logs, how exactly should I share dmesg? `dmesg > file`? /var/log/syslog? I already know kern.log doesn't contain logind and some other messages that are present in dmesg
4) Should we continue in this thread or rather start a new one?
On Tue, May 16, 2023 at 01:26:23PM +0300, Acid Bong wrote:
Can you collect the complete dmesg log and output of "sudo lspci -vv" and post them somewhere (https://bugzilla.kernel.org is a good place)?
`lspci -vvnn` output is linked in the head of the thread. Append .txt to make it readable in the browser (I only understood it after the upload).
Ideally the dmesg would be from the most recent kernel you have.
Speaking of that, a couple of questions:
- Should I post them with or without pci=nomsi/noaer? The problem
with disabling it is that it floods the logs so fast, that they reach 700M in 5-7 minutes, and, when rotation is enabled (my parameters are default, up to 10 copies 10M each), all pre-flood data is lost instantly.
You're seeing AER logging, and that's what I'm interested in, so if you could do one quick boot *without* "pci=nomsi" and "pci=noaer", that would be great. Then turn it off again so you don't drown in logs.
The snippet from [1] shows a few messages related to 00:1c.5, and it would be useful to know if there are errors related to other devices as well.
Something like "head -c500K /var/log/dmesg > file" should be plenty.
Also I'm currently bisecting the kernel with MSI disabled in the config. But I'm keeping the parameter in the bootloader for cases when I'm using Gentoo's prebuilt kernel.
- Can I delete messages by ufw? They contain MACs of my router,
laptop and cellphone and I don't really wanna share them
Sure, delete those.
- I'm not savvy in logs, how exactly should I share dmesg? `dmesg >
file`? /var/log/syslog? I already know kern.log doesn't contain logind and some other messages that are present in dmesg
- Should we continue in this thread or rather start a new one?
Good point, a new thread would probably be better.
Bjorn
[1] https://lore.kernel.org/all/CRWCUOAB4JKZ.3EKQN1TFFMVQL@bong/
So, I followed your advice and used the sources (6.3-rc6). Compiled even two versions: with my config (cf. head letter) and the Arch Linux one (I'm using Gentoo, but it still fits well), both updated with `olddefconfig`. Just to make sure that the problem is independent from the config.
Good news: I experienced the hanging 3 times with both kernels yesterday.
Two of them were on the custom kernel, and they were of the rare kind - they occured on shutdown. It goes normally, init disables the services, unmounts the filesystems, turns off the screen, but then - no response and the LED and the fan are still on. Another couple of shutdowns went normal, so the issue it still irregular.
One happened later on the Arch-based one and after a suspend.
/var/log/kern.log showed nothing specific in all cases.
Bad news: it seems, the fix hasn't arrived yet.
How do I proceed next?
--
P.S. On the `pci=nomsi` case: I don't consider it being related to the issue we're discussing. For me it seems like a hardware issue that can be bypassed by reconfiguration.
On 17.04.23 09:37, Acid Bong wrote:
So, I followed your advice and used the sources (6.3-rc6). Compiled even two versions: with my config (cf. head letter) and the Arch Linux one (I'm using Gentoo, but it still fits well), both updated with `olddefconfig`. Just to make sure that the problem is independent from the config.
Good news: I experienced the hanging 3 times with both kernels yesterday.
Two of them were on the custom kernel, and they were of the rare kind - they occured on shutdown. It goes normally, init disables the services, unmounts the filesystems, turns off the screen, but then - no response and the LED and the fan are still on. Another couple of shutdowns went normal, so the issue it still irregular.
One happened later on the Arch-based one and after a suspend.
/var/log/kern.log showed nothing specific in all cases.
Bad news: it seems, the fix hasn't arrived yet.
How do I proceed next?
Ideally you should still try to bisect this to find the change that causes your problems.
But I'm CCing the ACPI and PCI maintainers nevertheless, now that it's clear that it happens in vanilla mainline, too. *If* you are lucky they have an idea what might be wrong and can point you in a direction to narrow the cause down. But if you are unlucky, they will have no idea and just ignore this until you bisect the problem.
FWIW, Rafael, Bjorn thread starts here: https://lore.kernel.org/all/CRVU11I7JJWF.367PSO4YAQQEI@bong/
To quote some parts of it ``` Sometimes when I suspend (by closing the lid, less often - by pressing Fn+F1 (sleep key combo)) or poweroff my laptop (both by pressing powerit button and running "loginctl poweroff"), it goes in such a state when it doesn't respond to opening/closing the lid, power button nor Ctrl+Alt+Del, but, unlike in sleep mode, the fan is rotating and the "awake status" LED is on. [...] The issue appeared when I was using pf-kernel with genpatches and updated from 6.1-pf2 to 6.1-pf3 (corresponding to vanilla versions 6.1.3 -> 6.1.6). I used that fork until 6.2-pf2, but since then (early March) moved to vanilla sources and started following the 6.1.y branch when it was declared LTS. And the issue was present on all of them. ```
P.S. On the `pci=nomsi` case: I don't consider it being related to the issue we're discussing. For me it seems like a hardware issue that can be bypassed by reconfiguration.
I wouldn't be so sure about that.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page.
On Fri, Apr 14, 2023 at 02:51:47PM +0700, Bagas Sanjaya wrote:
On 4/14/23 02:35, Acid Bong wrote:
The issue appeared when I was using pf-kernel with genpatches and updated from 6.1-pf2 to 6.1-pf3 (corresponding to vanilla versions 6.1.3 -> 6.1.6). I used that fork until 6.2-pf2, but since then (early March) moved to vanilla sources and started following the 6.1.y branch when it was declared LTS. And the issue was present on all of them.
The hang was last detected 3 days ago on 6.1.22 and today on 6.1.23.
Have you tried testing latest mainline to see if commits which are backported to 6.1.y cause your regression?
#regzbot poke
Acid Bong, have you successfully bisected to find the culprit commit? How about swapping the hardware? I'm poking because the thread looks stale for a while.
Thanks.
Hi there, and thank you for the reminder.
Bisecting, unfortunately, takes a long time: I'm only trying out the 7th commit, 15e7433e1dc2 (previous 6 marked as bad). The bug, as noted in the head, doesn't have any (strict) patterns and takes randomly long times: some kernels hung on the next day after compilation, one took 5 days. I'm not excluding a possibility that I might've got the versions wrong and the bug occured on the update from 6.1-pf1 to 6.1-pf2 (6.1 and 6.1.3; could be unrelated, but I saw a bunch of commits related to i915 and Skylake).
I also checked my package manager log, no programs related to kernel compilation (glibc, gcc, archivers and such) were updated until I updated to the problematic version, and for about two weeks after the upgrade (the first occurence happened soon after it).
What exactly do you mean by "swapping the hardware"? I'm already sure it's not related to my storage, because a month ago I replaced my faulty HDD with an SSD, but the bug still remained. Unfortunately, I don't have spare PCs or resources to purchase new hardware.
On Tue, May 02, 2023 at 12:02:26AM +0300, Acid Bong wrote:
Hi there, and thank you for the reminder.
Bisecting, unfortunately, takes a long time: I'm only trying out the 7th commit, 15e7433e1dc2 (previous 6 marked as bad). The bug, as noted in the head, doesn't have any (strict) patterns and takes randomly long times: some kernels hung on the next day after compilation, one took 5 days. I'm not excluding a possibility that I might've got the versions wrong and the bug occured on the update from 6.1-pf1 to 6.1-pf2 (6.1 and 6.1.3; could be unrelated, but I saw a bunch of commits related to i915 and Skylake).
OK, try keep updating on bisection process.
What exactly do you mean by "swapping the hardware"? I'm already sure it's not related to my storage, because a month ago I replaced my faulty HDD with an SSD, but the bug still remained. Unfortunately, I don't have spare PCs or resources to purchase new hardware.
In case of laptops, I mean buying out new laptop (maybe with similar hardware specs as your current one) and try reproducing the regression there.
Thanks.
Hi there, hello.
This seems to be my final update.
About a week ago I returned to using Gajim, which, as I remember from earlier, also seemed to be responsible for these hangings, and they got more frequent (I haven't updated any software for the last 2 months). I decided to move to the kernel version 6.1.1, which I earlier marked as "good", and my laptop hung last evening during the shutdown. As always, nothing in the logs.
I tried to compile some versions from 5.15.y branch, but either I had a bad luck, or the commits weren't properly compatible with GCC 12 yet, but they (.48 and .78) emitted warnings, so I never used them (or I broke the repo, who knows).
Due to the fact that software does have impact on this behaviour, and due to my health issues and potential conscription (cuz our army doesn't care about health), which will cut me from my laptop for a long-long time, I give up on bisecting. I'll just update all my software (there's also a GCC upgrade in the repos) and hope for the best.
Sorry for inconvenience and have a great day. Thank you very much.
On Fri, Jun 09, 2023 at 02:09:17PM +0300, Acid Bong wrote:
Hi there, hello.
This seems to be my final update.
About a week ago I returned to using Gajim, which, as I remember from earlier, also seemed to be responsible for these hangings, and they got more frequent (I haven't updated any software for the last 2 months). I decided to move to the kernel version 6.1.1, which I earlier marked as "good", and my laptop hung last evening during the shutdown. As always, nothing in the logs.
I tried to compile some versions from 5.15.y branch, but either I had a bad luck, or the commits weren't properly compatible with GCC 12 yet, but they (.48 and .78) emitted warnings, so I never used them (or I broke the repo, who knows).
Due to the fact that software does have impact on this behaviour, and due to my health issues and potential conscription (cuz our army doesn't care about health), which will cut me from my laptop for a long-long time, I give up on bisecting. I'll just update all my software (there's also a GCC upgrade in the repos) and hope for the best.
Sorry for inconvenience and have a great day. Thank you very much.
No inconvenience on our side; your help is invaluable, especially for intermittent problems like this one. They are really hard to find and debug, and I'm sorry that we didn't get this one resolved.
Bjorn
On 09.06.23 18:55, Bjorn Helgaas wrote:
On Fri, Jun 09, 2023 at 02:09:17PM +0300, Acid Bong wrote:
Hi there, hello.
About a week ago I returned to using Gajim, which, as I remember from earlier, also seemed to be responsible for these hangings, and they got more frequent (I haven't updated any software for the last 2 months). I decided to move to the kernel version 6.1.1, which I earlier marked as "good", and my laptop hung last evening during the shutdown. As always, nothing in the logs.
I tried to compile some versions from 5.15.y branch, but either I had a bad luck, or the commits weren't properly compatible with GCC 12 yet, but they (.48 and .78) emitted warnings, so I never used them (or I broke the repo, who knows).
Due to the fact that software does have impact on this behaviour, and due to my health issues and potential conscription (cuz our army doesn't care about health), which will cut me from my laptop for a long-long time, I give up on bisecting. I'll just update all my software (there's also a GCC upgrade in the repos) and hope for the best.
Sorry for inconvenience and have a great day. Thank you very much.
No inconvenience on our side; your help is invaluable, especially for intermittent problems like this one. They are really hard to find and debug, and I'm sorry that we didn't get this one resolved.
+1
Then let me remove this from the regression tracking, too.
#regzbot inconclusive: ignored, reporter for various real life reasons unfortunately will be unable to bisect/debug #regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you.
Hi there, hello. A little mid-update.
I bisected almost all range after 6.1.3 and the only untested commits left are unrelated to my hardware (AMD-specific stuff). I spent a week with a 6.1.1 kernel and didn't experience a single hang since, which leads me to a couple of conclusions:
1) it's not a hardware issue after all, since certain versions don't produce the bug 2) (this one's more an assumption) I might've got the version range wrong.
I'm gonna try 6.1.2 and 6.1.3 as well (up to 7 more days for each), and, depending on the output, bisect in a different range (now I regret not doing it in the beginning).
At the moment the earliest *tested* commit is: ``` [15e7433e1dc202] arm64: dts: qcom: sc8280xp: fix UFS DMA coherency. ``` and it's marked as "bad".
Thank you for your patience.
linux-stable-mirror@lists.linaro.org