On Wed, 21 Jun 2023 at 19:57, Jason A. Donenfeld Jason@zx2c4.com wrote:
+Ard - any ideas here?
On Wed, Jun 21, 2023 at 10:46 AM Linux regression tracking (Thorsten Leemhuis) regressions@leemhuis.info wrote:
[added Jason (who authored the culprit) to the list of recipients; moved net people and list to BCC, guess they are not much interested in this anymore then]
On 21.06.23 08:07, Sami Korkalainen wrote:
I bisected again. It seems I made some mistake last time, as I got a different result this time. Maybe, because these problematic kernels may boot fine sometimes, like I said before.
Anyway, first bad commit (makes much more sense this time): e7b813b32a42a3a6281a4fd9ae7700a0257c1d50 efi: random: refresh non-volatile random seed when RNG is initialized
I confirmed that this is the code causing the issue by commenting it out (see the patch file). Without this code, the latest mainline boots fine.
Jason, in that case it seems this is something for you. For the initial report, see here:
https://lore.kernel.org/all/GQUnKz2al3yke5mB2i1kp3SzNHjK8vi6KJEh7rnLrOQ24Orl...
Quoting a part of it:
Linux 6.2 and newer are (mostly) unbootable on my old HP 6730b laptop, the 6.1.30 works still fine. The weirdest thing is that newer kernels (like 6.3.4 and 6.4-rc3) may boot ok on the first try, but when rebooting, the very same version doesn't boot. Some times, when trying to boot, I get this message repeated forever: ACPI Error: No handler or method for GPE [XX], disabling event (20221020/evgpe-839) On newer kernels, the date is 20230331 instead of 20221020. There is also some other error, but I can't read it as it gets overwritten by the other ACPI error, see image linked at the end. And some times, the screen will just stay completely blank. I tried booting with acpi=off, but it does not help.
Catching up with email after my vacation, apologies for the delay.
This ship seems to have sailed in the meantime, but I'll contribute some observations anyway.
The machine in question appears to be Vista-era Windows laptop, and I am not surprised at all that the firmware is flaky. In those days, firmware testing was limited to boot testing Windows, and nobody bothered testing for EFI compliance beyond that (as it is not needed to get the Windows sticker)
However, the failure mode still strikes me as odd, and I'd be interested in finding out whether booting with efi=noruntime makes a difference at all, as that would prevent the SetVariable() all from taking place, without affecting anything else.
Setting the variable from user space is ultimately a better choice, I think. The reason it was avoided it here is so that we don't have to rely on user space to set limited permissions on the efivarfs file entry in order to avoid the seed from being world readable (which is something, e.g., systemd does today for other 'sensitive' EFI variables, whatever that means). But given that this variable is in its own GUIDed namespace, we could easily fix that in efivarfs itself.