Hi, Andres,
On Tue, 2023-12-05 at 22:58 -0800, Andres Freund wrote:
Hi,
On 2023-12-01 08:31:48 +0000, Zhang, Rui wrote:
As a quick fix, I'm not going to fix the "potential issue" describes above because we have not seen a real problem caused by this yet.
Can you please try the below patch to confirm if the problem is gone on your system? This patch falls back to the previous way as sent at https://lore.kernel.org/lkml/87pm4bp54z.ffs@tglx/T/
I've just spent a couple hours bisecting why upgrading to 6.7-rc4 left me with just a single CPU core on my dual socket workstation.
before: [ 0.000000] Linux version 6.6.0-andres-00003-g31255e072b2e ... ... [ 0.022960] ACPI: Using ACPI (MADT) for SMP configuration information ... [ 0.022968] smpboot: Allowing 40 CPUs, 0 hotplug CPUs ... [ 0.345921] smpboot: CPU0: Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz (family: 0x6, model: 0x55, stepping: 0x7) ... [ 0.347229] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 [ 0.349082] .... node #1, CPUs: #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 [ 0.003190] smpboot: CPU 10 Converting physical 0 to logical die 1
[ 0.361053] .... node #0, CPUs: #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 [ 0.363990] .... node #1, CPUs: #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 ... [ 0.370886] smp: Brought up 2 nodes, 40 CPUs [ 0.370891] smpboot: Max logical packages: 2 [ 0.370896] smpboot: Total of 40 processors activated (200000.00 BogoMIPS) [ 0.403905] node 0 deferred pages initialised in 32ms [ 0.408865] node 1 deferred pages initialised in 37ms
after: [ 0.000000] Linux version 6.6.0-andres-00004-gec9aedb2aa1a ... ... [ 0.022935] ACPI: Using ACPI (MADT) for SMP configuration information ... [ 0.022942] smpboot: Allowing 1 CPUs, 0 hotplug CPUs ... [ 0.356424] smpboot: CPU0: Intel(R) Xeon(R) Gold 5215 CPU @ 2.50GHz (family: 0x6, model: 0x55, stepping: 0x7) ... [ 0.357098] smp: Bringing up secondary CPUs ... [ 0.357107] smp: Brought up 2 nodes, 1 CPU [ 0.357108] smpboot: Max logical packages: 1 [ 0.357110] smpboot: Total of 1 processors activated (5000.00 BogoMIPS) [ 0.726283] node 0 deferred pages initialised in 368ms [ 0.774704] node 1 deferred pages initialised in 418ms
There does seem to be something off with the ACPI data, when booting without the patch,
which patch are you referring to? the original patch in this thread?
Does the second patch fixes the problem? I mean the patch at https://lore.kernel.org/all/904ce2b870b8a7f34114f93adc7c8170420869d1.camel@i...
thanks, rui
I do see messages like: [ 0.715228] APIC: NR_CPUS/possible_cpus limit of 40 reached. Processor 40/0x7f00 ignored. [ 0.715231] ACPI: Unable to map lapic to logical cpu number
But other than that, the system has worked for a couple years.
It's obviously not good to regress from 2x10/20 cores/threads to a single core. I guess it's at least somewhat funny to imagine a 2 socket system with a single core...
It seems particularly worrying that this patch has apparently been selected for -stable: https://lore.kernel.org/all/20231122153212.852040-2-sashal@kernel.org/
Even if it didn't have these unintended consequences, it seems like a commit like this hardly is -stable material?
I've attached .config, dmesg of a boot with gec9aedb2aa1a and one with gec9aedb2aa1a^.
Greetings,
Andres Freund