Processors based on the Zen microarchitecture support IOPORT based deeper C-states. The ACPI idle driver reads the acpi_gbl_FADT.xpm_timer_block.address in the IOPORT based C-state exit path which is claimed to be a "Dummy wait op" and has been around since ACPI's introduction to Linux dating back to Andy Grover's Mar 14, 2002 posting [1].
Old, circa 2002 chipsets have a bug which was elaborated by Andreas Mohr back in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying 'extra' P_LVLx access") where the commit log claims: "this dummy read was about: STPCLK# doesn't get asserted in time on (some) chipsets, which is why we need to have a dummy I/O read to delay further instruction processing until the CPU is fully stopped."
This workaround is very painful on modern systems with a large number of cores. The "inl()" can take thousands of cycles. Sampling certain workloads with IBS on AMD Zen3 system shows that a significant amount of time is spent in the dummy op, which incorrectly gets accounted as C-State residency. A large C-State residency value can prime the cpuidle governor to recommend a deeper C-State during the subsequent idle instances, starting a vicious cycle, leading to performance degradation on workloads that rapidly switch between busy and idle phases. (For the extent of the performance degradation refer link [2])
The dummy wait is unnecessary on processors based on the Zen microarchitecture (AMD family 17h+ and HYGON). Skip it to prevent polluting the C-state residency information. Among the pre-family 17h AMD processors, there has been at least one report of an AMD Athlon on a VIA chipset (circa 2006) where this this problem was seen (see [3] for report by Andreas Mohr).
Modern Intel processors use MWAIT based C-States in the intel_idle driver and are not impacted by this code path. For older Intel processors that use the acpi_idle driver, a workaround was suggested by Dave Hansen and Rafael J. Wysocki to regard all Intel chipsets using the IOPORT based C-state management as being affected by this problem (see [4] for workaround proposed).
For these reasons, mark all the Intel processors and pre-family 17h AMD processors with x86_BUG_STPCLK. In the acpi_idle driver, restrict the dummy wait during IOPORT based C-state transitions to only these processors.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/co... [1] Link: https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/ [2] Link: https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/ [3] Link: https://lore.kernel.org/lkml/88c17568-8694-940a-0f1f-9d345e8dcbdb@intel.com/ [4]
Suggested-by: Calvin Ong calvin.ong@amd.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Len Brown lenb@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com CC: Pu Wen puwen@hygon.cn Cc: stable@vger.kernel.org Signed-off-by: K Prateek Nayak kprateek.nayak@amd.com --- v1->v2: o Introduce X86_BUG_STPCLK to mark chipsets as being affected by the STPCLK# signal assertion issue. o Mark all Intel chipsets and pre fam-17h AMD chipsets as being affected by the X86_BUG_STPCLK. o Skip dummy xpm_timer_block read in IOPORT based C-state exit path in ACPI processor_idle if chipset is not affected by X86_BUG_STPCLK. --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/kernel/cpu/amd.c | 12 ++++++++++++ arch/x86/kernel/cpu/intel.c | 12 ++++++++++++ drivers/acpi/processor_idle.c | 8 ++++++++ 4 files changed, 33 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index ef4775c6db01..fcd3617ed315 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -460,5 +460,6 @@ #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */ #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ +#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
#endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 48276c0e479d..8cb5887a53a3 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c) if (!cpu_has(c, X86_FEATURE_XENPV)) set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
+ /* + * CPUs based on the Zen microarchitecture (Fam 17h onward) can + * guarantee that STPCLK# signal is asserted in time after the + * P_LVL2 read to freeze execution after an IOPORT based C-state + * entry. Among the older AMD processors, there has been at least + * one report of an AMD Athlon processor on a VIA chipset + * (circa 2006) having this issue. Mark all these older AMD + * processor families as being affected. + */ + if (c->x86 < 0x17) + set_cpu_bug(c, X86_BUG_STPCLK); + /* * Turn on the Instructions Retired free counter on machines not * susceptible to erratum #1054 "Instructions Retired Performance diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 2d7ea5480ec3..96fe1320c238 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c) ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT))) set_cpu_bug(c, X86_BUG_MONITOR);
+ /* + * Intel chipsets prior to Nehalem used the ACPI processor_idle + * driver for C-state management. Some of these processors that + * used IOPORT based C-states could not guarantee that STPCLK# + * signal gets asserted in time after P_LVL2 read to freeze + * execution properly. Since a clear cut-off point is not known + * as to when this bug was solved, mark all the chipsets as + * being affected. Only the ones that use IOPORT based C-state + * transitions via the acpi_idle driver will be impacted. + */ + set_cpu_bug(c, X86_BUG_STPCLK); + #ifdef CONFIG_X86_64 if (c->x86 == 15) c->x86_cache_alignment = c->x86_clflush_size * 2; diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 16a1663d02d4..493f9ccdb72d 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -528,6 +528,14 @@ static int acpi_idle_bm_check(void) static void wait_for_freeze(void) { #ifdef CONFIG_X86 + /* + * A dummy wait operation is only required for those chipsets + * that cannot assert STPCLK# signal in time after P_LVL2 read. + * If a chipset is not affected by this problem, skip it. + */ + if (!static_cpu_has_bug(X86_BUG_STPCLK)) + return; + /* No delay is needed if we are in guest */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) return;