Processors based on the Zen microarchitecture support IOPORT based deeper C-states. The ACPI idle driver reads the acpi_gbl_FADT.xpm_timer_block.address in the IOPORT based C-state exit path which is claimed to be a "Dummy wait op" and has been around since ACPI's introduction to Linux dating back to Andy Grover's Mar 14, 2002 posting [1].
Old, circa 2002 chipsets have a bug which was elaborated by Andreas Mohr back in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying 'extra' P_LVLx access") where the commit log claims: "this dummy read was about: STPCLK# doesn't get asserted in time on (some) chipsets, which is why we need to have a dummy I/O read to delay further instruction processing until the CPU is fully stopped."
This workaround is very painful on modern systems with a large number of cores. The "inl()" can take thousands of cycles. Sampling certain workloads with IBS on AMD Zen3 system shows that a significant amount of time is spent in the dummy op, which incorrectly gets accounted as C-State residency. A large C-State residency value can prime the cpuidle governor to recommend a deeper C-State during the subsequent idle instances, starting a vicious cycle, leading to performance degradation on workloads that rapidly switch between busy and idle phases. (For the extent of the performance degradation refer link [2])
The dummy wait is unnecessary on processors based on the Zen microarchitecture (AMD family 17h+ and HYGON). Skip it to prevent polluting the C-state residency information. Among the pre-family 17h AMD processors, there has been at least one report of an AMD Athlon on a VIA chipset (circa 2006) where this this problem was seen (see [3] for report by Andreas Mohr).
Modern Intel processors use MWAIT based C-States in the intel_idle driver and are not impacted by this code path. For older Intel processors that use the acpi_idle driver, a workaround was suggested by Dave Hansen and Rafael J. Wysocki to regard all Intel chipsets using the IOPORT based C-state management as being affected by this problem (see [4] for workaround proposed).
For these reasons, mark all the Intel processors and pre-family 17h AMD processors with x86_BUG_STPCLK. In the acpi_idle driver, restrict the dummy wait during IOPORT based C-state transitions to only these processors.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/co... [1] Link: https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/ [2] Link: https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/ [3] Link: https://lore.kernel.org/lkml/88c17568-8694-940a-0f1f-9d345e8dcbdb@intel.com/ [4]
Suggested-by: Calvin Ong calvin.ong@amd.com Cc: Dave Hansen dave.hansen@linux.intel.com Cc: Len Brown lenb@kernel.org Cc: Rafael J. Wysocki rafael.j.wysocki@intel.com CC: Pu Wen puwen@hygon.cn Cc: stable@vger.kernel.org Signed-off-by: K Prateek Nayak kprateek.nayak@amd.com --- v1->v2: o Introduce X86_BUG_STPCLK to mark chipsets as being affected by the STPCLK# signal assertion issue. o Mark all Intel chipsets and pre fam-17h AMD chipsets as being affected by the X86_BUG_STPCLK. o Skip dummy xpm_timer_block read in IOPORT based C-state exit path in ACPI processor_idle if chipset is not affected by X86_BUG_STPCLK. --- arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/kernel/cpu/amd.c | 12 ++++++++++++ arch/x86/kernel/cpu/intel.c | 12 ++++++++++++ drivers/acpi/processor_idle.c | 8 ++++++++ 4 files changed, 33 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index ef4775c6db01..fcd3617ed315 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -460,5 +460,6 @@ #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */ #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ +#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
#endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 48276c0e479d..8cb5887a53a3 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c) if (!cpu_has(c, X86_FEATURE_XENPV)) set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
+ /* + * CPUs based on the Zen microarchitecture (Fam 17h onward) can + * guarantee that STPCLK# signal is asserted in time after the + * P_LVL2 read to freeze execution after an IOPORT based C-state + * entry. Among the older AMD processors, there has been at least + * one report of an AMD Athlon processor on a VIA chipset + * (circa 2006) having this issue. Mark all these older AMD + * processor families as being affected. + */ + if (c->x86 < 0x17) + set_cpu_bug(c, X86_BUG_STPCLK); + /* * Turn on the Instructions Retired free counter on machines not * susceptible to erratum #1054 "Instructions Retired Performance diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 2d7ea5480ec3..96fe1320c238 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c) ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT))) set_cpu_bug(c, X86_BUG_MONITOR);
+ /* + * Intel chipsets prior to Nehalem used the ACPI processor_idle + * driver for C-state management. Some of these processors that + * used IOPORT based C-states could not guarantee that STPCLK# + * signal gets asserted in time after P_LVL2 read to freeze + * execution properly. Since a clear cut-off point is not known + * as to when this bug was solved, mark all the chipsets as + * being affected. Only the ones that use IOPORT based C-state + * transitions via the acpi_idle driver will be impacted. + */ + set_cpu_bug(c, X86_BUG_STPCLK); + #ifdef CONFIG_X86_64 if (c->x86 == 15) c->x86_cache_alignment = c->x86_clflush_size * 2; diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 16a1663d02d4..493f9ccdb72d 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -528,6 +528,14 @@ static int acpi_idle_bm_check(void) static void wait_for_freeze(void) { #ifdef CONFIG_X86 + /* + * A dummy wait operation is only required for those chipsets + * that cannot assert STPCLK# signal in time after P_LVL2 read. + * If a chipset is not affected by this problem, skip it. + */ + if (!static_cpu_has_bug(X86_BUG_STPCLK)) + return; + /* No delay is needed if we are in guest */ if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) return;
On Fri, Sep 23, 2022 at 09:08:01PM +0530, K Prateek Nayak wrote:
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index ef4775c6db01..fcd3617ed315 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -460,5 +460,6 @@ #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */ #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ +#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
What for?
Did you not see this thread?
https://lore.kernel.org/r/78d13a19-2806-c8af-573e-7f2625edfab8@intel.com
Hello Boris,
On 9/23/2022 9:16 PM, Borislav Petkov wrote:
On Fri, Sep 23, 2022 at 09:08:01PM +0530, K Prateek Nayak wrote:
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index ef4775c6db01..fcd3617ed315 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -460,5 +460,6 @@ #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */ #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ +#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
What for?
Did you not see this thread?
https://lore.kernel.org/r/78d13a19-2806-c8af-573e-7f2625edfab8@intel.com
Yes, I did see the patch, but Andreas replied to Dave's patch attached in v1 saying that he had seen this issue with AMD Athlon on a VIA chipset back in 2006. (https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/) Therefore, Dave's patch is not sufficient.
Dave suggested the use of AMD vendor and family check, but Peter then pointed out that we would also need the Hygon vendor check.
Hence I worked on the v2 which addressed these with the help of X86_BUG_STPCLK so that we can avoid vendor specific checks inside the common acpi code. -- Thanks and Regards, Prateek
On Fri, Sep 23, 2022 at 09:55:02PM +0530, K Prateek Nayak wrote:
Yes, I did see the patch, but Andreas replied to Dave's patch attached in v1 saying that he had seen this issue with AMD Athlon on a VIA chipset back in 2006. (https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/) Therefore, Dave's patch is not sufficient.
Well, AFAICT, that box is old and dead. And I'm being told that the AMD machines which we care for and which are still alive should not need the dummy read.
Which means, we could try the simple fix first and see who complains.
Hello Boris,
On 9/23/2022 10:00 PM, Borislav Petkov wrote:
[..snip..]
Well, AFAICT, that box is old and dead. And I'm being told that the AMD machines which we care for and which are still alive should not need the dummy read.
Which means, we could try the simple fix first and see who complains.
That's a fair point. -- Thanks and Regards, Prateek
From: K Prateek Nayak
Sent: 23 September 2022 16:38
....
This workaround is very painful on modern systems with a large number of cores. The "inl()" can take thousands of cycles. Sampling certain workloads with IBS on AMD Zen3 system shows that a significant amount of time is spent in the dummy op, which incorrectly gets accounted as C-State residency. A large C-State residency value can prime the cpuidle governor to recommend a deeper C-State during the subsequent idle instances, starting a vicious cycle, leading to performance degradation on workloads that rapidly switch between busy and idle phases. (For the extent of the performance degradation refer link [2])
Isn't that a horrid bug itself? Sounds like it affects any code that is doing pio reads of hardware buffers. While they are slow they are necessary. IIRC any PCIe read into an Altera fpga takes about 128 cycles of the 125MHz clock. The Intel cpu I've checked will only execute one concurrent PCIe read for each cpu core - so the cpu soon stalls for thousands of clocks.
The dummy wait is unnecessary on processors based on the Zen microarchitecture (AMD family 17h+ and HYGON). Skip it to prevent polluting the C-state residency information. Among the pre-family 17h AMD processors, there has been at least one report of an AMD Athlon on a VIA chipset (circa 2006) where this this problem was seen (see [3] for report by Andreas Mohr).
Modern Intel processors use MWAIT based C-States in the intel_idle driver and are not impacted by this code path. For older Intel processors that use the acpi_idle driver, a workaround was suggested by Dave Hansen and Rafael J. Wysocki to regard all Intel chipsets using the IOPORT based C-state management as being affected by this problem (see [4] for workaround proposed).
Can you use a surrogate (maybe AVX support?) to exclude large groups on modern cpu?
Another possibility is that is the io address doesn't really matter are there any locations that have moved on-die and are now executed much faster than the ISA bus speed of older systems? Or do all the 'originally ISA' peripherals still run at ISA speeds?
David
- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Sep 23, 2022 at 09:08:01PM +0530, K Prateek Nayak wrote:
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index ef4775c6db01..fcd3617ed315 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -460,5 +460,6 @@ #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */ #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ +#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */ #endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 48276c0e479d..8cb5887a53a3 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c) if (!cpu_has(c, X86_FEATURE_XENPV)) set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
- /*
* CPUs based on the Zen microarchitecture (Fam 17h onward) can
* guarantee that STPCLK# signal is asserted in time after the
* P_LVL2 read to freeze execution after an IOPORT based C-state
* entry. Among the older AMD processors, there has been at least
* one report of an AMD Athlon processor on a VIA chipset
* (circa 2006) having this issue. Mark all these older AMD
* processor families as being affected.
*/
- if (c->x86 < 0x17)
set_cpu_bug(c, X86_BUG_STPCLK);
- /*
- Turn on the Instructions Retired free counter on machines not
- susceptible to erratum #1054 "Instructions Retired Performance
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 2d7ea5480ec3..96fe1320c238 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c) ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT))) set_cpu_bug(c, X86_BUG_MONITOR);
- /*
* Intel chipsets prior to Nehalem used the ACPI processor_idle
* driver for C-state management. Some of these processors that
* used IOPORT based C-states could not guarantee that STPCLK#
* signal gets asserted in time after P_LVL2 read to freeze
* execution properly. Since a clear cut-off point is not known
* as to when this bug was solved, mark all the chipsets as
* being affected. Only the ones that use IOPORT based C-state
* transitions via the acpi_idle driver will be impacted.
*/
- set_cpu_bug(c, X86_BUG_STPCLK);
#ifdef CONFIG_X86_64 if (c->x86 == 15) c->x86_cache_alignment = c->x86_clflush_size * 2;
Quiz time:
#define X86_VENDOR_INTEL 0 #define X86_VENDOR_CYRIX 1 #define X86_VENDOR_AMD 2 #define X86_VENDOR_UMC 3 #define X86_VENDOR_CENTAUR 5 #define X86_VENDOR_TRANSMETA 7 #define X86_VENDOR_NSC 8 #define X86_VENDOR_HYGON 9 #define X86_VENDOR_ZHAOXIN 10 #define X86_VENDOR_VORTEX 11 #define X86_VENDOR_NUM 12 #define X86_VENDOR_UNKNOWN 0xff
For how many of the above have you changed behaviour?
Not to mention that this is the gazillion-th time AMD has failed to change HYGON in lock-step. That's Zen too -- deal with it.
Hello Peter,
On 9/26/2022 5:37 PM, Peter Zijlstra wrote:
On Fri, Sep 23, 2022 at 09:08:01PM +0530, K Prateek Nayak wrote:
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h index ef4775c6db01..fcd3617ed315 100644 --- a/arch/x86/include/asm/cpufeatures.h +++ b/arch/x86/include/asm/cpufeatures.h @@ -460,5 +460,6 @@ #define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */ #define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */ #define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */ +#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */ #endif /* _ASM_X86_CPUFEATURES_H */ diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 48276c0e479d..8cb5887a53a3 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c) if (!cpu_has(c, X86_FEATURE_XENPV)) set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
- /*
* CPUs based on the Zen microarchitecture (Fam 17h onward) can
* guarantee that STPCLK# signal is asserted in time after the
* P_LVL2 read to freeze execution after an IOPORT based C-state
* entry. Among the older AMD processors, there has been at least
* one report of an AMD Athlon processor on a VIA chipset
* (circa 2006) having this issue. Mark all these older AMD
* processor families as being affected.
*/
- if (c->x86 < 0x17)
set_cpu_bug(c, X86_BUG_STPCLK);
- /*
- Turn on the Instructions Retired free counter on machines not
- susceptible to erratum #1054 "Instructions Retired Performance
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c index 2d7ea5480ec3..96fe1320c238 100644 --- a/arch/x86/kernel/cpu/intel.c +++ b/arch/x86/kernel/cpu/intel.c @@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c) ((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT))) set_cpu_bug(c, X86_BUG_MONITOR);
- /*
* Intel chipsets prior to Nehalem used the ACPI processor_idle
* driver for C-state management. Some of these processors that
* used IOPORT based C-states could not guarantee that STPCLK#
* signal gets asserted in time after P_LVL2 read to freeze
* execution properly. Since a clear cut-off point is not known
* as to when this bug was solved, mark all the chipsets as
* being affected. Only the ones that use IOPORT based C-state
* transitions via the acpi_idle driver will be impacted.
*/
- set_cpu_bug(c, X86_BUG_STPCLK);
#ifdef CONFIG_X86_64 if (c->x86 == 15) c->x86_cache_alignment = c->x86_clflush_size * 2;
Quiz time:
#define X86_VENDOR_INTEL 0 #define X86_VENDOR_CYRIX 1 #define X86_VENDOR_AMD 2 #define X86_VENDOR_UMC 3 #define X86_VENDOR_CENTAUR 5 #define X86_VENDOR_TRANSMETA 7 #define X86_VENDOR_NSC 8 #define X86_VENDOR_HYGON 9 #define X86_VENDOR_ZHAOXIN 10 #define X86_VENDOR_VORTEX 11 #define X86_VENDOR_NUM 12 #define X86_VENDOR_UNKNOWN 0xff
For how many of the above have you changed behaviour?
The proposed logic does alter the behavior for x86 chipsets that depend on acpi_idle driver and have IOPORT based C-state. Based on what Rafael and Dave suggested, I have marked all Intel processors to be affected by this bug. In light of Andreas' report, I've also marked all the pre-family 17h AMD processors to be affected by this bug to avoid causing any regression.
It is hard to tell if any other vendor had this bug in their chipsets. Dave's patch does not make this consideration either and limits the dummy operation to only Intel chipsets using acpi_idle driver. (https://lore.kernel.org/all/78d13a19-2806-c8af-573e-7f2625edfab8@intel.com/) If folks reported a regression, I would have been happy to fix it for them.
Not to mention that this is the gazillion-th time AMD has failed to change HYGON in lock-step. That's Zen too -- deal with it.
Hygon is based on the Zen microarchitecture (equivalent to Fam 17h on AMD) and they too do not need the the dummy wait op to ensure correct behavior. Hence, they are not marked with x86_BUG_STPCLK.
In the patch description, I've called out:
"mark all the Intel processors and pre-family 17h AMD processors with x86_BUG_STPCLK. In the acpi_idle driver, restrict the dummy wait during IOPORT based C-state transitions to only these processors."
Both Hygon and AMD Fam 17h+, which are based on Zen microachitecture, are not affected by x86_BUG_STPCLK and hence skip the dummy wait op.
Did I miss something? -- Thanks and Regards, Prateek
Hi,
Am Mon, Sep 26, 2022 at 10:02:44PM +0530 schrieb K Prateek Nayak:
Hello Peter,
On 9/26/2022 5:37 PM, Peter Zijlstra wrote:
For how many of the above have you changed behaviour?
The proposed logic does alter the behavior for x86 chipsets that depend on acpi_idle driver and have IOPORT based C-state. Based on what Rafael and Dave suggested, I have marked all Intel processors to be affected by this bug. In light of Andreas' report, I've also marked all the pre-family 17h AMD processors to be affected by this bug to avoid causing any regression.
It is hard to tell if any other vendor had this bug in their chipsets. Dave's patch does not make this consideration either and limits the dummy operation to only Intel chipsets using acpi_idle driver. (https://lore.kernel.org/all/78d13a19-2806-c8af-573e-7f2625edfab8@intel.com/) If folks reported a regression, I would have been happy to fix it for them.
Despite certain, umm, controversies, the discussion/patch activities appear to be heading into a good direction ;)
This text somehow prompted me to think of whether STPCLK# [quirk] behaviour is a property of the CPU family, or the chipset, or actually a combination of it.
Given that [from recollection] VIA 8233/8235 spec PDFs do mention STPCLK#, possibly a chipset does have a say in it? (which obviously would then mean that the kernel's quirk state decision-making would have to be refined)
Minor reference (note 8237, not 8233): http://www.chipset-ic.com/datasheet/VT8237.pdf "STPCLK# is asserted by the VT8237R to the CPU to throttle the processor." (and many other STPCLK# mentions there)
Greetings
Andreas Mohr
linux-stable-mirror@lists.linaro.org