On 9/22/22 2:01 PM, Dave Hansen wrote:
> On 9/22/22 11:53, Rafael J. Wysocki wrote:
>> Acked-by: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
>>
>> or do you want me to pick this up?
>
> I'll just stick it in x86/urgent.
>
> It's modifying code in a x86 #ifdef. I'll call it a small enclave of
> sovereign x86 territory in ACPI land, just like an embassy. ;)
Can it be cc:stable@vger.kernel.org, since it applies cleanly as far
back as this v5.4 commit?:
commit fa583f71a99c85e52781ed877c82c8757437b680
Author: Yin Fengwei <fengwei.yin(a)intel.com>
Date: Thu Oct 24 15:04:20 2019 +0800
ACPI: processor_idle: Skip dummy wait if kernel is in guest
Thanks,
Kim
From: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
commit 82806744fd7dde603b64c151eeddaa4ee62193fd upstream.
swiotlb_find_slots() skips slots according to io tlb aligned mask
calculated from min aligned mask and original physical address
offset. This affects max mapping size. The mapping size can't
achieve the IO_TLB_SEGSIZE * IO_TLB_SIZE when original offset is
non-zero. This will cause system boot up failure in Hyper-V
Isolation VM where swiotlb force is enabled. Scsi layer use return
value of dma_max_mapping_size() to set max segment size and it
finally calls swiotlb_max_mapping_size(). Hyper-V storage driver
sets min align mask to 4k - 1. Scsi layer may pass 256k length of
request buffer with 0~4k offset and Hyper-V storage driver can't
get swiotlb bounce buffer via DMA API. Swiotlb_find_slots() can't
find 256k length bounce buffer with offset. Make swiotlb_max_mapping
_size() take min align mask into account.
Signed-off-by: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
---
kernel/dma/swiotlb.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 018f140aaaf4..a9849670bdb5 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -709,7 +709,18 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
size_t swiotlb_max_mapping_size(struct device *dev)
{
- return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
+ int min_align_mask = dma_get_min_align_mask(dev);
+ int min_align = 0;
+
+ /*
+ * swiotlb_find_slots() skips slots according to
+ * min align mask. This affects max mapping size.
+ * Take it into acount here.
+ */
+ if (min_align_mask)
+ min_align = roundup(min_align_mask, IO_TLB_SIZE);
+
+ return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align;
}
bool is_swiotlb_active(struct device *dev)
--
2.37.1
From: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
commit 82806744fd7dde603b64c151eeddaa4ee62193fd upstream.
swiotlb_find_slots() skips slots according to io tlb aligned mask
calculated from min aligned mask and original physical address
offset. This affects max mapping size. The mapping size can't
achieve the IO_TLB_SEGSIZE * IO_TLB_SIZE when original offset is
non-zero. This will cause system boot up failure in Hyper-V
Isolation VM where swiotlb force is enabled. Scsi layer use return
value of dma_max_mapping_size() to set max segment size and it
finally calls swiotlb_max_mapping_size(). Hyper-V storage driver
sets min align mask to 4k - 1. Scsi layer may pass 256k length of
request buffer with 0~4k offset and Hyper-V storage driver can't
get swiotlb bounce buffer via DMA API. Swiotlb_find_slots() can't
find 256k length bounce buffer with offset. Make swiotlb_max_mapping
_size() take min align mask into account.
Signed-off-by: Tianyu Lan <Tianyu.Lan(a)microsoft.com>
Signed-off-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Rishabh Bhatnagar <risbhat(a)amazon.com>
---
kernel/dma/swiotlb.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 4a9831d01f0e..d897d161366a 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -734,7 +734,18 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size,
size_t swiotlb_max_mapping_size(struct device *dev)
{
- return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
+ int min_align_mask = dma_get_min_align_mask(dev);
+ int min_align = 0;
+
+ /*
+ * swiotlb_find_slots() skips slots according to
+ * min align mask. This affects max mapping size.
+ * Take it into acount here.
+ */
+ if (min_align_mask)
+ min_align = roundup(min_align_mask, IO_TLB_SIZE);
+
+ return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align;
}
bool is_swiotlb_active(void)
--
2.37.1
Processors based on the Zen microarchitecture support IOPORT based deeper
C-states. The ACPI idle driver reads the
acpi_gbl_FADT.xpm_timer_block.address in the IOPORT based C-state exit
path which is claimed to be a "Dummy wait op" and has been around since
ACPI's introduction to Linux dating back to Andy Grover's Mar 14, 2002
posting [1].
Old, circa 2002 chipsets have a bug which was elaborated by Andreas Mohr
back in 2006 in commit b488f02156d3d ("ACPI: restore comment justifying
'extra' P_LVLx access") where the commit log claims:
"this dummy read was about: STPCLK# doesn't get asserted in time on
(some) chipsets, which is why we need to have a dummy I/O read to delay
further instruction processing until the CPU is fully stopped."
This workaround is very painful on modern systems with a large number of
cores. The "inl()" can take thousands of cycles. Sampling certain
workloads with IBS on AMD Zen3 system shows that a significant amount of
time is spent in the dummy op, which incorrectly gets accounted as
C-State residency. A large C-State residency value can prime the cpuidle
governor to recommend a deeper C-State during the subsequent idle
instances, starting a vicious cycle, leading to performance degradation
on workloads that rapidly switch between busy and idle phases.
(For the extent of the performance degradation refer link [2])
The dummy wait is unnecessary on processors based on the Zen
microarchitecture (AMD family 17h+ and HYGON). Skip it to prevent
polluting the C-state residency information. Among the pre-family 17h
AMD processors, there has been at least one report of an AMD Athlon on a
VIA chipset (circa 2006) where this this problem was seen (see [3] for
report by Andreas Mohr).
Modern Intel processors use MWAIT based C-States in the intel_idle driver
and are not impacted by this code path. For older Intel processors that
use the acpi_idle driver, a workaround was suggested by Dave Hansen and
Rafael J. Wysocki to regard all Intel chipsets using the IOPORT based
C-state management as being affected by this problem (see [4] for
workaround proposed).
For these reasons, mark all the Intel processors and pre-family 17h
AMD processors with x86_BUG_STPCLK. In the acpi_idle driver, restrict the
dummy wait during IOPORT based C-state transitions to only these
processors.
Link: https://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux-fullhistory.git/c… [1]
Link: https://lore.kernel.org/lkml/20220921063638.2489-1-kprateek.nayak@amd.com/ [2]
Link: https://lore.kernel.org/lkml/Yyy6l94G0O2B7Yh1@rhlx01.hs-esslingen.de/ [3]
Link: https://lore.kernel.org/lkml/88c17568-8694-940a-0f1f-9d345e8dcbdb@intel.com/ [4]
Suggested-by: Calvin Ong <calvin.ong(a)amd.com>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: Len Brown <lenb(a)kernel.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki(a)intel.com>
CC: Pu Wen <puwen(a)hygon.cn>
Cc: stable(a)vger.kernel.org
Signed-off-by: K Prateek Nayak <kprateek.nayak(a)amd.com>
---
v1->v2:
o Introduce X86_BUG_STPCLK to mark chipsets as being affected by the
STPCLK# signal assertion issue.
o Mark all Intel chipsets and pre fam-17h AMD chipsets as being affected
by the X86_BUG_STPCLK.
o Skip dummy xpm_timer_block read in IOPORT based C-state exit path in
ACPI processor_idle if chipset is not affected by X86_BUG_STPCLK.
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/amd.c | 12 ++++++++++++
arch/x86/kernel/cpu/intel.c | 12 ++++++++++++
drivers/acpi/processor_idle.c | 8 ++++++++
4 files changed, 33 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index ef4775c6db01..fcd3617ed315 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -460,5 +460,6 @@
#define X86_BUG_MMIO_UNKNOWN X86_BUG(26) /* CPU is too old and its MMIO Stale Data status is unknown */
#define X86_BUG_RETBLEED X86_BUG(27) /* CPU is affected by RETBleed */
#define X86_BUG_EIBRS_PBRSB X86_BUG(28) /* EIBRS is vulnerable to Post Barrier RSB Predictions */
+#define X86_BUG_STPCLK X86_BUG(29) /* STPCLK# signal does not get asserted in time during IOPORT based C-state entry */
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 48276c0e479d..8cb5887a53a3 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -988,6 +988,18 @@ static void init_amd(struct cpuinfo_x86 *c)
if (!cpu_has(c, X86_FEATURE_XENPV))
set_cpu_bug(c, X86_BUG_SYSRET_SS_ATTRS);
+ /*
+ * CPUs based on the Zen microarchitecture (Fam 17h onward) can
+ * guarantee that STPCLK# signal is asserted in time after the
+ * P_LVL2 read to freeze execution after an IOPORT based C-state
+ * entry. Among the older AMD processors, there has been at least
+ * one report of an AMD Athlon processor on a VIA chipset
+ * (circa 2006) having this issue. Mark all these older AMD
+ * processor families as being affected.
+ */
+ if (c->x86 < 0x17)
+ set_cpu_bug(c, X86_BUG_STPCLK);
+
/*
* Turn on the Instructions Retired free counter on machines not
* susceptible to erratum #1054 "Instructions Retired Performance
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 2d7ea5480ec3..96fe1320c238 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -696,6 +696,18 @@ static void init_intel(struct cpuinfo_x86 *c)
((c->x86_model == INTEL_FAM6_ATOM_GOLDMONT)))
set_cpu_bug(c, X86_BUG_MONITOR);
+ /*
+ * Intel chipsets prior to Nehalem used the ACPI processor_idle
+ * driver for C-state management. Some of these processors that
+ * used IOPORT based C-states could not guarantee that STPCLK#
+ * signal gets asserted in time after P_LVL2 read to freeze
+ * execution properly. Since a clear cut-off point is not known
+ * as to when this bug was solved, mark all the chipsets as
+ * being affected. Only the ones that use IOPORT based C-state
+ * transitions via the acpi_idle driver will be impacted.
+ */
+ set_cpu_bug(c, X86_BUG_STPCLK);
+
#ifdef CONFIG_X86_64
if (c->x86 == 15)
c->x86_cache_alignment = c->x86_clflush_size * 2;
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 16a1663d02d4..493f9ccdb72d 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -528,6 +528,14 @@ static int acpi_idle_bm_check(void)
static void wait_for_freeze(void)
{
#ifdef CONFIG_X86
+ /*
+ * A dummy wait operation is only required for those chipsets
+ * that cannot assert STPCLK# signal in time after P_LVL2 read.
+ * If a chipset is not affected by this problem, skip it.
+ */
+ if (!static_cpu_has_bug(X86_BUG_STPCLK))
+ return;
+
/* No delay is needed if we are in guest */
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
return;
--
2.25.1
The quilt patch titled
Subject: x86/uaccess: avoid check_object_size() in copy_from_user_nmi()
has been removed from the -mm tree. Its filename was
x86-uaccess-avoid-check_object_size-in-copy_from_user_nmi.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Kees Cook <keescook(a)chromium.org>
Subject: x86/uaccess: avoid check_object_size() in copy_from_user_nmi()
Date: Mon, 19 Sep 2022 13:16:48 -0700
The check_object_size() helper under CONFIG_HARDENED_USERCOPY is designed
to skip any checks where the length is known at compile time as a
reasonable heuristic to avoid "likely known-good" cases. However, it can
only do this when the copy_*_user() helpers are, themselves, inline too.
Using find_vmap_area() requires taking a spinlock. The
check_object_size() helper can call find_vmap_area() when the destination
is in vmap memory. If show_regs() is called in interrupt context, it will
attempt a call to copy_from_user_nmi(), which may call check_object_size()
and then find_vmap_area(). If something in normal context happens to be
in the middle of calling find_vmap_area() (with the spinlock held), the
interrupt handler will hang forever.
The copy_from_user_nmi() call is actually being called with a fixed-size
length, so check_object_size() should never have been called in the first
place. Given the narrow constraints, just replace the
__copy_from_user_inatomic() call with an open-coded version that calls
only into the sanitizers and not check_object_size(), followed by a call
to raw_copy_from_user().
[akpm(a)linux-foundation.org: no instrument_copy_from_user() in my tree...]
Link: https://lkml.kernel.org/r/20220919201648.2250764-1-keescook@chromium.org
Link: https://lore.kernel.org/all/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9…
Fixes: 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns")
Signed-off-by: Kees Cook <keescook(a)chromium.org>
Reported-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: Florian Lehner <dev(a)der-flo.net>
Suggested-by: Andrew Morton <akpm(a)linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz(a)infradead.org>
Tested-by: Florian Lehner <dev(a)der-flo.net>
Cc: Matthew Wilcox <willy(a)infradead.org>
Cc: Josh Poimboeuf <jpoimboe(a)kernel.org>
Cc: Dave Hansen <dave.hansen(a)linux.intel.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
arch/x86/lib/usercopy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/arch/x86/lib/usercopy.c~x86-uaccess-avoid-check_object_size-in-copy_from_user_nmi
+++ a/arch/x86/lib/usercopy.c
@@ -44,7 +44,7 @@ copy_from_user_nmi(void *to, const void
* called from other contexts.
*/
pagefault_disable();
- ret = __copy_from_user_inatomic(to, from, n);
+ ret = raw_copy_from_user(to, from, n);
pagefault_enable();
return ret;
_
Patches currently in -mm which might be from keescook(a)chromium.org are
The quilt patch titled
Subject: mm/page_isolation: fix isolate_single_pageblock() isolation behavior
has been removed from the -mm tree. Its filename was
mm-page_isolation-fix-isolate_single_pageblock-isolation-behavior.patch
This patch was dropped because it was merged into the mm-hotfixes-stable branch
of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
------------------------------------------------------
From: Zi Yan <ziy(a)nvidia.com>
Subject: mm/page_isolation: fix isolate_single_pageblock() isolation behavior
Date: Tue, 13 Sep 2022 22:39:13 -0400
set_migratetype_isolate() does not allow isolating MIGRATE_CMA pageblocks
unless it is used for CMA allocation. isolate_single_pageblock() did not
have the same behavior when it is used together with
set_migratetype_isolate() in start_isolate_page_range(). This allows
alloc_contig_range() with migratetype other than MIGRATE_CMA, like
MIGRATE_MOVABLE (used by alloc_contig_pages()), to isolate first and last
pageblock but fail the rest. The failure leads to changing migratetype of
the first and last pageblock to MIGRATE_MOVABLE from MIGRATE_CMA,
corrupting the CMA region. This can happen during gigantic page
allocations.
Like Doug said here:
https://lore.kernel.org/linux-mm/a3363a52-883b-dcd1-b77f-f2bb378d6f2d@gmail…,
for gigantic page allocations, the user would notice no difference,
since the allocation on CMA region will fail as well as it did before.
But it might hurt the performance of device drivers that use CMA, since
CMA region size decreases.
Fix it by passing migratetype into isolate_single_pageblock(), so that
set_migratetype_isolate() used by isolate_single_pageblock() will prevent
the isolation happening.
Link: https://lkml.kernel.org/r/20220914023913.1855924-1-zi.yan@sent.com
Fixes: b2c9e2fbba32 ("mm: make alloc_contig_range work at pageblock granularity")
Signed-off-by: Zi Yan <ziy(a)nvidia.com>
Reported-by: Doug Berger <opendmb(a)gmail.com>
Cc: David Hildenbrand <david(a)redhat.com>
Cc: Doug Berger <opendmb(a)gmail.com>
Cc: Mike Kravetz <mike.kravetz(a)oracle.com>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
mm/page_isolation.c | 25 ++++++++++++++-----------
1 file changed, 14 insertions(+), 11 deletions(-)
--- a/mm/page_isolation.c~mm-page_isolation-fix-isolate_single_pageblock-isolation-behavior
+++ a/mm/page_isolation.c
@@ -288,6 +288,7 @@ __first_valid_page(unsigned long pfn, un
* @isolate_before: isolate the pageblock before the boundary_pfn
* @skip_isolation: the flag to skip the pageblock isolation in second
* isolate_single_pageblock()
+ * @migratetype: migrate type to set in error recovery.
*
* Free and in-use pages can be as big as MAX_ORDER-1 and contain more than one
* pageblock. When not all pageblocks within a page are isolated at the same
@@ -302,9 +303,9 @@ __first_valid_page(unsigned long pfn, un
* the in-use page then splitting the free page.
*/
static int isolate_single_pageblock(unsigned long boundary_pfn, int flags,
- gfp_t gfp_flags, bool isolate_before, bool skip_isolation)
+ gfp_t gfp_flags, bool isolate_before, bool skip_isolation,
+ int migratetype)
{
- unsigned char saved_mt;
unsigned long start_pfn;
unsigned long isolate_pageblock;
unsigned long pfn;
@@ -328,13 +329,13 @@ static int isolate_single_pageblock(unsi
start_pfn = max(ALIGN_DOWN(isolate_pageblock, MAX_ORDER_NR_PAGES),
zone->zone_start_pfn);
- saved_mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
+ if (skip_isolation) {
+ int mt = get_pageblock_migratetype(pfn_to_page(isolate_pageblock));
- if (skip_isolation)
- VM_BUG_ON(!is_migrate_isolate(saved_mt));
- else {
- ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt, flags,
- isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
+ VM_BUG_ON(!is_migrate_isolate(mt));
+ } else {
+ ret = set_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype,
+ flags, isolate_pageblock, isolate_pageblock + pageblock_nr_pages);
if (ret)
return ret;
@@ -475,7 +476,7 @@ static int isolate_single_pageblock(unsi
failed:
/* restore the original migratetype */
if (!skip_isolation)
- unset_migratetype_isolate(pfn_to_page(isolate_pageblock), saved_mt);
+ unset_migratetype_isolate(pfn_to_page(isolate_pageblock), migratetype);
return -EBUSY;
}
@@ -537,7 +538,8 @@ int start_isolate_page_range(unsigned lo
bool skip_isolation = false;
/* isolate [isolate_start, isolate_start + pageblock_nr_pages) pageblock */
- ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false, skip_isolation);
+ ret = isolate_single_pageblock(isolate_start, flags, gfp_flags, false,
+ skip_isolation, migratetype);
if (ret)
return ret;
@@ -545,7 +547,8 @@ int start_isolate_page_range(unsigned lo
skip_isolation = true;
/* isolate [isolate_end - pageblock_nr_pages, isolate_end) pageblock */
- ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true, skip_isolation);
+ ret = isolate_single_pageblock(isolate_end, flags, gfp_flags, true,
+ skip_isolation, migratetype);
if (ret) {
unset_migratetype_isolate(pfn_to_page(isolate_start), migratetype);
return ret;
_
Patches currently in -mm which might be from ziy(a)nvidia.com are