September 2022 - Linux-stable-mirror

[merged mm-hotfixes-stable] mmhwpoison-check-mm-when-killing-accessing-process.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm,hwpoison: check mm when killing accessing process has been removed from the -mm tree. Its filename was mmhwpoison-check-mm-when-killing-accessing-process.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Shuai Xue <xueshuai(a)linux.alibaba.com> Subject: mm,hwpoison: check mm when killing accessing process Date: Wed, 14 Sep 2022 14:49:35 +0800 The GHES code calls memory_failure_queue() from IRQ context to queue work into workqueue and schedule it on the current CPU. Then the work is processed in memory_failure_work_func() by kworker and calls memory_failure(). When a page is already poisoned, commit a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") make memory_failure() call kill_accessing_process() that: - holds mmap locking of current->mm - does pagetable walk to find the error virtual address - and sends SIGBUS to the current process with error info. However, the mm of kworker is not valid, resulting in a null-pointer dereference. So check mm when killing the accessing process. [akpm(a)linux-foundation.org: remove unrelated whitespace alteration] Link: https://lkml.kernel.org/r/20220914064935.7851-1-xueshuai@linux.alibaba.com Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Shuai Xue <xueshuai(a)linux.alibaba.com> Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Cc: Huang Ying <ying.huang(a)intel.com> Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com> Cc: Bixuan Cui <cuibixuan(a)linux.alibaba.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory-failure.c | 3 +++ 1 file changed, 3 insertions(+) --- a/mm/memory-failure.c~mmhwpoison-check-mm-when-killing-accessing-process +++ a/mm/memory-failure.c @@ -745,6 +745,9 @@ static int kill_accessing_process(struct }; priv.tk.tsk = p; + if (!p->mm) + return -EFAULT; + mmap_read_lock(p->mm); ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops, (void *)&priv); _ Patches currently in -mm which might be from xueshuai(a)linux.alibaba.com are

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] mm-hugetlb-correct-demote-page-offset-logic.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/hugetlb: correct demote page offset logic has been removed from the -mm tree. Its filename was mm-hugetlb-correct-demote-page-offset-logic.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Doug Berger <opendmb(a)gmail.com> Subject: mm/hugetlb: correct demote page offset logic Date: Wed, 14 Sep 2022 12:09:17 -0700 With gigantic pages it may not be true that struct page structures are contiguous across the entire gigantic page. The nth_page macro is used here in place of direct pointer arithmetic to correct for this. Mike said: : This error could cause addressing exceptions. However, this is only : possible in configurations where CONFIG_SPARSEMEM && : !CONFIG_SPARSEMEM_VMEMMAP. Such a configuration option is rare and : unknown to be the default anywhere. Link: https://lkml.kernel.org/r/20220914190917.3517663-1-opendmb@gmail.com Fixes: 8531fc6f52f5 ("hugetlb: add hugetlb demote page support") Signed-off-by: Doug Berger <opendmb(a)gmail.com> Reviewed-by: Mike Kravetz <mike.kravetz(a)oracle.com> Reviewed-by: Oscar Salvador <osalvador(a)suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual(a)arm.com> Cc: Muchun Song <songmuchun(a)bytedance.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/hugetlb.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) --- a/mm/hugetlb.c~mm-hugetlb-correct-demote-page-offset-logic +++ a/mm/hugetlb.c @@ -3420,6 +3420,7 @@ static int demote_free_huge_page(struct { int i, nid = page_to_nid(page); struct hstate *target_hstate; + struct page *subpage; int rc = 0; target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); @@ -3453,15 +3454,16 @@ static int demote_free_huge_page(struct mutex_lock(&target_hstate->resize_lock); for (i = 0; i < pages_per_huge_page(h); i += pages_per_huge_page(target_hstate)) { + subpage = nth_page(page, i); if (hstate_is_gigantic(target_hstate)) - prep_compound_gigantic_page_for_demote(page + i, + prep_compound_gigantic_page_for_demote(subpage, target_hstate->order); else - prep_compound_page(page + i, target_hstate->order); - set_page_private(page + i, 0); - set_page_refcounted(page + i); - prep_new_huge_page(target_hstate, page + i, nid); - put_page(page + i); + prep_compound_page(subpage, target_hstate->order); + set_page_private(subpage, 0); + set_page_refcounted(subpage); + prep_new_huge_page(target_hstate, subpage, nid); + put_page(subpage); } mutex_unlock(&target_hstate->resize_lock); _ Patches currently in -mm which might be from opendmb(a)gmail.com are

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] mm-prevent-page_frag_alloc-from-corrupting-the-memory.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: prevent page_frag_alloc() from corrupting the memory has been removed from the -mm tree. Its filename was mm-prevent-page_frag_alloc-from-corrupting-the-memory.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Maurizio Lombardi <mlombard(a)redhat.com> Subject: mm: prevent page_frag_alloc() from corrupting the memory Date: Fri, 15 Jul 2022 14:50:13 +0200 A number of drivers call page_frag_alloc() with a fragment's size > PAGE_SIZE. In low memory conditions, __page_frag_cache_refill() may fail the order 3 cache allocation and fall back to order 0; In this case, the cache will be smaller than the fragment, causing memory corruptions. Prevent this from happening by checking if the newly allocated cache is large enough for the fragment; if not, the allocation will fail and page_frag_alloc() will return NULL. Link: https://lkml.kernel.org/r/20220715125013.247085-1-mlombard@redhat.com Fixes: b63ae8ca096d ("mm/net: Rename and move page fragment handling from net/ to mm/") Signed-off-by: Maurizio Lombardi <mlombard(a)redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck(a)fb.com> Cc: Chen Lin <chen45464546(a)163.com> Cc: Jakub Kicinski <kuba(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/page_alloc.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) --- a/mm/page_alloc.c~mm-prevent-page_frag_alloc-from-corrupting-the-memory +++ a/mm/page_alloc.c @@ -5740,6 +5740,18 @@ refill: /* reset page count bias and offset to start of new frag */ nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; offset = size - fragsz; + if (unlikely(offset < 0)) { + /* + * The caller is trying to allocate a fragment + * with fragsz > PAGE_SIZE but the cache isn't big + * enough to satisfy the request, this may + * happen in low memory conditions. + * We don't release the cache page because + * it could make memory pressure worse + * so we simply return NULL here. + */ + return NULL; + } } nc->pagecnt_bias--; _ Patches currently in -mm which might be from mlombard(a)redhat.com are

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] mm-bring-back-update_mmu_cache-to-finish_fault.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: bring back update_mmu_cache() to finish_fault() has been removed from the -mm tree. Its filename was mm-bring-back-update_mmu_cache-to-finish_fault.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Sergei Antonov <saproj(a)gmail.com> Subject: mm: bring back update_mmu_cache() to finish_fault() Date: Thu, 8 Sep 2022 23:48:09 +0300 Running this test program on ARMv4 a few times (sometimes just once) reproduces the bug. int main() { unsigned i; char paragon[SIZE]; void* ptr; memset(paragon, 0xAA, SIZE); ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_ANON | MAP_SHARED, -1, 0); if (ptr == MAP_FAILED) return 1; printf("ptr = %p\n", ptr); for (i=0;i<10000;i++){ memset(ptr, 0xAA, SIZE); if (memcmp(ptr, paragon, SIZE)) { printf("Unexpected bytes on iteration %u!!!\n", i); break; } } munmap(ptr, SIZE); } In the "ptr" buffer there appear runs of zero bytes which are aligned by 16 and their lengths are multiple of 16. Linux v5.11 does not have the bug, "git bisect" finds the first bad commit: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Before the commit update_mmu_cache() was called during a call to filemap_map_pages() as well as finish_fault(). After the commit finish_fault() lacks it. Bring back update_mmu_cache() to finish_fault() to fix the bug. Also call update_mmu_tlb() only when returning VM_FAULT_NOPAGE to more closely reproduce the code of alloc_set_pte() function that existed before the commit. On many platforms update_mmu_cache() is nop: x86, see arch/x86/include/asm/pgtable ARMv6+, see arch/arm/include/asm/tlbflush.h So, it seems, few users ran into this bug. Link: https://lkml.kernel.org/r/20220908204809.2012451-1-saproj@gmail.com Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Sergei Antonov <saproj(a)gmail.com> Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: Will Deacon <will(a)kernel.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/memory.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) --- a/mm/memory.c~mm-bring-back-update_mmu_cache-to-finish_fault +++ a/mm/memory.c @@ -4386,14 +4386,20 @@ vm_fault_t finish_fault(struct vm_fault vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); - ret = 0; + /* Re-check under ptl */ - if (likely(!vmf_pte_changed(vmf))) + if (likely(!vmf_pte_changed(vmf))) { do_set_pte(vmf, page, vmf->address); - else + + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, vmf->address, vmf->pte); + + ret = 0; + } else { + update_mmu_tlb(vma, vmf->address, vmf->pte); ret = VM_FAULT_NOPAGE; + } - update_mmu_tlb(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; } _ Patches currently in -mm which might be from saproj(a)gmail.com are

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] frontswap-dont-call-init-if-no-ops-are-registered.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: frontswap: don't call ->init if no ops are registered has been removed from the -mm tree. Its filename was frontswap-dont-call-init-if-no-ops-are-registered.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Christoph Hellwig <hch(a)lst.de> Subject: frontswap: don't call ->init if no ops are registered Date: Fri, 9 Sep 2022 15:08:29 +0200 If no frontswap module (i.e. zswap) was registered, frontswap_ops will be NULL. In such situation, swapon crashes with the following stack trace: Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a4fab000 [0000000000000000] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] SMP Modules linked in: zram fsl_dpaa2_eth pcs_lynx phylink ahci_qoriq crct10dif_ce ghash_ce sbsa_gwdt fsl_mc_dpio nvme lm90 nvme_core at803x xhci_plat_hcd rtc_fsl_ftm_alarm xgmac_mdio ahci_platform i2c_imx ip6_tables ip_tables fuse Unloaded tainted modules: cppc_cpufreq():1 CPU: 10 PID: 761 Comm: swapon Not tainted 6.0.0-rc2-00454-g22100432cf14 #1 Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : frontswap_init+0x38/0x60 lr : __do_sys_swapon+0x8a8/0x9f4 sp : ffff80000969bcf0 x29: ffff80000969bcf0 x28: ffff37bee0d8fc00 x27: ffff80000a7f5000 x26: fffffcdefb971e80 x25: ffffaba797453b90 x24: 0000000000000064 x23: ffff37c1f209d1a8 x22: ffff37bee880e000 x21: ffffaba797748560 x20: ffff37bee0d8fce4 x19: ffffaba797748488 x18: 0000000000000014 x17: 0000000030ec029a x16: ffffaba795a479b0 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000030 x12: 0000000000000001 x11: ffff37c63c0aba18 x10: 0000000000000000 x9 : ffffaba7956b8c88 x8 : ffff80000969bcd0 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 0000000000000001 x4 : 0000000000000000 x3 : ffffaba79730f000 x2 : ffff37bee0d8fc00 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: frontswap_init+0x38/0x60 __do_sys_swapon+0x8a8/0x9f4 __arm64_sys_swapon+0x28/0x3c invoke_syscall+0x78/0x100 el0_svc_common.constprop.0+0xd4/0xf4 do_el0_svc+0x38/0x4c el0_svc+0x34/0x10c el0t_64_sync_handler+0x11c/0x150 el0t_64_sync+0x190/0x194 Code: d000e283 910003fd f9006c41 f946d461 (f9400021) ---[ end trace 0000000000000000 ]--- Link: https://lkml.kernel.org/r/20220909130829.3262926-1-hch@lst.de Fixes: 1da0d94a3ec8 ("frontswap: remove support for multiple ops") Reported-by: Nathan Chancellor <nathan(a)kernel.org> Signed-off-by: Christoph Hellwig <hch(a)lst.de> Signed-off-by: Liu Shixin <liushixin2(a)huawei.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk(a)oracle.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/frontswap.c | 3 +++ 1 file changed, 3 insertions(+) --- a/mm/frontswap.c~frontswap-dont-call-init-if-no-ops-are-registered +++ a/mm/frontswap.c @@ -125,6 +125,9 @@ void frontswap_init(unsigned type, unsig * p->frontswap set to something valid to work properly. */ frontswap_map_set(sis, map); + + if (!frontswap_enabled()) + return; frontswap_ops->init(type); } _ Patches currently in -mm which might be from hch(a)lst.de are mm-add-psi-accounting-around-read_folio-and-readahead-calls.patch sched-psi-export-psi_memstall_enterleave.patch btrfs-add-manual-psi-accounting-for-compressed-reads.patch erofs-add-manual-psi-accounting-for-the-compressed-address-space.patch block-remove-psi-accounting-from-the-bio-layer.patch

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] mm-huge_memory-use-pfn_to_online_page-in-split_huge_pages_all.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() has been removed from the -mm tree. Its filename was mm-huge_memory-use-pfn_to_online_page-in-split_huge_pages_all.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Subject: mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Date: Thu, 8 Sep 2022 13:11:50 +0900 NULL pointer dereference is triggered when calling thp split via debugfs on the system with offlined memory blocks. With debug option enabled, the following kernel messages are printed out: page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000 flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff) raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: unmovable page page:000000007d7ab72e is uninitialized and poisoned page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p)) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:1248! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41 ... RIP: 0010:split_huge_pages_write+0xcf4/0xe30 This shows that page_to_nid() in page_zone() is unexpectedly called for an offlined memmap. Use pfn_to_online_page() to get struct page in PFN walker. Link: https://lkml.kernel.org/r/20220908041150.3430269-1-naoya.horiguchi@linux.dev Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: Naoya Horiguchi <naoya.horiguchi(a)nec.com> Co-developed-by: David Hildenbrand <david(a)redhat.com> Signed-off-by: David Hildenbrand <david(a)redhat.com> Reviewed-by: Yang Shi <shy828301(a)gmail.com> Acked-by: Michal Hocko <mhocko(a)suse.com> Reviewed-by: Miaohe Lin <linmiaohe(a)huawei.com> Reviewed-by: Oscar Salvador <osalvador(a)suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov(a)linux.intel.com> Cc: Matthew Wilcox <willy(a)infradead.org> Cc: Muchun Song <songmuchun(a)bytedance.com> Cc: <stable(a)vger.kernel.org> [5.10+] Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/huge_memory.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) --- a/mm/huge_memory.c~mm-huge_memory-use-pfn_to_online_page-in-split_huge_pages_all +++ a/mm/huge_memory.c @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void) max_zone_pfn = zone_end_pfn(zone); for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) { int nr_pages; - if (!pfn_valid(pfn)) - continue; - page = pfn_to_page(pfn); - if (!get_page_unless_zero(page)) + page = pfn_to_online_page(pfn); + if (!page || !get_page_unless_zero(page)) continue; if (zone != page_zone(page)) _ Patches currently in -mm which might be from naoya.horiguchi(a)nec.com are mmhwpoisonhugetlbmemory_hotplug-hotremove-memory-section-with-hwpoisoned-hugepage.patch mm-hwpoison-move-definitions-of-num_poisoned_pages_-to-memory-failurec.patch mm-hwpoison-pass-pfn-to-num_poisoned_pages_.patch mm-hwpoison-introduce-per-memory_block-hwpoison-counter.patch

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] mm-fix-madivse_pageout-mishandling-on-non-lru-page.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: fix madvise_pageout mishandling on non-LRU page has been removed from the -mm tree. Its filename was mm-fix-madivse_pageout-mishandling-on-non-lru-page.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Minchan Kim <minchan(a)kernel.org> Subject: mm: fix madivse_pageout mishandling on non-LRU page Date: Thu, 8 Sep 2022 08:12:04 -0700 MADV_PAGEOUT tries to isolate non-LRU pages and gets a warning from isolate_lru_page below. Fix it by checking PageLRU in advance. ------------[ cut here ]------------ trying to isolate tail page WARNING: CPU: 0 PID: 6175 at mm/folio-compat.c:158 isolate_lru_page+0x130/0x140 Modules linked in: CPU: 0 PID: 6175 Comm: syz-executor.0 Not tainted 5.18.12 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:isolate_lru_page+0x130/0x140 Link: https://lore.kernel.org/linux-mm/485f8c33.2471b.182d5726afb.Coremail.hantia… Link: https://lkml.kernel.org/r/20220908151204.762596-1-minchan@kernel.org Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT") Signed-off-by: Minchan Kim <minchan(a)kernel.org> Reported-by: ��`� <hantianshuo(a)iie.ac.cn> Suggested-by: Yang Shi <shy828301(a)gmail.com> Acked-by: Yang Shi <shy828301(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/madvise.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) --- a/mm/madvise.c~mm-fix-madivse_pageout-mishandling-on-non-lru-page +++ a/mm/madvise.c @@ -451,8 +451,11 @@ regular_page: continue; } - /* Do not interfere with other mappings of this page */ - if (page_mapcount(page) != 1) + /* + * Do not interfere with other mappings of this page and + * non-LRU page. + */ + if (!PageLRU(page) || page_mapcount(page) != 1) continue; VM_BUG_ON_PAGE(PageTransCompound(page), page); _ Patches currently in -mm which might be from minchan(a)kernel.org are

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] powerpc-64s-radix-dont-need-to-broadcast-ipi-for-radix-pmd-collapse-flush.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush has been removed from the -mm tree. Its filename was powerpc-64s-radix-dont-need-to-broadcast-ipi-for-radix-pmd-collapse-flush.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Yang Shi <shy828301(a)gmail.com> Subject: powerpc/64s/radix: don't need to broadcast IPI for radix pmd collapse flush Date: Wed, 7 Sep 2022 11:01:44 -0700 The IPI broadcast is used to serialize against fast-GUP, but fast-GUP will move to use RCU instead of disabling local interrupts in fast-GUP. Using an IPI is the old-styled way of serializing against fast-GUP although it still works as expected now. And fast-GUP now fixed the potential race with THP collapse by checking whether PMD is changed or not. So IPI broadcast in radix pmd collapse flush is not necessary anymore. But it is still needed for hash TLB. Link: https://lkml.kernel.org/r/20220907180144.555485-2-shy828301@gmail.com Suggested-by: Aneesh Kumar K.V <aneesh.kumar(a)linux.ibm.com> Signed-off-by: Yang Shi <shy828301(a)gmail.com> Acked-by: David Hildenbrand <david(a)redhat.com> Acked-by: Peter Xu <peterx(a)redhat.com> Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jason Gunthorpe <jgg(a)nvidia.com> Cc: John Hubbard <jhubbard(a)nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com> Cc: Michael Ellerman <mpe(a)ellerman.id.au> Cc: Nicholas Piggin <npiggin(a)gmail.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- arch/powerpc/mm/book3s64/radix_pgtable.c | 9 --------- 1 file changed, 9 deletions(-) --- a/arch/powerpc/mm/book3s64/radix_pgtable.c~powerpc-64s-radix-dont-need-to-broadcast-ipi-for-radix-pmd-collapse-flush +++ a/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -937,15 +937,6 @@ pmd_t radix__pmdp_collapse_flush(struct pmd = *pmdp; pmd_clear(pmdp); - /* - * pmdp collapse_flush need to ensure that there are no parallel gup - * walk after this call. This is needed so that we can have stable - * page ref count when collapsing a page. We don't allow a collapse page - * if we have gup taken on the page. We can ensure that by sending IPI - * because gup walk happens with IRQ disabled. - */ - serialize_against_pte_lookup(vma->vm_mm); - radix__flush_tlb_collapsed_pmd(vma->vm_mm, address); return pmd; _ Patches currently in -mm which might be from shy828301(a)gmail.com are mm-madv_collapse-refetch-vm_end-after-reacquiring-mmap_lock.patch

2 years, 9 months

1
0
0 0

[merged mm-hotfixes-stable] mm-gup-fix-the-fast-gup-race-against-thp-collapse.patch removed from -mm tree

by Andrew Morton

The quilt patch titled Subject: mm: gup: fix the fast GUP race against THP collapse has been removed from the -mm tree. Its filename was mm-gup-fix-the-fast-gup-race-against-thp-collapse.patch This patch was dropped because it was merged into the mm-hotfixes-stable branch of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm ------------------------------------------------------ From: Yang Shi <shy828301(a)gmail.com> Subject: mm: gup: fix the fast GUP race against THP collapse Date: Wed, 7 Sep 2022 11:01:43 -0700 Since general RCU GUP fast was introduced in commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer sufficient to handle concurrent GUP-fast in all cases, it only handles traditional IPI-based GUP-fast correctly. On architectures that send an IPI broadcast on TLB flush, it works as expected. But on the architectures that do not use IPI to broadcast TLB flush, it may have the below race: CPU A CPU B THP collapse fast GUP gup_pmd_range() <-- see valid pmd gup_pte_range() <-- work on pte pmdp_collapse_flush() <-- clear pmd and flush __collapse_huge_page_isolate() check page pinned <-- before GUP bump refcount pin the page check PTE <-- no change __collapse_huge_page_copy() copy data to huge page ptep_clear() install huge pmd for the huge page return the stale page discard the stale page The race can be fixed by checking whether PMD is changed or not after taking the page pin in fast GUP, just like what it does for PTE. If the PMD is changed it means there may be parallel THP collapse, so GUP should back off. Also update the stale comment about serializing against fast GUP in khugepaged. Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") Acked-by: David Hildenbrand <david(a)redhat.com> Acked-by: Peter Xu <peterx(a)redhat.com> Signed-off-by: Yang Shi <shy828301(a)gmail.com> Reviewed-by: John Hubbard <jhubbard(a)nvidia.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar(a)linux.ibm.com> Cc: Hugh Dickins <hughd(a)google.com> Cc: Jason Gunthorpe <jgg(a)nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov(a)linux.intel.com> Cc: Michael Ellerman <mpe(a)ellerman.id.au> Cc: Nicholas Piggin <npiggin(a)gmail.com> Cc: Christophe Leroy <christophe.leroy(a)csgroup.eu> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/gup.c | 34 ++++++++++++++++++++++++++++------ mm/khugepaged.c | 10 ++++++---- 2 files changed, 34 insertions(+), 10 deletions(-) --- a/mm/gup.c~mm-gup-fix-the-fast-gup-race-against-thp-collapse +++ a/mm/gup.c @@ -2357,8 +2357,28 @@ static void __maybe_unused undo_dev_page } #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - unsigned int flags, struct page **pages, int *nr) +/* + * Fast-gup relies on pte change detection to avoid concurrent pgtable + * operations. + * + * To pin the page, fast-gup needs to do below in order: + * (1) pin the page (by prefetching pte), then (2) check pte not changed. + * + * For the rest of pgtable operations where pgtable updates can be racy + * with fast-gup, we need to do (1) clear pte, then (2) check whether page + * is pinned. + * + * Above will work for all pte-level operations, including THP split. + * + * For THP collapse, it's a bit more complicated because fast-gup may be + * walking a pgtable page that is being freed (pte is still valid but pmd + * can be cleared already). To avoid race in such condition, we need to + * also check pmd here to make sure pmd doesn't change (corresponds to + * pmdp_collapse_flush() in the THP collapse code path). + */ +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { struct dev_pagemap *pgmap = NULL; int nr_start = *nr, ret = 0; @@ -2404,7 +2424,8 @@ static int gup_pte_range(pmd_t pmd, unsi goto pte_unmap; } - if (unlikely(pte_val(pte) != pte_val(*ptep))) { + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) || + unlikely(pte_val(pte) != pte_val(*ptep))) { gup_put_folio(folio, 1, flags); goto pte_unmap; } @@ -2451,8 +2472,9 @@ pte_unmap: * get_user_pages_fast_only implementation that can pin pages. Thus it's still * useful to have gup_huge_pmd even if we can't operate on ptes. */ -static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end, - unsigned int flags, struct page **pages, int *nr) +static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr, + unsigned long end, unsigned int flags, + struct page **pages, int *nr) { return 0; } @@ -2776,7 +2798,7 @@ static int gup_pmd_range(pud_t *pudp, pu if (!gup_huge_pd(__hugepd(pmd_val(pmd)), addr, PMD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) + } else if (!gup_pte_range(pmd, pmdp, addr, next, flags, pages, nr)) return 0; } while (pmdp++, addr = next, addr != end); --- a/mm/khugepaged.c~mm-gup-fix-the-fast-gup-race-against-thp-collapse +++ a/mm/khugepaged.c @@ -1083,10 +1083,12 @@ static void collapse_huge_page(struct mm pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ /* - * After this gup_fast can't run anymore. This also removes - * any huge TLB entry from the CPU so we won't allow - * huge and small TLB entries for the same virtual address - * to avoid the risk of CPU bugs in that area. + * This removes any huge TLB entry from the CPU so we won't allow + * huge and small TLB entries for the same virtual address to + * avoid the risk of CPU bugs in that area. + * + * Parallel fast GUP is fine since fast GUP will back off when + * it detects PMD is changed. */ _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); _ Patches currently in -mm which might be from shy828301(a)gmail.com are mm-madv_collapse-refetch-vm_end-after-reacquiring-mmap_lock.patch

2 years, 9 months

1
0
0 0

[PATCH 1/5] ext4: Make mballoc try target group first even with mb_optimize_scan

by Jan Kara

One of the side-effects of mb_optimize_scan was that the optimized functions to select next group to try were called even before we tried the goal group. As a result we no longer allocate files close to corresponding inodes as well as we don't try to expand currently allocated extent in the same group. This results in reaim regression with workfile.disk workload of upto 8% with many clients on my test machine: baseline mb_optimize_scan Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%) Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%* Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%* Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%* Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%* Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%) Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%) Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%* Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%* Also this is causing huge regression (time increased by a factor of 5 or so) when untarring archive with lots of small files on some eMMC storage cards. Fix the problem by making sure we try goal group first. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") CC: stable(a)vger.kernel.org Reported-and-tested-by: Stefan Wahren <stefan.wahren(a)i2se.com> Tested-by: Ojaswin Mujoo <ojaswin(a)linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list(a)gmail.com> Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/ Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/ Signed-off-by: Jan Kara <jack(a)suse.cz> --- fs/ext4/mballoc.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index bd8f8b5c3d30..41e1cfecac3b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1049,8 +1049,10 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac, { *new_cr = ac->ac_criteria; - if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) + if (!should_optimize_scan(ac) || ac->ac_groups_linear_remaining) { + *group = next_linear_group(ac, *group, ngroups); return; + } if (*new_cr == 0) { ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups); @@ -2636,7 +2638,7 @@ static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { ext4_group_t prefetch_grp = 0, ngroups, group, i; - int cr = -1; + int cr = -1, new_cr; int err = 0, first_err = 0; unsigned int nr = 0, prefetch_ios = 0; struct ext4_sb_info *sbi; @@ -2711,13 +2713,11 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) ac->ac_groups_linear_remaining = sbi->s_mb_max_linear_groups; prefetch_grp = group; - for (i = 0; i < ngroups; group = next_linear_group(ac, group, ngroups), - i++) { - int ret = 0, new_cr; + for (i = 0, new_cr = cr; i < ngroups; i++, + ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups)) { + int ret = 0; cond_resched(); - - ext4_mb_choose_next_group(ac, &new_cr, &group, ngroups); if (new_cr != cr) { cr = new_cr; goto repeat; -- 2.35.3

2 years, 9 months

2
4
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror September 2022