June 2025 - Linux-stable-mirror

[PATCH v2 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung

by Kairui Song

From: Kairui Song <kasong(a)tencent.com> The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve the performance, as it avoided two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Cc: stable(a)vger.kernel.org Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song <kasong(a)tencent.com> Reviewed-by: Kemeng Shi <shikemeng(a)huaweicloud.com> --- mm/shmem.c | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index eda35be2a8d9..4e7ef343a29b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio, pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); - long nr = folio_nr_pages(folio); + unsigned long nr = folio_nr_pages(folio); + swp_entry_t iter, swap; + void *entry; VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio, gfp &= GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp); + swap = iter = radix_to_swp_entry(expected); do { xas_lock_irq(&xas); - if (expected != xas_find_conflict(&xas)) { - xas_set_err(&xas, -EEXIST); - goto unlock; + xas_for_each_conflict(&xas, entry) { + /* + * The range must either be empty, or filled with + * expected swap entries. Shmem swap entries are never + * partially freed without split of both entry and + * folio, so there shouldn't be any holes. + */ + if (!expected || entry != swp_to_radix_entry(iter)) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + iter.val += 1 << xas_get_order(&xas); } - if (expected && xas_find_conflict(&xas)) { + if (expected && iter.val - nr != swap.val) { xas_set_err(&xas, -EEXIST); goto unlock; } @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, error = -ENOMEM; goto failed; } - } else if (order != folio_order(folio)) { + } else if (order > folio_order(folio)) { /* * Swap readahead may swap in order 0 folios into swapcache * asynchronously, while the shmem mapping can still stores @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); } + } else if (order < folio_order(folio)) { + swap.val = round_down(swp_type(swap), folio_order(folio)); } alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || - folio->swap.val != swap.val || - !shmem_confirm_swap(mapping, index, swap) || - xa_get_order(&mapping->i_pages, index) != folio_order(folio)) { + folio->swap.val != swap.val) { error = -EEXIST; goto unlock; } -- 2.50.0

6 months

2
5
0 0

[PATCH v2] lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()

by Harry Yoo

alloc_tag_top_users() attempts to lock alloc_tag_cttype->mod_lock even when the alloc_tag_cttype is not allocated because: 1) alloc tagging is disabled because mem profiling is disabled (!alloc_tag_cttype) 2) alloc tagging is enabled, but not yet initialized (!alloc_tag_cttype) 3) alloc tagging is enabled, but failed initialization (!alloc_tag_cttype or IS_ERR(alloc_tag_cttype)) In all cases, alloc_tag_cttype is not allocated, and therefore alloc_tag_top_users() should not attempt to acquire the semaphore. This leads to a crash on memory allocation failure by attempting to acquire a non-existent semaphore: Oops: general protection fault, probably for non-canonical address 0xdffffc000000001b: 0000 [#3] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x00000000000000d8-0x00000000000000df] CPU: 2 UID: 0 PID: 1 Comm: systemd Tainted: G D 6.16.0-rc2 #1 VOLUNTARY Tainted: [D]=DIE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:down_read_trylock+0xaa/0x3b0 Code: d0 7c 08 84 d2 0f 85 a0 02 00 00 8b 0d df 31 dd 04 85 c9 75 29 48 b8 00 00 00 00 00 fc ff df 48 8d 6b 68 48 89 ea 48 c1 ea 03 <80> 3c 02 00 0f 85 88 02 00 00 48 3b 5b 68 0f 85 53 01 00 00 65 ff RSP: 0000:ffff8881002ce9b8 EFLAGS: 00010016 RAX: dffffc0000000000 RBX: 0000000000000070 RCX: 0000000000000000 RDX: 000000000000001b RSI: 000000000000000a RDI: 0000000000000070 RBP: 00000000000000d8 R08: 0000000000000001 R09: ffffed107dde49d1 R10: ffff8883eef24e8b R11: ffff8881002cec20 R12: 1ffff11020059d37 R13: 00000000003fff7b R14: ffff8881002cec20 R15: dffffc0000000000 FS: 00007f963f21d940(0000) GS:ffff888458ca6000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f963f5edf71 CR3: 000000010672c000 CR4: 0000000000350ef0 Call Trace: <TASK> codetag_trylock_module_list+0xd/0x20 alloc_tag_top_users+0x369/0x4b0 __show_mem+0x1cd/0x6e0 warn_alloc+0x2b1/0x390 __alloc_frozen_pages_noprof+0x12b9/0x21a0 alloc_pages_mpol+0x135/0x3e0 alloc_slab_page+0x82/0xe0 new_slab+0x212/0x240 ___slab_alloc+0x82a/0xe00 </TASK> As David Wang points out, this issue became easier to trigger after commit 780138b12381 ("alloc_tag: check mem_profiling_support in alloc_tag_init"). Before the commit, the issue occurred only when it failed to allocate and initialize alloc_tag_cttype or if a memory allocation fails before alloc_tag_init() is called. After the commit, it can be easily triggered when memory profiling is compiled but disabled at boot. To properly determine whether alloc_tag_init() has been called and its data structures initialized, verify that alloc_tag_cttype is a valid pointer before acquiring the semaphore. If the variable is NULL or an error value, it has not been properly initialized. In such a case, just skip and do not attempt acquire the semaphore. Reported-by: kernel test robot <oliver.sang(a)intel.com> Closes: https://lore.kernel.org/oe-lkp/202506181351.bba867dd-lkp@intel.com Fixes: 780138b12381 ("alloc_tag: check mem_profiling_support in alloc_tag_init") Fixes: 1438d349d16b ("lib: add memory allocations report in show_mem()") Cc: stable(a)vger.kernel.org Signed-off-by: Harry Yoo <harry.yoo(a)oracle.com> --- v1 -> v2: - v1 fixed the bug only when MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=n. v2 now fixes the bug even when MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT=y. I didn't expect alloc_tag_cttype to be NULL when mem_profiling_support is true, but as David points out (Thanks David!) if a memory allocation fails before alloc_tag_init(), it can be NULL. So instead of indirectly checking mem_profiling_support, just directly check if alloc_tag_cttype is allocated. - Closes: https://lore.kernel.org/oe-lkp/202505071555.e757f1e0-lkp@intel.com tag was removed because it was not a crash and not relevant to this patch. - Added Cc: stable because, if an allocation fails before alloc_tag_init(), it can be triggered even prior-780138b12381. I verified that the bug can be triggered in v6.12 and fixed by this patch. It should be quite difficult to trigger in practice, though. Maybe I'm a bit paranoid? lib/alloc_tag.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 66a4628185f7..d8ec4c03b7d2 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -124,7 +124,9 @@ size_t alloc_tag_top_users(struct codetag_bytes *tags, size_t count, bool can_sl struct codetag_bytes n; unsigned int i, nr = 0; - if (can_sleep) + if (IS_ERR_OR_NULL(alloc_tag_cttype)) + return 0; + else if (can_sleep) codetag_lock_module_list(alloc_tag_cttype, true); else if (!codetag_trylock_module_list(alloc_tag_cttype)) return 0; -- 2.43.0

6 months

3
3
0 0

+ mm-ptdump-take-the-memory-hotplug-lock-inside-ptdump_walk_pgd.patch added to mm-new branch

by Andrew Morton

The patch titled Subject: mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd() has been added to the -mm mm-new branch. Its filename is mm-ptdump-take-the-memory-hotplug-lock-inside-ptdump_walk_pgd.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-new branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Note, mm-new is a provisional staging ground for work-in-progress patches, and acceptance into mm-new is a notification for others take notice and to finish up reviews. Please do not hesitate to respond to review feedback and post updated versions to replace or incrementally fixup patches in mm-new. Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Anshuman Khandual <anshuman.khandual(a)arm.com> Subject: mm/ptdump: take the memory hotplug lock inside ptdump_walk_pgd() Date: Fri, 20 Jun 2025 10:54:27 +0530 Memory hot remove unmaps and tears down various kernel page table regions as required. The ptdump code can race with concurrent modifications of the kernel page tables. When leaf entries are modified concurrently, the dump code may log stale or inconsistent information for a VA range, but this is otherwise not harmful. But when intermediate levels of kernel page table are freed, the dump code will continue to use memory that has been freed and potentially reallocated for another purpose. In such cases, the ptdump code may dereference bogus addresses, leading to a number of potential problems. To avoid the above mentioned race condition, platforms such as arm64, riscv and s390 take memory hotplug lock, while dumping kernel page table via the sysfs interface /sys/kernel/debug/kernel_page_tables. Similar race condition exists while checking for pages that might have been marked W+X via /sys/kernel/debug/kernel_page_tables/check_wx_pages which in turn calls ptdump_check_wx(). Instead of solving this race condition again, let's just move the memory hotplug lock inside generic ptdump_check_wx() which will benefit both the scenarios. Drop get_online_mems() and put_online_mems() combination from all existing platform ptdump code paths. Link: https://lkml.kernel.org/r/20250620052427.2092093-1-anshuman.khandual@arm.com Fixes: bbd6ec605c0f ("arm64/mm: Enable memory hot remove") Signed-off-by: Anshuman Khandual <anshuman.khandual(a)arm.com> Cc: Catalin Marinas <catalin.marinas(a)arm.com> Cc: Will Deacon <will(a)kernel.org> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Paul Walmsley <paul.walmsley(a)sifive.com> Cc: Palmer Dabbelt <palmer(a)dabbelt.com> Cc: Alexander Gordeev <agordeev(a)linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer(a)linux.ibm.com> Cc: Heiko Carstens <hca(a)linux.ibm.com> Cc: Vasily Gorbik <gor(a)linux.ibm.com> Cc: Christian Borntraeger <borntraeger(a)linux.ibm.com> Cc: Sven Schnelle <svens(a)linux.ibm.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- arch/arm64/mm/ptdump_debugfs.c | 3 --- arch/riscv/mm/ptdump.c | 3 --- arch/s390/mm/dump_pagetables.c | 2 -- mm/ptdump.c | 2 ++ 4 files changed, 2 insertions(+), 8 deletions(-) --- a/arch/arm64/mm/ptdump_debugfs.c~mm-ptdump-take-the-memory-hotplug-lock-inside-ptdump_walk_pgd +++ a/arch/arm64/mm/ptdump_debugfs.c @@ -1,6 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 #include <linux/debugfs.h> -#include <linux/memory_hotplug.h> #include <linux/seq_file.h> #include <asm/ptdump.h> @@ -9,9 +8,7 @@ static int ptdump_show(struct seq_file * { struct ptdump_info *info = m->private; - get_online_mems(); ptdump_walk(m, info); - put_online_mems(); return 0; } DEFINE_SHOW_ATTRIBUTE(ptdump); --- a/arch/riscv/mm/ptdump.c~mm-ptdump-take-the-memory-hotplug-lock-inside-ptdump_walk_pgd +++ a/arch/riscv/mm/ptdump.c @@ -6,7 +6,6 @@ #include <linux/efi.h> #include <linux/init.h> #include <linux/debugfs.h> -#include <linux/memory_hotplug.h> #include <linux/seq_file.h> #include <linux/ptdump.h> @@ -413,9 +412,7 @@ bool ptdump_check_wx(void) static int ptdump_show(struct seq_file *m, void *v) { - get_online_mems(); ptdump_walk(m, m->private); - put_online_mems(); return 0; } --- a/arch/s390/mm/dump_pagetables.c~mm-ptdump-take-the-memory-hotplug-lock-inside-ptdump_walk_pgd +++ a/arch/s390/mm/dump_pagetables.c @@ -247,11 +247,9 @@ static int ptdump_show(struct seq_file * .marker = markers, }; - get_online_mems(); mutex_lock(&cpa_mutex); ptdump_walk_pgd(&st.ptdump, &init_mm, NULL); mutex_unlock(&cpa_mutex); - put_online_mems(); return 0; } DEFINE_SHOW_ATTRIBUTE(ptdump); --- a/mm/ptdump.c~mm-ptdump-take-the-memory-hotplug-lock-inside-ptdump_walk_pgd +++ a/mm/ptdump.c @@ -176,6 +176,7 @@ void ptdump_walk_pgd(struct ptdump_state { const struct ptdump_range *range = st->range; + get_online_mems(); mmap_write_lock(mm); while (range->start != range->end) { walk_page_range_debug(mm, range->start, range->end, @@ -183,6 +184,7 @@ void ptdump_walk_pgd(struct ptdump_state range++; } mmap_write_unlock(mm); + put_online_mems(); /* Flush out the last page */ st->note_page_flush(st); _ Patches currently in -mm which might be from anshuman.khandual(a)arm.com are mm-ptdump-take-the-memory-hotplug-lock-inside-ptdump_walk_pgd.patch

6 months

1
0
0 0

+ mmhugetlb-change-mechanism-to-detect-a-cow-on-private-mapping.patch added to mm-new branch

by Andrew Morton

The patch titled Subject: mm,hugetlb: change mechanism to detect a COW on private mapping has been added to the -mm mm-new branch. Its filename is mmhugetlb-change-mechanism-to-detect-a-cow-on-private-mapping.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-new branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Note, mm-new is a provisional staging ground for work-in-progress patches, and acceptance into mm-new is a notification for others take notice and to finish up reviews. Please do not hesitate to respond to review feedback and post updated versions to replace or incrementally fixup patches in mm-new. Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Oscar Salvador <osalvador(a)suse.de> Subject: mm,hugetlb: change mechanism to detect a COW on private mapping Date: Fri, 20 Jun 2025 14:30:10 +0200 Patch series "Misc rework on hugetlb faulting path", v2. This patchset aims to give some love to the hugetlb faulting path, doing so by removing obsolete comments that are no longer true, sorting out the folio lock, and changing the mechanism we use to determine whether we are COWing a private mapping already. The most important patch of the series is #1, as it fixes a deadlock that was described in [1], where two processes were holding the same lock for the folio in the pagecache, and then deadlocked in the mutex. Looking up and locking the folio in the pagecache was done to check whether that folio was the same folio we had mapped in our pagetables, meaning that if it was different we knew that we already mapped that folio privately, so any further CoW would be made on a private mapping, which lead us to the question: __Was the reservation for that address consumed?__ That is all we care about, because if it was indeed consumed and we are the owner and we cannot allocate more folios, we need to unmap the folio from the processes pagetables and make it exclusive for us. We figured we do not need to look up the folio at all, and it is just enough to check whether the folio we have mapped is anonymous, which means we mapped it privately, so the reservation was indeed consumed. Patch #2 sorts out folio locking in the faulting path, reducing the scope of it ,only taking it when we are dealing with an anonymous folio and document it. More details in the patch. Patches #3-5 are cleanups. hugetlb_wp() checks whether the process is trying to COW on a private mapping in order to know whether the reservation for that address was already consumed or not. If it was consumed and we are the ownner of the mapping, the folio will have to be unmapped from the other processes. Currently, that check is done by looking up the folio in the pagecache and comparing it to the folio which is mapped in our pagetables. If it differs, it means we already mapped it privately before, consuming a reservation on the way. All we are interested in is whether the mapped folio is anonymous, so we can simplify and check for that instead. Also, we transition from a trylock to a folio_lock, since the former was only needed when hugetlb_fault() had to lock both folios, in order to avoid deadlock. Link: https://lkml.kernel.org/r/20250620123014.29748-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20250620123014.29748-2-osalvador@suse.de Link: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/ [1] Fixes: 40549ba8f8e0 ("hugetlb: use new vma_lock for pmd sharing synchronization") Reported-by: Gavin Guo <gavinguo(a)igalia.com> Closes: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/ Signed-off-by: Oscar Salvador <osalvador(a)suse.de> Suggested-by: Peter Xu <peterx(a)redhat.com> Acked-by: David Hildenbrand <david(a)redhat.com> Cc: Muchun Song <muchun.song(a)linux.dev> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/hugetlb.c | 70 +++++++++++-------------------------------------- 1 file changed, 17 insertions(+), 53 deletions(-) --- a/mm/hugetlb.c~mmhugetlb-change-mechanism-to-detect-a-cow-on-private-mapping +++ a/mm/hugetlb.c @@ -6130,8 +6130,7 @@ static void unmap_ref_private(struct mm_ * cannot race with other handlers or page migration. * Keep the pte_same checks anyway to make transition from the mutex easier. */ -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio, - struct vm_fault *vmf) +static vm_fault_t hugetlb_wp(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct mm_struct *mm = vma->vm_mm; @@ -6193,16 +6192,17 @@ retry_avoidcopy: PageAnonExclusive(&old_folio->page), &old_folio->page); /* - * If the process that created a MAP_PRIVATE mapping is about to - * perform a COW due to a shared page count, attempt to satisfy - * the allocation without using the existing reserves. The pagecache - * page is used to determine if the reserve at this address was - * consumed or not. If reserves were used, a partial faulted mapping - * at the time of fork() could consume its reserves on COW instead - * of the full address range. + * If the process that created a MAP_PRIVATE mapping is about to perform + * a COW due to a shared page count, attempt to satisfy the allocation + * without using the existing reserves. + * In order to determine where this is a COW on a MAP_PRIVATE mapping it + * is enough to check whether the old_folio is anonymous. This means that + * the reserve for this address was consumed. If reserves were used, a + * partial faulted mapping at the fime of fork() could consume its reserves + * on COW instead of the full address range. */ if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && - old_folio != pagecache_folio) + folio_test_anon(old_folio)) cow_from_owner = true; folio_get(old_folio); @@ -6581,7 +6581,7 @@ static vm_fault_t hugetlb_no_page(struct hugetlb_count_add(pages_per_huge_page(h), mm); if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ - ret = hugetlb_wp(folio, vmf); + ret = hugetlb_wp(vmf); } spin_unlock(vmf->ptl); @@ -6648,11 +6648,9 @@ vm_fault_t hugetlb_fault(struct mm_struc { vm_fault_t ret; u32 hash; - struct folio *folio = NULL; - struct folio *pagecache_folio = NULL; + struct folio *folio; struct hstate *h = hstate_vma(vma); struct address_space *mapping; - int need_wait_lock = 0; struct vm_fault vmf = { .vma = vma, .address = address & huge_page_mask(h), @@ -6747,8 +6745,7 @@ vm_fault_t hugetlb_fault(struct mm_struc * If we are going to COW/unshare the mapping later, we examine the * pending reservations for this page now. This will ensure that any * allocations necessary to record that reservation occur outside the - * spinlock. Also lookup the pagecache page now as it is used to - * determine if a reservation has been consumed. + * spinlock. */ if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(vmf.orig_pte)) { @@ -6758,11 +6755,6 @@ vm_fault_t hugetlb_fault(struct mm_struc } /* Just decrements count, does not deallocate */ vma_end_reservation(h, vma, vmf.address); - - pagecache_folio = filemap_lock_hugetlb_folio(h, mapping, - vmf.pgoff); - if (IS_ERR(pagecache_folio)) - pagecache_folio = NULL; } vmf.ptl = huge_pte_lock(h, mm, vmf.pte); @@ -6776,10 +6768,6 @@ vm_fault_t hugetlb_fault(struct mm_struc (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) { if (!userfaultfd_wp_async(vma)) { spin_unlock(vmf.ptl); - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } hugetlb_vma_unlock_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); return handle_userfault(&vmf, VM_UFFD_WP); @@ -6791,23 +6779,14 @@ vm_fault_t hugetlb_fault(struct mm_struc /* Fallthrough to CoW */ } - /* - * hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) and - * pagecache_folio, so here we need take the former one - * when folio != pagecache_folio or !pagecache_folio. - */ + /* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */ folio = page_folio(pte_page(vmf.orig_pte)); - if (folio != pagecache_folio) - if (!folio_trylock(folio)) { - need_wait_lock = 1; - goto out_ptl; - } - + folio_lock(folio); folio_get(folio); if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) { if (!huge_pte_write(vmf.orig_pte)) { - ret = hugetlb_wp(pagecache_folio, &vmf); + ret = hugetlb_wp(&vmf); goto out_put_page; } else if (likely(flags & FAULT_FLAG_WRITE)) { vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte); @@ -6818,16 +6797,10 @@ vm_fault_t hugetlb_fault(struct mm_struc flags & FAULT_FLAG_WRITE)) update_mmu_cache(vma, vmf.address, vmf.pte); out_put_page: - if (folio != pagecache_folio) - folio_unlock(folio); + folio_unlock(folio); folio_put(folio); out_ptl: spin_unlock(vmf.ptl); - - if (pagecache_folio) { - folio_unlock(pagecache_folio); - folio_put(pagecache_folio); - } out_mutex: hugetlb_vma_unlock_read(vma); @@ -6839,15 +6812,6 @@ out_mutex: vma_end_read(vma); mutex_unlock(&hugetlb_fault_mutex_table[hash]); - /* - * Generally it's safe to hold refcount during waiting page lock. But - * here we just wait to defer the next page fault to avoid busy loop and - * the page is not used after unlocked before returning from the current - * page fault. So we are safe from accessing freed page, even if we wait - * here without taking refcount. - */ - if (need_wait_lock) - folio_wait_locked(folio); return ret; } _ Patches currently in -mm which might be from osalvador(a)suse.de are mmslub-do-not-special-case-n_normal-nodes-for-slab_nodes.patch mmmemory_hotplug-remove-status_change_nid_normal-and-update-documentation.patch mmmemory_hotplug-implement-numa-node-notifier.patch mmslub-use-node-notifier-instead-of-memory-notifier.patch mmmemory-tiers-use-node-notifier-instead-of-memory-notifier.patch driverscxl-use-node-notifier-instead-of-memory-notifier.patch drivershmat-use-node-notifier-instead-of-memory-notifier.patch kernelcpuset-use-node-notifier-instead-of-memory-notifier.patch mmmempolicy-use-node-notifier-instead-of-memory-notifier.patch mmpage_ext-derive-the-node-from-the-pfn.patch mmmemory_hotplug-drop-status_change_nid-parameter-from-memory_notify.patch mmhugetlb-change-mechanism-to-detect-a-cow-on-private-mapping.patch mmhugetlb-sort-out-folio-locking-in-the-faulting-path.patch mmhugetlb-rename-anon_rmap-to-new_anon_folio-and-make-it-boolean.patch mmhugetlb-rename-anon_rmap-to-new_anon_folio-and-make-it-boolean-fix.patch mmhugetlb-drop-obsolete-comment-about-non-present-pte-and-second-faults.patch mmhugetlb-drop-unlikelys-from-hugetlb_fault.patch

6 months

1
0
0 0

[PATCH v3] net/9p: Fix buffer overflow in USB transport layer

by Dominique Martinet via B4 Relay

From: Dominique Martinet <asmadeus(a)codewreck.org> A buffer overflow vulnerability exists in the USB 9pfs transport layer where inconsistent size validation between packet header parsing and actual data copying allows a malicious USB host to overflow heap buffers. The issue occurs because: - usb9pfs_rx_header() validates only the declared size in packet header - usb9pfs_rx_complete() uses req->actual (actual received bytes) for memcpy This allows an attacker to craft packets with small declared size (bypassing validation) but large actual payload (triggering overflow in memcpy). Add validation in usb9pfs_rx_complete() to ensure req->actual does not exceed the buffer capacity before copying data. Reported-by: Yuhao Jiang <danisjiang(a)gmail.com> Closes: https://lkml.kernel.org/r/20250616132539.63434-1-danisjiang@gmail.com Fixes: a3be076dc174 ("net/9p/usbg: Add new usb gadget function transport") Cc: stable(a)vger.kernel.org Signed-off-by: Dominique Martinet <asmadeus(a)codewreck.org> --- (still not actually tested, can't get dummy_hcd/gt to create a device listed by p9_fwd.py/useable in qemu, I give up..) Changes in v3: - fix typo s/req_sizel/req_size/ -- sorry for that, module wasn't built... - Link to v2: https://lore.kernel.org/r/20250620-9p-usb_overflow-v2-1-026c6109c7a1@codewr… Changes in v2: - run through p9_client_cb() on error - Link to v1: https://lore.kernel.org/r/20250616132539.63434-1-danisjiang@gmail.com --- net/9p/trans_usbg.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/net/9p/trans_usbg.c b/net/9p/trans_usbg.c index 6b694f117aef296a66419fed5252305e7a1d0936..468f7e8f0277b9ae5f1bb3c94c649fca97d28857 100644 --- a/net/9p/trans_usbg.c +++ b/net/9p/trans_usbg.c @@ -231,6 +231,8 @@ static void usb9pfs_rx_complete(struct usb_ep *ep, struct usb_request *req) struct f_usb9pfs *usb9pfs = ep->driver_data; struct usb_composite_dev *cdev = usb9pfs->function.config->cdev; struct p9_req_t *p9_rx_req; + unsigned int req_size = req->actual; + int status = REQ_STATUS_RCVD; if (req->status) { dev_err(&cdev->gadget->dev, "%s usb9pfs complete --> %d, %d/%d\n", @@ -242,11 +244,19 @@ static void usb9pfs_rx_complete(struct usb_ep *ep, struct usb_request *req) if (!p9_rx_req) return; - memcpy(p9_rx_req->rc.sdata, req->buf, req->actual); + if (req_size > p9_rx_req->rc.capacity) { + dev_err(&cdev->gadget->dev, + "%s received data size %u exceeds buffer capacity %zu\n", + ep->name, req_size, p9_rx_req->rc.capacity); + req_size = 0; + status = REQ_STATUS_ERROR; + } - p9_rx_req->rc.size = req->actual; + memcpy(p9_rx_req->rc.sdata, req->buf, req_size); - p9_client_cb(usb9pfs->client, p9_rx_req, REQ_STATUS_RCVD); + p9_rx_req->rc.size = req_size; + + p9_client_cb(usb9pfs->client, p9_rx_req, status); p9_req_put(usb9pfs->client, p9_rx_req); complete(&usb9pfs->received); --- base-commit: 74b4cc9b8780bfe8a3992c9ac0033bf22ac01f19 change-id: 20250620-9p-usb_overflow-25bfc5e9bef3 Best regards, -- Dominique Martinet <asmadeus(a)codewreck.org>

6 months

4
4
0 0

[PATCH v2] net/9p: Fix buffer overflow in USB transport layer

by Dominique Martinet via B4 Relay

From: Dominique Martinet <asmadeus(a)codewreck.org> A buffer overflow vulnerability exists in the USB 9pfs transport layer where inconsistent size validation between packet header parsing and actual data copying allows a malicious USB host to overflow heap buffers. The issue occurs because: - usb9pfs_rx_header() validates only the declared size in packet header - usb9pfs_rx_complete() uses req->actual (actual received bytes) for memcpy This allows an attacker to craft packets with small declared size (bypassing validation) but large actual payload (triggering overflow in memcpy). Add validation in usb9pfs_rx_complete() to ensure req->actual does not exceed the buffer capacity before copying data. Reported-by: Yuhao Jiang <danisjiang(a)gmail.com> Closes: https://lkml.kernel.org/r/20250616132539.63434-1-danisjiang@gmail.com Fixes: a3be076dc174 ("net/9p/usbg: Add new usb gadget function transport") Cc: stable(a)vger.kernel.org Signed-off-by: Dominique Martinet <asmadeus(a)codewreck.org> --- Not actually tested, I'll try to find time to figure out how to run with qemu for real this time... Changes in v2: - run through p9_client_cb() on error - Link to v1: https://lore.kernel.org/r/20250616132539.63434-1-danisjiang@gmail.com --- net/9p/trans_usbg.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/net/9p/trans_usbg.c b/net/9p/trans_usbg.c index 6b694f117aef296a66419fed5252305e7a1d0936..43078e0d4ca3f4063660f659d28452c81bef10b4 100644 --- a/net/9p/trans_usbg.c +++ b/net/9p/trans_usbg.c @@ -231,6 +231,8 @@ static void usb9pfs_rx_complete(struct usb_ep *ep, struct usb_request *req) struct f_usb9pfs *usb9pfs = ep->driver_data; struct usb_composite_dev *cdev = usb9pfs->function.config->cdev; struct p9_req_t *p9_rx_req; + unsigned int req_size = req->actual; + int status = REQ_STATUS_RCVD; if (req->status) { dev_err(&cdev->gadget->dev, "%s usb9pfs complete --> %d, %d/%d\n", @@ -242,11 +244,19 @@ static void usb9pfs_rx_complete(struct usb_ep *ep, struct usb_request *req) if (!p9_rx_req) return; - memcpy(p9_rx_req->rc.sdata, req->buf, req->actual); + if (req_size > p9_rx_req->rc.capacity) { + dev_err(&cdev->gadget->dev, + "%s received data size %u exceeds buffer capacity %zu\n", + ep->name, req_size, p9_rx_req->rc.capacity); + req_size = 0; + status = REQ_STATUS_ERROR; + } - p9_rx_req->rc.size = req->actual; + memcpy(p9_rx_req->rc.sdata, req->buf, req_size); - p9_client_cb(usb9pfs->client, p9_rx_req, REQ_STATUS_RCVD); + p9_rx_req->rc.size = req_sizel; + + p9_client_cb(usb9pfs->client, p9_rx_req, status); p9_req_put(usb9pfs->client, p9_rx_req); complete(&usb9pfs->received); --- base-commit: 74b4cc9b8780bfe8a3992c9ac0033bf22ac01f19 change-id: 20250620-9p-usb_overflow-25bfc5e9bef3 Best regards, -- Dominique Martinet <asmadeus(a)codewreck.org>

6 months

4
3
0 0

+ kallsyms-fix-build-without-execinfo.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: kallsyms: fix build without execinfo has been added to the -mm mm-hotfixes-unstable branch. Its filename is kallsyms-fix-build-without-execinfo.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Achill Gilgenast <fossdd(a)pwned.life> Subject: kallsyms: fix build without execinfo Date: Sun, 22 Jun 2025 03:45:49 +0200 Some libc's like musl libc don't provide execinfo.h since it's not part of POSIX. In order to fix compilation on musl, only include execinfo.h if available (HAVE_BACKTRACE_SUPPORT) This was discovered with c104c16073b7 ("Kunit to check the longest symbol length") which starts to include linux/kallsyms.h with Alpine Linux' configs. Link: https://lkml.kernel.org/r/20250622014608.448718-1-fossdd@pwned.life Fixes: c104c16073b7 ("Kunit to check the longest symbol length") Signed-off-by: Achill Gilgenast <fossdd(a)pwned.life> Cc: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- tools/include/linux/kallsyms.h | 4 ++++ 1 file changed, 4 insertions(+) --- a/tools/include/linux/kallsyms.h~kallsyms-fix-build-without-execinfo +++ a/tools/include/linux/kallsyms.h @@ -18,6 +18,7 @@ static inline const char *kallsyms_looku return NULL; } +#ifdef HAVE_BACKTRACE_SUPPORT #include <execinfo.h> #include <stdlib.h> static inline void print_ip_sym(const char *loglvl, unsigned long ip) @@ -30,5 +31,8 @@ static inline void print_ip_sym(const ch free(name); } +#else +static inline void print_ip_sym(const char *loglvl, unsigned long ip) {} +#endif #endif _ Patches currently in -mm which might be from fossdd(a)pwned.life are kallsyms-fix-build-without-execinfo.patch

6 months

1
0
0 0

Backport of `Kunit to check the longest symbol length` to 6.12

by Sergio González Collado

Hello, Please consider applying the following commits for 6.12.y: c104c16073b7 ("Kunit to check the longest symbol length") f710202b2a45 ("x86/tools: Drop duplicate unlikely() definition in insn_decoder_test.c") They should apply cleanly. Those two commits implement a kunit test to verify that a symbol with KSYM_NAME_LEN of 512 can be read. The first commit implements the test. This commit also includes a fix for the test x86/insn_decoder_test. In the case a symbol exceeds the symbol length limit, an error will happen: arch/x86/tools/insn_decoder_test: error: malformed line 1152000: tBb_+0xf2> ..which overflowed by 10 characters reading this line: ffffffff81458193: 74 3d je ffffffff814581d2 <_RNvXse_NtNtNtCshGpAVYOtgW1_4core4iter8adapters7flattenINtB5_13FlattenCompatINtNtB7_3map3MapNtNtNtBb_3str4iter5CharsNtB1v_17CharEscapeDefaultENtNtBb_4char13EscapeDefaultENtNtBb_3fmt5Debug3fmtBb_+0xf2> The fix was proposed in [1] and initially mentioned at [2]. The second commit fixes a warning when building with clang because there was a definition of unlikely from compiler.h in tools/include/linux, which conflicted with the one in the instruction decoder selftest. [1] https://lore.kernel.org/lkml/Y9ES4UKl%2F+DtvAVS@gmail.com/ [2] https://lore.kernel.org/lkml/320c4dba-9919-404b-8a26-a8af16be1845@app.fastm… I will send something similar to 6.6.y and 6.1.y Thanks! Best regards, Sergio

6 months

2
1
0 0

[PATCH 5.15.y 0/5] Backport few sfq fixes

by Harshit Mogalapalli

commit: 10685681bafc ("net_sched: sch_sfq: don't allow 1 packet limit") fixes CVE-2024-57996 and commit: b3bf8f63e617 ("net_sched: sch_sfq: move the limit validation") fixes CVE-2025-37752. Patches 3 and 5 are CVE fixes for above mentioned CVEs. Patch 1,2 and 4 are pulled in as stable-deps. Testeing performed on the patches 5.15.185 kernel with the above 5 patches: (Used latest upstream kselftests for tc-testing) # ./tdc.py -f tc-tests/qdiscs/sfq.json -- ns/SubPlugin.__init__ Test 7482: Create SFQ with default setting Test c186: Create SFQ with limit setting Test ae23: Create SFQ with perturb setting Test a430: Create SFQ with quantum setting Test 4539: Create SFQ with divisor setting Test b089: Create SFQ with flows setting Test 99a0: Create SFQ with depth setting Test 7389: Create SFQ with headdrop setting Test 6472: Create SFQ with redflowlimit setting Test 8929: Show SFQ class Test 4d6f: Check that limit of 1 is rejected Test 7f8f: Check that a derived limit of 1 is rejected (limit 2 depth 1 flows 1) Test 5168: Check that a derived limit of 1 is rejected (limit 2 depth 1 divisor 1) All test results: 1..13 ok 1 7482 - Create SFQ with default setting ok 2 c186 - Create SFQ with limit setting ok 3 ae23 - Create SFQ with perturb setting ok 4 a430 - Create SFQ with quantum setting ok 5 4539 - Create SFQ with divisor setting ok 6 b089 - Create SFQ with flows setting ok 7 99a0 - Create SFQ with depth setting ok 8 7389 - Create SFQ with headdrop setting ok 9 6472 - Create SFQ with redflowlimit setting ok 10 8929 - Show SFQ class ok 11 4d6f - Check that limit of 1 is rejected ok 12 7f8f - Check that a derived limit of 1 is rejected (limit 2 depth 1 flows 1) ok 13 5168 - Check that a derived limit of 1 is rejected (limit 2 depth 1 divisor 1) # uname -a Linux hamogala-vm-6 5.15.185+ #1 SMP Fri Jun 13 18:34:53 GMT 2025 x86_64 x86_64 x86_64 GNU/Linux I will try to send similar backports to kernels older than 5.15.y as well. Thanks, Harshit Eric Dumazet (2): net_sched: sch_sfq: annotate data-races around q->perturb_period net_sched: sch_sfq: handle bigger packets Octavian Purdila (3): net_sched: sch_sfq: don't allow 1 packet limit net_sched: sch_sfq: use a temporary work area for validating configuration net_sched: sch_sfq: move the limit validation net/sched/sch_sfq.c | 112 ++++++++++++++++++++++++++++---------------- 1 file changed, 71 insertions(+), 41 deletions(-) -- 2.47.1

6 months

2
9
0 0

FAILED: patch "[PATCH] drm/i915/display: Add check for alloc_ordered_workqueue() and" failed to apply to 6.6-stable tree

by gregkh＠linuxfoundation.org

The patch below does not apply to the 6.6-stable tree. If someone wants it applied there, or to any other stable or longterm tree, then please email the backport, including the original git commit id to <stable(a)vger.kernel.org>. To reproduce the conflict and resubmit, you may use the following commands: git fetch https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/ linux-6.6.y git checkout FETCH_HEAD git cherry-pick -x f4c7baa0699b69edb6887a992283b389761e0e81 # <resolve conflicts, build, test, etc.> git commit -s git send-email --to '<stable(a)vger.kernel.org>' --in-reply-to '2025062239-erased-bonus-68e2@gregkh' --subject-prefix 'PATCH 6.6.y' HEAD^.. Possible dependencies: thanks, greg k-h ------------------ original commit in Linus's tree ------------------ From f4c7baa0699b69edb6887a992283b389761e0e81 Mon Sep 17 00:00:00 2001 From: Haoxiang Li <haoxiang_li2024(a)163.com> Date: Fri, 16 May 2025 15:16:54 +0300 Subject: [PATCH] drm/i915/display: Add check for alloc_ordered_workqueue() and alloc_workqueue() Add check for the return value of alloc_ordered_workqueue() and alloc_workqueue(). Furthermore, if some allocations fail, cleanup works are added to avoid potential memory leak problem. Fixes: 40053823baad ("drm/i915/display: move modeset probe/remove functions to intel_display_driver.c") Cc: stable(a)vger.kernel.org Signed-off-by: Haoxiang Li <haoxiang_li2024(a)163.com> Reviewed-by: Matthew Auld <matthew.auld(a)intel.com> Link: https://lore.kernel.org/r/20d3d096c6a4907636f8a1389b3b4dd753ca356e.17473976… Signed-off-by: Jani Nikula <jani.nikula(a)intel.com> (cherry picked from commit dcab7a228f4ea9cda3f5b0a1f0679e046d23d7f7) Signed-off-by: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com> diff --git a/drivers/gpu/drm/i915/display/intel_display_driver.c b/drivers/gpu/drm/i915/display/intel_display_driver.c index 5c74ab5fd1aa..411fe7b918a7 100644 --- a/drivers/gpu/drm/i915/display/intel_display_driver.c +++ b/drivers/gpu/drm/i915/display/intel_display_driver.c @@ -244,31 +244,45 @@ int intel_display_driver_probe_noirq(struct intel_display *display) intel_dmc_init(display); display->wq.modeset = alloc_ordered_workqueue("i915_modeset", 0); + if (!display->wq.modeset) { + ret = -ENOMEM; + goto cleanup_vga_client_pw_domain_dmc; + } + display->wq.flip = alloc_workqueue("i915_flip", WQ_HIGHPRI | WQ_UNBOUND, WQ_UNBOUND_MAX_ACTIVE); + if (!display->wq.flip) { + ret = -ENOMEM; + goto cleanup_wq_modeset; + } + display->wq.cleanup = alloc_workqueue("i915_cleanup", WQ_HIGHPRI, 0); + if (!display->wq.cleanup) { + ret = -ENOMEM; + goto cleanup_wq_flip; + } intel_mode_config_init(display); ret = intel_cdclk_init(display); if (ret) - goto cleanup_vga_client_pw_domain_dmc; + goto cleanup_wq_cleanup; ret = intel_color_init(display); if (ret) - goto cleanup_vga_client_pw_domain_dmc; + goto cleanup_wq_cleanup; ret = intel_dbuf_init(display); if (ret) - goto cleanup_vga_client_pw_domain_dmc; + goto cleanup_wq_cleanup; ret = intel_bw_init(display); if (ret) - goto cleanup_vga_client_pw_domain_dmc; + goto cleanup_wq_cleanup; ret = intel_pmdemand_init(display); if (ret) - goto cleanup_vga_client_pw_domain_dmc; + goto cleanup_wq_cleanup; intel_init_quirks(display); @@ -276,6 +290,12 @@ int intel_display_driver_probe_noirq(struct intel_display *display) return 0; +cleanup_wq_cleanup: + destroy_workqueue(display->wq.cleanup); +cleanup_wq_flip: + destroy_workqueue(display->wq.flip); +cleanup_wq_modeset: + destroy_workqueue(display->wq.modeset); cleanup_vga_client_pw_domain_dmc: intel_dmc_fini(display); intel_power_domains_driver_remove(display);

6 months

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror June 2025