- Linux-stable-mirror - lists.linaro.org

[PATCH 6.12.y 0/2] crypto: rng - FIPS 140-3 compliance for random number generation

by Jay Wang

This patch series implements FIPS 140-3 compliance requirements for random number generation in the Linux kernel 6.12. The changes ensure that when the kernel is operating in FIPS mode, FIPS-compliant random number generators are used instead of the default /dev/random implementation. IMPORTANT: These two patches must be applied together as a series. Applying only the first patch without the second will cause a deadlock during boot in FIPS-enabled environments. The second patch fixes a critical timing issue introduced by the first patch where the crypto RNG attempts to override the drivers/char/random interface before the default RNG becomes available. The series consists of two patches: 1. Initial implementation to override drivers/char/random in FIPS mode 2. Refinement to ensure override only occurs after FIPS-mode RNGs are available These 2 patches are required for FIPS 140-3 certification and compliance in government and enterprise environments. Herbert Xu (1): crypto: rng - Override drivers/char/random in FIPS mode Jay Wang (1): Override drivers/char/random only after FIPS-mode RNGs become available crypto/rng.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) -- 2.47.1

5 days, 18 hours

3
5
0 0

[tip: ras/urgent] x86/mce: Don't remove sysfs if thresholding sysfs init fails

by tip-bot2 for Yazen Ghannam

The following commit has been merged into the ras/urgent branch of tip: Commit-ID: 4c113a5b28bfd589e2010b5fc8867578b0135ed7 Gitweb: https://git.kernel.org/tip/4c113a5b28bfd589e2010b5fc8867578b0135ed7 Author: Yazen Ghannam <yazen.ghannam(a)amd.com> AuthorDate: Tue, 24 Jun 2025 14:15:56 Committer: Borislav Petkov (AMD) <bp(a)alien8.de> CommitterDate: Thu, 26 Jun 2025 17:28:13 +02:00 x86/mce: Don't remove sysfs if thresholding sysfs init fails Currently, the MCE subsystem sysfs interface will be removed if the thresholding sysfs interface fails to be created. A common failure is due to new MCA bank types that are not recognized and don't have a short name set. The MCA thresholding feature is optional and should not break the common MCE sysfs interface. Also, new MCA bank types are occasionally introduced, and updates will be needed to recognize them. But likewise, this should not break the common sysfs interface. Keep the MCE sysfs interface regardless of the status of the thresholding sysfs interface. Signed-off-by: Yazen Ghannam <yazen.ghannam(a)amd.com> Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo(a)intel.com> Reviewed-by: Tony Luck <tony.luck(a)intel.com> Tested-by: Tony Luck <tony.luck(a)intel.com> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-1-236dd74f645f@amd.com --- arch/x86/kernel/cpu/mce/core.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index e9b3c5d..07d6193 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -2801,15 +2801,9 @@ static int mce_cpu_dead(unsigned int cpu) static int mce_cpu_online(unsigned int cpu) { struct timer_list *t = this_cpu_ptr(&mce_timer); - int ret; mce_device_create(cpu); - - ret = mce_threshold_create_device(cpu); - if (ret) { - mce_device_remove(cpu); - return ret; - } + mce_threshold_create_device(cpu); mce_reenable_cpu(); mce_start_timer(t); return 0;

5 days, 22 hours

1
0
0 0

[tip: ras/urgent] x86/mce: Ensure user polling settings are honored when restarting timer

by tip-bot2 for Yazen Ghannam

The following commit has been merged into the ras/urgent branch of tip: Commit-ID: 00c092de6f28ebd32208aef83b02d61af2229b60 Gitweb: https://git.kernel.org/tip/00c092de6f28ebd32208aef83b02d61af2229b60 Author: Yazen Ghannam <yazen.ghannam(a)amd.com> AuthorDate: Tue, 24 Jun 2025 14:15:57 Committer: Borislav Petkov (AMD) <bp(a)alien8.de> CommitterDate: Fri, 27 Jun 2025 12:41:44 +02:00 x86/mce: Ensure user polling settings are honored when restarting timer Users can disable MCA polling by setting the "ignore_ce" parameter or by setting "check_interval=0". This tells the kernel to *not* start the MCE timer on a CPU. If the user did not disable CMCI, then storms can occur. When these happen, the MCE timer will be started with a fixed interval. After the storm subsides, the timer's next interval is set to check_interval. This disregards the user's input through "ignore_ce" and "check_interval". Furthermore, if "check_interval=0", then the new timer will run faster than expected. Create a new helper to check these conditions and use it when a CMCI storm ends. [ bp: Massage. ] Fixes: 7eae17c4add5 ("x86/mce: Add per-bank CMCI storm mitigation") Signed-off-by: Yazen Ghannam <yazen.ghannam(a)amd.com> Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-2-236dd74f645f@amd.com --- arch/x86/kernel/cpu/mce/core.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 07d6193..4da4eab 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1740,6 +1740,11 @@ static void mc_poll_banks_default(void) void (*mc_poll_banks)(void) = mc_poll_banks_default; +static bool should_enable_timer(unsigned long iv) +{ + return !mca_cfg.ignore_ce && iv; +} + static void mce_timer_fn(struct timer_list *t) { struct timer_list *cpu_t = this_cpu_ptr(&mce_timer); @@ -1763,7 +1768,7 @@ static void mce_timer_fn(struct timer_list *t) if (mce_get_storm_mode()) { __start_timer(t, HZ); - } else { + } else if (should_enable_timer(iv)) { __this_cpu_write(mce_next_interval, iv); __start_timer(t, iv); } @@ -2156,11 +2161,10 @@ static void mce_start_timer(struct timer_list *t) { unsigned long iv = check_interval * HZ; - if (mca_cfg.ignore_ce || !iv) - return; - - this_cpu_write(mce_next_interval, iv); - __start_timer(t, iv); + if (should_enable_timer(iv)) { + this_cpu_write(mce_next_interval, iv); + __start_timer(t, iv); + } } static void __mcheck_cpu_setup_timer(void)

5 days, 22 hours

1
0
0 0

[tip: ras/urgent] x86/mce/amd: Add default names for MCA banks and blocks

by tip-bot2 for Yazen Ghannam

The following commit has been merged into the ras/urgent branch of tip: Commit-ID: d66e1e90b16055d2f0ee76e5384e3f119c3c2773 Gitweb: https://git.kernel.org/tip/d66e1e90b16055d2f0ee76e5384e3f119c3c2773 Author: Yazen Ghannam <yazen.ghannam(a)amd.com> AuthorDate: Tue, 24 Jun 2025 14:15:58 Committer: Borislav Petkov (AMD) <bp(a)alien8.de> CommitterDate: Fri, 27 Jun 2025 13:13:36 +02:00 x86/mce/amd: Add default names for MCA banks and blocks Ensure that sysfs init doesn't fail for new/unrecognized bank types or if a bank has additional blocks available. Most MCA banks have a single thresholding block, so the block takes the same name as the bank. Unified Memory Controllers (UMCs) are a special case where there are two blocks and each has a unique name. However, the microarchitecture allows for five blocks. Any new MCA bank types with more than one block will be missing names for the extra blocks. The MCE sysfs will fail to initialize in this case. Fixes: 87a6d4091bd7 ("x86/mce/AMD: Update sysfs bank names for SMCA systems") Signed-off-by: Yazen Ghannam <yazen.ghannam(a)amd.com> Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-3-236dd74f645f@amd.com --- arch/x86/kernel/cpu/mce/amd.c | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 9d852c3..6820ebc 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -1113,13 +1113,20 @@ static const char *get_name(unsigned int cpu, unsigned int bank, struct threshol } bank_type = smca_get_bank_type(cpu, bank); - if (bank_type >= N_SMCA_BANK_TYPES) - return NULL; if (b && (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2)) { if (b->block < ARRAY_SIZE(smca_umc_block_names)) return smca_umc_block_names[b->block]; - return NULL; + } + + if (b && b->block) { + snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN, "th_block_%u", b->block); + return buf_mcatype; + } + + if (bank_type >= N_SMCA_BANK_TYPES) { + snprintf(buf_mcatype, MAX_MCATYPE_NAME_LEN, "th_bank_%u", bank); + return buf_mcatype; } if (per_cpu(smca_bank_counts, cpu)[bank_type] == 1)

5 days, 22 hours

1
0
0 0

[tip: ras/urgent] x86/mce/amd: Fix threshold limit reset

by tip-bot2 for Yazen Ghannam

The following commit has been merged into the ras/urgent branch of tip: Commit-ID: 5f6e3b720694ad771911f637a51930f511427ce1 Gitweb: https://git.kernel.org/tip/5f6e3b720694ad771911f637a51930f511427ce1 Author: Yazen Ghannam <yazen.ghannam(a)amd.com> AuthorDate: Tue, 24 Jun 2025 14:15:59 Committer: Borislav Petkov (AMD) <bp(a)alien8.de> CommitterDate: Fri, 27 Jun 2025 13:16:23 +02:00 x86/mce/amd: Fix threshold limit reset The MCA threshold limit must be reset after servicing the interrupt. Currently, the restart function doesn't have an explicit check for this. It makes some assumptions based on the current limit and what's in the registers. These assumptions don't always hold, so the limit won't be reset in some cases. Make the reset condition explicit. Either an interrupt/overflow has occurred or the bank is being initialized. Signed-off-by: Yazen Ghannam <yazen.ghannam(a)amd.com> Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de> Cc: stable(a)vger.kernel.org Link: https://lore.kernel.org/20250624-wip-mca-updates-v4-4-236dd74f645f@amd.com --- arch/x86/kernel/cpu/mce/amd.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c index 6820ebc..5c4eb28 100644 --- a/arch/x86/kernel/cpu/mce/amd.c +++ b/arch/x86/kernel/cpu/mce/amd.c @@ -350,7 +350,6 @@ static void smca_configure(unsigned int bank, unsigned int cpu) struct thresh_restart { struct threshold_block *b; - int reset; int set_lvt_off; int lvt_off; u16 old_limit; @@ -432,13 +431,13 @@ static void threshold_restart_bank(void *_tr) rdmsr(tr->b->address, lo, hi); - if (tr->b->threshold_limit < (hi & THRESHOLD_MAX)) - tr->reset = 1; /* limit cannot be lower than err count */ - - if (tr->reset) { /* reset err count and overflow bit */ - hi = - (hi & ~(MASK_ERR_COUNT_HI | MASK_OVERFLOW_HI)) | - (THRESHOLD_MAX - tr->b->threshold_limit); + /* + * Reset error count and overflow bit. + * This is done during init or after handling an interrupt. + */ + if (hi & MASK_OVERFLOW_HI || tr->set_lvt_off) { + hi &= ~(MASK_ERR_COUNT_HI | MASK_OVERFLOW_HI); + hi |= THRESHOLD_MAX - tr->b->threshold_limit; } else if (tr->old_limit) { /* change limit w/o reset */ int new_count = (hi & THRESHOLD_MAX) + (tr->old_limit - tr->b->threshold_limit);

5 days, 22 hours

1
0
0 0

[PATCH v2 1/1] mm/rmap: fix potential out-of-bounds page table access during batched unmap

by Lance Yang

From: Lance Yang <lance.yang(a)linux.dev> As pointed out by David[1], the batched unmap logic in try_to_unmap_one() can read past the end of a PTE table if a large folio is mapped starting at the last entry of that table. It would be quite rare in practice, as MADV_FREE typically splits the large folio ;) So let's fix the potential out-of-bounds read by refactoring the logic into a new helper, folio_unmap_pte_batch(). The new helper now correctly calculates the safe number of pages to scan by limiting the operation to the boundaries of the current VMA and the PTE table. In addition, the "all-or-nothing" batching restriction is removed to support partial batches. The reference counting is also cleaned up to use folio_put_refs(). [1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redha… Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Cc: <stable(a)vger.kernel.org> Suggested-by: David Hildenbrand <david(a)redhat.com> Suggested-by: Barry Song <baohua(a)kernel.org> Signed-off-by: Lance Yang <lance.yang(a)linux.dev> --- v1 -> v2: - Update subject and changelog (per Barry) - https://lore.kernel.org/linux-mm/20250627025214.30887-1-lance.yang@linux.dev mm/rmap.c | 46 ++++++++++++++++++++++++++++------------------ 1 file changed, 28 insertions(+), 18 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index fb63d9256f09..1320b88fab74 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1845,23 +1845,32 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page, #endif } -/* We support batch unmapping of PTEs for lazyfree large folios */ -static inline bool can_batch_unmap_folio_ptes(unsigned long addr, - struct folio *folio, pte_t *ptep) +static inline unsigned int folio_unmap_pte_batch(struct folio *folio, + struct page_vma_mapped_walk *pvmw, + enum ttu_flags flags, pte_t pte) { const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; - int max_nr = folio_nr_pages(folio); - pte_t pte = ptep_get(ptep); + unsigned long end_addr, addr = pvmw->address; + struct vm_area_struct *vma = pvmw->vma; + unsigned int max_nr; + + if (flags & TTU_HWPOISON) + return 1; + if (!folio_test_large(folio)) + return 1; + /* We may only batch within a single VMA and a single page table. */ + end_addr = pmd_addr_end(addr, vma->vm_end); + max_nr = (end_addr - addr) >> PAGE_SHIFT; + + /* We only support lazyfree batching for now ... */ if (!folio_test_anon(folio) || folio_test_swapbacked(folio)) - return false; + return 1; if (pte_unused(pte)) - return false; - if (pte_pfn(pte) != folio_pfn(folio)) - return false; + return 1; - return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL, - NULL, NULL) == max_nr; + return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, fpb_flags, + NULL, NULL, NULL); } /* @@ -2024,9 +2033,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (pte_dirty(pteval)) folio_mark_dirty(folio); } else if (likely(pte_present(pteval))) { - if (folio_test_large(folio) && !(flags & TTU_HWPOISON) && - can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) - nr_pages = folio_nr_pages(folio); + nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval); end_addr = address + nr_pages * PAGE_SIZE; flush_cache_range(vma, address, end_addr); @@ -2206,13 +2213,16 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, hugetlb_remove_rmap(folio); } else { folio_remove_rmap_ptes(folio, subpage, nr_pages, vma); - folio_ref_sub(folio, nr_pages - 1); } if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); - folio_put(folio); - /* We have already batched the entire folio */ - if (nr_pages > 1) + folio_put_refs(folio, nr_pages); + + /* + * If we are sure that we batched the entire folio and cleared + * all PTEs, we can just optimize and stop right here. + */ + if (nr_pages == folio_nr_pages(folio)) goto walk_done; continue; walk_abort: -- 2.49.0

6 days, 11 hours

5
9
0 0

+ mm-shmem-swap-improve-cached-mthp-handling-and-fix-potential-hung.patch added to mm-new branch

by Andrew Morton

The patch titled Subject: mm/shmem, swap: improve cached mTHP handling and fix potential hung has been added to the -mm mm-new branch. Its filename is mm-shmem-swap-improve-cached-mthp-handling-and-fix-potential-hung.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-new branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Note, mm-new is a provisional staging ground for work-in-progress patches, and acceptance into mm-new is a notification for others take notice and to finish up reviews. Please do not hesitate to respond to review feedback and post updated versions to replace or incrementally fixup patches in mm-new. Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Kairui Song <kasong(a)tencent.com> Subject: mm/shmem, swap: improve cached mTHP handling and fix potential hung Date: Fri, 27 Jun 2025 14:20:14 +0800 Patch series "mm/shmem, swap: bugfix and improvement of mTHP swap in", v3. The current mTHP swapin path have some problems. It may potentially hang, may cause redundant faults due to false positive swap cache lookup, and it will involve at least 4 Xarray tree walks (get order, get order again, confirm swap, insert folio). And for !CONFIG_TRANSPARENT_HUGEPAGE builds, it will performs some mTHP related checks. This series fixes all of the mentioned issues, and the code should be more robust and prepared for the swap table series. Now tree walks is reduced to twice (get order & confirm, insert folio) and added more sanity checks and comments. !CONFIG_TRANSPARENT_HUGEPAGE build overhead is also minimized, and comes with a sanity check now. The performance is slightly better after this series, sequential swap in of 24G data from ZRAM, using transparent_hugepage_tmpfs=always (24 samples each): Before: 11.17, Standard Deviation: 0.02 After patch 1: 10.89, Standard Deviation: 0.05 After patch 2: 10.84, Standard Deviation: 0.03 After patch 3: 10.91, Standard Deviation: 0.03 After patch 4: 10.86, Standard Deviation: 0.03 After patch 5: 10.07, Standard Deviation: 0.04 After patch 7: 10.09, Standard Deviation: 0.03 Each patch improves the performance by a little, which is about ~10% faster in total. Build kernel test showed very slightly improvement, testing with make -j24 with defconfig in a 256M memcg also using ZRAM as swap, and transparent_hugepage_tmpfs=always (6 test runs): Before: system time avg: 3911.80s After: system time avg: 3863.76s This patch (of 7): The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve performance, as it avoids two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Link: https://lkml.kernel.org/r/20250627062020.534-2-ryncsn@gmail.com Link: https://lkml.kernel.org/r/20250627062020.534-2-ryncsn@gmail.com Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song <kasong(a)tencent.com> Reviewed-by: Kemeng Shi <shikemeng(a)huaweicloud.com> Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com> Cc: Baoquan He <bhe(a)redhat.com> Cc: Barry Song <baohua(a)kernel.org> Cc: Chris Li <chrisl(a)kernel.org> Cc: Hugh Dickins <hughd(a)google.com> Cc: Matthew Wilcox (Oracle) <willy(a)infradead.org> Cc: Nhat Pham <nphamcs(a)gmail.com> Cc: Dev Jain <dev.jain(a)arm.com> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/shmem.c | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) --- a/mm/shmem.c~mm-shmem-swap-improve-cached-mthp-handling-and-fix-potential-hung +++ a/mm/shmem.c @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struc pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); - long nr = folio_nr_pages(folio); + unsigned long nr = folio_nr_pages(folio); + swp_entry_t iter, swap; + void *entry; VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struc gfp &= GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp); + swap = iter = radix_to_swp_entry(expected); do { xas_lock_irq(&xas); - if (expected != xas_find_conflict(&xas)) { - xas_set_err(&xas, -EEXIST); - goto unlock; + xas_for_each_conflict(&xas, entry) { + /* + * The range must either be empty, or filled with + * expected swap entries. Shmem swap entries are never + * partially freed without split of both entry and + * folio, so there shouldn't be any holes. + */ + if (!expected || entry != swp_to_radix_entry(iter)) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + iter.val += 1 << xas_get_order(&xas); } - if (expected && xas_find_conflict(&xas)) { + if (expected && iter.val - nr != swap.val) { xas_set_err(&xas, -EEXIST); goto unlock; } @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct ino error = -ENOMEM; goto failed; } - } else if (order != folio_order(folio)) { + } else if (order > folio_order(folio)) { /* * Swap readahead may swap in order 0 folios into swapcache * asynchronously, while the shmem mapping can still stores @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct ino swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); } + } else if (order < folio_order(folio)) { + swap.val = round_down(swap.val, 1 << folio_order(folio)); } alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || - folio->swap.val != swap.val || - !shmem_confirm_swap(mapping, index, swap) || - xa_get_order(&mapping->i_pages, index) != folio_order(folio)) { + folio->swap.val != swap.val) { error = -EEXIST; goto unlock; } _ Patches currently in -mm which might be from kasong(a)tencent.com are mm-list_lru-refactor-the-locking-code.patch mm-shmem-swap-improve-cached-mthp-handling-and-fix-potential-hung.patch mm-shmem-swap-avoid-redundant-xarray-lookup-during-swapin.patch mm-shmem-swap-tidy-up-thp-swapin-checks.patch mm-shmem-swap-clean-up-swap-entry-splitting.patch mm-shmem-swap-never-use-swap-cache-and-readahead-for-swp_synchronous_io.patch mm-shmem-swap-fix-major-fault-counting.patch mm-shmem-swap-avoid-false-positive-swap-cache-lookup.patch

6 days, 13 hours

1
0
0 0

+ mm-rmap-fix-potential-out-of-bounds-page-table-access-during-batched-unmap.patch added to mm-hotfixes-unstable branch

by Andrew Morton

The patch titled Subject: mm/rmap: fix potential out-of-bounds page table access during batched unmap has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-rmap-fix-potential-out-of-bounds-page-table-access-during-batched-unmap.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche… This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Lance Yang <lance.yang(a)linux.dev> Subject: mm/rmap: fix potential out-of-bounds page table access during batched unmap Date: Fri, 27 Jun 2025 14:23:19 +0800 As pointed out by David[1], the batched unmap logic in try_to_unmap_one() can read past the end of a PTE table if a large folio is mapped starting at the last entry of that table. It would be quite rare in practice, as MADV_FREE typically splits the large folio ;) So let's fix the potential out-of-bounds read by refactoring the logic into a new helper, folio_unmap_pte_batch(). The new helper now correctly calculates the safe number of pages to scan by limiting the operation to the boundaries of the current VMA and the PTE table. In addition, the "all-or-nothing" batching restriction is removed to support partial batches. The reference counting is also cleaned up to use folio_put_refs(). [1] https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redha… Link: https://lkml.kernel.org/r/20250627062319.84936-1-lance.yang@linux.dev Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Lance Yang <lance.yang(a)linux.dev> Suggested-by: David Hildenbrand <david(a)redhat.com> Suggested-by: Barry Song <baohua(a)kernel.org> Cc: Baolin Wang <baolin.wang(a)linux.alibaba.com> Cc: Barry Song <v-songbaohua(a)oppo.com> Cc: Chris Li <chrisl(a)kernel.org> Cc: "Huang, Ying" <huang.ying.caritas(a)gmail.com> Cc: Kairui Song <kasong(a)tencent.com> Cc: Lance Yang <lance.yang(a)linux.dev> Cc: Liam Howlett <liam.howlett(a)oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes(a)oracle.com> Cc: Mingzhe Yang <mingzhe.yang(a)ly.com> Cc: Rik van Riel <riel(a)surriel.com> Cc: Ryan Roberts <ryan.roberts(a)arm.com> Cc: Tangquan Zheng <zhengtangquan(a)oppo.com> Cc: Vlastimil Babka <vbabka(a)suse.cz> Cc: <stable(a)vger.kernel.org> Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org> --- mm/rmap.c | 46 ++++++++++++++++++++++++++++------------------ 1 file changed, 28 insertions(+), 18 deletions(-) --- a/mm/rmap.c~mm-rmap-fix-potential-out-of-bounds-page-table-access-during-batched-unmap +++ a/mm/rmap.c @@ -1845,23 +1845,32 @@ void folio_remove_rmap_pud(struct folio #endif } -/* We support batch unmapping of PTEs for lazyfree large folios */ -static inline bool can_batch_unmap_folio_ptes(unsigned long addr, - struct folio *folio, pte_t *ptep) +static inline unsigned int folio_unmap_pte_batch(struct folio *folio, + struct page_vma_mapped_walk *pvmw, + enum ttu_flags flags, pte_t pte) { const fpb_t fpb_flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY; - int max_nr = folio_nr_pages(folio); - pte_t pte = ptep_get(ptep); + unsigned long end_addr, addr = pvmw->address; + struct vm_area_struct *vma = pvmw->vma; + unsigned int max_nr; + + if (flags & TTU_HWPOISON) + return 1; + if (!folio_test_large(folio)) + return 1; + + /* We may only batch within a single VMA and a single page table. */ + end_addr = pmd_addr_end(addr, vma->vm_end); + max_nr = (end_addr - addr) >> PAGE_SHIFT; + /* We only support lazyfree batching for now ... */ if (!folio_test_anon(folio) || folio_test_swapbacked(folio)) - return false; + return 1; if (pte_unused(pte)) - return false; - if (pte_pfn(pte) != folio_pfn(folio)) - return false; + return 1; - return folio_pte_batch(folio, addr, ptep, pte, max_nr, fpb_flags, NULL, - NULL, NULL) == max_nr; + return folio_pte_batch(folio, addr, pvmw->pte, pte, max_nr, fpb_flags, + NULL, NULL, NULL); } /* @@ -2024,9 +2033,7 @@ static bool try_to_unmap_one(struct foli if (pte_dirty(pteval)) folio_mark_dirty(folio); } else if (likely(pte_present(pteval))) { - if (folio_test_large(folio) && !(flags & TTU_HWPOISON) && - can_batch_unmap_folio_ptes(address, folio, pvmw.pte)) - nr_pages = folio_nr_pages(folio); + nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval); end_addr = address + nr_pages * PAGE_SIZE; flush_cache_range(vma, address, end_addr); @@ -2206,13 +2213,16 @@ discard: hugetlb_remove_rmap(folio); } else { folio_remove_rmap_ptes(folio, subpage, nr_pages, vma); - folio_ref_sub(folio, nr_pages - 1); } if (vma->vm_flags & VM_LOCKED) mlock_drain_local(); - folio_put(folio); - /* We have already batched the entire folio */ - if (nr_pages > 1) + folio_put_refs(folio, nr_pages); + + /* + * If we are sure that we batched the entire folio and cleared + * all PTEs, we can just optimize and stop right here. + */ + if (nr_pages == folio_nr_pages(folio)) goto walk_done; continue; walk_abort: _ Patches currently in -mm which might be from lance.yang(a)linux.dev are mm-rmap-fix-potential-out-of-bounds-page-table-access-during-batched-unmap.patch

6 days, 14 hours

1
0
0 0

[PATCH v1 3/4] mpi3mr: Serialize admin queue BAR writes on 32-bit systems

by Ranjan Kumar

On 32-bit systems, 64-bit BAR writes to admin queue registers are performed as two 32-bit writes. Without locking, this can cause partial writes when accessed concurrently. Updated per-queue spinlocks is used to serialize these writes and prevent race conditions. Fixes: 824a156633df ("scsi: mpi3mr: Base driver code") Cc: stable(a)vger.kernel.org Signed-off-by: Ranjan Kumar <ranjan.kumar(a)broadcom.com> --- drivers/scsi/mpi3mr/mpi3mr.h | 4 ++++ drivers/scsi/mpi3mr/mpi3mr_fw.c | 15 +++++++++++---- drivers/scsi/mpi3mr/mpi3mr_os.c | 2 ++ 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/scsi/mpi3mr/mpi3mr.h b/drivers/scsi/mpi3mr/mpi3mr.h index bf272dd69d23..f2201108ea91 100644 --- a/drivers/scsi/mpi3mr/mpi3mr.h +++ b/drivers/scsi/mpi3mr/mpi3mr.h @@ -1137,6 +1137,8 @@ struct scmd_priv { * @logdata_buf: Circular buffer to store log data entries * @logdata_buf_idx: Index of entry in buffer to store * @logdata_entry_sz: log data entry size + * @adm_req_q_bar_writeq_lock: Admin request queue lock + * @adm_reply_q_bar_writeq_lock: Admin reply queue lock * @pend_large_data_sz: Counter to track pending large data * @io_throttle_data_length: I/O size to track in 512b blocks * @io_throttle_high: I/O size to start throttle in 512b blocks @@ -1339,6 +1341,8 @@ struct mpi3mr_ioc { u8 *logdata_buf; u16 logdata_buf_idx; u16 logdata_entry_sz; + spinlock_t adm_req_q_bar_writeq_lock; + spinlock_t adm_reply_q_bar_writeq_lock; atomic_t pend_large_data_sz; u32 io_throttle_data_length; diff --git a/drivers/scsi/mpi3mr/mpi3mr_fw.c b/drivers/scsi/mpi3mr/mpi3mr_fw.c index 8976582946a2..0152d31d430a 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_fw.c +++ b/drivers/scsi/mpi3mr/mpi3mr_fw.c @@ -23,17 +23,22 @@ module_param(poll_queues, int, 0444); MODULE_PARM_DESC(poll_queues, "Number of queues for io_uring poll mode. (Range 1 - 126)"); #if defined(writeq) && defined(CONFIG_64BIT) -static inline void mpi3mr_writeq(__u64 b, void __iomem *addr) +static inline void mpi3mr_writeq(__u64 b, void __iomem *addr, + spinlock_t *write_queue_lock) { writeq(b, addr); } #else -static inline void mpi3mr_writeq(__u64 b, void __iomem *addr) +static inline void mpi3mr_writeq(__u64 b, void __iomem *addr, + spinlock_t *write_queue_lock) { __u64 data_out = b; + unsigned long flags; + spin_lock_irqsave(write_queue_lock, flags); writel((u32)(data_out), addr); writel((u32)(data_out >> 32), (addr + 4)); + spin_unlock_irqrestore(write_queue_lock, flags); } #endif @@ -2954,9 +2959,11 @@ static int mpi3mr_setup_admin_qpair(struct mpi3mr_ioc *mrioc) (mrioc->num_admin_req); writel(num_admin_entries, &mrioc->sysif_regs->admin_queue_num_entries); mpi3mr_writeq(mrioc->admin_req_dma, - &mrioc->sysif_regs->admin_request_queue_address); + &mrioc->sysif_regs->admin_request_queue_address, + &mrioc->adm_req_q_bar_writeq_lock); mpi3mr_writeq(mrioc->admin_reply_dma, - &mrioc->sysif_regs->admin_reply_queue_address); + &mrioc->sysif_regs->admin_reply_queue_address, + &mrioc->adm_reply_q_bar_writeq_lock); writel(mrioc->admin_req_pi, &mrioc->sysif_regs->admin_request_queue_pi); writel(mrioc->admin_reply_ci, &mrioc->sysif_regs->admin_reply_queue_ci); return retval; diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c index ce444efd859e..0d1c9bbb13b5 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_os.c +++ b/drivers/scsi/mpi3mr/mpi3mr_os.c @@ -5365,6 +5365,8 @@ mpi3mr_probe(struct pci_dev *pdev, const struct pci_device_id *id) spin_lock_init(&mrioc->tgtdev_lock); spin_lock_init(&mrioc->watchdog_lock); spin_lock_init(&mrioc->chain_buf_lock); + spin_lock_init(&mrioc->adm_req_q_bar_writeq_lock); + spin_lock_init(&mrioc->adm_reply_q_bar_writeq_lock); spin_lock_init(&mrioc->sas_node_lock); spin_lock_init(&mrioc->trigger_lock); -- 2.31.1

6 days, 14 hours

1
0
0 0

[PATCH v1 1/4] mpi3mr: Fix race between config read submit and interrupt completion

by Ranjan Kumar

The "is_waiting" flag was updated after calling complete(), which could lead to a race where the waiting thread wakes up before the flag is cleared, may cause a missed wakeup or stale state check. Reorder the operations to update "is_waiting" before signaling completion to ensure consistent state. Fixes: 824a156633df ("scsi: mpi3mr: Base driver code") Cc: stable(a)vger.kernel.org Signed-off-by: Chandrakanth Patil <chandrakanth.patil(a)broadcom.com> Signed-off-by: Ranjan Kumar <ranjan.kumar(a)broadcom.com> --- drivers/scsi/mpi3mr/mpi3mr_fw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/scsi/mpi3mr/mpi3mr_fw.c b/drivers/scsi/mpi3mr/mpi3mr_fw.c index 1d7901a8f0e4..0186676698d4 100644 --- a/drivers/scsi/mpi3mr/mpi3mr_fw.c +++ b/drivers/scsi/mpi3mr/mpi3mr_fw.c @@ -428,8 +428,8 @@ static void mpi3mr_process_admin_reply_desc(struct mpi3mr_ioc *mrioc, MPI3MR_SENSE_BUF_SZ); } if (cmdptr->is_waiting) { - complete(&cmdptr->done); cmdptr->is_waiting = 0; + complete(&cmdptr->done); } else if (cmdptr->callback) cmdptr->callback(mrioc, cmdptr); } -- 2.31.1

6 days, 14 hours

1
0
0 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

Linux-stable-mirror