Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.

23 Sep 2025

On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
...
On 22.09.25 01:27, Harry Yoo wrote:
...
Hi. This is supposed to be a patch, but I think it's worth discussing
how it should be backported to -stable, so I've labeled it as [DISCUSSION].
The bug described below was unintentionally fixed in v6.5 and not
backported to -stable. So technically I would need to use "Option 3" [A],
What is option 3?
Citing Option 3 from [A]:
...
Send the patch, after verifying that it follows the above rules, to
stable@vger.kernel.org and mention the kernel versions you wish it to be
applied to. When doing so, you must note the upstream commit ID in the
changelog of your submission with a separate line above the commit text,
like this:
commit <sha1> upstream.
Or alternatively:
[ Upstream commit <sha1> ]
If the submitted patch deviates from the original upstream patch
(for example because it had to be adjusted for the older API),
this must be very clearly documented and justified in the patch description.
...
Just to clarify: it's fine to do a backport of a commit
even though it was not tagged as a fix.
Thanks for looking into it, David!
Ok, I was worried that the original patch's description will confuse
people because 1) we don't allow pte_map_offset_lock() to fail in older
kernels, which the original patch relies on, and 2) the patch does not
mention the race condition (because it fixed the race 'accidentaly' :D).
I'll backport the original patch but make it clear that:
1. while the original patch did not mention the race condition,
   the patch fixes a it, and add link to this discussion.
2. we can't remove 1) pmd_trans_unstable() check in change_pte_range(),
   and 2) "bad" pmd check in change_pmd_range() because we don't allow
   pte_offset_map_lock() to fail().
3. pmd_read_atomic() is used instead of pmdp_get_lockless() beucase it
   does not exist in older kernels.
...
...
but since the original patch [B] did not intend to fix a bug (and it's also
part of a larger patch series), it looks quite different from the patch below,
and I'm not sure what the backport should look like.
I think there are probably two options:

Provide the description of the original patch along with a very long,
 detailed explanation of why the patch deviates from the upstream version, or

Post the patch below with a clarification that it was fixed upstream
 by commit 670ddd8cdcbd1.


Any thoughts?
[A] https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html#opti...
[B] https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com
     (Upstream commit 670ddd8cdcbd1)
In any case, no matter how we backport this, it needs some review and
feedback would be appreciated. The patch applies to v6.1 and v5.15, and
v5.10 but not v5.4.
From cf45867ab8e48b42160b7253390db7bdecef1455 Mon Sep 17 00:00:00 2001
From: Harry Yoo harry.yoo@oracle.com
Date: Thu, 11 Sep 2025 20:05:40 +0900
Subject: [PATCH] mm, numa: fix bad pmd by atomically checking is_swap_pmd() in
  change_prot_numa()
It was observed that a bad pmd is seen when automatic NUMA balancing
is marking page table entries as prot_numa:
[2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02)
With some kernel modification, the call stack was dumped:
[2437548.235022] Call Trace:
[2437548.238234]  <TASK>
[2437548.241060]  dump_stack_lvl+0x46/0x61
[2437548.245689]  panic+0x106/0x2e5
[2437548.249497]  pmd_clear_bad+0x3c/0x3c
[2437548.253967]  change_pmd_range.isra.0+0x34d/0x3a7
[2437548.259537]  change_p4d_range+0x156/0x20e
[2437548.264392]  change_protection_range+0x116/0x1a9
[2437548.269976]  change_prot_numa+0x15/0x37
[2437548.274774]  task_numa_work+0x1b8/0x302
[2437548.279512]  task_work_run+0x62/0x95
[2437548.283882]  exit_to_user_mode_loop+0x1a4/0x1a9
[2437548.289277]  exit_to_user_mode_prepare+0xf4/0xfc
[2437548.294751]  ? sysvec_apic_timer_interrupt+0x34/0x81
[2437548.300677]  irqentry_exit_to_user_mode+0x5/0x25
[2437548.306153]  asm_sysvec_apic_timer_interrupt+0x16/0x1b
This is due to a race condition between change_prot_numa() and
THP migration because the kernel doesn't check is_swap_pmd() and
pmd_trans_huge() atomically:
change_prot_numa()                      THP migration

change_pmd_range()

-> is_swap_pmd() returns false,
    meaning it's not a PMD migration
    entry.
		    - do_huge_pmd_numa_page()
		    -> migrate_misplaced_page() sets
		       migration entries for the THP.



change_pmd_range()

-> pmd_none_or_clear_bad_unless_trans_huge()
-> pmd_none() and pmd_trans_huge() returns false

pmd_none_or_clear_bad_unless_trans_huge()

-> pmd_bad() returns true for the migration entry!
For the race condition described above to occur:

AutoNUMA must be unmapping a range of pages, with at least part of the

range already unmapped by AutoNUMA.

While AutoNUMA is in the process of unmapping, a NUMA hinting fault

occurs within that range, specifically when we are about to unmap
the PMD entry, between the is_swap_pmd() and pmd_trans_huge() checks.
So this is a really rare race condition and it's observed that it takes
usually a few days of autonuma-intensive testing to trigger.
A bit of history on a similar race condition in the past:
In fact, a similar race condition caused by not checking pmd_trans_huge()
atomically was reported [1] in 2017. However, instead of the patch [1],
another patch series [3] fixed the problem [2] by not clearing the pmd
entry but invaliding it instead (so that pmd_trans_huge() would still
return true).
Despite patch series [3], the bad pmd error continued to be reported
in mainline. As a result, [1] was resurrected [4] and it landed mainline
in 2020 in a hope that it would resolve the issue. However, now it turns
out that [3] was not sufficient.
Fix this race condition by checking is_swap_pmd() and pmd_trans_huge()
atomically. With that, the kernel should see either
pmd_trans_huge() == true, or is_swap_pmd() == true when another task is
migrating the page concurrently.
This bug was introduced when THP migration support was added. More
specifically, by commit 84c3fc4e9c56 ("mm: thp: check pmd migration entry
in common path")).
It is unintentionally fixed since v6.5 by commit 670ddd8cdcbd1
("mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()") while
removing pmd_none_or_clear_bad_unless_trans_huge() function. But it's not
backported to -stable because it was fixed unintentionally.
Link: https://lore.kernel.org/linux-mm/20170410094825.2yfo5zehn7pchg6a@techsingula... [1]
Link: https://lore.kernel.org/linux-mm/8A6309F4-DB76-48FA-BE7F-BF9536A4C4E5@cs.rut... [2]
Link: https://lore.kernel.org/linux-mm/20170302151034.27829-1-kirill.shutemov@linu... [3]
Link: https://lore.kernel.org/linux-mm/20200216191800.22423-1-aquini@redhat.com [4]
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Harry Yoo harry.yoo@oracle.com

mm/mprotect.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 668bfaa6ed2a..c0e796c0f9b0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -303,7 +303,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
This is like the worst function, ever :D
Glad it's removed recently :)
...
...
if (pmd_none(pmdval))
   	return 1;

if (pmd_trans_huge(pmdval))


if (is_swap_pmd(pmdval) || pmd_trans_huge(pmdval))
return 0;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);

@@ -373,7 +373,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
   	 * Hence, it's necessary to atomically read the PMD value
   	 * for all the checks.
   	 */

if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&




if (!pmd_devmap(*pmd) &&
pmd_none_or_clear_bad_unless_trans_huge(pmd))
goto next;



This is all because we are trying to be smart and walking page tables
without the page table lock held. This is just absolutely nasty.
commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
taking the pmd_lock") did this :(
...
What about the following check:
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
Couldn't we have a similar race there when we are concurrently migrating?
An excellent point! I agree that there could be a similar race,
but with something other than "bad pmd" error.
It'd be more robust to do something like:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 668bfaa6ed2a..6feca04f9833 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -288,31 +288,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
    return pages;
 }
-/*
- * Used when setting automatic NUMA hinting protection where it is
- * critical that a numa hinting PMD is not confused with a bad PMD.
- */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
-{
-	pmd_t pmdval = pmd_read_atomic(pmd);
-
-	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	barrier();
-#endif
-
-	if (pmd_none(pmdval))
-		return 1;
-	if (pmd_trans_huge(pmdval))
-		return 0;
-	if (unlikely(pmd_bad(pmdval))) {
-		pmd_clear_bad(pmd);
-		return 1;
-	}
-
-	return 0;
-}
-
 /* Return true if we're uffd wr-protecting file-backed memory, or false */
 static inline bool
 uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags)
@@ -361,6 +336,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
    pmd = pmd_offset(pud, addr);
    do {
    	unsigned long this_pages;
+		pmd_t _pmd;
next = pmd_addr_end(addr, end);
@@ -373,9 +349,20 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
    	 * Hence, it's necessary to atomically read the PMD value
    	 * for all the checks.
    	 */
-		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
-			goto next;
+		_pmd = pmd_read_atomic(pmd);
+		/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		barrier();
+#endif
+		if (!is_swap_pmd(_pmd) && !pmd_devmap(_pmd) &&
+				!pmd_trans_huge(_pmd)) {
+			if (pmd_none(_pmd))
+				goto next;
+			if (unlikely(pmd_bad(_pmd))) {
+				pmd_clear_bad(pmd);
+				goto next;
+			}
+		}
/* invoke the mmu notifier if the pmd is populated */
    	if (!range.start) {
@@ -385,7 +372,7 @@ static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
    		mmu_notifier_invalidate_range_start(&range);
    	}
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
    		if ((next - addr != HPAGE_PMD_SIZE) ||
    		    uffd_wp_protect_file(vma, cp_flags)) {
    			__split_huge_pmd(vma, pmd, addr, false, NULL);
-- 
Cheers,
Harry / Hyeonggon


    

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.

change_prot_numa() THP migration