Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.

6 Oct 2025

      On 02.10.25 16:07, Harry Yoo wrote:
...
On Wed, Sep 24, 2025 at 05:52:14PM +0200, David Hildenbrand wrote:
...
On 24.09.25 13:54, Harry Yoo wrote:
...
On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
...
On 23.09.25 13:46, Harry Yoo wrote:
...
On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
...
On 22.09.25 01:27, Harry Yoo wrote:
In case is_swap_pmd() or pmd_trans_huge() returned true, but another
kernel thread splits THP after we checked it, __split_huge_pmd() or
change_huge_pmd() will just return without actually splitting or changing
pmd entry, if it turns out that evaluating
(is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
was false positive due to race condition, because they both double check
after acquiring pmd lock:

__split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()

or is_pmd_migration_entry() under pmd lock.

change_huge_pmd() checks if it's either is_swap_pmd(),

pmd_trans_huge(), or pmd_devmap() under pmd lock.
And if either function simply returns because it was not a THP,
pmd migration entry, or pmd devmap, khugepaged cannot colleapse
huge page because we're holding mmap_lock in read mode.
And then we call change_pte_range() and that's safe.
...
After that, I'm not sure ... maybe we'll just retry
Or as you mentioned, if we are misled into thinking it is not a THP,
PMD devmap, or swap PMD due to race condition, we'd end up going into
change_pte_range().
...
or we'll accidentally try treating it as a PTE table.
But then pmd_trans_unstable() check should prevent us from treating
it as PTE table (and we're still holding mmap_lock here).
In such case we don't retry but skip it instead.
...
Looks like
pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
I think you mean
pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
Yes!
...
...
return "0"
in case we hit migration entry? :/
pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
eventually checks !pmd_none() && !pmd_present() case.
Apologies for the late reply.
...
Ah, right, I missed the pmd_present() while skimming over this extremely
horrible function.
So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would return
"1" and make us retry.
We don't retry in pre-6.5 kernels because retrying is a new behavior
after commit 670ddd8cdcbd1.
:/
...
...
...
...
...
It'd be more robust to do something like:
That's also what I had in mind. But all this lockless stuff makes me a bit
nervous :)
Yeah the code is not very straightforward... :/
But technically the diff that I pasted here should be enough to fix
this... or do you have any alternative approach in mind?
Hopefully, I'm not convinced this code is not buggy, but at least regarding
concurrent migration it should be fine with that.
I've been thinking about this...
Actually, it'll make more sense to open-code what pte_map_offset_lock()
does in the mainline:

do not remove the "bad pte" checks, because pte_offset_map() in pre-6.5
 kernels doesn't do the check for us unlike the mainline.
check is_swap_pmd(), pmd_trans_huge(), pmd_devmap() without ptl, but
 atomically.
after acquiring ptl in change_pte_range(), check if pmd has changed
 since step 1 and 2. if yes, retry (like mainline). if no, we're all good.

What do you think?
Only for -stable, right? Does not sound too wrong for me, but I would 
have to take a closer look at the end result!
-- 
Cheers

David / dhildenb

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.