Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.

24 Sep 2025

      On 24.09.25 13:54, Harry Yoo wrote:
...
On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
...
On 23.09.25 13:46, Harry Yoo wrote:
...
On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
...
On 22.09.25 01:27, Harry Yoo wrote:
This is all because we are trying to be smart and walking page tables
without the page table lock held. This is just absolutely nasty.
commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
taking the pmd_lock") did this :(
Right. I can understand why we would not want to grab the lock when there is
a leaf page table. But everything else is just asking for trouble (as we saw
:) ).
...
...
What about the following check:
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
Couldn't we have a similar race there when we are concurrently migrating?
An excellent point! I agree that there could be a similar race,
but with something other than "bad pmd" error.
Right, instead we'd go into change_pte_range() where we check
pmd_trans_unstable().
Uh, my brain hurts again... :)
Yeah, that's why I gave up faster than you trying to understand these 
horrible, horrible functions in the stable kernel :)
...
In case is_swap_pmd() or pmd_trans_huge() returned true, but another
kernel thread splits THP after we checked it, __split_huge_pmd() or
change_huge_pmd() will just return without actually splitting or changing
pmd entry, if it turns out that evaluating
(is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true
was false positive due to race condition, because they both double check
after acquiring pmd lock:

__split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()

or is_pmd_migration_entry() under pmd lock.

change_huge_pmd() checks if it's either is_swap_pmd(),

pmd_trans_huge(), or pmd_devmap() under pmd lock.
And if either function simply returns because it was not a THP,
pmd migration entry, or pmd devmap, khugepaged cannot colleapse
huge page because we're holding mmap_lock in read mode.
And then we call change_pte_range() and that's safe.
...
After that, I'm not sure ... maybe we'll just retry
Or as you mentioned, if we are misled into thinking it is not a THP,
PMD devmap, or swap PMD due to race condition, we'd end up going into
change_pte_range().
...
or we'll accidentally try treating it as a PTE table.
But then pmd_trans_unstable() check should prevent us from treating
it as PTE table (and we're still holding mmap_lock here).
In such case we don't retry but skip it instead.
...
Looks like
pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
I think you mean
pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
Yes!
...
...
return "0"
in case we hit migration entry? :/
pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it
eventually checks !pmd_none() && !pmd_present() case.
Ah, right, I missed the pmd_present() while skimming over this extremely 
horrible function.
So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would 
return "1" and make us retry.
...
...
...
It'd be more robust to do something like:
That's also what I had in mind. But all this lockless stuff makes me a bit
nervous :)
Yeah the code is not very straightforward... :/
But technically the diff that I pasted here should be enough to fix
this... or do you have any alternative approach in mind?
Hopefully, I'm not convinced this code is not buggy, but at least 
regarding concurrent migration it should be fine with that.
-- 
Cheers

David / dhildenb

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.