Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.

23 Sep 2025

      On 23.09.25 13:46, Harry Yoo wrote:
...
On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
...
On 22.09.25 01:27, Harry Yoo wrote:
...
Hi. This is supposed to be a patch, but I think it's worth discussing
how it should be backported to -stable, so I've labeled it as [DISCUSSION].
The bug described below was unintentionally fixed in v6.5 and not
backported to -stable. So technically I would need to use "Option 3" [A],
What is option 3?
Citing Option 3 from [A]:
Ah, I see.
...
...
Send the patch, after verifying that it follows the above rules, to
stable@vger.kernel.org and mention the kernel versions you wish it to be
applied to. When doing so, you must note the upstream commit ID in the
changelog of your submission with a separate line above the commit text,
like this:
commit <sha1> upstream.
Or alternatively:
[ Upstream commit <sha1> ]
If the submitted patch deviates from the original upstream patch
(for example because it had to be adjusted for the older API),
this must be very clearly documented and justified in the patch description.
...
Just to clarify: it's fine to do a backport of a commit
even though it was not tagged as a fix.
Thanks for looking into it, David!
Ok, I was worried that the original patch's description will confuse
people because 1) we don't allow pte_map_offset_lock() to fail in older
kernels, which the original patch relies on, and 2) the patch does not
mention the race condition (because it fixed the race 'accidentaly' :D).
I'll backport the original patch but make it clear that:

while the original patch did not mention the race condition,
 the patch fixes a it, and add link to this discussion.

we can't remove 1) pmd_trans_unstable() check in change_pte_range(),
 and 2) "bad" pmd check in change_pmd_range() because we don't allow
 pte_offset_map_lock() to fail().

pmd_read_atomic() is used instead of pmdp_get_lockless() beucase it
 does not exist in older kernels.

Right, and backporting all the prerequisites might not be 
feasible/desirable.
[...]
...
...
...
		goto next;

This is all because we are trying to be smart and walking page tables
without the page table lock held. This is just absolutely nasty.
commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of
taking the pmd_lock") did this :(
Right. I can understand why we would not want to grab the lock when 
there is a leaf page table. But everything else is just asking for 
trouble (as we saw :) ).
...
...
What about the following check:
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
Couldn't we have a similar race there when we are concurrently migrating?
An excellent point! I agree that there could be a similar race,
but with something other than "bad pmd" error.
Right, instead we'd go into change_pte_range() where we check 
pmd_trans_unstable().
After that, I'm not sure ... maybe we'll just retry or we'll 
accidentally try treating it as a PTE table.
Looks like 
pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would 
return "0"
in case we hit migration entry? :/
...
It'd be more robust to do something like:
That's also what I had in mind. But all this lockless stuff makes me a 
bit nervous :)
-- 
Cheers

David / dhildenb

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [DISCUSSION] Fixing bad pmd due to a race condition between change_prot_numa() and THP migration in pre-6.5 kernels.