On 24.09.25 13:54, Harry Yoo wrote:
On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
On 23.09.25 13:46, Harry Yoo wrote:
On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
On 22.09.25 01:27, Harry Yoo wrote: This is all because we are trying to be smart and walking page tables without the page table lock held. This is just absolutely nasty.
commit 175ad4f1e7a2 ("mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock") did this :(
Right. I can understand why we would not want to grab the lock when there is a leaf page table. But everything else is just asking for trouble (as we saw :) ).
What about the following check:
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
Couldn't we have a similar race there when we are concurrently migrating?
An excellent point! I agree that there could be a similar race, but with something other than "bad pmd" error.
Right, instead we'd go into change_pte_range() where we check pmd_trans_unstable().
Uh, my brain hurts again... :)
Yeah, that's why I gave up faster than you trying to understand these horrible, horrible functions in the stable kernel :)
In case is_swap_pmd() or pmd_trans_huge() returned true, but another kernel thread splits THP after we checked it, __split_huge_pmd() or change_huge_pmd() will just return without actually splitting or changing pmd entry, if it turns out that evaluating (is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true was false positive due to race condition, because they both double check after acquiring pmd lock:
- __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
or is_pmd_migration_entry() under pmd lock.
- change_huge_pmd() checks if it's either is_swap_pmd(),
pmd_trans_huge(), or pmd_devmap() under pmd lock.
And if either function simply returns because it was not a THP, pmd migration entry, or pmd devmap, khugepaged cannot colleapse huge page because we're holding mmap_lock in read mode.
And then we call change_pte_range() and that's safe.
After that, I'm not sure ... maybe we'll just retry
Or as you mentioned, if we are misled into thinking it is not a THP, PMD devmap, or swap PMD due to race condition, we'd end up going into change_pte_range().
or we'll accidentally try treating it as a PTE table.
But then pmd_trans_unstable() check should prevent us from treating it as PTE table (and we're still holding mmap_lock here). In such case we don't retry but skip it instead.
Looks like pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
I think you mean pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
Yes!
return "0" in case we hit migration entry? :/
pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it eventually checks !pmd_none() && !pmd_present() case.
Ah, right, I missed the pmd_present() while skimming over this extremely horrible function.
So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would return "1" and make us retry.
It'd be more robust to do something like:
That's also what I had in mind. But all this lockless stuff makes me a bit nervous :)
Yeah the code is not very straightforward... :/
But technically the diff that I pasted here should be enough to fix this... or do you have any alternative approach in mind?
Hopefully, I'm not convinced this code is not buggy, but at least regarding concurrent migration it should be fine with that.