On 02.10.25 16:07, Harry Yoo wrote:
On Wed, Sep 24, 2025 at 05:52:14PM +0200, David Hildenbrand wrote:
On 24.09.25 13:54, Harry Yoo wrote:
On Tue, Sep 23, 2025 at 04:09:06PM +0200, David Hildenbrand wrote:
On 23.09.25 13:46, Harry Yoo wrote:
On Tue, Sep 23, 2025 at 11:00:57AM +0200, David Hildenbrand wrote:
On 22.09.25 01:27, Harry Yoo wrote:
In case is_swap_pmd() or pmd_trans_huge() returned true, but another kernel thread splits THP after we checked it, __split_huge_pmd() or change_huge_pmd() will just return without actually splitting or changing pmd entry, if it turns out that evaluating (is_swap_pmd() || pmd_trans_huge() || pmd_devmap()) as true was false positive due to race condition, because they both double check after acquiring pmd lock:
- __split_huge_pmd() checks if it's either pmd_trans_huge(), pmd_devmap()
or is_pmd_migration_entry() under pmd lock.
- change_huge_pmd() checks if it's either is_swap_pmd(),
pmd_trans_huge(), or pmd_devmap() under pmd lock.
And if either function simply returns because it was not a THP, pmd migration entry, or pmd devmap, khugepaged cannot colleapse huge page because we're holding mmap_lock in read mode.
And then we call change_pte_range() and that's safe.
After that, I'm not sure ... maybe we'll just retry
Or as you mentioned, if we are misled into thinking it is not a THP, PMD devmap, or swap PMD due to race condition, we'd end up going into change_pte_range().
or we'll accidentally try treating it as a PTE table.
But then pmd_trans_unstable() check should prevent us from treating it as PTE table (and we're still holding mmap_lock here). In such case we don't retry but skip it instead.
Looks like pmd_trans_unstable()->pud_none_or_trans_huge_or_dev_or_clear_bad() would
I think you mean pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad()?
Yes!
return "0" in case we hit migration entry? :/
pmd_none_or_trans_huge_or_clear_bad() open-coded is_swap_pmd(), as it eventually checks !pmd_none() && !pmd_present() case.
Apologies for the late reply.
Ah, right, I missed the pmd_present() while skimming over this extremely horrible function.
So pmd_trans_unstable()->pmd_none_or_trans_huge_or_clear_bad() would return "1" and make us retry.
We don't retry in pre-6.5 kernels because retrying is a new behavior after commit 670ddd8cdcbd1.
:/
It'd be more robust to do something like:
That's also what I had in mind. But all this lockless stuff makes me a bit nervous :)
Yeah the code is not very straightforward... :/
But technically the diff that I pasted here should be enough to fix this... or do you have any alternative approach in mind?
Hopefully, I'm not convinced this code is not buggy, but at least regarding concurrent migration it should be fine with that.
I've been thinking about this...
Actually, it'll make more sense to open-code what pte_map_offset_lock() does in the mainline:
- do not remove the "bad pte" checks, because pte_offset_map() in pre-6.5 kernels doesn't do the check for us unlike the mainline.
- check is_swap_pmd(), pmd_trans_huge(), pmd_devmap() without ptl, but atomically.
- after acquiring ptl in change_pte_range(), check if pmd has changed since step 1 and 2. if yes, retry (like mainline). if no, we're all good.
What do you think?
Only for -stable, right? Does not sound too wrong for me, but I would have to take a closer look at the end result!