When migrating a THP, concurrent access to the PMD migration entry during a deferred split scan can lead to a page fault, as illustrated below. To prevent this page fault, it is necessary to check the PMD migration entry and return early. In this context, there is no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality of the target folio. Since the PMD migration entry is locked, it cannot be served as the target.
BUG: unable to handle page fault for address: ffffea60001db008 CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 Call Trace: <TASK> try_to_migrate_one+0x28c/0x3730 rmap_walk_anon+0x4f6/0x770 unmap_folio+0x196/0x1f0 split_huge_page_to_list_to_order+0x9f6/0x1560 deferred_split_scan+0xac5/0x12a0 shrinker_debugfs_scan_write+0x376/0x470 full_proxy_write+0x15c/0x220 vfs_write+0x2fc/0xcb0 ksys_write+0x146/0x250 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e
The bug is found by syzkaller on an internal kernel, then confirmed on upstream.
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Cc: stable@vger.kernel.org Signed-off-by: Gavin Guo gavinguo@igalia.com --- mm/huge_memory.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..0cb9547dcff2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze, struct folio *folio) { + bool pmd_migration = is_pmd_migration_entry(*pmd); + VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); VM_WARN_ON_ONCE(folio && !folio_test_locked(folio)); @@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, * require a folio to check the PMD against. Otherwise, there * is a risk of replacing the wrong folio. */ - if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || - is_pmd_migration_entry(*pmd)) { - if (folio && folio != pmd_folio(*pmd)) - return; + if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) { + if (folio) { + /* + * Do not apply pmd_folio() to a migration entry; and + * folio lock guarantees that it must be of the wrong + * folio anyway. + */ + if (pmd_migration) + return; + if (folio != pmd_folio(*pmd)) + return; + } __split_huge_pmd_locked(vma, pmd, address, freeze); } }
base-commit: a24588245776dafc227243a01bfbeb8a59bafba9
On 14 Apr 2025, at 3:27, Gavin Guo wrote:
When migrating a THP, concurrent access to the PMD migration entry during a deferred split scan can lead to a page fault, as illustrated
It is an access violation, right? Because pmd_folio(*pmd_migration_entry) does not return a folio address. Page fault made this sounded like not a big issue.
below. To prevent this page fault, it is necessary to check the PMD migration entry and return early. In this context, there is no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality of the target folio. Since the PMD migration entry is locked, it cannot be served as the target.
You mean split_huge_pmd_address() locks the PMD page table, so that page migration cannot proceed, or the THP is locked by migration, so that it cannot be split? The sentence is a little confusing to me.
BUG: unable to handle page fault for address: ffffea60001db008 CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 Call Trace:
<TASK> try_to_migrate_one+0x28c/0x3730 rmap_walk_anon+0x4f6/0x770 unmap_folio+0x196/0x1f0 split_huge_page_to_list_to_order+0x9f6/0x1560 deferred_split_scan+0xac5/0x12a0 shrinker_debugfs_scan_write+0x376/0x470 full_proxy_write+0x15c/0x220 vfs_write+0x2fc/0xcb0 ksys_write+0x146/0x250 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e
The bug is found by syzkaller on an internal kernel, then confirmed on upstream.
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Cc: stable@vger.kernel.org Signed-off-by: Gavin Guo gavinguo@igalia.com
mm/huge_memory.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..0cb9547dcff2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze, struct folio *folio) {
- bool pmd_migration = is_pmd_migration_entry(*pmd);
- VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
@@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, * require a folio to check the PMD against. Otherwise, there * is a risk of replacing the wrong folio. */
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
is_pmd_migration_entry(*pmd)) {
if (folio && folio != pmd_folio(*pmd))
return;
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) {
if (folio) {
/*
* Do not apply pmd_folio() to a migration entry; and
* folio lock guarantees that it must be of the wrong
* folio anyway.
Why does the folio lock imply it is a wrong folio?
*/
if (pmd_migration)
return;
if (folio != pmd_folio(*pmd))
return;
}
Why not just
if (folio && pmd_migration) return;
if (pmd_trans_huge() …) { … } ?
Thanks.
Best Regards, Yan, Zi
Hi Zi-Yan,
Thank you for the comment!
On 4/15/25 00:50, Zi Yan wrote:
On 14 Apr 2025, at 3:27, Gavin Guo wrote:
When migrating a THP, concurrent access to the PMD migration entry during a deferred split scan can lead to a page fault, as illustrated
It is an access violation, right? Because pmd_folio(*pmd_migration_entry) does not return a folio address. Page fault made this sounded like not a big issue.
Right, pmd_folio(*pmd_migration_entry) is defined with the following macro:
#define pmd_folio(pmd) page_folio(pmd_page(pmd))
page_folio will eventually access compound_head:
static __always_inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); ... }
However, given the invalid access address starting with 0xffffea (ffffea60001db008) which is the base address of the struct page. It indicates that the page address calculated from pmd_page is invalid during the THP migration where the pmd migration entry has been set up in set_pmd_migration_entry, for example:
do_mbind migrate_pages migrate_pages_sync migration_pages_batch migrate_folio_unmap try_to_migrate try_to_migrate_one rmap_walk_anon set_pmd_migration_entry set_pmd_at
below. To prevent this page fault, it is necessary to check the PMD migration entry and return early. In this context, there is no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality of the target folio. Since the PMD migration entry is locked, it cannot be served as the target.
You mean split_huge_pmd_address() locks the PMD page table, so that page migration cannot proceed, or the THP is locked by migration, so that it cannot be split? The sentence is a little confusing to me.
During the THP migration, the THP folio is locked (folio_trylock). When this THP folio is identified as partially-mapped and inserted into the deferred split queue, the system then scans the queue and attempts to split the folio when memory is under pressure or through the sysfs debug interface. Since the migrated folio remained locked, during the deferred_split_scan, the folio_trylock fails with the migrated folio. As a result, the folio passed to split_huge_pmd_locked ends up being unequal to the folio referenced by the pmd migration entry, indicating the pmd migration folio cannot be the target for splitting and needs to return.
An alternative approach is similar to the following:
+ swp_entry_t entry; + struct folio *dst; if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)) { - if (folio && folio != pmd_folio(*pmd)) - return; + if (folio) { + if (is_pmd_migration_entry(*pmd)) { + entry = pmd_to_swp_entry(*pmd); + dst = pfn_swap_entry_folio(entry); + } else + dst = pmd_folio(*pmd); + + if (folio != dst) + return + } __split_huge_pmd_locked(vma, pmd, address, freeze);
However, this extra effort to translate the pmd migration folio is unnecessary. Any ideas of exceptions?
BUG: unable to handle page fault for address: ffffea60001db008 CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 Call Trace:
<TASK> try_to_migrate_one+0x28c/0x3730 rmap_walk_anon+0x4f6/0x770 unmap_folio+0x196/0x1f0 split_huge_page_to_list_to_order+0x9f6/0x1560 deferred_split_scan+0xac5/0x12a0 shrinker_debugfs_scan_write+0x376/0x470 full_proxy_write+0x15c/0x220 vfs_write+0x2fc/0xcb0 ksys_write+0x146/0x250 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e
The bug is found by syzkaller on an internal kernel, then confirmed on upstream.
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Cc: stable@vger.kernel.org Signed-off-by: Gavin Guo gavinguo@igalia.com
mm/huge_memory.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..0cb9547dcff2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze, struct folio *folio) {
- bool pmd_migration = is_pmd_migration_entry(*pmd);
- VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
@@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, * require a folio to check the PMD against. Otherwise, there * is a risk of replacing the wrong folio. */
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
is_pmd_migration_entry(*pmd)) {
if (folio && folio != pmd_folio(*pmd))
return;
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) {
if (folio) {
/*
* Do not apply pmd_folio() to a migration entry; and
* folio lock guarantees that it must be of the wrong
* folio anyway.
Why does the folio lock imply it is a wrong folio?
As explained above.
*/
if (pmd_migration)
return;
if (folio != pmd_folio(*pmd))
return;
}
Why not just
if (folio && pmd_migration) return;
if (pmd_trans_huge() …) { … } ?
Do you mean to implement as follows?
if (folio && pmd_migration) return;
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if (folio && folio != pmd_folio(*pmd)) return; }
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) __split_huge_pmd_locked(vma, pmd, address, freeze);
To match the current upstream logic, when folio is null, no matter the condition is either pmd_trans_huge, pmd_devmap, or pmd_migration, the __split_huge_pmd_locked must be always executed. Given that, the __split_huge_pmd_locked cannot be put inside if condition for trans_huge and devmap. Likewise, the __split_huge_pmd_locked cannot be called directly without wrapping the if condition checks. What happens if this is not a pmd entry? This will be invoked unconditionally.
The original implementation consolidates all the logic into one large if statement with nested if condition check. However, it's more robust and clear. The second one simplifies the structure and can rule out the pmd_migration earlier and doesn't have the big if condition. However, it's trickier and needs extra care and maintenance.
Thanks.
Best Regards, Yan, Zi
Thanks!
On 15 Apr 2025, at 6:07, Gavin Guo wrote:
Hi Zi-Yan,
Thank you for the comment!
On 4/15/25 00:50, Zi Yan wrote:
On 14 Apr 2025, at 3:27, Gavin Guo wrote:
When migrating a THP, concurrent access to the PMD migration entry during a deferred split scan can lead to a page fault, as illustrated
It is an access violation, right? Because pmd_folio(*pmd_migration_entry) does not return a folio address. Page fault made this sounded like not a big issue.
Right, pmd_folio(*pmd_migration_entry) is defined with the following macro:
#define pmd_folio(pmd) page_folio(pmd_page(pmd))
page_folio will eventually access compound_head:
static __always_inline unsigned long _compound_head(const struct page *page) { unsigned long head = READ_ONCE(page->compound_head); ... }
However, given the invalid access address starting with 0xffffea (ffffea60001db008) which is the base address of the struct page. It indicates that the page address calculated from pmd_page is invalid during the THP migration where the pmd migration entry has been set up in set_pmd_migration_entry, for example:
Exactly. Maybe I was not clear. I just want you to update your git commit log to say “… can lead to an invalid address access” instead of “page fault”.
do_mbind migrate_pages migrate_pages_sync migration_pages_batch migrate_folio_unmap try_to_migrate try_to_migrate_one rmap_walk_anon set_pmd_migration_entry set_pmd_at
below. To prevent this page fault, it is necessary to check the PMD migration entry and return early. In this context, there is no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality of the target folio. Since the PMD migration entry is locked, it cannot be served as the target.
You mean split_huge_pmd_address() locks the PMD page table, so that page migration cannot proceed, or the THP is locked by migration, so that it cannot be split? The sentence is a little confusing to me.
During the THP migration, the THP folio is locked (folio_trylock). When this THP folio is identified as partially-mapped and inserted into the deferred split queue, the system then scans the queue and attempts to split the folio when memory is under pressure or through the sysfs debug interface. Since the migrated folio remained locked, during the deferred_split_scan, the folio_trylock fails with the migrated folio. As
Wait. If folio_trylock() failed, how could split_folio() is called in deferred_split_scan()? Your call trace has split_folio(), which means the folio is locked by deferred_split_scan() already, not migration. What am I missing?
a result, the folio passed to split_huge_pmd_locked ends up being unequal to the folio referenced by the pmd migration entry, indicating the pmd migration folio cannot be the target for splitting and needs to return.
An alternative approach is similar to the following:
swp_entry_t entry;
struct folio *dst; if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)) {
if (folio && folio != pmd_folio(*pmd))
return;
if (folio) {
if (is_pmd_migration_entry(*pmd)) {
entry = pmd_to_swp_entry(*pmd);
dst = pfn_swap_entry_folio(entry);
} else
dst = pmd_folio(*pmd);
if (folio != dst)
return
} __split_huge_pmd_locked(vma, pmd, address, freeze);
However, this extra effort to translate the pmd migration folio is unnecessary. Any ideas of exceptions?
I get this part. Your assumption is that split_huge_pmd_address() with folio passed in is only called by THP split. It will be better to state your assumption explicitly in the comment. It is unclear to people why “folio lock guarantees … wrong folio”.
BUG: unable to handle page fault for address: ffffea60001db008 CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 Call Trace:
<TASK> try_to_migrate_one+0x28c/0x3730 rmap_walk_anon+0x4f6/0x770 unmap_folio+0x196/0x1f0 split_huge_page_to_list_to_order+0x9f6/0x1560 deferred_split_scan+0xac5/0x12a0 shrinker_debugfs_scan_write+0x376/0x470 full_proxy_write+0x15c/0x220 vfs_write+0x2fc/0xcb0 ksys_write+0x146/0x250 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e
The bug is found by syzkaller on an internal kernel, then confirmed on upstream.
Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Cc: stable@vger.kernel.org Signed-off-by: Gavin Guo gavinguo@igalia.com
mm/huge_memory.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a47682d1ab7..0cb9547dcff2 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3075,6 +3075,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, bool freeze, struct folio *folio) {
- bool pmd_migration = is_pmd_migration_entry(*pmd);
- VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio)); VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE)); VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
@@ -3085,10 +3087,18 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address, * require a folio to check the PMD against. Otherwise, there * is a risk of replacing the wrong folio. */
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
is_pmd_migration_entry(*pmd)) {
if (folio && folio != pmd_folio(*pmd))
return;
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) {
if (folio) {
/*
* Do not apply pmd_folio() to a migration entry; and
* folio lock guarantees that it must be of the wrong
* folio anyway.
Why does the folio lock imply it is a wrong folio?
As explained above.
*/
if (pmd_migration)
return;
if (folio != pmd_folio(*pmd))
return;
}
Why not just
if (folio && pmd_migration) return;
if (pmd_trans_huge() …) { … } ?
Do you mean to implement as follows?
if (folio && pmd_migration) return;
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { if (folio && folio != pmd_folio(*pmd)) return; }
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) || pmd_migration) __split_huge_pmd_locked(vma, pmd, address, freeze);
Yes.
To match the current upstream logic, when folio is null, no matter the condition is either pmd_trans_huge, pmd_devmap, or pmd_migration, the __split_huge_pmd_locked must be always executed. Given that, the __split_huge_pmd_locked cannot be put inside if condition for trans_huge and devmap. Likewise, the __split_huge_pmd_locked cannot be called directly without wrapping the if condition checks. What happens if this is not a pmd entry? This will be invoked unconditionally.
The original implementation consolidates all the logic into one large if statement with nested if condition check. However, it's more robust and clear. The second one simplifies the structure and can rule out the pmd_migration earlier and doesn't have the big if condition. However, it's trickier and needs extra care and maintenance.
IMHO, fewer indentation is always better. But I have no strong opinion on it.
Anyway, we need to figure out why both THP migration and deferred_split_scan() hold the THP lock first, which sounds impossible to me. Or some other execution interleaving is happening.
Thanks.
Best Regards, Yan, Zi
linux-stable-mirror@lists.linaro.org