From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev --- Tested on x86_64 and on QEMU for arm64 (with and without MTE support), and the fix works as expected.
mm/huge_memory.c | 15 +++------------ mm/migrate.c | 8 +------- 2 files changed, 4 insertions(+), 19 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0; - void *kaddr; int i;
for (i = 0; i < folio_nr_pages(folio); i++) { - kaddr = kmap_local_folio(folio, i * PAGE_SIZE); - if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { - num_zero_pages++; - if (num_zero_pages > khugepaged_max_ptes_none) { - kunmap_local(kaddr); + if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) { + if (++num_zero_pages > khugepaged_max_ptes_none) return true; - } } else { /* * Another path for early exit once the number * of non-zero filled pages exceeds threshold. */ - num_filled_pages++; - if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) { - kunmap_local(kaddr); + if (++num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) return false; - } } - kunmap_local(kaddr); } return false; } diff --git a/mm/migrate.c b/mm/migrate.c index aee61a980374..ce83c2c3c287 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -300,9 +300,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, unsigned long idx) { struct page *page = folio_page(folio, idx); - bool contains_data; pte_t newpte; - void *addr;
if (PageCompound(page)) return false; @@ -319,11 +317,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, * this subpage has been non present. If the subpage is only zero-filled * then map it to the shared zeropage. */ - addr = kmap_local_page(page); - contains_data = memchr_inv(addr, 0, PAGE_SIZE); - kunmap_local(addr); - - if (contains_data) + if (!pages_identical(page, ZERO_PAGE(0))) return false;
newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
On 21 Sep 2025, at 22:14, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
Tested on x86_64 and on QEMU for arm64 (with and without MTE support), and the fix works as expected.
From [1], I see you mentioned RISC-V also has the address masking feature. Is it affected by this? And memcmp_pages() is only implemented by ARM64 for MTE. Should any arch with address masking always implement it to avoid the same issue?
mm/huge_memory.c | 15 +++------------ mm/migrate.c | 8 +------- 2 files changed, 4 insertions(+), 19 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0;
void *kaddr; int i;
for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
num_zero_pages++;
if (num_zero_pages > khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none) return true;
} else { /* * Another path for early exit once the number * of non-zero filled pages exceeds threshold. */}
num_filled_pages++;
if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (++num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) return false;
}}
} return false;kunmap_local(kaddr);
} diff --git a/mm/migrate.c b/mm/migrate.c index aee61a980374..ce83c2c3c287 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -300,9 +300,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, unsigned long idx) { struct page *page = folio_page(folio, idx);
bool contains_data; pte_t newpte;
void *addr;
if (PageCompound(page)) return false;
@@ -319,11 +317,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, * this subpage has been non present. If the subpage is only zero-filled * then map it to the shared zeropage. */
- addr = kmap_local_page(page);
- contains_data = memchr_inv(addr, 0, PAGE_SIZE);
- kunmap_local(addr);
- if (contains_data)
if (!pages_identical(page, ZERO_PAGE(0))) return false;
newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
-- 2.49.0
The changes look good to me. Thanks. Acked-by: Zi Yan ziy@nvidia.com
-- Best Regards, Yan, Zi
Cc: RISC-V folks
On 2025/9/22 10:36, Zi Yan wrote:
On 21 Sep 2025, at 22:14, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
Tested on x86_64 and on QEMU for arm64 (with and without MTE support), and the fix works as expected.
From [1], I see you mentioned RISC-V also has the address masking feature. Is it affected by this? And memcmp_pages() is only implemented by ARM64 for MTE. Should any arch with address masking always implement it to avoid the same issue?
Yeah, I'm new to RISC-V, seems like RISC-V has a similar feature as described in Documentation/arch/riscv/uabi.rst, which is the Supm (Supervisor-mode Pointer Masking) extension.
``` Pointer masking ---------------
Support for pointer masking in userspace (the Supm extension) is provided via the ``PR_SET_TAGGED_ADDR_CTRL`` and ``PR_GET_TAGGED_ADDR_CTRL`` ``prctl()`` operations. Pointer masking is disabled by default. To enable it, userspace must call ``PR_SET_TAGGED_ADDR_CTRL`` with the ``PR_PMLEN`` field set to the number of mask/tag bits needed by the application. ``PR_PMLEN`` is interpreted as a lower bound; if the kernel is unable to satisfy the request, the ``PR_SET_TAGGED_ADDR_CTRL`` operation will fail. The actual number of tag bits is returned in ``PR_PMLEN`` by the ``PR_GET_TAGGED_ADDR_CTRL`` operation. ```
But, IIUC, Supm by itself only ensures that the upper bits are ignored on memory access :)
So, RISC-V today would likely not be affected. However, once it implements full hardware tag checking, it will face the exact same zero-page problem.
Anyway, any architecture with a feature like MTE in the future will need its own memcmp_pages() to prevent unsafe merges ;)
mm/huge_memory.c | 15 +++------------ mm/migrate.c | 8 +------- 2 files changed, 4 insertions(+), 19 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0;
void *kaddr; int i;
for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
num_zero_pages++;
if (num_zero_pages > khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none) return true;
} else { /* * Another path for early exit once the number * of non-zero filled pages exceeds threshold. */}
num_filled_pages++;
if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (++num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) return false;
}}
} return false; }kunmap_local(kaddr);
diff --git a/mm/migrate.c b/mm/migrate.c index aee61a980374..ce83c2c3c287 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -300,9 +300,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, unsigned long idx) { struct page *page = folio_page(folio, idx);
bool contains_data; pte_t newpte;
void *addr;
if (PageCompound(page)) return false;
@@ -319,11 +317,7 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, * this subpage has been non present. If the subpage is only zero-filled * then map it to the shared zeropage. */
- addr = kmap_local_page(page);
- contains_data = memchr_inv(addr, 0, PAGE_SIZE);
- kunmap_local(addr);
- if (contains_data)
if (!pages_identical(page, ZERO_PAGE(0))) return false;
newpte = pte_mkspecial(pfn_pte(my_zero_pfn(pvmw->address),
-- 2.49.0
The changes look good to me. Thanks. Acked-by: Zi Yan ziy@nvidia.com
Cheers!
On 22.09.25 04:14, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
LGTM, thanks
Acked-by: David Hildenbrand david@redhat.com
On 22/09/2025 03:14, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
Tested on x86_64 and on QEMU for arm64 (with and without MTE support), and the fix works as expected.
mm/huge_memory.c | 15 +++------------ mm/migrate.c | 8 +------- 2 files changed, 4 insertions(+), 19 deletions(-)
Thanks for the fix!
Acked-by: Usama Arif usamaarif642@gmail.com
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
Functionally, the patch looks fine, both with and without MTE.
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0;
- void *kaddr; int i;
for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
num_zero_pages++;
if (num_zero_pages > khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none) return true;
I wonder what the overhead of doing a memcmp() vs memchr_inv() is. The former will need to read from two places. If it's noticeable, it would affect architectures that don't have an MTE equivalent.
Alternatively we could introduce something like folio_has_metadata() which on arm64 simply checks PG_mte_tagged.
On 22.09.25 19:24, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
Functionally, the patch looks fine, both with and without MTE.
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0;
- void *kaddr; int i;
for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
num_zero_pages++;
if (num_zero_pages > khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none) return true;
I wonder what the overhead of doing a memcmp() vs memchr_inv() is. The former will need to read from two places. If it's noticeable, it would affect architectures that don't have an MTE equivalent.
Alternatively we could introduce something like folio_has_metadata() which on arm64 simply checks PG_mte_tagged.
We discussed something similar in the other thread (I suggested page_is_mergable()). I'd prefer to use pages_identical() for now, so we have the same logic here and in ksm code.
(this patch here almost looks like a cleanup :) )
If this becomes a problem, what we could do is in pages_identical() would be simply doing the memchr_inv() in case is_zero_pfn(). KSM might benefit from that as well when merging with the shared zeropage through try_to_merge_with_zero_page().
On 2025/9/23 01:59, David Hildenbrand wrote:
On 22.09.25 19:24, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/ a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@mediatek.com Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
Functionally, the patch looks fine, both with and without MTE.
Reviewed-by: Catalin Marinas catalin.marinas@arm.com
Thanks for taking time to review!
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0; - void *kaddr; int i; for (i = 0; i < folio_nr_pages(folio); i++) { - kaddr = kmap_local_folio(folio, i * PAGE_SIZE); - if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { - num_zero_pages++; - if (num_zero_pages > khugepaged_max_ptes_none) { - kunmap_local(kaddr); + if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) { + if (++num_zero_pages > khugepaged_max_ptes_none) return true;
I wonder what the overhead of doing a memcmp() vs memchr_inv() is. The former will need to read from two places. If it's noticeable, it would affect architectures that don't have an MTE equivalent.
Alternatively we could introduce something like folio_has_metadata() which on arm64 simply checks PG_mte_tagged.
We discussed something similar in the other thread (I suggested page_is_mergable()). I'd prefer to use pages_identical() for now, so we have the same logic here and in ksm code.
(this patch here almost looks like a cleanup :) )
Yeah, let's keep it as-is for now.
Using the same pages_identical() pattern as KSM makes the logic consistent.
And it's simple enough to be easily backported to stable trees ;)
If this becomes a problem, what we could do is in pages_identical() would be simply doing the memchr_inv() in case is_zero_pfn(). KSM might benefit from that as well when merging with the shared zeropage through try_to_merge_with_zero_page().
Right, there is room for that optimization. I will look into it as a follow-up patch after this one is settled and backported, especially if the performance overhead turns out to be a real concern :)
Cheers, Lance
On Mon, Sep 22, 2025 at 07:59:00PM +0200, David Hildenbrand wrote:
On 22.09.25 19:24, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0;
- void *kaddr; int i; for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
num_zero_pages++;
if (num_zero_pages > khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none) return true;
I wonder what the overhead of doing a memcmp() vs memchr_inv() is. The former will need to read from two places. If it's noticeable, it would affect architectures that don't have an MTE equivalent.
Alternatively we could introduce something like folio_has_metadata() which on arm64 simply checks PG_mte_tagged.
We discussed something similar in the other thread (I suggested page_is_mergable()). I'd prefer to use pages_identical() for now, so we have the same logic here and in ksm code.
(this patch here almost looks like a cleanup :) )
If this becomes a problem, what we could do is in pages_identical() would be simply doing the memchr_inv() in case is_zero_pfn(). KSM might benefit from that as well when merging with the shared zeropage through try_to_merge_with_zero_page().
Yes, we can always optimise it later.
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
The MTE tag setting evolved a bit over time with some locking using PG_* flags to avoid a set_pte_at() race trying to initialise the tags on the same page. We also moved the swap restoring to arch_swap_restore() rather than the set_pte_at() path. So it is safe now to merge with the zero page if the other page isn't tagged. A subsequent set_pte_at() attempting to clear the tags would notice that the zero page is already tagged.
We could go a step further and add tag comparison (I had some code around) but I think the quick fix is to just not treat the zero page as tagged. Not fully tested yet:
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret; + bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1); + bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2);
addr1 = page_address(page1); addr2 = page_address(page2); @@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2)
/* * If the page content is identical but at least one of the pages is - * tagged, return non-zero to avoid KSM merging. If only one of the - * pages is tagged, __set_ptes() may zero or change the tags of the - * other page via mte_sync_tags(). + * tagged, return non-zero to avoid KSM merging. Ignore the zero page + * since it is always tagged with the tags cleared. */ - if (page_mte_tagged(page1) || page_mte_tagged(page2)) + if (page1_tagged || page2_tagged) return addr1 != addr2;
return ret;
On 23.09.25 13:52, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 07:59:00PM +0200, David Hildenbrand wrote:
On 22.09.25 19:24, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0;
- void *kaddr; int i; for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
num_zero_pages++;
if (num_zero_pages > khugepaged_max_ptes_none) {
kunmap_local(kaddr);
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none) return true;
I wonder what the overhead of doing a memcmp() vs memchr_inv() is. The former will need to read from two places. If it's noticeable, it would affect architectures that don't have an MTE equivalent.
Alternatively we could introduce something like folio_has_metadata() which on arm64 simply checks PG_mte_tagged.
We discussed something similar in the other thread (I suggested page_is_mergable()). I'd prefer to use pages_identical() for now, so we have the same logic here and in ksm code.
(this patch here almost looks like a cleanup :) )
If this becomes a problem, what we could do is in pages_identical() would be simply doing the memchr_inv() in case is_zero_pfn(). KSM might benefit from that as well when merging with the shared zeropage through try_to_merge_with_zero_page().
Yes, we can always optimise it later.
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
The MTE tag setting evolved a bit over time with some locking using PG_* flags to avoid a set_pte_at() race trying to initialise the tags on the same page. We also moved the swap restoring to arch_swap_restore() rather than the set_pte_at() path. So it is safe now to merge with the zero page if the other page isn't tagged. A subsequent set_pte_at() attempting to clear the tags would notice that the zero page is already tagged.
We could go a step further and add tag comparison (I had some code around) but I think the quick fix is to just not treat the zero page as tagged.
I assume any tag changes would result in CoW.
It would be interesting to know if there are use cases with VMs or other workloads where that could be beneficial with KSM.
Not fully tested yet:
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret;
- bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1);
- bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2);
addr1 = page_address(page1); addr2 = page_address(page2); @@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2) /* * If the page content is identical but at least one of the pages is
* tagged, return non-zero to avoid KSM merging. If only one of the
* pages is tagged, __set_ptes() may zero or change the tags of the
* other page via mte_sync_tags().
* tagged, return non-zero to avoid KSM merging. Ignore the zero page
*/* since it is always tagged with the tags cleared.
- if (page_mte_tagged(page1) || page_mte_tagged(page2))
- if (page1_tagged || page2_tagged) return addr1 != addr2;
That looks reasonable to me.
@Lance as you had a test setup, could you give this a try as well with KSM shared zeropage deduplication enabled whether it now works as expected as well?
Then, this should likely be an independent fix.
For KSM you likely have to enable it first through /sys/kernel/mm/ksm/use_zero_pages.
On 2025/9/23 20:00, David Hildenbrand wrote:
On 23.09.25 13:52, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 07:59:00PM +0200, David Hildenbrand wrote:
On 22.09.25 19:24, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0; - void *kaddr; int i; for (i = 0; i < folio_nr_pages(folio); i++) { - kaddr = kmap_local_folio(folio, i * PAGE_SIZE); - if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { - num_zero_pages++; - if (num_zero_pages > khugepaged_max_ptes_none) { - kunmap_local(kaddr); + if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) { + if (++num_zero_pages > khugepaged_max_ptes_none) return true;
I wonder what the overhead of doing a memcmp() vs memchr_inv() is. The former will need to read from two places. If it's noticeable, it would affect architectures that don't have an MTE equivalent.
Alternatively we could introduce something like folio_has_metadata() which on arm64 simply checks PG_mte_tagged.
We discussed something similar in the other thread (I suggested page_is_mergable()). I'd prefer to use pages_identical() for now, so we have the same logic here and in ksm code.
(this patch here almost looks like a cleanup :) )
If this becomes a problem, what we could do is in pages_identical() would be simply doing the memchr_inv() in case is_zero_pfn(). KSM might benefit from that as well when merging with the shared zeropage through try_to_merge_with_zero_page().
Yes, we can always optimise it later.
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
The MTE tag setting evolved a bit over time with some locking using PG_* flags to avoid a set_pte_at() race trying to initialise the tags on the same page. We also moved the swap restoring to arch_swap_restore() rather than the set_pte_at() path. So it is safe now to merge with the zero page if the other page isn't tagged. A subsequent set_pte_at() attempting to clear the tags would notice that the zero page is already tagged.
We could go a step further and add tag comparison (I had some code around) but I think the quick fix is to just not treat the zero page as tagged.
I assume any tag changes would result in CoW.
It would be interesting to know if there are use cases with VMs or other workloads where that could be beneficial with KSM.
Not fully tested yet:
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret; + bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1); + bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2); addr1 = page_address(page1); addr2 = page_address(page2); @@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2) /* * If the page content is identical but at least one of the pages is - * tagged, return non-zero to avoid KSM merging. If only one of the - * pages is tagged, __set_ptes() may zero or change the tags of the - * other page via mte_sync_tags(). + * tagged, return non-zero to avoid KSM merging. Ignore the zero page + * since it is always tagged with the tags cleared. */ - if (page_mte_tagged(page1) || page_mte_tagged(page2)) + if (page1_tagged || page2_tagged) return addr1 != addr2;
That looks reasonable to me.
Yeah, looks good to me as well.
@Lance as you had a test setup, could you give this a try as well with KSM shared zeropage deduplication enabled whether it now works as expected as well?
Sure. I'll test that and get back to you ;)
Then, this should likely be an independent fix.
For KSM you likely have to enable it first through /sys/kernel/mm/ksm/ use_zero_pages.
Got it.
On Tue, Sep 23, 2025 at 02:00:01PM +0200, David Hildenbrand wrote:
On 23.09.25 13:52, Catalin Marinas wrote:
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
The MTE tag setting evolved a bit over time with some locking using PG_* flags to avoid a set_pte_at() race trying to initialise the tags on the same page. We also moved the swap restoring to arch_swap_restore() rather than the set_pte_at() path. So it is safe now to merge with the zero page if the other page isn't tagged. A subsequent set_pte_at() attempting to clear the tags would notice that the zero page is already tagged.
We could go a step further and add tag comparison (I had some code around) but I think the quick fix is to just not treat the zero page as tagged.
I assume any tag changes would result in CoW.
Yes.
It would be interesting to know if there are use cases with VMs or other workloads where that could be beneficial with KSM.
With VMs, if MTE is allowed in the guest, we currently treat any VM page as tagged. In the initial version of the MTE spec, we did not have any fine-rained control at the stage 2 page table over whether MTE is in use by the guest (just a big knob in a control register). We later got FEAT_MTE_PERM which allows stage 2 to trap tag accesses in a VM on a page by page basis, though we haven't added KVM support for it yet.
If we add full tag comparison, VMs may be able to share more pages. For example, code pages are never tagged in a VM but the hypervisor doesn't know this, so it just avoids sharing. I posted tag comparison some years ago but dropped it eventually to keep things simple:
https://lore.kernel.org/all/20200421142603.3894-9-catalin.marinas@arm.com/
However, it needs a bit of tidying up since at the time we assumed everything was tagged. I can respin the above (on top of the fix below), though I don't see many vendors rushing to deploy MTE in a multi-VM scenario (Android + pKVM maybe but not sure they enable KSM due to power constraints).
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret;
- bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1);
- bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2); addr1 = page_address(page1); addr2 = page_address(page2);
@@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2) /* * If the page content is identical but at least one of the pages is
* tagged, return non-zero to avoid KSM merging. If only one of the
* pages is tagged, __set_ptes() may zero or change the tags of the
* other page via mte_sync_tags().
* tagged, return non-zero to avoid KSM merging. Ignore the zero page
*/* since it is always tagged with the tags cleared.
- if (page_mte_tagged(page1) || page_mte_tagged(page2))
- if (page1_tagged || page2_tagged) return addr1 != addr2;
That looks reasonable to me.
@Lance as you had a test setup, could you give this a try as well with KSM shared zeropage deduplication enabled whether it now works as expected as well?
Thanks!
Then, this should likely be an independent fix.
Yes, I'll add a proper commit message. We could do a cc stable, though it's more of an optimisation.
On 2025/9/23 20:00, David Hildenbrand wrote:
On 23.09.25 13:52, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 07:59:00PM +0200, David Hildenbrand wrote:
On 22.09.25 19:24, Catalin Marinas wrote:
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 32e0ec2dde36..28d4b02a1aa5 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4104,29 +4104,20 @@ static unsigned long deferred_split_count(struct shrinker *shrink, static bool thp_underused(struct folio *folio) { int num_zero_pages = 0, num_filled_pages = 0; - void *kaddr; int i; for (i = 0; i < folio_nr_pages(folio); i++) { - kaddr = kmap_local_folio(folio, i * PAGE_SIZE); - if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { - num_zero_pages++; - if (num_zero_pages > khugepaged_max_ptes_none) { - kunmap_local(kaddr); + if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) { + if (++num_zero_pages > khugepaged_max_ptes_none) return true;
I wonder what the overhead of doing a memcmp() vs memchr_inv() is. The former will need to read from two places. If it's noticeable, it would affect architectures that don't have an MTE equivalent.
Alternatively we could introduce something like folio_has_metadata() which on arm64 simply checks PG_mte_tagged.
We discussed something similar in the other thread (I suggested page_is_mergable()). I'd prefer to use pages_identical() for now, so we have the same logic here and in ksm code.
(this patch here almost looks like a cleanup :) )
If this becomes a problem, what we could do is in pages_identical() would be simply doing the memchr_inv() in case is_zero_pfn(). KSM might benefit from that as well when merging with the shared zeropage through try_to_merge_with_zero_page().
Yes, we can always optimise it later.
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
The MTE tag setting evolved a bit over time with some locking using PG_* flags to avoid a set_pte_at() race trying to initialise the tags on the same page. We also moved the swap restoring to arch_swap_restore() rather than the set_pte_at() path. So it is safe now to merge with the zero page if the other page isn't tagged. A subsequent set_pte_at() attempting to clear the tags would notice that the zero page is already tagged.
We could go a step further and add tag comparison (I had some code around) but I think the quick fix is to just not treat the zero page as tagged.
I assume any tag changes would result in CoW.
Right, I did a test and confirmed that any attempt to change tags on the shared zero page results in a Copy-on-Write, as expected ;)
It would be interesting to know if there are use cases with VMs or other workloads where that could be beneficial with KSM.
Not fully tested yet:
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret; + bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1); + bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2); addr1 = page_address(page1); addr2 = page_address(page2); @@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2) /* * If the page content is identical but at least one of the pages is - * tagged, return non-zero to avoid KSM merging. If only one of the - * pages is tagged, __set_ptes() may zero or change the tags of the - * other page via mte_sync_tags(). + * tagged, return non-zero to avoid KSM merging. Ignore the zero page + * since it is always tagged with the tags cleared. */ - if (page_mte_tagged(page1) || page_mte_tagged(page2)) + if (page1_tagged || page2_tagged) return addr1 != addr2;
That looks reasonable to me.
@Lance as you had a test setup, could you give this a try as well with KSM shared zeropage deduplication enabled whether it now works as expected as well?
This works as expected. Both KSM (with use_zero_pages enabled) and the THP shrinker are now able to successfully merge zero-filled pages with the shared zero page, as long as those pages are not mapped with PROT_MTE.
Then, this should likely be an independent fix.
@Catalin could you send a real patch?
On Tue, Sep 23, 2025 at 12:52:06PM +0100, Catalin Marinas wrote:
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
[...]
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret;
- bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1);
- bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2);
addr1 = page_address(page1); addr2 = page_address(page2); @@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2) /* * If the page content is identical but at least one of the pages is
* tagged, return non-zero to avoid KSM merging. If only one of the
* pages is tagged, __set_ptes() may zero or change the tags of the
* other page via mte_sync_tags().
* tagged, return non-zero to avoid KSM merging. Ignore the zero page
*/* since it is always tagged with the tags cleared.
- if (page_mte_tagged(page1) || page_mte_tagged(page2))
- if (page1_tagged || page2_tagged) return addr1 != addr2;
return ret;
Unrelated to this discussion, I got an internal report that Linux hangs during boot with CONFIG_DEFERRED_STRUCT_PAGE_INIT because try_page_mte_tagging() locks up on uninitialised page flags.
Since we (always?) map the zero page as pte_special(), set_pte_at() won't check if the tags have to be initialised, so we can skip the PG_mte_tagged altogether. We actually had this code for some time until we introduced the pte_special() check in set_pte_at().
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 7b78c95a9017..e325ba34f45c 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -2419,17 +2419,21 @@ static void bti_enable(const struct arm64_cpu_capabilities *__unused) #ifdef CONFIG_ARM64_MTE static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap) { + static bool cleared_zero_page = false; + sysreg_clear_set(sctlr_el1, 0, SCTLR_ELx_ATA | SCTLR_EL1_ATA0);
mte_cpu_setup();
/* * Clear the tags in the zero page. This needs to be done via the - * linear map which has the Tagged attribute. + * linear map which has the Tagged attribute. Since this page is + * always mapped as pte_special(), set_pte_at() will not attempt to + * clear the tags or set PG_mte_tagged. */ - if (try_page_mte_tagging(ZERO_PAGE(0))) { + if (!cleared_zero_page) { + cleared_zero_page = true; mte_clear_page_tags(lm_alias(empty_zero_page)); - set_page_mte_tagged(ZERO_PAGE(0)); }
kasan_init_hw_tags_cpu();
On 23.09.25 18:14, Catalin Marinas wrote:
On Tue, Sep 23, 2025 at 12:52:06PM +0100, Catalin Marinas wrote:
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
[...]
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret;
- bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1);
- bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2);
addr1 = page_address(page1); addr2 = page_address(page2); @@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2) /* * If the page content is identical but at least one of the pages is
* tagged, return non-zero to avoid KSM merging. If only one of the
* pages is tagged, __set_ptes() may zero or change the tags of the
* other page via mte_sync_tags().
* tagged, return non-zero to avoid KSM merging. Ignore the zero page
*/* since it is always tagged with the tags cleared.
- if (page_mte_tagged(page1) || page_mte_tagged(page2))
- if (page1_tagged || page2_tagged) return addr1 != addr2;
return ret;
Unrelated to this discussion, I got an internal report that Linux hangs during boot with CONFIG_DEFERRED_STRUCT_PAGE_INIT because try_page_mte_tagging() locks up on uninitialised page flags.
Since we (always?) map the zero page as pte_special(), set_pte_at()
Yes. (if pte_special is implemented by the arch of course)
won't check if the tags have to be initialised, so we can skip the PG_mte_tagged altogether. We actually had this code for some time until we introduced the pte_special() check in set_pte_at().
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
LGTM!
On 2025/9/24 00:14, Catalin Marinas wrote:
On Tue, Sep 23, 2025 at 12:52:06PM +0100, Catalin Marinas wrote:
I just realised that on arm64 with MTE we won't get any merging with the zero page even if the user page isn't mapped with PROT_MTE. In cpu_enable_mte() we zero the tags in the zero page and set PG_mte_tagged. The reason is that we want to use the zero page with PROT_MTE mappings (until tag setting causes CoW). Hmm, the arm64 memcmp_pages() messed up KSM merging with the zero page even before this patch.
[...]
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c index e5e773844889..72a1dfc54659 100644 --- a/arch/arm64/kernel/mte.c +++ b/arch/arm64/kernel/mte.c @@ -73,6 +73,8 @@ int memcmp_pages(struct page *page1, struct page *page2) { char *addr1, *addr2; int ret;
- bool page1_tagged = page_mte_tagged(page1) && !is_zero_page(page1);
- bool page2_tagged = page_mte_tagged(page2) && !is_zero_page(page2);
addr1 = page_address(page1); addr2 = page_address(page2); @@ -83,11 +85,10 @@ int memcmp_pages(struct page *page1, struct page *page2) /* * If the page content is identical but at least one of the pages is
* tagged, return non-zero to avoid KSM merging. If only one of the
* pages is tagged, __set_ptes() may zero or change the tags of the
* other page via mte_sync_tags().
* tagged, return non-zero to avoid KSM merging. Ignore the zero page
*/* since it is always tagged with the tags cleared.
- if (page_mte_tagged(page1) || page_mte_tagged(page2))
- if (page1_tagged || page2_tagged) return addr1 != addr2;
return ret;
Unrelated to this discussion, I got an internal report that Linux hangs during boot with CONFIG_DEFERRED_STRUCT_PAGE_INIT because try_page_mte_tagging() locks up on uninitialised page flags.
Since we (always?) map the zero page as pte_special(), set_pte_at() won't check if the tags have to be initialised, so we can skip the PG_mte_tagged altogether. We actually had this code for some time until we introduced the pte_special() check in set_pte_at().
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
I can confirm that this alternative patch also works correctly; my tests for MTE all pass ;)
This looks like a better fix since it solves the boot hang issue too.
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 7b78c95a9017..e325ba34f45c 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -2419,17 +2419,21 @@ static void bti_enable(const struct arm64_cpu_capabilities *__unused) #ifdef CONFIG_ARM64_MTE static void cpu_enable_mte(struct arm64_cpu_capabilities const *cap) {
- static bool cleared_zero_page = false;
- sysreg_clear_set(sctlr_el1, 0, SCTLR_ELx_ATA | SCTLR_EL1_ATA0);
mte_cpu_setup(); /* * Clear the tags in the zero page. This needs to be done via the
* linear map which has the Tagged attribute.
* linear map which has the Tagged attribute. Since this page is
* always mapped as pte_special(), set_pte_at() will not attempt to
*/* clear the tags or set PG_mte_tagged.
- if (try_page_mte_tagging(ZERO_PAGE(0))) {
- if (!cleared_zero_page) {
mte_clear_page_tags(lm_alias(empty_zero_page));cleared_zero_page = true;
}set_page_mte_tagged(ZERO_PAGE(0));
kasan_init_hw_tags_cpu();
On Wed, Sep 24, 2025 at 10:49:27AM +0800, Lance Yang wrote:
On 2025/9/24 00:14, Catalin Marinas wrote:
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
I can confirm that this alternative patch also works correctly; my tests for MTE all pass ;)
Thanks Lance for testing. I'll post one of the variants today.
This looks like a better fix since it solves the boot hang issue too.
In principle, yes, until I tracked down why I changed it in the first place - 68d54ceeec0e ("arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page"). ptrace() can read tags from PROT_MTE mappings and we want to allow reading zeroes as well if the page points to the zero page. Not flagging the page as PG_mte_tagged caused issues.
I can change the logic in the ptrace() code, I just need to figure out what happens to the huge zero page. Ideally we should treat both in the same way but, AFAICT, we don't use pmd_mkspecial() on the huge zero page, so it gets flagged with PG_mte_tagged.
If I go with the first fix for the page merging, I'll look to defer the zero page initialisation post page_alloc_init_late() in a separate patch.
On 24.09.25 10:50, Catalin Marinas wrote:
On Wed, Sep 24, 2025 at 10:49:27AM +0800, Lance Yang wrote:
On 2025/9/24 00:14, Catalin Marinas wrote:
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
I can confirm that this alternative patch also works correctly; my tests for MTE all pass ;)
Thanks Lance for testing. I'll post one of the variants today.
This looks like a better fix since it solves the boot hang issue too.
In principle, yes, until I tracked down why I changed it in the first place - 68d54ceeec0e ("arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page"). ptrace() can read tags from PROT_MTE mappings and we want to allow reading zeroes as well if the page points to the zero page. Not flagging the page as PG_mte_tagged caused issues.
I can change the logic in the ptrace() code, I just need to figure out what happens to the huge zero page. Ideally we should treat both in the same way but, AFAICT, we don't use pmd_mkspecial() on the huge zero page, so it gets flagged with PG_mte_tagged.
I changed that recently :) The huge zero folio will now always have pmd_special() set.
On Wed, Sep 24, 2025 at 11:13:18AM +0200, David Hildenbrand wrote:
On 24.09.25 10:50, Catalin Marinas wrote:
On Wed, Sep 24, 2025 at 10:49:27AM +0800, Lance Yang wrote:
On 2025/9/24 00:14, Catalin Marinas wrote:
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
I can confirm that this alternative patch also works correctly; my tests for MTE all pass ;)
Thanks Lance for testing. I'll post one of the variants today.
This looks like a better fix since it solves the boot hang issue too.
In principle, yes, until I tracked down why I changed it in the first place - 68d54ceeec0e ("arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page"). ptrace() can read tags from PROT_MTE mappings and we want to allow reading zeroes as well if the page points to the zero page. Not flagging the page as PG_mte_tagged caused issues.
I can change the logic in the ptrace() code, I just need to figure out what happens to the huge zero page. Ideally we should treat both in the same way but, AFAICT, we don't use pmd_mkspecial() on the huge zero page, so it gets flagged with PG_mte_tagged.
I changed that recently :) The huge zero folio will now always have pmd_special() set.
Oh, which commit was this? It means that we can end up with uninitialised tags if we have a PROT_MTE huge zero page since set_pmd_at/set_pte_at() skips mte_sync_tags().
On 24.09.25 11:34, Catalin Marinas wrote:
On Wed, Sep 24, 2025 at 11:13:18AM +0200, David Hildenbrand wrote:
On 24.09.25 10:50, Catalin Marinas wrote:
On Wed, Sep 24, 2025 at 10:49:27AM +0800, Lance Yang wrote:
On 2025/9/24 00:14, Catalin Marinas wrote:
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
I can confirm that this alternative patch also works correctly; my tests for MTE all pass ;)
Thanks Lance for testing. I'll post one of the variants today.
This looks like a better fix since it solves the boot hang issue too.
In principle, yes, until I tracked down why I changed it in the first place - 68d54ceeec0e ("arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page"). ptrace() can read tags from PROT_MTE mappings and we want to allow reading zeroes as well if the page points to the zero page. Not flagging the page as PG_mte_tagged caused issues.
I can change the logic in the ptrace() code, I just need to figure out what happens to the huge zero page. Ideally we should treat both in the same way but, AFAICT, we don't use pmd_mkspecial() on the huge zero page, so it gets flagged with PG_mte_tagged.
I changed that recently :) The huge zero folio will now always have pmd_special() set.
Oh, which commit was this? It means that we can end up with uninitialised tags if we have a PROT_MTE huge zero page since set_pmd_at/set_pte_at() skips mte_sync_tags().
This one:
commit d82d09e482199e6bbc204df10b2082f764cbe1f4 Author: David Hildenbrand david@redhat.com Date: Mon Aug 11 13:26:25 2025 +0200
mm/huge_memory: mark PMD mappings of the huge zero folio special
The huge zero folio is refcounted (+mapcounted -- is that a word?) differently than "normal" folios, similarly (but different) to the ordinary shared zeropage.
It should be in mm-stable, to go upstream in the upcoming merge window. It's been lurking in -next for a while now.
As it behaves just like the ordinary shared zeropage now, would we have to zero/initialize the tags after allocating it?
On Wed, Sep 24, 2025 at 11:44:19AM +0200, David Hildenbrand wrote:
On 24.09.25 11:34, Catalin Marinas wrote:
On Wed, Sep 24, 2025 at 11:13:18AM +0200, David Hildenbrand wrote:
On 24.09.25 10:50, Catalin Marinas wrote:
On Wed, Sep 24, 2025 at 10:49:27AM +0800, Lance Yang wrote:
On 2025/9/24 00:14, Catalin Marinas wrote:
So alternative patch that also fixes the deferred struct page init (on the assumptions that the zero page is always mapped as pte_special():
I can confirm that this alternative patch also works correctly; my tests for MTE all pass ;)
Thanks Lance for testing. I'll post one of the variants today.
This looks like a better fix since it solves the boot hang issue too.
In principle, yes, until I tracked down why I changed it in the first place - 68d54ceeec0e ("arm64: mte: Allow PTRACE_PEEKMTETAGS access to the zero page"). ptrace() can read tags from PROT_MTE mappings and we want to allow reading zeroes as well if the page points to the zero page. Not flagging the page as PG_mte_tagged caused issues.
I can change the logic in the ptrace() code, I just need to figure out what happens to the huge zero page. Ideally we should treat both in the same way but, AFAICT, we don't use pmd_mkspecial() on the huge zero page, so it gets flagged with PG_mte_tagged.
I changed that recently :) The huge zero folio will now always have pmd_special() set.
Oh, which commit was this? It means that we can end up with uninitialised tags if we have a PROT_MTE huge zero page since set_pmd_at/set_pte_at() skips mte_sync_tags().
This one:
commit d82d09e482199e6bbc204df10b2082f764cbe1f4 Author: David Hildenbrand david@redhat.com Date: Mon Aug 11 13:26:25 2025 +0200
mm/huge_memory: mark PMD mappings of the huge zero folio special The huge zero folio is refcounted (+mapcounted -- is that a word?) differently than "normal" folios, similarly (but different) to the ordinary shared zeropage.
It should be in mm-stable, to go upstream in the upcoming merge window. It's been lurking in -next for a while now.
Thanks. At least it's something to address in the next kernel version. I need to improve the MTE kselftests to catch the zero page scenarios.
As it behaves just like the ordinary shared zeropage now, would we have to zero/initialize the tags after allocating it?
Yes. Before pmd_special(), it was be done lazily via set_pmd_at(). I think it just needs a __GFP_ZEROTAGS. The only other place we use this flag is in vma_alloc_zeroed_movable_folio(), as an optimisation to avoid a separate loop for zeroing the tags after data.
On Mon, Sep 22, 2025 at 10:14:58AM +0800, Lance Yang wrote:
From: Lance Yang lance.yang@linux.dev
When both THP and MTE are enabled, splitting a THP and replacing its zero-filled subpages with the shared zeropage can cause MTE tag mismatch faults in userspace.
Remapping zero-filled subpages to the shared zeropage is unsafe, as the zeropage has a fixed tag of zero, which may not match the tag expected by the userspace pointer.
KSM already avoids this problem by using memcmp_pages(), which on arm64 intentionally reports MTE-tagged pages as non-identical to prevent unsafe merging.
As suggested by David[1], this patch adopts the same pattern, replacing the memchr_inv() byte-level check with a call to pages_identical(). This leverages existing architecture-specific logic to determine if a page is truly identical to the shared zeropage.
Having both the THP shrinker and KSM rely on pages_identical() makes the design more future-proof, IMO. Instead of handling quirks in generic code, we just let the architecture decide what makes two pages identical.
[1] https://lore.kernel.org/all/ca2106a3-4bb2-4457-81af-301fd99fbef4@redhat.com
Cc: stable@vger.kernel.org Reported-by: Qun-wei Lin Qun-wei.Lin@mediatek.com Closes: https://lore.kernel.org/all/a7944523fcc3634607691c35311a5d59d1a3f8d4.camel@m... Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp") Suggested-by: David Hildenbrand david@redhat.com Signed-off-by: Lance Yang lance.yang@linux.dev
Nice catch.
Reviewed-by: Wei Yang richard.weiyang@gmail.com
linux-stable-mirror@lists.linaro.org