Hi all,
I have found a problem in recent kernels 6.9+ when running under the Xen hypervisor as a PV dom0. In my setup (PV Dom0 with 4G RAM, using XFS), I shrink memory to 256M via the balloon driver after boot:
xl mem-set Domain-0 256m
Once memory is reduced, even running a simple command like ls usually triggers the following warning:
[ 27.963562] [ T2553] page: refcount:88 mapcount:21 mapping:ffff88813ff6f6a8 index:0x110 pfn:0x10085c [ 27.963564] [ T2553] head: order:2 mapcount:83 entire_mapcount:0 nr_pages_mapped:4 pincount:0 [ 27.963565] [ T2553] memcg:ffff888003573000 [ 27.963567] [ T2553] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 27.963570] [ T2553] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 27.963573] [ T2553] raw: 002000000000416c ffffea0004021e88 ffffea0004021908 ffff88813ff6f6a8 [ 27.963574] [ T2553] raw: 0000000000000110 ffff88811bf06b60 0000005800000014 ffff888003573000 [ 27.963576] [ T2553] head: 002000000000416c ffffea0004021e88 ffffea0004021908 ffff88813ff6f6a8 [ 27.963577] [ T2553] head: 0000000000000110 ffff88811bf06b60 0000005800000014 ffff888003573000 [ 27.963578] [ T2553] head: 0020000000000202 ffffea0004021701 0000000400000052 00000000ffffffff [ 27.963580] [ T2553] head: 0000000300000003 8000000300000002 0000000000000014 0000000000000004 [ 27.963581] [ T2553] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1), const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *: (struct folio *)_compound_head(page + nr_pages - 1))) != folio) [ 27.963590] [ T2553] ------------[ cut here ]------------ [ 27.963591] [ T2553] WARNING: CPU: 1 PID: 2553 at include/linux/rmap.h:427 __folio_rmap_sanity_checks+0x1a0/0x270 [ 27.963596] [ T2553] CPU: 1 UID: 0 PID: 2553 Comm: ls Not tainted 6.15.0-rc4-00002-ge0363f942651 #270 PREEMPT(undef) [ 27.963599] [ T2553] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 [ 27.963601] [ T2553] RIP: e030:__folio_rmap_sanity_checks+0x1a0/0x270 [ 27.963603] [ T2553] Code: 89 df e8 b3 7d fd ff 0f 0b e9 eb fe ff ff 48 8d 42 ff 48 39 c3 0f 84 0e ff ff ff 48 c7 c6 e8 49 5b 82 48 89 df e8 90 7d fd ff <0f> 0b e9 f8 fe ff ff 41 f7 c4 ff 0f 00 00 0f 85 b2 fe ff ff 49 8b [ 27.963605] [ T2553] RSP: e02b:ffffc90040f8fac0 EFLAGS: 00010246 [ 27.963608] [ T2553] RAX: 00000000000000e5 RBX: ffffea0004021700 RCX: 0000000000000000 [ 27.963609] [ T2553] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff [ 27.963610] [ T2553] RBP: ffffc90040f8fae0 R08: 00000000ffffdfff R09: ffffffff82929308 [ 27.963612] [ T2553] R10: ffffffff82879360 R11: 0000000000000002 R12: ffffea0004021700 [ 27.963613] [ T2553] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000005 [ 27.963625] [ T2553] FS: 00007ff06dafe740(0000) GS:ffff8880b7ccb000(0000) knlGS:0000000000000000 [ 27.963627] [ T2553] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.963628] [ T2553] CR2: 000055deb932f630 CR3: 0000000002844000 CR4: 0000000000050660 [ 27.963630] [ T2553] Call Trace: [ 27.963632] [ T2553] <TASK> [ 27.963633] [ T2553] folio_remove_rmap_ptes+0x24/0x2b0 [ 27.963635] [ T2553] unmap_page_range+0x132e/0x17e0 [ 27.963638] [ T2553] unmap_single_vma+0x81/0xd0 [ 27.963640] [ T2553] unmap_vmas+0xb5/0x180 [ 27.963642] [ T2553] exit_mmap+0x10c/0x460 [ 27.963644] [ T2553] mmput+0x59/0x120 [ 27.963647] [ T2553] do_exit+0x2d1/0xbd0 [ 27.963649] [ T2553] do_group_exit+0x2f/0x90 [ 27.963651] [ T2553] __x64_sys_exit_group+0x13/0x20 [ 27.963652] [ T2553] x64_sys_call+0x126b/0x1d70 [ 27.963655] [ T2553] do_syscall_64+0x54/0x120 [ 27.963657] [ T2553] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 27.963659] [ T2553] RIP: 0033:0x7ff06dbdbe9d [ 27.963660] [ T2553] Code: Unable to access opcode bytes at 0x7ff06dbdbe73. [ 27.963661] [ T2553] RSP: 002b:00007ffde44f9ac8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7 [ 27.963664] [ T2553] RAX: ffffffffffffffda RBX: 00007ff06dce4fa8 RCX: 00007ff06dbdbe9d [ 27.963665] [ T2553] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000000 [ 27.963666] [ T2553] RBP: 0000000000000000 R08: 00007ffde44f9a70 R09: 00007ffde44f99ff [ 27.963667] [ T2553] R10: 00007ffde44f9980 R11: 0000000000000206 R12: 00007ff06dce3680 [ 27.963668] [ T2553] R13: 00007ff06dd010e0 R14: 0000000000000002 R15: 00007ff06dce4fc0 [ 27.963670] [ T2553] </TASK> [ 27.963671] [ T2553] ---[ end trace 0000000000000000 ]---
Later on bigger problems start to appear:
[ 69.035059] [ T2593] BUG: Bad page map in process ls pte:10006c125 pmd:1003dc067 [ 69.035061] [ T2593] page: refcount:8 mapcount:20 mapping:ffff88813fd736a8 index:0x110 pfn:0x10006c [ 69.035064] [ T2593] head: order:2 mapcount:-2 entire_mapcount:0 nr_pages_mapped:8388532 pincount:0 [ 69.035066] [ T2593] memcg:ffff888003573000 [ 69.035067] [ T2593] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 69.035069] [ T2593] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 69.035072] [ T2593] raw: 002000000000416c ffffea0004002348 ffffea0004001d08 ffff88813fd736a8 [ 69.035074] [ T2593] raw: 0000000000000110 ffff888100d19860 0000000800000013 ffff888003573000 [ 69.035076] [ T2593] head: 002000000000416c ffffea0004002348 ffffea0004001d08 ffff88813fd736a8 [ 69.035078] [ T2593] head: 0000000000000110 ffff888100d19860 0000000800000013 ffff888003573000 [ 69.035079] [ T2593] head: 0020000000000202 ffffea0004001b01 ffffffb4fffffffd 00000000ffffffff [ 69.035081] [ T2593] head: 0000000300000003 0000000500000002 0000000000000013 0000000000000004 [ 69.035082] [ T2593] page dumped because: bad pte [ 69.035083] [ T2593] addr:00007f6524b96000 vm_flags:00000075 anon_vma:0000000000000000 mapping:ffff88813fd736a8 index:110 [ 69.035086] [ T2593] file:libc.so.6 fault:xfs_filemap_fault mmap:xfs_file_mmap read_folio:xfs_vm_read_folio
The system eventually becomes unusable and typically crashes.
I was able to bisect this issue to the commit: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP").
If I understand correctly, the folio from the first warning has 4 pages, but folio_pte_batch incorrectly counts 5 pages, because expected_pte and pte are both zero-valued PTEs, and the loop is not broken in such a case because pte_same() returns true.
[ 27.963266] [ T2553] folio_pte_batch debug: printing PTE values starting at addr=0x7ff06dc11000 [ 27.963268] [ T2553] PTE[ 0] = 000000010085c125 [ 27.963272] [ T2553] PTE[ 1] = 000000010085d125 [ 27.963274] [ T2553] PTE[ 2] = 000000010085e125 [ 27.963276] [ T2553] PTE[ 3] = 000000010085f125 [ 27.963277] [ T2553] PTE[ 4] = 0000000000000000 <-- not present [ 27.963279] [ T2553] PTE[ 5] = 0000000102e47125 [ 27.963281] [ T2553] PTE[ 6] = 0000000102e48125 [ 27.963283] [ T2553] PTE[ 7] = 0000000102e49125
As a consequence, zap_present_folio_ptes() is called with nr = 5, which later calls folio_remove_rmap_ptes(), where __folio_rmap_sanity_checks() emits the warning log.
The patch in the following message fixes the problem for me. It adds a check that breaks the loop when encountering non-present PTE, before the !pte_same() check.
I still don't fully understand how the zero-valued PTE appears there or why this issue is triggered only when xen-balloon driver shrinks the memory. I wasn't able to reproduce it without shrinking.
Cheers, Petr
Petr Vaněk (1): mm: Fix folio_pte_batch() overcount with zero PTEs
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz --- mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
+ if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
if (!pte_same(pte, expected_pte)) break;break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
On 29.04.25 16:41, Ryan Roberts wrote:
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
So if we cannot trust ptep_get() output, then, ... how could we trust anything here and ever possibly batch?
On 29/04/2025 15:46, David Hildenbrand wrote:
On 29.04.25 16:41, Ryan Roberts wrote:
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
So if we cannot trust ptep_get() output, then, ... how could we trust anything here and ever possibly batch?
The point is that for a write followed by a read to the same PTE, the read may not return what was written. It could return the value of the PTE at the point of entry into the lazy mmu mode.
I guess one quick way to test is to hack out lazy mmu support. Something like this? (totally untested):
----8<---- diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index c4c23190925c..1f0a1a713072 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -541,22 +541,6 @@ static inline void arch_end_context_switch(struct task_struct *next) PVOP_VCALL1(cpu.end_context_switch, next); }
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE -static inline void arch_enter_lazy_mmu_mode(void) -{ - PVOP_VCALL0(mmu.lazy_mode.enter); -} - -static inline void arch_leave_lazy_mmu_mode(void) -{ - PVOP_VCALL0(mmu.lazy_mode.leave); -} - -static inline void arch_flush_lazy_mmu_mode(void) -{ - PVOP_VCALL0(mmu.lazy_mode.flush); -} - static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) { ----8<----
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
if (!pte_same(pte, expected_pte)) break;break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Something with XEN is really problematic here.
-- Cheers,
David / dhildenb
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The really weird thing is that this has only been seen on XEN.
But even on XEN, a present pte should not suddenly get !present -- we're not re-reading from ptep :/
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
The really weird thing is that this has only been seen on XEN.
But even on XEN, a present pte should not suddenly get !present -- we're not re-reading from ptep :/
-- Cheers,
David / dhildenb
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
Is it only a problem if we exceed a certain pfn?
linux-stable-mirror@lists.linaro.org