Hi all,
I have found a problem in recent kernels 6.9+ when running under the Xen hypervisor as a PV dom0. In my setup (PV Dom0 with 4G RAM, using XFS), I shrink memory to 256M via the balloon driver after boot:
xl mem-set Domain-0 256m
Once memory is reduced, even running a simple command like ls usually triggers the following warning:
[ 27.963562] [ T2553] page: refcount:88 mapcount:21 mapping:ffff88813ff6f6a8 index:0x110 pfn:0x10085c [ 27.963564] [ T2553] head: order:2 mapcount:83 entire_mapcount:0 nr_pages_mapped:4 pincount:0 [ 27.963565] [ T2553] memcg:ffff888003573000 [ 27.963567] [ T2553] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 27.963570] [ T2553] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 27.963573] [ T2553] raw: 002000000000416c ffffea0004021e88 ffffea0004021908 ffff88813ff6f6a8 [ 27.963574] [ T2553] raw: 0000000000000110 ffff88811bf06b60 0000005800000014 ffff888003573000 [ 27.963576] [ T2553] head: 002000000000416c ffffea0004021e88 ffffea0004021908 ffff88813ff6f6a8 [ 27.963577] [ T2553] head: 0000000000000110 ffff88811bf06b60 0000005800000014 ffff888003573000 [ 27.963578] [ T2553] head: 0020000000000202 ffffea0004021701 0000000400000052 00000000ffffffff [ 27.963580] [ T2553] head: 0000000300000003 8000000300000002 0000000000000014 0000000000000004 [ 27.963581] [ T2553] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1), const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *: (struct folio *)_compound_head(page + nr_pages - 1))) != folio) [ 27.963590] [ T2553] ------------[ cut here ]------------ [ 27.963591] [ T2553] WARNING: CPU: 1 PID: 2553 at include/linux/rmap.h:427 __folio_rmap_sanity_checks+0x1a0/0x270 [ 27.963596] [ T2553] CPU: 1 UID: 0 PID: 2553 Comm: ls Not tainted 6.15.0-rc4-00002-ge0363f942651 #270 PREEMPT(undef) [ 27.963599] [ T2553] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 [ 27.963601] [ T2553] RIP: e030:__folio_rmap_sanity_checks+0x1a0/0x270 [ 27.963603] [ T2553] Code: 89 df e8 b3 7d fd ff 0f 0b e9 eb fe ff ff 48 8d 42 ff 48 39 c3 0f 84 0e ff ff ff 48 c7 c6 e8 49 5b 82 48 89 df e8 90 7d fd ff <0f> 0b e9 f8 fe ff ff 41 f7 c4 ff 0f 00 00 0f 85 b2 fe ff ff 49 8b [ 27.963605] [ T2553] RSP: e02b:ffffc90040f8fac0 EFLAGS: 00010246 [ 27.963608] [ T2553] RAX: 00000000000000e5 RBX: ffffea0004021700 RCX: 0000000000000000 [ 27.963609] [ T2553] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff [ 27.963610] [ T2553] RBP: ffffc90040f8fae0 R08: 00000000ffffdfff R09: ffffffff82929308 [ 27.963612] [ T2553] R10: ffffffff82879360 R11: 0000000000000002 R12: ffffea0004021700 [ 27.963613] [ T2553] R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000005 [ 27.963625] [ T2553] FS: 00007ff06dafe740(0000) GS:ffff8880b7ccb000(0000) knlGS:0000000000000000 [ 27.963627] [ T2553] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.963628] [ T2553] CR2: 000055deb932f630 CR3: 0000000002844000 CR4: 0000000000050660 [ 27.963630] [ T2553] Call Trace: [ 27.963632] [ T2553] <TASK> [ 27.963633] [ T2553] folio_remove_rmap_ptes+0x24/0x2b0 [ 27.963635] [ T2553] unmap_page_range+0x132e/0x17e0 [ 27.963638] [ T2553] unmap_single_vma+0x81/0xd0 [ 27.963640] [ T2553] unmap_vmas+0xb5/0x180 [ 27.963642] [ T2553] exit_mmap+0x10c/0x460 [ 27.963644] [ T2553] mmput+0x59/0x120 [ 27.963647] [ T2553] do_exit+0x2d1/0xbd0 [ 27.963649] [ T2553] do_group_exit+0x2f/0x90 [ 27.963651] [ T2553] __x64_sys_exit_group+0x13/0x20 [ 27.963652] [ T2553] x64_sys_call+0x126b/0x1d70 [ 27.963655] [ T2553] do_syscall_64+0x54/0x120 [ 27.963657] [ T2553] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 27.963659] [ T2553] RIP: 0033:0x7ff06dbdbe9d [ 27.963660] [ T2553] Code: Unable to access opcode bytes at 0x7ff06dbdbe73. [ 27.963661] [ T2553] RSP: 002b:00007ffde44f9ac8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7 [ 27.963664] [ T2553] RAX: ffffffffffffffda RBX: 00007ff06dce4fa8 RCX: 00007ff06dbdbe9d [ 27.963665] [ T2553] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000000 [ 27.963666] [ T2553] RBP: 0000000000000000 R08: 00007ffde44f9a70 R09: 00007ffde44f99ff [ 27.963667] [ T2553] R10: 00007ffde44f9980 R11: 0000000000000206 R12: 00007ff06dce3680 [ 27.963668] [ T2553] R13: 00007ff06dd010e0 R14: 0000000000000002 R15: 00007ff06dce4fc0 [ 27.963670] [ T2553] </TASK> [ 27.963671] [ T2553] ---[ end trace 0000000000000000 ]---
Later on bigger problems start to appear:
[ 69.035059] [ T2593] BUG: Bad page map in process ls pte:10006c125 pmd:1003dc067 [ 69.035061] [ T2593] page: refcount:8 mapcount:20 mapping:ffff88813fd736a8 index:0x110 pfn:0x10006c [ 69.035064] [ T2593] head: order:2 mapcount:-2 entire_mapcount:0 nr_pages_mapped:8388532 pincount:0 [ 69.035066] [ T2593] memcg:ffff888003573000 [ 69.035067] [ T2593] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 69.035069] [ T2593] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 69.035072] [ T2593] raw: 002000000000416c ffffea0004002348 ffffea0004001d08 ffff88813fd736a8 [ 69.035074] [ T2593] raw: 0000000000000110 ffff888100d19860 0000000800000013 ffff888003573000 [ 69.035076] [ T2593] head: 002000000000416c ffffea0004002348 ffffea0004001d08 ffff88813fd736a8 [ 69.035078] [ T2593] head: 0000000000000110 ffff888100d19860 0000000800000013 ffff888003573000 [ 69.035079] [ T2593] head: 0020000000000202 ffffea0004001b01 ffffffb4fffffffd 00000000ffffffff [ 69.035081] [ T2593] head: 0000000300000003 0000000500000002 0000000000000013 0000000000000004 [ 69.035082] [ T2593] page dumped because: bad pte [ 69.035083] [ T2593] addr:00007f6524b96000 vm_flags:00000075 anon_vma:0000000000000000 mapping:ffff88813fd736a8 index:110 [ 69.035086] [ T2593] file:libc.so.6 fault:xfs_filemap_fault mmap:xfs_file_mmap read_folio:xfs_vm_read_folio
The system eventually becomes unusable and typically crashes.
I was able to bisect this issue to the commit: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP").
If I understand correctly, the folio from the first warning has 4 pages, but folio_pte_batch incorrectly counts 5 pages, because expected_pte and pte are both zero-valued PTEs, and the loop is not broken in such a case because pte_same() returns true.
[ 27.963266] [ T2553] folio_pte_batch debug: printing PTE values starting at addr=0x7ff06dc11000 [ 27.963268] [ T2553] PTE[ 0] = 000000010085c125 [ 27.963272] [ T2553] PTE[ 1] = 000000010085d125 [ 27.963274] [ T2553] PTE[ 2] = 000000010085e125 [ 27.963276] [ T2553] PTE[ 3] = 000000010085f125 [ 27.963277] [ T2553] PTE[ 4] = 0000000000000000 <-- not present [ 27.963279] [ T2553] PTE[ 5] = 0000000102e47125 [ 27.963281] [ T2553] PTE[ 6] = 0000000102e48125 [ 27.963283] [ T2553] PTE[ 7] = 0000000102e49125
As a consequence, zap_present_folio_ptes() is called with nr = 5, which later calls folio_remove_rmap_ptes(), where __folio_rmap_sanity_checks() emits the warning log.
The patch in the following message fixes the problem for me. It adds a check that breaks the loop when encountering non-present PTE, before the !pte_same() check.
I still don't fully understand how the zero-valued PTE appears there or why this issue is triggered only when xen-balloon driver shrinks the memory. I wasn't able to reproduce it without shrinking.
Cheers, Petr
Petr Vaněk (1): mm: Fix folio_pte_batch() overcount with zero PTEs
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz --- mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
+ if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
if (!pte_same(pte, expected_pte)) break;break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
On 29.04.25 16:41, Ryan Roberts wrote:
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
So if we cannot trust ptep_get() output, then, ... how could we trust anything here and ever possibly batch?
On 29/04/2025 15:46, David Hildenbrand wrote:
On 29.04.25 16:41, Ryan Roberts wrote:
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
So if we cannot trust ptep_get() output, then, ... how could we trust anything here and ever possibly batch?
The point is that for a write followed by a read to the same PTE, the read may not return what was written. It could return the value of the PTE at the point of entry into the lazy mmu mode.
I guess one quick way to test is to hack out lazy mmu support. Something like this? (totally untested):
----8<---- diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index c4c23190925c..1f0a1a713072 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -541,22 +541,6 @@ static inline void arch_end_context_switch(struct task_struct *next) PVOP_VCALL1(cpu.end_context_switch, next); }
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE -static inline void arch_enter_lazy_mmu_mode(void) -{ - PVOP_VCALL0(mmu.lazy_mode.enter); -} - -static inline void arch_leave_lazy_mmu_mode(void) -{ - PVOP_VCALL0(mmu.lazy_mode.leave); -} - -static inline void arch_flush_lazy_mmu_mode(void) -{ - PVOP_VCALL0(mmu.lazy_mode.flush); -} - static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) { ----8<----
On Tue, Apr 29, 2025 at 04:02:10PM +0100, Ryan Roberts wrote:
On 29/04/2025 15:46, David Hildenbrand wrote:
On 29.04.25 16:41, Ryan Roberts wrote:
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
So if we cannot trust ptep_get() output, then, ... how could we trust anything here and ever possibly batch?
The point is that for a write followed by a read to the same PTE, the read may not return what was written. It could return the value of the PTE at the point of entry into the lazy mmu mode.
I guess one quick way to test is to hack out lazy mmu support. Something like this? (totally untested):
I (blindly) applied the suggested change but I am still seeing the same issue.
Petr
----8<---- diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index c4c23190925c..1f0a1a713072 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -541,22 +541,6 @@ static inline void arch_end_context_switch(struct task_struct *next) PVOP_VCALL1(cpu.end_context_switch, next); }
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE -static inline void arch_enter_lazy_mmu_mode(void) -{
PVOP_VCALL0(mmu.lazy_mode.enter);
-}
-static inline void arch_leave_lazy_mmu_mode(void) -{
PVOP_VCALL0(mmu.lazy_mode.leave);
-}
-static inline void arch_flush_lazy_mmu_mode(void) -{
PVOP_VCALL0(mmu.lazy_mode.flush);
-}
static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) { ----8<----
On 30/04/2025 14:04, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:02:10PM +0100, Ryan Roberts wrote:
On 29/04/2025 15:46, David Hildenbrand wrote:
On 29.04.25 16:41, Ryan Roberts wrote:
On 29/04/2025 15:29, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags); + if (!pte_present(pte)) + break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
Something with XEN is really problematic here.
We are inside a lazy MMU region (arch_enter_lazy_mmu_mode()) at this point, which I believe XEN uses. If a PTE was written then read back while in lazy mode you could get a stale value.
See https://lore.kernel.org/all/912c7a32-b39c-494f-a29c-4865cd92aeba@agordeev.lo... for an example bug.
So if we cannot trust ptep_get() output, then, ... how could we trust anything here and ever possibly batch?
The point is that for a write followed by a read to the same PTE, the read may not return what was written. It could return the value of the PTE at the point of entry into the lazy mmu mode.
I guess one quick way to test is to hack out lazy mmu support. Something like this? (totally untested):
I (blindly) applied the suggested change but I am still seeing the same issue.
Thanks for trying; it was just something that came to mind as a possibility knowing it was XEN and inside lazy mmu region. I think your other discussion has concluded that the x86 implementation of pte_advance_pfn() is not correct when XEN is in use? (I was just scanning, perhaps I came to wrong conclusion)..
Thanks, Ryan
Petr
----8<---- diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index c4c23190925c..1f0a1a713072 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -541,22 +541,6 @@ static inline void arch_end_context_switch(struct task_struct *next) PVOP_VCALL1(cpu.end_context_switch, next); }
-#define __HAVE_ARCH_ENTER_LAZY_MMU_MODE -static inline void arch_enter_lazy_mmu_mode(void) -{
PVOP_VCALL0(mmu.lazy_mode.enter);
-}
-static inline void arch_leave_lazy_mmu_mode(void) -{
PVOP_VCALL0(mmu.lazy_mode.leave);
-}
-static inline void arch_flush_lazy_mmu_mode(void) -{
PVOP_VCALL0(mmu.lazy_mode.flush);
-}
static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, phys_addr_t phys, pgprot_t flags) { ----8<----
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
if (!pte_same(pte, expected_pte)) break;break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Something with XEN is really problematic here.
-- Cheers,
David / dhildenb
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The really weird thing is that this has only been seen on XEN.
But even on XEN, a present pte should not suddenly get !present -- we're not re-reading from ptep :/
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
The really weird thing is that this has only been seen on XEN.
But even on XEN, a present pte should not suddenly get !present -- we're not re-reading from ptep :/
-- Cheers,
David / dhildenb
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
Is it only a problem if we exceed a certain pfn?
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote: > folio_pte_batch() could overcount the number of contiguous PTEs when > pte_advance_pfn() returns a zero-valued PTE and the following PTE in > memory also happens to be zero. The loop doesn't break in such a case > because pte_same() returns true, and the batch size is advanced by one > more than it should be. > > To fix this, bail out early if a non-present PTE is encountered, > preventing the invalid comparison. > > This issue started to appear after commit 10ebac4f95e7 ("mm/memory: > optimize unmap/zap with PTE-mapped THP") and was discovered via git > bisect. > > Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") > Cc: stable@vger.kernel.org > Signed-off-by: Petr Vaněk arkamar@atlas.cz > --- > mm/internal.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/internal.h b/mm/internal.h > index e9695baa5922..c181fe2bac9d 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, > dirty = !!pte_dirty(pte); > pte = __pte_batch_clear_ignored(pte, flags); > > + if (!pte_present(pte)) > + break; > if (!pte_same(pte, expected_pte)) > break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix. Maybe it would be better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
Cheers, Petr
On 30.04.25 13:52, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote: > On 29.04.25 16:22, Petr Vaněk wrote: >> folio_pte_batch() could overcount the number of contiguous PTEs when >> pte_advance_pfn() returns a zero-valued PTE and the following PTE in >> memory also happens to be zero. The loop doesn't break in such a case >> because pte_same() returns true, and the batch size is advanced by one >> more than it should be. >> >> To fix this, bail out early if a non-present PTE is encountered, >> preventing the invalid comparison. >> >> This issue started to appear after commit 10ebac4f95e7 ("mm/memory: >> optimize unmap/zap with PTE-mapped THP") and was discovered via git >> bisect. >> >> Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") >> Cc: stable@vger.kernel.org >> Signed-off-by: Petr Vaněk arkamar@atlas.cz >> --- >> mm/internal.h | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/mm/internal.h b/mm/internal.h >> index e9695baa5922..c181fe2bac9d 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, >> dirty = !!pte_dirty(pte); >> pte = __pte_batch_clear_ignored(pte, flags); >> >> + if (!pte_present(pte)) >> + break; >> if (!pte_same(pte, expected_pte)) >> break; > > How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix.
No :)
Maybe it would be
better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
I think what's happening is that -- under XEN only -- we might get garbage when calling pte_advance_pfn() and the next PFN would no longer fall into the folio. And the current code cannot deal with that XEN garbage.
But still not 100% sure.
The following is completely untested, could you give that a try? I might find some time this evening to test myself and try to further improve it.
From 7d4149a5ea18cba6a694946e59efa9f51d793a4e Mon Sep 17 00:00:00 2001 From: David Hildenbrand david@redhat.com Date: Wed, 30 Apr 2025 16:35:12 +0200 Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand david@redhat.com --- mm/internal.h | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa59226..a9ea7f62486ec 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -248,11 +248,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty) { - unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); - const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte, *ptep; bool writable, young, dirty; - int nr; + int nr, cur_nr;
if (any_writable) *any_writable = false; @@ -265,11 +263,17 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+ /* Limit max_nr to the actual remaining PFNs in the folio. */ + max_nr = min_t(unsigned long, max_nr, + folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte)); + if (unlikely(max_nr == 1)) + return 1; + nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); ptep = start_ptep + nr;
- while (ptep < end_ptep) { + while (nr < max_nr) { pte = ptep_get(ptep); if (any_writable) writable = !!pte_write(pte); @@ -282,14 +286,6 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (!pte_same(pte, expected_pte)) break;
- /* - * Stop immediately once we reached the end of the folio. In - * corner cases the next PFN might fall into a different - * folio. - */ - if (pte_pfn(pte) >= folio_end_pfn) - break; - if (any_writable) *any_writable |= writable; if (any_young) @@ -297,12 +293,13 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (any_dirty) *any_dirty |= dirty;
- nr = pte_batch_hint(ptep, pte); - expected_pte = pte_advance_pfn(expected_pte, nr); - ptep += nr; + cur_nr = pte_batch_hint(ptep, pte); + expected_pte = pte_advance_pfn(expected_pte, cur_nr); + ptep += cur_nr; + nr += cur_nr; }
- return min(ptep - start_ptep, max_nr); + return min(nr, max_nr); }
/**
On Wed, Apr 30, 2025 at 04:37:21PM +0200, David Hildenbrand wrote:
On 30.04.25 13:52, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote: > On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote: >> On 29.04.25 16:22, Petr Vaněk wrote: >>> folio_pte_batch() could overcount the number of contiguous PTEs when >>> pte_advance_pfn() returns a zero-valued PTE and the following PTE in >>> memory also happens to be zero. The loop doesn't break in such a case >>> because pte_same() returns true, and the batch size is advanced by one >>> more than it should be. >>> >>> To fix this, bail out early if a non-present PTE is encountered, >>> preventing the invalid comparison. >>> >>> This issue started to appear after commit 10ebac4f95e7 ("mm/memory: >>> optimize unmap/zap with PTE-mapped THP") and was discovered via git >>> bisect. >>> >>> Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") >>> Cc: stable@vger.kernel.org >>> Signed-off-by: Petr Vaněk arkamar@atlas.cz >>> --- >>> mm/internal.h | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/mm/internal.h b/mm/internal.h >>> index e9695baa5922..c181fe2bac9d 100644 >>> --- a/mm/internal.h >>> +++ b/mm/internal.h >>> @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, >>> dirty = !!pte_dirty(pte); >>> pte = __pte_batch_clear_ignored(pte, flags); >>> >>> + if (!pte_present(pte)) >>> + break; >>> if (!pte_same(pte, expected_pte)) >>> break; >> >> How could pte_same() suddenly match on a present and non-present PTE. > > In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. > pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix.
No :)
Maybe it would be
better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
I think what's happening is that -- under XEN only -- we might get garbage when calling pte_advance_pfn() and the next PFN would no longer fall into the folio. And the current code cannot deal with that XEN garbage.
But still not 100% sure.
The following is completely untested, could you give that a try?
Yes, it solves the issue for me.
However, maybe it would be better to solve it with the following patch. The pte_pfn_to_mfn() actually returns the same value for non-present PTEs. I suggest to return original PTE if the mfn is INVALID_P2M_ENTRY, rather than empty non-present PTE, but the _PAGE_PRESENT bit will be set to zero. Thus, we will not loose information about original pfn but it will be clear that the page is not present.
From e84781f9ec4fb7275d5e7629cf7e222466caf759 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Petr=20Van=C4=9Bk?= arkamar@atlas.cz Date: Wed, 30 Apr 2025 17:08:41 +0200 Subject: [PATCH] x86/mm: Reset pte _PAGE_PRESENT bit for invalid mft MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Signed-off-by: Petr Vaněk arkamar@atlas.cz --- arch/x86/xen/mmu_pv.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 38971c6dcd4b..92a6a9af0c65 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -392,28 +392,25 @@ static pteval_t pte_mfn_to_pfn(pteval_t val) static pteval_t pte_pfn_to_mfn(pteval_t val) { if (val & _PAGE_PRESENT) { unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; pteval_t flags = val & PTE_FLAGS_MASK; unsigned long mfn;
mfn = __pfn_to_mfn(pfn);
/* - * If there's no mfn for the pfn, then just create an - * empty non-present pte. Unfortunately this loses - * information about the original pfn, so - * pte_mfn_to_pfn is asymmetric. + * If there's no mfn for the pfn, then just reset present pte bit. */ if (unlikely(mfn == INVALID_P2M_ENTRY)) { - mfn = 0; - flags = 0; + mfn = pfn; + flags &= ~_PAGE_PRESENT; } else mfn &= ~(FOREIGN_FRAME_BIT | IDENTITY_FRAME_BIT); val = ((pteval_t)mfn << PAGE_SHIFT) | flags; }
return val; }
__visible pteval_t xen_pte_val(pte_t pte) {
On 30.04.25 18:00, Petr Vaněk wrote:
On Wed, Apr 30, 2025 at 04:37:21PM +0200, David Hildenbrand wrote:
On 30.04.25 13:52, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote: > On 29.04.25 16:45, Petr Vaněk wrote: >> On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote: >>> On 29.04.25 16:22, Petr Vaněk wrote: >>>> folio_pte_batch() could overcount the number of contiguous PTEs when >>>> pte_advance_pfn() returns a zero-valued PTE and the following PTE in >>>> memory also happens to be zero. The loop doesn't break in such a case >>>> because pte_same() returns true, and the batch size is advanced by one >>>> more than it should be. >>>> >>>> To fix this, bail out early if a non-present PTE is encountered, >>>> preventing the invalid comparison. >>>> >>>> This issue started to appear after commit 10ebac4f95e7 ("mm/memory: >>>> optimize unmap/zap with PTE-mapped THP") and was discovered via git >>>> bisect. >>>> >>>> Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") >>>> Cc: stable@vger.kernel.org >>>> Signed-off-by: Petr Vaněk arkamar@atlas.cz >>>> --- >>>> mm/internal.h | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/mm/internal.h b/mm/internal.h >>>> index e9695baa5922..c181fe2bac9d 100644 >>>> --- a/mm/internal.h >>>> +++ b/mm/internal.h >>>> @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, >>>> dirty = !!pte_dirty(pte); >>>> pte = __pte_batch_clear_ignored(pte, flags); >>>> >>>> + if (!pte_present(pte)) >>>> + break; >>>> if (!pte_same(pte, expected_pte)) >>>> break; >>> >>> How could pte_same() suddenly match on a present and non-present PTE. >> >> In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. >> pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs. > > Observe that folio_pte_batch() was called *with a present pte*. > > do_zap_pte_range() > if (pte_present(ptent)) > zap_present_ptes() > folio_pte_batch() > > How can we end up with an expected_pte that is !present, if it is based > on the provided pte that *is present* and we only used pte_advance_pfn() > to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix.
No :)
Maybe it would be
better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
I think what's happening is that -- under XEN only -- we might get garbage when calling pte_advance_pfn() and the next PFN would no longer fall into the folio. And the current code cannot deal with that XEN garbage.
But still not 100% sure.
The following is completely untested, could you give that a try?
Yes, it solves the issue for me.
Cool!
However, maybe it would be better to solve it with the following patch. The pte_pfn_to_mfn() actually returns the same value for non-present PTEs. I suggest to return original PTE if the mfn is INVALID_P2M_ENTRY, rather than empty non-present PTE, but the _PAGE_PRESENT bit will be set to zero. Thus, we will not loose information about original pfn but it will be clear that the page is not present.
From e84781f9ec4fb7275d5e7629cf7e222466caf759 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Petr=20Van=C4=9Bk?= arkamar@atlas.cz Date: Wed, 30 Apr 2025 17:08:41 +0200 Subject: [PATCH] x86/mm: Reset pte _PAGE_PRESENT bit for invalid mft MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Signed-off-by: Petr Vaněk arkamar@atlas.cz
arch/x86/xen/mmu_pv.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 38971c6dcd4b..92a6a9af0c65 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -392,28 +392,25 @@ static pteval_t pte_mfn_to_pfn(pteval_t val) static pteval_t pte_pfn_to_mfn(pteval_t val) { if (val & _PAGE_PRESENT) { unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; pteval_t flags = val & PTE_FLAGS_MASK; unsigned long mfn; mfn = __pfn_to_mfn(pfn); /*
* If there's no mfn for the pfn, then just create an
* empty non-present pte. Unfortunately this loses
* information about the original pfn, so
* pte_mfn_to_pfn is asymmetric.
*/ if (unlikely(mfn == INVALID_P2M_ENTRY)) {* If there's no mfn for the pfn, then just reset present pte bit.
mfn = 0;
flags = 0;
mfn = pfn;
} else mfn &= ~(FOREIGN_FRAME_BIT | IDENTITY_FRAME_BIT); val = ((pteval_t)mfn << PAGE_SHIFT) | flags; }flags &= ~_PAGE_PRESENT;
return val; } __visible pteval_t xen_pte_val(pte_t pte) {
That might do as well.
I assume the following would also work? (removing the early ==1 check)
It has the general benefit of removing the pte_pfn() call from the loop body, which is why I like that fix. (almost looks like a cleanup)
From 75948778b586d4759a480bf412fd4682067b12ea Mon Sep 17 00:00:00 2001 From: David Hildenbrand david@redhat.com Date: Wed, 30 Apr 2025 16:35:12 +0200 Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand david@redhat.com --- mm/internal.h | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa59226..25a29872c634b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -248,11 +248,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty) { - unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); - const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte, *ptep; bool writable, young, dirty; - int nr; + int nr, cur_nr;
if (any_writable) *any_writable = false; @@ -265,11 +263,15 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+ /* Limit max_nr to the actual remaining PFNs in the folio we could batch. */ + max_nr = min_t(unsigned long, max_nr, + folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte)); + nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); ptep = start_ptep + nr;
- while (ptep < end_ptep) { + while (nr < max_nr) { pte = ptep_get(ptep); if (any_writable) writable = !!pte_write(pte); @@ -282,14 +284,6 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (!pte_same(pte, expected_pte)) break;
- /* - * Stop immediately once we reached the end of the folio. In - * corner cases the next PFN might fall into a different - * folio. - */ - if (pte_pfn(pte) >= folio_end_pfn) - break; - if (any_writable) *any_writable |= writable; if (any_young) @@ -297,12 +291,13 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (any_dirty) *any_dirty |= dirty;
- nr = pte_batch_hint(ptep, pte); - expected_pte = pte_advance_pfn(expected_pte, nr); - ptep += nr; + cur_nr = pte_batch_hint(ptep, pte); + expected_pte = pte_advance_pfn(expected_pte, cur_nr); + ptep += cur_nr; + nr += cur_nr; }
- return min(ptep - start_ptep, max_nr); + return min(nr, max_nr); }
/**
On Wed, Apr 30, 2025 at 11:25:56PM +0200, David Hildenbrand wrote:
On 30.04.25 18:00, Petr Vaněk wrote:
On Wed, Apr 30, 2025 at 04:37:21PM +0200, David Hildenbrand wrote:
On 30.04.25 13:52, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote: > On 29.04.25 16:52, David Hildenbrand wrote: >> On 29.04.25 16:45, Petr Vaněk wrote: >>> On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote: >>>> On 29.04.25 16:22, Petr Vaněk wrote: >>>>> folio_pte_batch() could overcount the number of contiguous PTEs when >>>>> pte_advance_pfn() returns a zero-valued PTE and the following PTE in >>>>> memory also happens to be zero. The loop doesn't break in such a case >>>>> because pte_same() returns true, and the batch size is advanced by one >>>>> more than it should be. >>>>> >>>>> To fix this, bail out early if a non-present PTE is encountered, >>>>> preventing the invalid comparison. >>>>> >>>>> This issue started to appear after commit 10ebac4f95e7 ("mm/memory: >>>>> optimize unmap/zap with PTE-mapped THP") and was discovered via git >>>>> bisect. >>>>> >>>>> Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") >>>>> Cc: stable@vger.kernel.org >>>>> Signed-off-by: Petr Vaněk arkamar@atlas.cz >>>>> --- >>>>> mm/internal.h | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/mm/internal.h b/mm/internal.h >>>>> index e9695baa5922..c181fe2bac9d 100644 >>>>> --- a/mm/internal.h >>>>> +++ b/mm/internal.h >>>>> @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, >>>>> dirty = !!pte_dirty(pte); >>>>> pte = __pte_batch_clear_ignored(pte, flags); >>>>> >>>>> + if (!pte_present(pte)) >>>>> + break; >>>>> if (!pte_same(pte, expected_pte)) >>>>> break; >>>> >>>> How could pte_same() suddenly match on a present and non-present PTE. >>> >>> In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. >>> pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs. >> >> Observe that folio_pte_batch() was called *with a present pte*. >> >> do_zap_pte_range() >> if (pte_present(ptent)) >> zap_present_ptes() >> folio_pte_batch() >> >> How can we end up with an expected_pte that is !present, if it is based >> on the provided pte that *is present* and we only used pte_advance_pfn() >> to advance the pfn? > > I've been staring at the code for too long and don't see the issue. > > We even have > > VM_WARN_ON_FOLIO(!pte_present(pte), folio); > > So the initial pteval we got is present. > > I don't see how > > nr = pte_batch_hint(start_ptep, pte); > expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); > > would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix.
No :)
Maybe it would be
better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
I think what's happening is that -- under XEN only -- we might get garbage when calling pte_advance_pfn() and the next PFN would no longer fall into the folio. And the current code cannot deal with that XEN garbage.
But still not 100% sure.
The following is completely untested, could you give that a try?
Yes, it solves the issue for me.
Cool!
However, maybe it would be better to solve it with the following patch. The pte_pfn_to_mfn() actually returns the same value for non-present PTEs. I suggest to return original PTE if the mfn is INVALID_P2M_ENTRY, rather than empty non-present PTE, but the _PAGE_PRESENT bit will be set to zero. Thus, we will not loose information about original pfn but it will be clear that the page is not present.
From e84781f9ec4fb7275d5e7629cf7e222466caf759 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Petr=20Van=C4=9Bk?= arkamar@atlas.cz Date: Wed, 30 Apr 2025 17:08:41 +0200 Subject: [PATCH] x86/mm: Reset pte _PAGE_PRESENT bit for invalid mft MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Signed-off-by: Petr Vaněk arkamar@atlas.cz
arch/x86/xen/mmu_pv.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 38971c6dcd4b..92a6a9af0c65 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -392,28 +392,25 @@ static pteval_t pte_mfn_to_pfn(pteval_t val) static pteval_t pte_pfn_to_mfn(pteval_t val) { if (val & _PAGE_PRESENT) { unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; pteval_t flags = val & PTE_FLAGS_MASK; unsigned long mfn; mfn = __pfn_to_mfn(pfn); /*
* If there's no mfn for the pfn, then just create an
* empty non-present pte. Unfortunately this loses
* information about the original pfn, so
* pte_mfn_to_pfn is asymmetric.
*/ if (unlikely(mfn == INVALID_P2M_ENTRY)) {* If there's no mfn for the pfn, then just reset present pte bit.
mfn = 0;
flags = 0;
mfn = pfn;
} else mfn &= ~(FOREIGN_FRAME_BIT | IDENTITY_FRAME_BIT); val = ((pteval_t)mfn << PAGE_SHIFT) | flags; }flags &= ~_PAGE_PRESENT;
return val; } __visible pteval_t xen_pte_val(pte_t pte) {
That might do as well.
I assume the following would also work? (removing the early ==1 check)
Yes, it also works in my case and the removal makes sense to me.
It has the general benefit of removing the pte_pfn() call from the loop body, which is why I like that fix. (almost looks like a cleanup)
Indeed, it looks like a cleanup to me as well :)
I am still considering if it would make sense to send both patches, I am not sure if reseting _PAGE_PRESENT flag is enough, because of swapping or other areas which I am not aware of.
From 75948778b586d4759a480bf412fd4682067b12ea Mon Sep 17 00:00:00 2001 From: David Hildenbrand david@redhat.com Date: Wed, 30 Apr 2025 16:35:12 +0200 Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand david@redhat.com
mm/internal.h | 27 +++++++++++---------------- 1 file changed, 11 insertions(+), 16 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa59226..25a29872c634b 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -248,11 +248,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty) {
- unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
- const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte, *ptep; bool writable, young, dirty;
- int nr;
- int nr, cur_nr;
if (any_writable) *any_writable = false; @@ -265,11 +263,15 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
- /* Limit max_nr to the actual remaining PFNs in the folio we could batch. */
- max_nr = min_t(unsigned long, max_nr,
folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte));
- nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); ptep = start_ptep + nr;
- while (ptep < end_ptep) {
- while (nr < max_nr) { pte = ptep_get(ptep); if (any_writable) writable = !!pte_write(pte);
@@ -282,14 +284,6 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (!pte_same(pte, expected_pte)) break;
/*
* Stop immediately once we reached the end of the folio. In
* corner cases the next PFN might fall into a different
* folio.
*/
if (pte_pfn(pte) >= folio_end_pfn)
break;
- if (any_writable) *any_writable |= writable; if (any_young)
@@ -297,12 +291,13 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (any_dirty) *any_dirty |= dirty;
nr = pte_batch_hint(ptep, pte);
expected_pte = pte_advance_pfn(expected_pte, nr);
ptep += nr;
cur_nr = pte_batch_hint(ptep, pte);
expected_pte = pte_advance_pfn(expected_pte, cur_nr);
ptep += cur_nr;
}nr += cur_nr;
- return min(ptep - start_ptep, max_nr);
- return min(nr, max_nr); }
/** -- 2.49.0
-- Cheers,
David / dhildenb
On 01.05.25 09:45, Petr Vaněk wrote:
On Wed, Apr 30, 2025 at 11:25:56PM +0200, David Hildenbrand wrote:
On 30.04.25 18:00, Petr Vaněk wrote:
On Wed, Apr 30, 2025 at 04:37:21PM +0200, David Hildenbrand wrote:
On 30.04.25 13:52, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote: > On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote: >> On 29.04.25 16:52, David Hildenbrand wrote: >>> On 29.04.25 16:45, Petr Vaněk wrote: >>>> On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote: >>>>> On 29.04.25 16:22, Petr Vaněk wrote: >>>>>> folio_pte_batch() could overcount the number of contiguous PTEs when >>>>>> pte_advance_pfn() returns a zero-valued PTE and the following PTE in >>>>>> memory also happens to be zero. The loop doesn't break in such a case >>>>>> because pte_same() returns true, and the batch size is advanced by one >>>>>> more than it should be. >>>>>> >>>>>> To fix this, bail out early if a non-present PTE is encountered, >>>>>> preventing the invalid comparison. >>>>>> >>>>>> This issue started to appear after commit 10ebac4f95e7 ("mm/memory: >>>>>> optimize unmap/zap with PTE-mapped THP") and was discovered via git >>>>>> bisect. >>>>>> >>>>>> Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") >>>>>> Cc: stable@vger.kernel.org >>>>>> Signed-off-by: Petr Vaněk arkamar@atlas.cz >>>>>> --- >>>>>> mm/internal.h | 2 ++ >>>>>> 1 file changed, 2 insertions(+) >>>>>> >>>>>> diff --git a/mm/internal.h b/mm/internal.h >>>>>> index e9695baa5922..c181fe2bac9d 100644 >>>>>> --- a/mm/internal.h >>>>>> +++ b/mm/internal.h >>>>>> @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, >>>>>> dirty = !!pte_dirty(pte); >>>>>> pte = __pte_batch_clear_ignored(pte, flags); >>>>>> >>>>>> + if (!pte_present(pte)) >>>>>> + break; >>>>>> if (!pte_same(pte, expected_pte)) >>>>>> break; >>>>> >>>>> How could pte_same() suddenly match on a present and non-present PTE. >>>> >>>> In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. >>>> pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs. >>> >>> Observe that folio_pte_batch() was called *with a present pte*. >>> >>> do_zap_pte_range() >>> if (pte_present(ptent)) >>> zap_present_ptes() >>> folio_pte_batch() >>> >>> How can we end up with an expected_pte that is !present, if it is based >>> on the provided pte that *is present* and we only used pte_advance_pfn() >>> to advance the pfn? >> >> I've been staring at the code for too long and don't see the issue. >> >> We even have >> >> VM_WARN_ON_FOLIO(!pte_present(pte), folio); >> >> So the initial pteval we got is present. >> >> I don't see how >> >> nr = pte_batch_hint(start_ptep, pte); >> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); >> >> would suddenly result in !pte_present(expected_pte). > > The issue is not happening in __pte_batch_clear_ignored but later in > following line: > > expected_pte = pte_advance_pfn(expected_pte, nr); > > The issue seems to be in __pte function which converts PTE value to > pte_t in pte_advance_pfn, because warnings disappears when I change the > line to > > expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) }; > > The kernel probably uses __pte function from > arch/x86/include/asm/paravirt.h because it is configured with > CONFIG_PARAVIRT=y: > > static inline pte_t __pte(pteval_t val) > { > return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, > "mov %%rdi, %%rax", ALT_NOT_XEN) }; > } > > I guess it might cause this weird magic, but I need more time to > understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix.
No :)
Maybe it would be
better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
I think what's happening is that -- under XEN only -- we might get garbage when calling pte_advance_pfn() and the next PFN would no longer fall into the folio. And the current code cannot deal with that XEN garbage.
But still not 100% sure.
The following is completely untested, could you give that a try?
Yes, it solves the issue for me.
Cool!
However, maybe it would be better to solve it with the following patch. The pte_pfn_to_mfn() actually returns the same value for non-present PTEs. I suggest to return original PTE if the mfn is INVALID_P2M_ENTRY, rather than empty non-present PTE, but the _PAGE_PRESENT bit will be set to zero. Thus, we will not loose information about original pfn but it will be clear that the page is not present.
From e84781f9ec4fb7275d5e7629cf7e222466caf759 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Petr=20Van=C4=9Bk?= arkamar@atlas.cz Date: Wed, 30 Apr 2025 17:08:41 +0200 Subject: [PATCH] x86/mm: Reset pte _PAGE_PRESENT bit for invalid mft MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Signed-off-by: Petr Vaněk arkamar@atlas.cz
arch/x86/xen/mmu_pv.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 38971c6dcd4b..92a6a9af0c65 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -392,28 +392,25 @@ static pteval_t pte_mfn_to_pfn(pteval_t val) static pteval_t pte_pfn_to_mfn(pteval_t val) { if (val & _PAGE_PRESENT) { unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; pteval_t flags = val & PTE_FLAGS_MASK; unsigned long mfn; mfn = __pfn_to_mfn(pfn); /*
* If there's no mfn for the pfn, then just create an
* empty non-present pte. Unfortunately this loses
* information about the original pfn, so
* pte_mfn_to_pfn is asymmetric.
* If there's no mfn for the pfn, then just reset present pte bit. */ if (unlikely(mfn == INVALID_P2M_ENTRY)) {
mfn = 0;
flags = 0;
mfn = pfn;
} return val; } __visible pteval_t xen_pte_val(pte_t pte) {flags &= ~_PAGE_PRESENT; } else mfn &= ~(FOREIGN_FRAME_BIT | IDENTITY_FRAME_BIT); val = ((pteval_t)mfn << PAGE_SHIFT) | flags;
That might do as well.
I assume the following would also work? (removing the early ==1 check)
Yes, it also works in my case and the removal makes sense to me.
It has the general benefit of removing the pte_pfn() call from the loop body, which is why I like that fix. (almost looks like a cleanup)
Indeed, it looks like a cleanup to me as well :)
Okay, let me polish the patch up and send it out later.
I am still considering if it would make sense to send both patches, I am not sure if reseting _PAGE_PRESENT flag is enough, because of swapping or other areas which I am not aware of.
The problem I'm having with pte_mfn_to_pfn() updates is that I don't understand what the expected output is even supposed to be. You should probably ask XEN folks (I realized that they are not CCed).
On 02.05.25 09:37, David Hildenbrand wrote:
On 01.05.25 09:45, Petr Vaněk wrote:
On Wed, Apr 30, 2025 at 11:25:56PM +0200, David Hildenbrand wrote:
On 30.04.25 18:00, Petr Vaněk wrote:
On Wed, Apr 30, 2025 at 04:37:21PM +0200, David Hildenbrand wrote:
On 30.04.25 13:52, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote: > On 29.04.25 20:33, Petr Vaněk wrote: >> On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote: >>> On 29.04.25 16:52, David Hildenbrand wrote: >>>> On 29.04.25 16:45, Petr Vaněk wrote: >>>>> On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote: >>>>>> On 29.04.25 16:22, Petr Vaněk wrote: >>>>>>> folio_pte_batch() could overcount the number of contiguous PTEs when >>>>>>> pte_advance_pfn() returns a zero-valued PTE and the following PTE in >>>>>>> memory also happens to be zero. The loop doesn't break in such a case >>>>>>> because pte_same() returns true, and the batch size is advanced by one >>>>>>> more than it should be. >>>>>>> >>>>>>> To fix this, bail out early if a non-present PTE is encountered, >>>>>>> preventing the invalid comparison. >>>>>>> >>>>>>> This issue started to appear after commit 10ebac4f95e7 ("mm/memory: >>>>>>> optimize unmap/zap with PTE-mapped THP") and was discovered via git >>>>>>> bisect. >>>>>>> >>>>>>> Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") >>>>>>> Cc: stable@vger.kernel.org >>>>>>> Signed-off-by: Petr Vaněk arkamar@atlas.cz >>>>>>> --- >>>>>>> mm/internal.h | 2 ++ >>>>>>> 1 file changed, 2 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/internal.h b/mm/internal.h >>>>>>> index e9695baa5922..c181fe2bac9d 100644 >>>>>>> --- a/mm/internal.h >>>>>>> +++ b/mm/internal.h >>>>>>> @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, >>>>>>> dirty = !!pte_dirty(pte); >>>>>>> pte = __pte_batch_clear_ignored(pte, flags); >>>>>>> >>>>>>> + if (!pte_present(pte)) >>>>>>> + break; >>>>>>> if (!pte_same(pte, expected_pte)) >>>>>>> break; >>>>>> >>>>>> How could pte_same() suddenly match on a present and non-present PTE. >>>>> >>>>> In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. >>>>> pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs. >>>> >>>> Observe that folio_pte_batch() was called *with a present pte*. >>>> >>>> do_zap_pte_range() >>>> if (pte_present(ptent)) >>>> zap_present_ptes() >>>> folio_pte_batch() >>>> >>>> How can we end up with an expected_pte that is !present, if it is based >>>> on the provided pte that *is present* and we only used pte_advance_pfn() >>>> to advance the pfn? >>> >>> I've been staring at the code for too long and don't see the issue. >>> >>> We even have >>> >>> VM_WARN_ON_FOLIO(!pte_present(pte), folio); >>> >>> So the initial pteval we got is present. >>> >>> I don't see how >>> >>> nr = pte_batch_hint(start_ptep, pte); >>> expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); >>> >>> would suddenly result in !pte_present(expected_pte). >> >> The issue is not happening in __pte_batch_clear_ignored but later in >> following line: >> >> expected_pte = pte_advance_pfn(expected_pte, nr); >> >> The issue seems to be in __pte function which converts PTE value to >> pte_t in pte_advance_pfn, because warnings disappears when I change the >> line to >> >> expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) }; >> >> The kernel probably uses __pte function from >> arch/x86/include/asm/paravirt.h because it is configured with >> CONFIG_PARAVIRT=y: >> >> static inline pte_t __pte(pteval_t val) >> { >> return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, >> "mov %%rdi, %%rax", ALT_NOT_XEN) }; >> } >> >> I guess it might cause this weird magic, but I need more time to >> understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
> What XEN does with basic primitives that convert between pteval and > pte_t is beyond horrible. > > How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not > run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
> Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix.
No :)
Maybe it would be
better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
I think what's happening is that -- under XEN only -- we might get garbage when calling pte_advance_pfn() and the next PFN would no longer fall into the folio. And the current code cannot deal with that XEN garbage.
But still not 100% sure.
The following is completely untested, could you give that a try?
Yes, it solves the issue for me.
Cool!
However, maybe it would be better to solve it with the following patch. The pte_pfn_to_mfn() actually returns the same value for non-present PTEs. I suggest to return original PTE if the mfn is INVALID_P2M_ENTRY, rather than empty non-present PTE, but the _PAGE_PRESENT bit will be set to zero. Thus, we will not loose information about original pfn but it will be clear that the page is not present.
From e84781f9ec4fb7275d5e7629cf7e222466caf759 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Petr=20Van=C4=9Bk?= arkamar@atlas.cz Date: Wed, 30 Apr 2025 17:08:41 +0200 Subject: [PATCH] x86/mm: Reset pte _PAGE_PRESENT bit for invalid mft MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Signed-off-by: Petr Vaněk arkamar@atlas.cz
arch/x86/xen/mmu_pv.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 38971c6dcd4b..92a6a9af0c65 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -392,28 +392,25 @@ static pteval_t pte_mfn_to_pfn(pteval_t val) static pteval_t pte_pfn_to_mfn(pteval_t val) { if (val & _PAGE_PRESENT) { unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT; pteval_t flags = val & PTE_FLAGS_MASK; unsigned long mfn; mfn = __pfn_to_mfn(pfn); /*
* If there's no mfn for the pfn, then just create an
* empty non-present pte. Unfortunately this loses
* information about the original pfn, so
* pte_mfn_to_pfn is asymmetric.
* If there's no mfn for the pfn, then just reset present pte bit. */ if (unlikely(mfn == INVALID_P2M_ENTRY)) {
mfn = 0;
flags = 0;
mfn = pfn;
} __visible pteval_t xen_pte_val(pte_t pte) {flags &= ~_PAGE_PRESENT; } else mfn &= ~(FOREIGN_FRAME_BIT | IDENTITY_FRAME_BIT); val = ((pteval_t)mfn << PAGE_SHIFT) | flags; } return val;
That might do as well.
I assume the following would also work? (removing the early ==1 check)
Yes, it also works in my case and the removal makes sense to me.
It has the general benefit of removing the pte_pfn() call from the loop body, which is why I like that fix. (almost looks like a cleanup)
Indeed, it looks like a cleanup to me as well :)
Okay, let me polish the patch up and send it out later.
Actually, can you polish it up and send it out? I think we need a better description on how exactly that problems happens, in particular involving pte_none(). So stuff from the cover letter should probably be living in here.
Feel free to add my
Co-developed-by: David Hildenbrand david@redhat.com Signed-off-by: David Hildenbrand david@redhat.com
and stay first author.
Thanks!
linux-stable-mirror@lists.linaro.org