On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote:
folio_pte_batch() could overcount the number of contiguous PTEs when pte_advance_pfn() returns a zero-valued PTE and the following PTE in memory also happens to be zero. The loop doesn't break in such a case because pte_same() returns true, and the batch size is advanced by one more than it should be.
To fix this, bail out early if a non-present PTE is encountered, preventing the invalid comparison.
This issue started to appear after commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") and was discovered via git bisect.
Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") Cc: stable@vger.kernel.org Signed-off-by: Petr Vaněk arkamar@atlas.cz
mm/internal.h | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa5922..c181fe2bac9d 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, dirty = !!pte_dirty(pte); pte = __pte_batch_clear_ignored(pte, flags);
if (!pte_present(pte))
break; if (!pte_same(pte, expected_pte)) break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
Is it only a problem if we exceed a certain pfn?