On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote:
On 29.04.25 16:22, Petr Vaněk wrote: > folio_pte_batch() could overcount the number of contiguous PTEs when > pte_advance_pfn() returns a zero-valued PTE and the following PTE in > memory also happens to be zero. The loop doesn't break in such a case > because pte_same() returns true, and the batch size is advanced by one > more than it should be. > > To fix this, bail out early if a non-present PTE is encountered, > preventing the invalid comparison. > > This issue started to appear after commit 10ebac4f95e7 ("mm/memory: > optimize unmap/zap with PTE-mapped THP") and was discovered via git > bisect. > > Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") > Cc: stable@vger.kernel.org > Signed-off-by: Petr Vaněk arkamar@atlas.cz > --- > mm/internal.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/internal.h b/mm/internal.h > index e9695baa5922..c181fe2bac9d 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, > dirty = !!pte_dirty(pte); > pte = __pte_batch_clear_ignored(pte, flags); > > + if (!pte_present(pte)) > + break; > if (!pte_same(pte, expected_pte)) > break;
How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix. Maybe it would be better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
Cheers, Petr