On 30.04.25 13:52, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 08:56:03PM +0200, David Hildenbrand wrote:
On 29.04.25 20:33, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 05:45:53PM +0200, David Hildenbrand wrote:
On 29.04.25 16:52, David Hildenbrand wrote:
On 29.04.25 16:45, Petr Vaněk wrote:
On Tue, Apr 29, 2025 at 04:29:30PM +0200, David Hildenbrand wrote: > On 29.04.25 16:22, Petr Vaněk wrote: >> folio_pte_batch() could overcount the number of contiguous PTEs when >> pte_advance_pfn() returns a zero-valued PTE and the following PTE in >> memory also happens to be zero. The loop doesn't break in such a case >> because pte_same() returns true, and the batch size is advanced by one >> more than it should be. >> >> To fix this, bail out early if a non-present PTE is encountered, >> preventing the invalid comparison. >> >> This issue started to appear after commit 10ebac4f95e7 ("mm/memory: >> optimize unmap/zap with PTE-mapped THP") and was discovered via git >> bisect. >> >> Fixes: 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP") >> Cc: stable@vger.kernel.org >> Signed-off-by: Petr Vaněk arkamar@atlas.cz >> --- >> mm/internal.h | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/mm/internal.h b/mm/internal.h >> index e9695baa5922..c181fe2bac9d 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -279,6 +279,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, >> dirty = !!pte_dirty(pte); >> pte = __pte_batch_clear_ignored(pte, flags); >> >> + if (!pte_present(pte)) >> + break; >> if (!pte_same(pte, expected_pte)) >> break; > > How could pte_same() suddenly match on a present and non-present PTE.
In the problematic case pte.pte == 0 and expected_pte.pte == 0 as well. pte_same() returns a.pte == b.pte -> 0 == 0. Both are non-present PTEs.
Observe that folio_pte_batch() was called *with a present pte*.
do_zap_pte_range() if (pte_present(ptent)) zap_present_ptes() folio_pte_batch()
How can we end up with an expected_pte that is !present, if it is based on the provided pte that *is present* and we only used pte_advance_pfn() to advance the pfn?
I've been staring at the code for too long and don't see the issue.
We even have
VM_WARN_ON_FOLIO(!pte_present(pte), folio);
So the initial pteval we got is present.
I don't see how
nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
would suddenly result in !pte_present(expected_pte).
The issue is not happening in __pte_batch_clear_ignored but later in following line:
expected_pte = pte_advance_pfn(expected_pte, nr);
The issue seems to be in __pte function which converts PTE value to pte_t in pte_advance_pfn, because warnings disappears when I change the line to
expected_pte = (pte_t){ .pte = pte_val(expected_pte) + (nr << PFN_PTE_SHIFT) };
The kernel probably uses __pte function from arch/x86/include/asm/paravirt.h because it is configured with CONFIG_PARAVIRT=y:
static inline pte_t __pte(pteval_t val) { return (pte_t) { PVOP_ALT_CALLEE1(pteval_t, mmu.make_pte, val, "mov %%rdi, %%rax", ALT_NOT_XEN) }; }
I guess it might cause this weird magic, but I need more time to understand what it does :)
I understand it slightly more. __pte() uses xen_make_pte(), which calls pte_pfn_to_mfn(), however, mfn for this pfn contains INVALID_P2M_ENTRY value, therefore the pte_pfn_to_mfn() returns 0, see [1].
I guess that the mfn was invalidated by xen-balloon driver?
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/...
What XEN does with basic primitives that convert between pteval and pte_t is beyond horrible.
How come set_ptes() that uses pte_next_pfn()->pte_advance_pfn() does not run into this?
I don't know, but I guess it is somehow related to pfn->mfn translation.
Is it only a problem if we exceed a certain pfn?
No, it is a problem if the corresponding mft to given pfn is invalid.
I am not sure if my original patch is a good fix.
No :)
Maybe it would be
better to have some sort of native_pte_advance_pfn() which will use native_make_pte() rather than __pte(). Or do you think the issue is in Xen part?
I think what's happening is that -- under XEN only -- we might get garbage when calling pte_advance_pfn() and the next PFN would no longer fall into the folio. And the current code cannot deal with that XEN garbage.
But still not 100% sure.
The following is completely untested, could you give that a try? I might find some time this evening to test myself and try to further improve it.
From 7d4149a5ea18cba6a694946e59efa9f51d793a4e Mon Sep 17 00:00:00 2001 From: David Hildenbrand david@redhat.com Date: Wed, 30 Apr 2025 16:35:12 +0200 Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand david@redhat.com --- mm/internal.h | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h index e9695baa59226..a9ea7f62486ec 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -248,11 +248,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, bool *any_writable, bool *any_young, bool *any_dirty) { - unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio); - const pte_t *end_ptep = start_ptep + max_nr; pte_t expected_pte, *ptep; bool writable, young, dirty; - int nr; + int nr, cur_nr;
if (any_writable) *any_writable = false; @@ -265,11 +263,17 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); VM_WARN_ON_FOLIO(page_folio(pfn_to_page(pte_pfn(pte))) != folio, folio);
+ /* Limit max_nr to the actual remaining PFNs in the folio. */ + max_nr = min_t(unsigned long, max_nr, + folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte)); + if (unlikely(max_nr == 1)) + return 1; + nr = pte_batch_hint(start_ptep, pte); expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags); ptep = start_ptep + nr;
- while (ptep < end_ptep) { + while (nr < max_nr) { pte = ptep_get(ptep); if (any_writable) writable = !!pte_write(pte); @@ -282,14 +286,6 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (!pte_same(pte, expected_pte)) break;
- /* - * Stop immediately once we reached the end of the folio. In - * corner cases the next PFN might fall into a different - * folio. - */ - if (pte_pfn(pte) >= folio_end_pfn) - break; - if (any_writable) *any_writable |= writable; if (any_young) @@ -297,12 +293,13 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, if (any_dirty) *any_dirty |= dirty;
- nr = pte_batch_hint(ptep, pte); - expected_pte = pte_advance_pfn(expected_pte, nr); - ptep += nr; + cur_nr = pte_batch_hint(ptep, pte); + expected_pte = pte_advance_pfn(expected_pte, cur_nr); + ptep += cur_nr; + nr += cur_nr; }
- return min(ptep - start_ptep, max_nr); + return min(nr, max_nr); }
/**