On 6/19/20 4:33 PM, Greg Kroah-Hartman wrote:
From: Andrea Arcangeli aarcange@redhat.com
commit c444eb564fb16645c172d550359cb3d75fe8a040 upstream.
Write protect anon page faults require an accurate mapcount to decide if to break the COW or not. This is implemented in the THP path with reuse_swap_page() -> page_trans_huge_map_swapcount()/page_trans_huge_mapcount().
If the COW triggers while the other processes sharing the page are under a huge pmd split, to do an accurate reading, we must ensure the mapcount isn't computed while it's being transferred from the head page to the tail pages.
reuse_swap_cache() already runs serialized by the page lock, so it's enough to add the page lock around __split_huge_pmd_locked too, in order to add the missing serialization.
Note: the commit in "Fixes" is just to facilitate the backporting, because the code before such commit didn't try to do an accurate THP mapcount calculation and it instead used the page_count() to decide if to COW or not. Both the page_count and the pin_count are THP-wide refcounts, so they're inaccurate if used in reuse_swap_page(). Reverting such commit (besides the unrelated fix to the local anon_vma assignment) would have also opened the window for memory corruption side effects to certain workloads as documented in such commit header.
Signed-off-by: Andrea Arcangeli aarcange@redhat.com Suggested-by: Jann Horn jannh@google.com Reported-by: Jann Horn jannh@google.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults") Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds torvalds@linux-foundation.org Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Hi, when evaluating this backport for our 4.12 based kernel, Nicolai found out that Jann's POC still triggers, AFAICS because do_huge_pmd_wp_page() doesn't take the page lock, which was only added by ba3c4ce6def4 ("mm, THP, swap: make reuse_swap_page() works for THP swapped out") in 4.14. The upstream stable 4.9 is thus in the same situation (didn't actually test the POC there, but should be obvious), so this is a heads up.
Now just backporting ba3c4ce6def4 to 4.9 stable isn't that simple, as that's part of a larger series (maybe with even more prerequisities, didn't check). I'm considering just taking the part of ba3c4ce6def4 that's wrapping page_trans_huge_mapcount() in the page lock (without changing it to reuse_swap_page() and changing the latter to deal with swapped out THP) and will look at it tomorrow. But suggestions (and/or later review) from Andrea/Kirill are welcome.
Thanks, Vlastimil
mm/huge_memory.c | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-)
--- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1755,6 +1755,8 @@ void __split_huge_pmd(struct vm_area_str spinlock_t *ptl; struct mm_struct *mm = vma->vm_mm; unsigned long haddr = address & HPAGE_PMD_MASK;
- bool was_locked = false;
- pmd_t _pmd;
mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); ptl = pmd_lock(mm, pmd); @@ -1764,11 +1766,32 @@ void __split_huge_pmd(struct vm_area_str * pmd against. Otherwise we can end up replacing wrong page. */ VM_BUG_ON(freeze && !page);
- if (page && page != pmd_page(*pmd))
goto out;
- if (page) {
VM_WARN_ON_ONCE(!PageLocked(page));
was_locked = true;
if (page != pmd_page(*pmd))
goto out;
- }
+repeat: if (pmd_trans_huge(*pmd)) {
page = pmd_page(*pmd);
if (!page) {
page = pmd_page(*pmd);
if (unlikely(!trylock_page(page))) {
get_page(page);
_pmd = *pmd;
spin_unlock(ptl);
lock_page(page);
spin_lock(ptl);
if (unlikely(!pmd_same(*pmd, _pmd))) {
unlock_page(page);
put_page(page);
page = NULL;
goto repeat;
}
put_page(page);
}
if (PageMlocked(page)) clear_page_mlock(page); } else if (!pmd_devmap(*pmd))}
@@ -1776,6 +1799,8 @@ void __split_huge_pmd(struct vm_area_str __split_huge_pmd_locked(vma, pmd, haddr, freeze); out: spin_unlock(ptl);
- if (!was_locked && page)
mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);unlock_page(page);
}