Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages

30 Aug 2018

On 08/29/2018 02:11 PM, Jerome Glisse wrote:
...
On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
...
On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
...
On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
[...]
...
...
What would be the best mmu notifier interface to use where there are no
start/end calls?
Or, is the best solution to add the start/end calls as is done in later
versions of the code?  If that is the suggestion, has there been any change
in invalidate start/end semantics that we should take into account?
start/end would be the one to add, 4.4 seems broken in respect to THP
and mmu notification. Another solution is to fix user of mmu notifier,
they were only a handful back then. For instance properly adjust the
address to match first address covered by pmd or pud and passing down
correct page size to mmu_notifier_invalidate_page() would allow to fix
this easily.
This is ok because user of try_to_unmap_one() replace the pte/pmd/pud
with an invalid one (either poison, migration or swap) inside the
function. So anyone racing would synchronize on those special entry
hence why it is fine to delay mmu_notifier_invalidate_page() to after
dropping the page table lock.
Adding start/end might the solution with less code churn as you would
only need to change try_to_unmap_one().
What about dependencies? 369ea8242c0fb sounds like it needs work for all
notifiers need to be updated as well.
This commit remove mmu_notifier_invalidate_page() hence why everything
need to be updated. But in 4.4 you can get away with just adding start/
end and keep around mmu_notifier_invalidate_page() to minimize disruption.
So the new semantic in 369ea8242c0fb is that all page table changes are
bracketed with mmu notifier start/end calls and invalidate_range right
after tlb flush. This simplify thing and make it more reliable for mmu
notifier users like IOMMU or ODP or GPUs drivers.
Here is what I came up with by adding the start/end calls to the 4.4 version
of try_to_unmap_one.  Note that this assumes/uses the new routine
adjust_range_if_pmd_sharing_possible to adjust the notifier/flush range if
huge pmd sharing is possible.  I changed the mmu_notifier_invalidate_page
to a mmu_notifier_invalidate_range, but am not sure if that needs to happen
earlier in the routine (like right after tlb flush as you said above).
Does this look reasonable?

diff --git a/mm/rmap.c b/mm/rmap.c
index b577fbb98d4b..7ba8bfeddb4b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1302,11 +1302,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
    pte_t pteval;
    spinlock_t *ptl;
    int ret = SWAP_AGAIN;
+	unsigned long start = address, end;
    enum ttu_flags flags = (enum ttu_flags)arg;
/* munlock has nothing to gain from examining un-locked vmas */
    if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
-		goto out;
+		return ret;
+
+	/*
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
+	}
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
pte = page_check_address(page, mm, address, &ptl, 0);
    if (!pte)
@@ -1334,6 +1353,29 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
    	}
    }
+	if (PageHuge(page) && huge_pmd_unshare(mm, &address, pte)) {
+		/*
+		 * huge_pmd_unshare unmapped an entire PMD page.  There is
+		 * no way of knowing exactly which PMDs may be cached for
+		 * this mm, so flush them all.  start/end were already
+		 * adjusted to cover this range.
+		 */
+		flush_cache_range(vma, start, end);
+		flush_tlb_range(vma, start, end);
+
+		/*
+		 * The ref count of the PMD page was dropped which is part
+		 * of the way map counting is done for shared PMDs.  When
+		 * there is no other sharing, huge_pmd_unshare returns false
+		 * and we will unmap the actual page and drop map count
+		 * to zero.
+		 *
+		 * Note that huge_pmd_unshare modified address and is likely
+		 * not what you would expect.
+		 */
+		goto out_unmap;
+	}
+
    /* Nuke the page table entry. */
    flush_cache_page(vma, address, page_to_pfn(page));
    if (should_defer_flush(mm, flags)) {
@@ -1424,10 +1466,11 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
    page_cache_release(page);
out_unmap:
-	pte_unmap_unlock(pte, ptl);
    if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
-		mmu_notifier_invalidate_page(mm, address);
+		mmu_notifier_invalidate_range(mm, start, end);
+	pte_unmap_unlock(pte, ptl);
 out:
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
    return ret;
 }
-- 
Mike Kravetz

    

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

Re: [PATCH v6 1/2] mm: migration: fix migration of huge PMD shared pages