On 08/29/2018 02:11 PM, Jerome Glisse wrote:
On Wed, Aug 29, 2018 at 08:39:06PM +0200, Michal Hocko wrote:
On Wed 29-08-18 14:14:25, Jerome Glisse wrote:
On Wed, Aug 29, 2018 at 10:24:44AM -0700, Mike Kravetz wrote:
[...]
What would be the best mmu notifier interface to use where there are no start/end calls? Or, is the best solution to add the start/end calls as is done in later versions of the code? If that is the suggestion, has there been any change in invalidate start/end semantics that we should take into account?
start/end would be the one to add, 4.4 seems broken in respect to THP and mmu notification. Another solution is to fix user of mmu notifier, they were only a handful back then. For instance properly adjust the address to match first address covered by pmd or pud and passing down correct page size to mmu_notifier_invalidate_page() would allow to fix this easily.
This is ok because user of try_to_unmap_one() replace the pte/pmd/pud with an invalid one (either poison, migration or swap) inside the function. So anyone racing would synchronize on those special entry hence why it is fine to delay mmu_notifier_invalidate_page() to after dropping the page table lock.
Adding start/end might the solution with less code churn as you would only need to change try_to_unmap_one().
What about dependencies? 369ea8242c0fb sounds like it needs work for all notifiers need to be updated as well.
This commit remove mmu_notifier_invalidate_page() hence why everything need to be updated. But in 4.4 you can get away with just adding start/ end and keep around mmu_notifier_invalidate_page() to minimize disruption.
So the new semantic in 369ea8242c0fb is that all page table changes are bracketed with mmu notifier start/end calls and invalidate_range right after tlb flush. This simplify thing and make it more reliable for mmu notifier users like IOMMU or ODP or GPUs drivers.
Here is what I came up with by adding the start/end calls to the 4.4 version of try_to_unmap_one. Note that this assumes/uses the new routine adjust_range_if_pmd_sharing_possible to adjust the notifier/flush range if huge pmd sharing is possible. I changed the mmu_notifier_invalidate_page to a mmu_notifier_invalidate_range, but am not sure if that needs to happen earlier in the routine (like right after tlb flush as you said above). Does this look reasonable?
diff --git a/mm/rmap.c b/mm/rmap.c index b577fbb98d4b..7ba8bfeddb4b 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1302,11 +1302,30 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, pte_t pteval; spinlock_t *ptl; int ret = SWAP_AGAIN; + unsigned long start = address, end; enum ttu_flags flags = (enum ttu_flags)arg;
/* munlock has nothing to gain from examining un-locked vmas */ if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED)) - goto out; + return ret; + + /* + * For THP, we have to assume the worse case ie pmd for invalidation. + * For hugetlb, it could be much worse if we need to do pud + * invalidation in the case of pmd sharing. + * + * Note that the page can not be free in this function as call of + * try_to_unmap() must hold a reference on the page. + */ + end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page))); + if (PageHuge(page)) { + /* + * If sharing is possible, start and end will be adjusted + * accordingly. + */ + adjust_range_if_pmd_sharing_possible(vma, &start, &end); + } + mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
pte = page_check_address(page, mm, address, &ptl, 0); if (!pte) @@ -1334,6 +1353,29 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, } }
+ if (PageHuge(page) && huge_pmd_unshare(mm, &address, pte)) { + /* + * huge_pmd_unshare unmapped an entire PMD page. There is + * no way of knowing exactly which PMDs may be cached for + * this mm, so flush them all. start/end were already + * adjusted to cover this range. + */ + flush_cache_range(vma, start, end); + flush_tlb_range(vma, start, end); + + /* + * The ref count of the PMD page was dropped which is part + * of the way map counting is done for shared PMDs. When + * there is no other sharing, huge_pmd_unshare returns false + * and we will unmap the actual page and drop map count + * to zero. + * + * Note that huge_pmd_unshare modified address and is likely + * not what you would expect. + */ + goto out_unmap; + } + /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); if (should_defer_flush(mm, flags)) { @@ -1424,10 +1466,11 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, page_cache_release(page);
out_unmap: - pte_unmap_unlock(pte, ptl); if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK)) - mmu_notifier_invalidate_page(mm, address); + mmu_notifier_invalidate_range(mm, start, end); + pte_unmap_unlock(pte, ptl); out: + mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); return ret; }