On 10/26/22 17:42, Peter Xu wrote:
Hi, Mike,
On Sat, Oct 22, 2022 at 07:50:47PM -0700, Mike Kravetz wrote:
[...]
-void __unmap_hugepage_range_final(struct mmu_gather *tlb, +static void __unmap_hugepage_range_locking(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page,
zap_flags_t zap_flags)
zap_flags_t zap_flags, bool final)
{ hugetlb_vma_lock_write(vma); i_mmap_lock_write(vma->vm_file->f_mapping); __unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
- /*
* Unlock and free the vma lock before releasing i_mmap_rwsem. When
* the vma_lock is freed, this makes the vma ineligible for pmd
* sharing. And, i_mmap_rwsem is required to set up pmd sharing.
* This is important as page tables for this unmapped range will
* be asynchrously deleted. If the page tables are shared, there
* will be issues when accessed by someone else.
*/
- __hugetlb_vma_unlock_write_free(vma);
- if (final) {
/*
* Unlock and free the vma lock before releasing i_mmap_rwsem.
* When the vma_lock is freed, this makes the vma ineligible
* for pmd sharing. And, i_mmap_rwsem is required to set up
* pmd sharing. This is important as page tables for this
* unmapped range will be asynchrously deleted. If the page
* tables are shared, there will be issues when accessed by
* someone else.
*/
__hugetlb_vma_unlock_write_free(vma);
i_mmap_unlock_write(vma->vm_file->f_mapping);
Pure question: can we rely on hugetlb_vm_op_close() to destroy the hugetlb vma lock?
I read the comment above, it seems we are trying to avoid racing with pmd sharing, but I don't see how that could ever happen, since iiuc there should only be two places that unmaps the vma (final==true):
(1) munmap: we're holding write lock, so no page fault possible (2) exit_mmap: we've already reset current->mm so no page fault possible
Thanks for taking a look Peter!
The possible sharing we are trying to stop would be initiated by a fault in a different process on the same underlying mapping object (inode). The specific vma in exit processing is still linked into the mapping interval tree. So, even though we call huge_pmd_unshare in the unmap processing (in __unmap_hugepage_range) the sharing could later be initiated by another process.
Hope that makes sense. That is also the reason the routine page_table_shareable contains this check:
/* * match the virtual addresses, permission and the alignment of the * page table page. * * Also, vma_lock (vm_private_data) is required for sharing. */ if (pmd_index(addr) != pmd_index(saddr) || vm_flags != svm_flags || !range_in_vma(svma, sbase, s_end) || !svma->vm_private_data) return 0;
FYI - The 'flags' check also prevents a non-uffd mapping from initiating sharing with a uffd mapping.
- } else {
i_mmap_unlock_write(vma->vm_file->f_mapping);
hugetlb_vma_unlock_write(vma);
- }
+}
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
unsigned long end, struct page *ref_page,
zap_flags_t zap_flags)
+{
- __unmap_hugepage_range_locking(tlb, vma, start, end, ref_page,
zap_flags, true);
} +#ifdef CONFIG_ADVISE_SYSCALLS +/*
- Similar setup as in zap_page_range(). madvise(MADV_DONTNEED) can not call
- zap_page_range for hugetlb vmas as __unmap_hugepage_range_final will delete
- the associated vma_lock.
- */
+void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end)
+{
- struct mmu_notifier_range range;
- struct mmu_gather tlb;
- mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
start, end);
Is mmu_notifier_invalidate_range_start() missing here?
It certainly does look like it. When I created this routine, I was trying to mimic what was done in the current calling path zap_page_range to __unmap_hugepage_range_final. Now when I look at that, I am not seeing a mmu_notifier_invalidate_range_start/end. Am I missing something, or are these missing today? Do note that we do MMU_NOTIFY_UNMAP in __unmap_hugepage_range.
- tlb_gather_mmu(&tlb, vma->vm_mm);
- update_hiwater_rss(vma->vm_mm);
- __unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0, false);
- mmu_notifier_invalidate_range_end(&range);
- tlb_finish_mmu(&tlb);
+} +#endif
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags) diff --git a/mm/madvise.c b/mm/madvise.c index 2baa93ca2310..90577a669635 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -790,7 +790,10 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, static long madvise_dontneed_single_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end) {
- zap_page_range(vma, start, end - start);
- if (!is_vm_hugetlb_page(vma))
zap_page_range(vma, start, end - start);
- else
return 0;clear_hugetlb_page_range(vma, start, end);
}
This does look a bit unfortunate - zap_page_range() contains yet another is_vm_hugetlb_page() check (further down in unmap_single_vma), it can be very confusing on which code path is really handling hugetlb.
The other mm_users check in v3 doesn't need this change, but was a bit hackish to me, because IIUC we're clear on the call paths to trigger this (unmap_vmas), so it seems clean to me to pass that info from the upper stack.
Maybe we can have a new zap_flags passed into unmap_single_vma() showing that it's destroying the vma?
I thought about that. However, we would need to start passing the flag here into zap_page_range as this is the beginning of that call down into the hugetlb code where we do not want to remove zap_page_rangethe vma_lock.