On 10/24/22 14:55, Mike Kravetz wrote:
On 10/22/22 19:50, Mike Kravetz wrote:
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
Create a new routine clear_hugetlb_page_range() that can be called from madvise(MADV_DONTNEED) for hugetlb vmas. It has the same setup as zap_page_range, but does not delete the vma_lock.
After seeing a syzbot use after free report [2] that is also addressed by this patch, I started thinking ...
When __unmap_hugepage_range_final was created, the only time unmap_single_vma was called for hugetlb vmas was during process exit time via exit_mmap. I got in trouble when I added a call via madvise(MADV_DONTNEED) which calls zap_page_range. This patch takes care of that calling path by having madvise(MADV_DONTNEED) call a new routine clear_hugetlb_page_range instead of zap_page_range for hugetlb vmas. The use after free bug had me auditing code paths to make sure __unmap_hugepage_range_final was REALLY only called at process exit time. If not, and we could fault on a vma after calling __unmap_hugepage_range_final we would be in trouble.
My thought was, what if we had __unmap_hugepage_range_final check mm->mm_users to determine if it was being called in the process exit path? If !mm_users, then we can delete the vma_lock to prevent pmd sharing as we know the process is exiting. If not, we do not delete the lock. That seems to be more robust and would prevent issues if someone accidentally introduces a new code path where __unmap_hugepage_range_final (unmap_single_vma for a hugetlb vma) could be called outside process exit context.
Thoughts?
[2] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
Sorry if this seems like I am talking to myself. Here is a proposed v3 as described above.
From 1466fd43e180ede3f6479d1dca4e7f350f86f80b Mon Sep 17 00:00:00 2001 From: Mike Kravetz mike.kravetz@oracle.com Date: Mon, 24 Oct 2022 15:40:05 -0700 Subject: [PATCH v3] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing
When hugetlb madvise(MADV_DONTNEED) support was added, the existing code to call zap_page_range() to clear the page tables associated with the address range was not modified. However, for hugetlb vmas zap_page_range will call __unmap_hugepage_range_final. This routine assumes the passed hugetlb vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
__unmap_hugepage_range_final was originally designed only to be called in the context of process exit (exit_mmap). It is now called in the context of madvise(MADV_DONTNEED). Restructure the routine and check for !mm_users which indicates it is being called in the context of process exit. If being called in process exit context, delete the vma_lock. Otherwise, just unmap and leave the lock. Since the routine is called in more than just process exit context, rename to eliminate 'final' as __unmap_hugetlb_page_range.
[1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6Ju...
Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Reported-by: Wei Chen harperchen1110@gmail.com Signed-off-by: Mike Kravetz mike.kravetz@oracle.com Cc: stable@vger.kernel.org --- v3 - Check for !mm_users in __unmap_hugepage_range_final instead of creating a separate function.
include/linux/hugetlb.h | 4 ++-- mm/hugetlb.c | 30 ++++++++++++++++++++---------- mm/memory.c | 2 +- 3 files changed, 23 insertions(+), 13 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index a899bc76d677..bc19a1f6ca10 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -158,7 +158,7 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *, void unmap_hugepage_range(struct vm_area_struct *, unsigned long, unsigned long, struct page *, zap_flags_t); -void __unmap_hugepage_range_final(struct mmu_gather *tlb, +void __unmap_hugetlb_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags); @@ -418,7 +418,7 @@ static inline unsigned long hugetlb_change_protection( return 0; }
-static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb, +static inline void __unmap_hugetlb_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 931789a8f734..3fe1152c3c20 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5202,27 +5202,37 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct tlb_flush_mmu_tlbonly(tlb); }
-void __unmap_hugepage_range_final(struct mmu_gather *tlb, +void __unmap_hugetlb_page_range(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long start, unsigned long end, struct page *ref_page, zap_flags_t zap_flags) { + struct mm_struct *mm = vma->vm_mm; + hugetlb_vma_lock_write(vma); i_mmap_lock_write(vma->vm_file->f_mapping);
__unmap_hugepage_range(tlb, vma, start, end, ref_page, zap_flags);
/* - * Unlock and free the vma lock before releasing i_mmap_rwsem. When - * the vma_lock is freed, this makes the vma ineligible for pmd - * sharing. And, i_mmap_rwsem is required to set up pmd sharing. - * This is important as page tables for this unmapped range will - * be asynchrously deleted. If the page tables are shared, there - * will be issues when accessed by someone else. + * Free the vma_lock here if process exiting */ - __hugetlb_vma_unlock_write_free(vma); - - i_mmap_unlock_write(vma->vm_file->f_mapping); + if (!atomic_read(&mm->mm_users)) { + /* + * Unlock and free the vma lock before releasing i_mmap_rwsem. + * When the vma_lock is freed, this makes the vma ineligible + * for pmd sharing. And, i_mmap_rwsem is required to set up + * pmd sharing. This is important as page tables for this + * unmapped range will be asynchrously deleted. If the page + * tables are shared, there will be issues when accessed by + * someone else. + */ + __hugetlb_vma_unlock_write_free(vma); + i_mmap_unlock_write(vma->vm_file->f_mapping); + } else { + i_mmap_unlock_write(vma->vm_file->f_mapping); + hugetlb_vma_unlock_write(vma); + } }
void unmap_hugepage_range(struct vm_area_struct *vma, unsigned long start, diff --git a/mm/memory.c b/mm/memory.c index 8e72f703ed99..1de8ea504047 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1687,7 +1687,7 @@ static void unmap_single_vma(struct mmu_gather *tlb, if (vma->vm_file) { zap_flags_t zap_flags = details ? details->zap_flags : 0; - __unmap_hugepage_range_final(tlb, vma, start, end, + __unmap_hugetlb_page_range(tlb, vma, start, end, NULL, zap_flags); } } else