On 10/22/22 19:50, Mike Kravetz wrote:
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races.
This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
Create a new routine clear_hugetlb_page_range() that can be called from madvise(MADV_DONTNEED) for hugetlb vmas. It has the same setup as zap_page_range, but does not delete the vma_lock.
After seeing a syzbot use after free report [2] that is also addressed by this patch, I started thinking ...
When __unmap_hugepage_range_final was created, the only time unmap_single_vma was called for hugetlb vmas was during process exit time via exit_mmap. I got in trouble when I added a call via madvise(MADV_DONTNEED) which calls zap_page_range. This patch takes care of that calling path by having madvise(MADV_DONTNEED) call a new routine clear_hugetlb_page_range instead of zap_page_range for hugetlb vmas. The use after free bug had me auditing code paths to make sure __unmap_hugepage_range_final was REALLY only called at process exit time. If not, and we could fault on a vma after calling __unmap_hugepage_range_final we would be in trouble.
My thought was, what if we had __unmap_hugepage_range_final check mm->mm_users to determine if it was being called in the process exit path? If !mm_users, then we can delete the vma_lock to prevent pmd sharing as we know the process is exiting. If not, we do not delete the lock. That seems to be more robust and would prevent issues if someone accidentally introduces a new code path where __unmap_hugepage_range_final (unmap_single_vma for a hugetlb vma) could be called outside process exit context.
Thoughts?
[2] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/