[PATCH 6.6] fork: defer linking file vma until vma is fully initialized - Linux-stable-mirror

2 Jul 2024

From: Miaohe Lin linmiaohe@huawei.com
commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 upstream.
Thorvald reported a WARNING [1]. And the root cause is below race:
CPU 1					CPU 2
 fork					hugetlbfs_fallocate
  dup_mmap				 hugetlbfs_punch_hole
   i_mmap_lock_write(mapping);
   vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree.
   i_mmap_unlock_write(mapping);
   hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem!
    				 i_mmap_lock_write(mapping);
    				 hugetlb_vmdelete_list
    				  vma_interval_tree_foreach
    				   hugetlb_vma_trylock_write -- Vma_lock is cleared.
   tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem!
    				   hugetlb_vma_unlock_write -- Vma_lock is assigned!!!
    				 i_mmap_unlock_write(mapping);
hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside
i_mmap_rwsem lock while vma lock can be used in the same time.  Fix this
by deferring linking file vma until vma is fully initialized.  Those vmas
should be initialized first before they can be used.
Backport notes:
The first backport attempt (cec11fa2e) was reverted (dd782da4707). This is
the new backport of the original fix (35e351780fa9).
35e351780f ("fork: defer linking file vma until vma is fully initialized")
fixed a hugetlb locking race by moving a bunch of intialization code to earlier
in the function. The call to open() was included in the move but the call to
copy_page_range was not, effectively inverting their relative ordering. This
created an issue for the vfio code which assumes copy_page_range happens before
the call to open() - vfio's open zaps the vma so that the fault handler is
invoked later, but when we inverted the ordering, copy_page_range can set up
mappings post-zap which would prevent the fault handler from being invoked
later. This patch moves the call to copy_page_range to earlier than the call to
open() to restore the original ordering of the two functions while keeping the
fix for hugetlb intact.
Commit aac6db75a9 made several changes to vfio_pci_core.c, including
removing the vfio-pci custom open function. This resolves the issue on
the main branch and so we only need to apply these changes when
backporting to stable branches.
35e351780f ("fork: defer linking file vma until vma is fully initialized")-> v6.9-rc5
aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") -> v6.10-rc4
Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com
Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing")
Signed-off-by: Miaohe Lin linmiaohe@huawei.com
Reported-by: Thorvald Natvig thorvald@google.com
Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1]
Reviewed-by: Jane Chu jane.chu@oracle.com
Cc: Christian Brauner brauner@kernel.org
Cc: Heiko Carstens hca@linux.ibm.com
Cc: Kent Overstreet kent.overstreet@linux.dev
Cc: Liam R. Howlett Liam.Howlett@oracle.com
Cc: Mateusz Guzik mjguzik@gmail.com
Cc: Matthew Wilcox (Oracle) willy@infradead.org
Cc: Miaohe Lin linmiaohe@huawei.com
Cc: Muchun Song muchun.song@linux.dev
Cc: Oleg Nesterov oleg@redhat.com
Cc: Peng Zhang zhangpeng.00@bytedance.com
Cc: Tycho Andersen tandersen@netflix.com
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Morton akpm@linux-foundation.org
Signed-off-by: Miaohe Lin linmiaohe@huawei.com
Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org
Signed-off-by: Leah Rumancik leah.rumancik@gmail.com
---
 kernel/fork.c | 27 +++++++++++++--------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 177ce7438db6..122d2cd124d5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -727,6 +727,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
    	} else if (anon_vma_fork(tmp, mpnt))
    		goto fail_nomem_anon_vma_fork;
    	vm_flags_clear(tmp, VM_LOCKED_MASK);
+		/*
+		 * Copy/update hugetlb private vma information.
+		 */
+		if (is_vm_hugetlb_page(tmp))
+			hugetlb_dup_vma_private(tmp);
+
+		if (!(tmp->vm_flags & VM_WIPEONFORK) &&
+				copy_page_range(tmp, mpnt))
+			goto fail_nomem_vmi_store;
+
+		if (tmp->vm_ops && tmp->vm_ops->open)
+			tmp->vm_ops->open(tmp);
+
    	file = tmp->vm_file;
    	if (file) {
    		struct address_space *mapping = file->f_mapping;
@@ -743,25 +756,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
    		i_mmap_unlock_write(mapping);
    	}
-		/*
-		 * Copy/update hugetlb private vma information.
-		 */
-		if (is_vm_hugetlb_page(tmp))
-			hugetlb_dup_vma_private(tmp);
-
    	/* Link the vma into the MT */
    	if (vma_iter_bulk_store(&vmi, tmp))
    		goto fail_nomem_vmi_store;
mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(tmp, mpnt);
-
-		if (tmp->vm_ops && tmp->vm_ops->open)
-			tmp->vm_ops->open(tmp);
-
-		if (retval)
-			goto loop_out;
    }
    /* a new mm has just been created */
    retval = arch_dup_mmap(oldmm, mm);
-- 
2.45.2.803.g4e1b14247a-goog