From: Miaohe Lin linmiaohe@huawei.com
commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 upstream.
Thorvald reported a WARNING [1]. And the root cause is below race:
CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping);
hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside i_mmap_rwsem lock while vma lock can be used in the same time. Fix this by deferring linking file vma until vma is fully initialized. Those vmas should be initialized first before they can be used.
Backport notes:
The first backport attempt (cec11fa2e) was reverted (dd782da4707). This is the new backport of the original fix (35e351780fa9).
35e351780f ("fork: defer linking file vma until vma is fully initialized") fixed a hugetlb locking race by moving a bunch of intialization code to earlier in the function. The call to open() was included in the move but the call to copy_page_range was not, effectively inverting their relative ordering. This created an issue for the vfio code which assumes copy_page_range happens before the call to open() - vfio's open zaps the vma so that the fault handler is invoked later, but when we inverted the ordering, copy_page_range can set up mappings post-zap which would prevent the fault handler from being invoked later. This patch moves the call to copy_page_range to earlier than the call to open() to restore the original ordering of the two functions while keeping the fix for hugetlb intact.
Commit aac6db75a9 made several changes to vfio_pci_core.c, including removing the vfio-pci custom open function. This resolves the issue on the main branch and so we only need to apply these changes when backporting to stable branches.
35e351780f ("fork: defer linking file vma until vma is fully initialized")-> v6.9-rc5 aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") -> v6.10-rc4
Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Miaohe Lin linmiaohe@huawei.com Reported-by: Thorvald Natvig thorvald@google.com Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1] Reviewed-by: Jane Chu jane.chu@oracle.com Cc: Christian Brauner brauner@kernel.org Cc: Heiko Carstens hca@linux.ibm.com Cc: Kent Overstreet kent.overstreet@linux.dev Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Mateusz Guzik mjguzik@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Miaohe Lin linmiaohe@huawei.com Cc: Muchun Song muchun.song@linux.dev Cc: Oleg Nesterov oleg@redhat.com Cc: Peng Zhang zhangpeng.00@bytedance.com Cc: Tycho Andersen tandersen@netflix.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Miaohe Lin linmiaohe@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Leah Rumancik leah.rumancik@gmail.com --- kernel/fork.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index 177ce7438db6..122d2cd124d5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -727,6 +727,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, } else if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; vm_flags_clear(tmp, VM_LOCKED_MASK); + /* + * Copy/update hugetlb private vma information. + */ + if (is_vm_hugetlb_page(tmp)) + hugetlb_dup_vma_private(tmp); + + if (!(tmp->vm_flags & VM_WIPEONFORK) && + copy_page_range(tmp, mpnt)) + goto fail_nomem_vmi_store; + + if (tmp->vm_ops && tmp->vm_ops->open) + tmp->vm_ops->open(tmp); + file = tmp->vm_file; if (file) { struct address_space *mapping = file->f_mapping; @@ -743,25 +756,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, i_mmap_unlock_write(mapping); }
- /* - * Copy/update hugetlb private vma information. - */ - if (is_vm_hugetlb_page(tmp)) - hugetlb_dup_vma_private(tmp); - /* Link the vma into the MT */ if (vma_iter_bulk_store(&vmi, tmp)) goto fail_nomem_vmi_store;
mm->map_count++; - if (!(tmp->vm_flags & VM_WIPEONFORK)) - retval = copy_page_range(tmp, mpnt); - - if (tmp->vm_ops && tmp->vm_ops->open) - tmp->vm_ops->open(tmp); - - if (retval) - goto loop_out; } /* a new mm has just been created */ retval = arch_dup_mmap(oldmm, mm);
On 2024/7/2 12:29, Leah Rumancik wrote:
From: Miaohe Lin linmiaohe@huawei.com
commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 upstream.
Thorvald reported a WARNING [1]. And the root cause is below race:
CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping);
hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside i_mmap_rwsem lock while vma lock can be used in the same time. Fix this by deferring linking file vma until vma is fully initialized. Those vmas should be initialized first before they can be used.
Backport notes:
The first backport attempt (cec11fa2e) was reverted (dd782da4707). This is the new backport of the original fix (35e351780fa9).
35e351780f ("fork: defer linking file vma until vma is fully initialized") fixed a hugetlb locking race by moving a bunch of intialization code to earlier in the function. The call to open() was included in the move but the call to copy_page_range was not, effectively inverting their relative ordering. This created an issue for the vfio code which assumes copy_page_range happens before the call to open() - vfio's open zaps the vma so that the fault handler is invoked later, but when we inverted the ordering, copy_page_range can set up mappings post-zap which would prevent the fault handler from being invoked later. This patch moves the call to copy_page_range to earlier than the call to open() to restore the original ordering of the two functions while keeping the fix for hugetlb intact.
Thanks for your update!
Commit aac6db75a9 made several changes to vfio_pci_core.c, including removing the vfio-pci custom open function. This resolves the issue on the main branch and so we only need to apply these changes when backporting to stable branches.
35e351780f ("fork: defer linking file vma until vma is fully initialized")-> v6.9-rc5 aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") -> v6.10-rc4
Link: https://lkml.kernel.org/r/20240410091441.3539905-1-linmiaohe@huawei.com Fixes: 8d9bfb260814 ("hugetlb: add vma based lock for pmd sharing") Signed-off-by: Miaohe Lin linmiaohe@huawei.com Reported-by: Thorvald Natvig thorvald@google.com Closes: https://lore.kernel.org/linux-mm/20240129161735.6gmjsswx62o4pbja@revolver/T/ [1] Reviewed-by: Jane Chu jane.chu@oracle.com Cc: Christian Brauner brauner@kernel.org Cc: Heiko Carstens hca@linux.ibm.com Cc: Kent Overstreet kent.overstreet@linux.dev Cc: Liam R. Howlett Liam.Howlett@oracle.com Cc: Mateusz Guzik mjguzik@gmail.com Cc: Matthew Wilcox (Oracle) willy@infradead.org Cc: Miaohe Lin linmiaohe@huawei.com Cc: Muchun Song muchun.song@linux.dev Cc: Oleg Nesterov oleg@redhat.com Cc: Peng Zhang zhangpeng.00@bytedance.com Cc: Tycho Andersen tandersen@netflix.com Cc: stable@vger.kernel.org Signed-off-by: Andrew Morton akpm@linux-foundation.org Signed-off-by: Miaohe Lin linmiaohe@huawei.com Signed-off-by: Greg Kroah-Hartman gregkh@linuxfoundation.org Signed-off-by: Leah Rumancik leah.rumancik@gmail.com
kernel/fork.c | 27 +++++++++++++-------------- 1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c index 177ce7438db6..122d2cd124d5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -727,6 +727,19 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, } else if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; vm_flags_clear(tmp, VM_LOCKED_MASK);
/*
* Copy/update hugetlb private vma information.
*/
if (is_vm_hugetlb_page(tmp))
hugetlb_dup_vma_private(tmp);
if (!(tmp->vm_flags & VM_WIPEONFORK) &&
copy_page_range(tmp, mpnt))
goto fail_nomem_vmi_store;
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);
- file = tmp->vm_file; if (file) { struct address_space *mapping = file->f_mapping;
@@ -743,25 +756,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, i_mmap_unlock_write(mapping); }
/*
* Copy/update hugetlb private vma information.
*/
if (is_vm_hugetlb_page(tmp))
hugetlb_dup_vma_private(tmp);
- /* Link the vma into the MT */ if (vma_iter_bulk_store(&vmi, tmp)) goto fail_nomem_vmi_store;
mm->map_count++;
if (!(tmp->vm_flags & VM_WIPEONFORK))
retval = copy_page_range(tmp, mpnt);
I have a vague memory that copy_page_range should be called after vma is inserted into the i_mmap tree. Or there might be a problem:
dup_mmap remove_migration_ptes copy_page_range -- Child process copys migration entry from parent rmap_walk rmap_walk_file i_mmap_lock_read(mapping); vma_interval_tree_foreach remove_migration_pte -- The vma of child process is still invisible So migration entry lefts in the child process's address space. i_mmap_unlock_read(mapping); i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- To late! Child process has stale migration entry left while migration is already done! i_mmap_unlock_write(mapping); But I'm not really sure. I might miss something. Thanks. .
On Mon, Jul 01, 2024 at 09:29:48PM -0700, Leah Rumancik wrote:
From: Miaohe Lin linmiaohe@huawei.com
commit 35e351780fa9d8240dd6f7e4f245f9ea37e96c19 upstream.
Thorvald reported a WARNING [1]. And the root cause is below race:
CPU 1 CPU 2 fork hugetlbfs_fallocate dup_mmap hugetlbfs_punch_hole i_mmap_lock_write(mapping); vma_interval_tree_insert_after -- Child vma is visible through i_mmap tree. i_mmap_unlock_write(mapping); hugetlb_dup_vma_private -- Clear vma_lock outside i_mmap_rwsem! i_mmap_lock_write(mapping); hugetlb_vmdelete_list vma_interval_tree_foreach hugetlb_vma_trylock_write -- Vma_lock is cleared. tmp->vm_ops->open -- Alloc new vma_lock outside i_mmap_rwsem! hugetlb_vma_unlock_write -- Vma_lock is assigned!!! i_mmap_unlock_write(mapping);
hugetlb_dup_vma_private() and hugetlb_vm_op_open() are called outside i_mmap_rwsem lock while vma lock can be used in the same time. Fix this by deferring linking file vma until vma is fully initialized. Those vmas should be initialized first before they can be used.
Backport notes:
The first backport attempt (cec11fa2e) was reverted (dd782da4707). This is the new backport of the original fix (35e351780fa9).
35e351780f ("fork: defer linking file vma until vma is fully initialized") fixed a hugetlb locking race by moving a bunch of intialization code to earlier in the function. The call to open() was included in the move but the call to copy_page_range was not, effectively inverting their relative ordering. This created an issue for the vfio code which assumes copy_page_range happens before the call to open() - vfio's open zaps the vma so that the fault handler is invoked later, but when we inverted the ordering, copy_page_range can set up mappings post-zap which would prevent the fault handler from being invoked later. This patch moves the call to copy_page_range to earlier than the call to open() to restore the original ordering of the two functions while keeping the fix for hugetlb intact.
Commit aac6db75a9 made several changes to vfio_pci_core.c, including removing the vfio-pci custom open function. This resolves the issue on the main branch and so we only need to apply these changes when backporting to stable branches.
35e351780f ("fork: defer linking file vma until vma is fully initialized")-> v6.9-rc5 aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") -> v6.10-rc4
Is there a strong reason not to take the commit above instead? That way:
a. We stay aligned with upstream, not needed a custom backport. b. We avoid similar issues in the future.
I'd say aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is a pretty significant change for stable. Then again, mucking around in dup_mmap doesn't seem very clear cut either..
cc'ing Alex from this patch
- leah
I tried out Sasha's suggestion. Note that *just* taking aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
But, the good news is both of those apply more or less cleanly to 6.6. And, at least under a very basic test which exercises VFIO memory mapping, things seem to work properly with that change.
I would agree with Leah that these seem a bit big to be stable fixes. But, I'm encouraged by the fact that Sasha suggested taking them. If there are no big objections (Alex? :) ) I can send the backport patches this week.
On Mon, 15 Jul 2024 13:35:41 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
I tried out Sasha's suggestion. Note that *just* taking aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
But, the good news is both of those apply more or less cleanly to 6.6. And, at least under a very basic test which exercises VFIO memory mapping, things seem to work properly with that change.
I would agree with Leah that these seem a bit big to be stable fixes. But, I'm encouraged by the fact that Sasha suggested taking them. If there are no big objections (Alex? :) ) I can send the backport patches this week.
If you were to take those, I think you'd also want:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
which helps avoid a potential regression in VM startup latency vs faulting each page of the VMA. Ideally we'd have had huge_fault working for pfnmaps before this conversion to avoid the latter commit.
I'm a bit confused by the lineage here though, 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") entered v6.9 whereas these vfio changes all came in v6.10, so why does the v6.6 backport end up with dependencies on these newer commits? Is there something that needs to be fixed in v6.9-stable as well?
Aside from the size of aac6db75a9 in particular, I'm not aware of any outstanding issues that would otherwise dissuade backport to v6.6-stable. Thanks,
Alex
On Mon, Jul 15, 2024 at 3:21 PM Alex Williamson alex.williamson@redhat.com wrote:
On Mon, 15 Jul 2024 13:35:41 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
I tried out Sasha's suggestion. Note that *just* taking aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
But, the good news is both of those apply more or less cleanly to 6.6. And, at least under a very basic test which exercises VFIO memory mapping, things seem to work properly with that change.
I would agree with Leah that these seem a bit big to be stable fixes. But, I'm encouraged by the fact that Sasha suggested taking them. If there are no big objections (Alex? :) ) I can send the backport patches this week.
If you were to take those, I think you'd also want:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
which helps avoid a potential regression in VM startup latency vs faulting each page of the VMA. Ideally we'd have had huge_fault working for pfnmaps before this conversion to avoid the latter commit.
I'm a bit confused by the lineage here though, 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") entered v6.9 whereas these vfio changes all came in v6.10, so why does the v6.6 backport end up with dependencies on these newer commits? Is there something that needs to be fixed in v6.9-stable as well?
Right, I believe 35e351780fa9 introduced a bug for VFIO by calling vm_ops->open() *before* copy_page_range(). So I think this bug affects not just 6.6 (to which 35e351780fa9 was stable backported) but also 6.9 as you say.
The reason to bring up all these newer commits is, it's unclear how to fix the bug. :) We thought we had a simple solution to just reorder when vm_ops->open() is called, but Miaohe pointed out elsewhere in this thread an issue with doing that.
Assuming the reordering is unworkable, the only other idea I have for fixing the bug without the larger refactor is:
1. Mark VFIO VMAs VM_WIPEONFORK so we don't copy_page_range after vm_ops->open() is called 2. Remove the WARN_ON_ONCE(1) in get_pat_info() so when VFIO zaps a not-fully-populated range (expected if we never copy_page_range!) we don't get a warning
There are downsides to this fix. It's kind of abusing VM_WIPEONFORK for a new purpose. It's removing a warning which may catch other legitimate problems. And it's diverging stable kernels from upstream as Sasha points out.
Just backporting the refactors fixes (well, totally avoids) the bug, and it doesn't require special hackery only for stable kernels.
Aside from the size of aac6db75a9 in particular, I'm not aware of any outstanding issues that would otherwise dissuade backport to v6.6-stable. Thanks,
Alex
On Mon, 15 Jul 2024 18:06:25 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
On Mon, Jul 15, 2024 at 3:21 PM Alex Williamson alex.williamson@redhat.com wrote:
On Mon, 15 Jul 2024 13:35:41 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
I tried out Sasha's suggestion. Note that *just* taking aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
But, the good news is both of those apply more or less cleanly to 6.6. And, at least under a very basic test which exercises VFIO memory mapping, things seem to work properly with that change.
I would agree with Leah that these seem a bit big to be stable fixes. But, I'm encouraged by the fact that Sasha suggested taking them. If there are no big objections (Alex? :) ) I can send the backport patches this week.
If you were to take those, I think you'd also want:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
which helps avoid a potential regression in VM startup latency vs faulting each page of the VMA. Ideally we'd have had huge_fault working for pfnmaps before this conversion to avoid the latter commit.
I'm a bit confused by the lineage here though, 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") entered v6.9 whereas these vfio changes all came in v6.10, so why does the v6.6 backport end up with dependencies on these newer commits? Is there something that needs to be fixed in v6.9-stable as well?
Right, I believe 35e351780fa9 introduced a bug for VFIO by calling vm_ops->open() *before* copy_page_range(). So I think this bug affects not just 6.6 (to which 35e351780fa9 was stable backported) but also 6.9 as you say.
The reason to bring up all these newer commits is, it's unclear how to fix the bug. :) We thought we had a simple solution to just reorder when vm_ops->open() is called, but Miaohe pointed out elsewhere in this thread an issue with doing that.
Assuming the reordering is unworkable, the only other idea I have for fixing the bug without the larger refactor is:
- Mark VFIO VMAs VM_WIPEONFORK so we don't copy_page_range after
vm_ops->open() is called 2. Remove the WARN_ON_ONCE(1) in get_pat_info() so when VFIO zaps a not-fully-populated range (expected if we never copy_page_range!) we don't get a warning
There are downsides to this fix. It's kind of abusing VM_WIPEONFORK for a new purpose. It's removing a warning which may catch other legitimate problems. And it's diverging stable kernels from upstream as Sasha points out.
Just backporting the refactors fixes (well, totally avoids) the bug, and it doesn't require special hackery only for stable kernels.
Yes, I'd agree that we want to stay as close as possible to the current upstream solution, even if we got there pretty haphazardly. Therefore it sounds like we should queue the following for v6.9-stable:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()") b7c5e64fecfa ("vfio: Create vfio_fs_type with inode per device")
And then anywhere that 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") gets backported, those will also need to follow.
Did anyone report an issue with 35e351780fa9 and vfio on v6.9 or the previous v6.6 backport to use as a test case or do we just know it's an issue from inspection? The revert only notes an xfstest issue. Thanks,
Alex
On Tue, Jul 16, 2024 at 9:08 AM Alex Williamson alex.williamson@redhat.com wrote:
On Mon, 15 Jul 2024 18:06:25 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
On Mon, Jul 15, 2024 at 3:21 PM Alex Williamson alex.williamson@redhat.com wrote:
On Mon, 15 Jul 2024 13:35:41 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
I tried out Sasha's suggestion. Note that *just* taking aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
But, the good news is both of those apply more or less cleanly to 6.6. And, at least under a very basic test which exercises VFIO memory mapping, things seem to work properly with that change.
I would agree with Leah that these seem a bit big to be stable fixes. But, I'm encouraged by the fact that Sasha suggested taking them. If there are no big objections (Alex? :) ) I can send the backport patches this week.
If you were to take those, I think you'd also want:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
which helps avoid a potential regression in VM startup latency vs faulting each page of the VMA. Ideally we'd have had huge_fault working for pfnmaps before this conversion to avoid the latter commit.
I'm a bit confused by the lineage here though, 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") entered v6.9 whereas these vfio changes all came in v6.10, so why does the v6.6 backport end up with dependencies on these newer commits? Is there something that needs to be fixed in v6.9-stable as well?
Right, I believe 35e351780fa9 introduced a bug for VFIO by calling vm_ops->open() *before* copy_page_range(). So I think this bug affects not just 6.6 (to which 35e351780fa9 was stable backported) but also 6.9 as you say.
The reason to bring up all these newer commits is, it's unclear how to fix the bug. :) We thought we had a simple solution to just reorder when vm_ops->open() is called, but Miaohe pointed out elsewhere in this thread an issue with doing that.
Assuming the reordering is unworkable, the only other idea I have for fixing the bug without the larger refactor is:
- Mark VFIO VMAs VM_WIPEONFORK so we don't copy_page_range after
vm_ops->open() is called 2. Remove the WARN_ON_ONCE(1) in get_pat_info() so when VFIO zaps a not-fully-populated range (expected if we never copy_page_range!) we don't get a warning
There are downsides to this fix. It's kind of abusing VM_WIPEONFORK for a new purpose. It's removing a warning which may catch other legitimate problems. And it's diverging stable kernels from upstream as Sasha points out.
Just backporting the refactors fixes (well, totally avoids) the bug, and it doesn't require special hackery only for stable kernels.
Yes, I'd agree that we want to stay as close as possible to the current upstream solution, even if we got there pretty haphazardly. Therefore it sounds like we should queue the following for v6.9-stable:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()") b7c5e64fecfa ("vfio: Create vfio_fs_type with inode per device")
And then anywhere that 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") gets backported, those will also need to follow.
Sounds good to me. I can send these patches for 6.9 and then 6.6.
Did anyone report an issue with 35e351780fa9 and vfio on v6.9 or the previous v6.6 backport to use as a test case or do we just know it's an issue from inspection? The revert only notes an xfstest issue. Thanks,
I'm not aware of any reports of this, besides our own detection internally.
We originally noticed via xfstests the failure mode where we call copy_page_range, so underneath untrack_pfn we find a 'hole' in the mapping so we WARN. A fair question is, why does running xfstests involve exercising vfio-pci? :) Internally our test machines use vfio-pci for other reasons, xfstests is an innocent bystander here. We just happened to trigger this WARN while xfstests was running, so it noticed + reported the WARN in the test results.
Since that repro is specific to our test machine setup, it unfortunately isn't an easily shareable regression test. :/
Alex
On Tue, 16 Jul 2024 09:58:34 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
On Tue, Jul 16, 2024 at 9:08 AM Alex Williamson alex.williamson@redhat.com wrote:
On Mon, 15 Jul 2024 18:06:25 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
On Mon, Jul 15, 2024 at 3:21 PM Alex Williamson alex.williamson@redhat.com wrote:
On Mon, 15 Jul 2024 13:35:41 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
I tried out Sasha's suggestion. Note that *just* taking aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
But, the good news is both of those apply more or less cleanly to 6.6. And, at least under a very basic test which exercises VFIO memory mapping, things seem to work properly with that change.
I would agree with Leah that these seem a bit big to be stable fixes. But, I'm encouraged by the fact that Sasha suggested taking them. If there are no big objections (Alex? :) ) I can send the backport patches this week.
If you were to take those, I think you'd also want:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
which helps avoid a potential regression in VM startup latency vs faulting each page of the VMA. Ideally we'd have had huge_fault working for pfnmaps before this conversion to avoid the latter commit.
I'm a bit confused by the lineage here though, 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") entered v6.9 whereas these vfio changes all came in v6.10, so why does the v6.6 backport end up with dependencies on these newer commits? Is there something that needs to be fixed in v6.9-stable as well?
Right, I believe 35e351780fa9 introduced a bug for VFIO by calling vm_ops->open() *before* copy_page_range(). So I think this bug affects not just 6.6 (to which 35e351780fa9 was stable backported) but also 6.9 as you say.
The reason to bring up all these newer commits is, it's unclear how to fix the bug. :) We thought we had a simple solution to just reorder when vm_ops->open() is called, but Miaohe pointed out elsewhere in this thread an issue with doing that.
Assuming the reordering is unworkable, the only other idea I have for fixing the bug without the larger refactor is:
- Mark VFIO VMAs VM_WIPEONFORK so we don't copy_page_range after
vm_ops->open() is called 2. Remove the WARN_ON_ONCE(1) in get_pat_info() so when VFIO zaps a not-fully-populated range (expected if we never copy_page_range!) we don't get a warning
There are downsides to this fix. It's kind of abusing VM_WIPEONFORK for a new purpose. It's removing a warning which may catch other legitimate problems. And it's diverging stable kernels from upstream as Sasha points out.
Just backporting the refactors fixes (well, totally avoids) the bug, and it doesn't require special hackery only for stable kernels.
Yes, I'd agree that we want to stay as close as possible to the current upstream solution, even if we got there pretty haphazardly. Therefore it sounds like we should queue the following for v6.9-stable:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()") b7c5e64fecfa ("vfio: Create vfio_fs_type with inode per device")
And then anywhere that 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") gets backported, those will also need to follow.
Sounds good to me. I can send these patches for 6.9 and then 6.6.
Did anyone report an issue with 35e351780fa9 and vfio on v6.9 or the previous v6.6 backport to use as a test case or do we just know it's an issue from inspection? The revert only notes an xfstest issue. Thanks,
I'm not aware of any reports of this, besides our own detection internally.
We originally noticed via xfstests the failure mode where we call copy_page_range, so underneath untrack_pfn we find a 'hole' in the mapping so we WARN. A fair question is, why does running xfstests involve exercising vfio-pci? :) Internally our test machines use vfio-pci for other reasons, xfstests is an innocent bystander here. We just happened to trigger this WARN while xfstests was running, so it noticed + reported the WARN in the test results.
Since that repro is specific to our test machine setup, it unfortunately isn't an easily shareable regression test. :/
Aha, that helps put the pieces together and is an interesting data point for vfio-pci use :) Sounds like you're then best able to verify the issue exists on v6.9 and the fix with the above three vfio patches, assuming you've got the bandwidth. Let me know if you need any support. Thanks!
Alex
linux-stable-mirror@lists.linaro.org