On Mon, 15 Jul 2024 18:06:25 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
On Mon, Jul 15, 2024 at 3:21 PM Alex Williamson alex.williamson@redhat.com wrote:
On Mon, 15 Jul 2024 13:35:41 -0700 Axel Rasmussen axelrasmussen@google.com wrote:
I tried out Sasha's suggestion. Note that *just* taking aac6db75a9 ("vfio/pci: Use unmap_mapping_range()") is not sufficient, we also need b7c5e64fec ("vfio: Create vfio_fs_type with inode per device").
But, the good news is both of those apply more or less cleanly to 6.6. And, at least under a very basic test which exercises VFIO memory mapping, things seem to work properly with that change.
I would agree with Leah that these seem a bit big to be stable fixes. But, I'm encouraged by the fact that Sasha suggested taking them. If there are no big objections (Alex? :) ) I can send the backport patches this week.
If you were to take those, I think you'd also want:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault")
which helps avoid a potential regression in VM startup latency vs faulting each page of the VMA. Ideally we'd have had huge_fault working for pfnmaps before this conversion to avoid the latter commit.
I'm a bit confused by the lineage here though, 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") entered v6.9 whereas these vfio changes all came in v6.10, so why does the v6.6 backport end up with dependencies on these newer commits? Is there something that needs to be fixed in v6.9-stable as well?
Right, I believe 35e351780fa9 introduced a bug for VFIO by calling vm_ops->open() *before* copy_page_range(). So I think this bug affects not just 6.6 (to which 35e351780fa9 was stable backported) but also 6.9 as you say.
The reason to bring up all these newer commits is, it's unclear how to fix the bug. :) We thought we had a simple solution to just reorder when vm_ops->open() is called, but Miaohe pointed out elsewhere in this thread an issue with doing that.
Assuming the reordering is unworkable, the only other idea I have for fixing the bug without the larger refactor is:
- Mark VFIO VMAs VM_WIPEONFORK so we don't copy_page_range after
vm_ops->open() is called 2. Remove the WARN_ON_ONCE(1) in get_pat_info() so when VFIO zaps a not-fully-populated range (expected if we never copy_page_range!) we don't get a warning
There are downsides to this fix. It's kind of abusing VM_WIPEONFORK for a new purpose. It's removing a warning which may catch other legitimate problems. And it's diverging stable kernels from upstream as Sasha points out.
Just backporting the refactors fixes (well, totally avoids) the bug, and it doesn't require special hackery only for stable kernels.
Yes, I'd agree that we want to stay as close as possible to the current upstream solution, even if we got there pretty haphazardly. Therefore it sounds like we should queue the following for v6.9-stable:
d71a989cf5d9 ("vfio/pci: Insert full vma on mmap'd MMIO fault") aac6db75a9fc ("vfio/pci: Use unmap_mapping_range()") b7c5e64fecfa ("vfio: Create vfio_fs_type with inode per device")
And then anywhere that 35e351780fa9 ("fork: defer linking file vma until vma is fully initialized") gets backported, those will also need to follow.
Did anyone report an issue with 35e351780fa9 and vfio on v6.9 or the previous v6.6 backport to use as a test case or do we just know it's an issue from inspection? The revert only notes an xfstest issue. Thanks,
Alex