On Wed, Jun 10, 2026 at 04:43:21PM +0100, Matt Evans wrote:
Hi Matt,
[...]
> + *
> + * With the goal of taking vdev->memory_lock in a world where
> + * vdev might not still exist:
> + *
> + * 1. Take the resv lock on the DMABUF:
> + * - If racing cleanup got in first, the buffer is revoked;
> + * stop/exit if so.
> + * - If we got in first, the buffer is not revoked so vdev is
> + * non-NULL, accessible, and cleanup _has not yet put the
> + * VFIO device registration_. So, the device refcount must
> + * be >0.
> + *
> + * 2. Take vfio_device registration (refcount guaranteed >0
> + * hereafter).
> + *
> + * 3. Unlock the DMABUF's resv lock:
> + * - A racing cleanup can now complete.
> + * - But, the device refcount >0, meaning the vfio_device
> + * (and vfio_pcie_core device vdev) have not yet been
> + * freed. vdev is accessible, even if the DMABUF has been
> + * revoked or cleanup has happened, because
> + * vfio_unregister_group_dev() can't complete.
> + *
> + * 4. Take the vdev->memory_lock
> + * - Either the DMABUF is usable, or has been cleaned up.
> + * Whichever, it can no longer change under us.
> + * - Test the DMABUF revocation status again: if it was
> + * revoked between 1 and 4 return a SIGBUS. Otherwise,
> + * return a PFN.
> + * - It's not necessary to also take the resv lock, because
> + * the status/vdev can't change while memory_lock is held.
> + *
> + * 5. Unlock, done.
> */
> +
> + dma_resv_lock(priv->dmabuf->resv, NULL);
> +
> + if (priv->revoked) {
> + pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n",
> + __func__, vmf->address, vma->vm_pgoff);
> + dma_resv_unlock(priv->dmabuf->resv);
> + return VM_FAULT_SIGBUS;
> + }
> +
> + /* If the buffer isn't revoked, vdev is valid */
> vdev = priv->vdev;
>
> + if (!vfio_device_try_get_registration(&vdev->vdev)) {
> + /*
> + * If vdev != NULL (above), the registration should
> + * already be >0 and so this try_get should never
> + * fail.
> + */
> + dev_warn(&vdev->pdev->dev, "%s: Unexpected registration failure\n",
> + __func__);
> + dma_resv_unlock(priv->dmabuf->resv);
> + return VM_FAULT_SIGBUS;
> + }
> + dma_resv_unlock(priv->dmabuf->resv);
> +
> scoped_guard(rwsem_read, &vdev->memory_lock) {
> + /* Revocation status must be re-read, under memory_lock */
> if (!priv->revoked) {
> int pres = vfio_pci_dma_buf_find_pfn(priv, vma,
> vmf->address,
Wait, I noticed that the is_aligned_for_order() check from mainline was
removed here. Was that intentional?
For hugepage faults (order > 0), we must ensure the PFN and address are
properly aligned before calling vfio_pci_vmf_insert_pfn().
In the current upstream code, we have:
if (is_aligned_for_order(vma, addr, pfn, order))
Should we restore that check here?
> @@ -1766,6 +1827,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
> __func__, order, pfn, vmf->address,
> vma->vm_pgoff, (unsigned int)ret);
>
> + vfio_device_put_registration(&vdev->vdev);
> return ret;
> }
>
> @@ -1774,7 +1836,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf)
> return vfio_pci_mmap_huge_fault(vmf, 0);
> }
>
> -static const struct vm_operations_struct vfio_pci_mmap_ops = {
> +const struct vm_operations_struct vfio_pci_mmap_ops = {
> .fault = vfio_pci_mmap_page_fault,
Nit: Instead of making this global, should we add a helper? E.g.:
void vfio_pci_set_vma_ops(struct vm_area_struct *vma)
{
vma->vm_ops = &vfio_pci_mmap_ops;
}
[...]
> +
> +static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
> +{
> + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> + /*
> + * If we observe that the buffer is revoked now then refuse
> + * the mmap(). This is a belt-and-braces early failure to
> + * ease debugging a revoked buffer being used. Userspace
> + * might also race an mmap() against an explicit revocation,
> + * or an action doing a temporary revoke; race scenarios are
> + * still safe because the fault handler ultimately prevents
> + * access to a revoked buffer if it isn't caught here.
> + */
> + if (READ_ONCE(priv->revoked))
> + return -ENODEV;
> + if ((vma->vm_flags & VM_SHARED) == 0)
> + return -EINVAL;
> +
> + /*
> + * dma_buf_mmap_internal() has asserted that the VMA is
> + * contained within the DMABUF size before calling this.
> + */
> +
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> +
> + /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
> + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
> + VM_DONTEXPAND | VM_DONTDUMP);
> + vma->vm_private_data = priv;
> + vma->vm_ops = &vfio_pci_mmap_ops;
> +
> + return 0;
> +}
> #endif /* CONFIG_VFIO_PCI_DMABUF */
>
Thanks,
Praan
On Fri, Jun 12, 2026 at 04:22:12PM +0100, Matt Evans wrote:
> Hi Pranjal,
>
> On 12/06/2026 11:41, Pranjal Shrivastava wrote:
> > On Wed, Jun 10, 2026 at 04:43:18PM +0100, Matt Evans wrote:
> >> Convert the VFIO device fd fops->mmap to create a DMABUF representing
> >> the BAR mapping, and make the VMA fault handler look up PFNs from the
> >> corresponding DMABUF. This supports future code mmap()ing BAR
> >> DMABUFs, and iommufd work to support Type1 P2P.
> >>
> >> First, vfio_pci_core_mmap() uses the new
> >> vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF
> >> representing a single BAR range. Then, the vfio_pci_mmap_huge_fault()
> >> callback is updated to understand revoked buffers, and uses the new
> >> vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given
> >> fault address.
> >>
> >> Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move()
> >> zaps PTEs (used on the revocation and cleanup paths).
> >>
> >> CONFIG_VFIO_PCI_CORE now unconditionally depends on
> >> CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The
> >> CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for
> >> VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of
> >> CONFIG_PCI_P2PDMA.
> >>
> >> Signed-off-by: Matt Evans <matt(a)ozlabs.org>
> >> ---
> >> drivers/vfio/pci/Kconfig | 5 +-
> >> drivers/vfio/pci/Makefile | 3 +-
> >> drivers/vfio/pci/vfio_pci_core.c | 75 +++++++++++++++++++-----------
> >> drivers/vfio/pci/vfio_pci_dmabuf.c | 12 +++++
> >> drivers/vfio/pci/vfio_pci_priv.h | 11 +----
> >> 5 files changed, 67 insertions(+), 39 deletions(-)
Hi Matt,
[...]
> >> int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
> >> struct vm_area_struct *vma,
> >> @@ -532,6 +538,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> >> struct vfio_pci_dma_buf *tmp;
> >>
> >> lockdep_assert_held_write(&vdev->memory_lock);
> >> + /*
> >> + * Holding memory_lock ensures a racing VMA fault observes
> >> + * priv->revoked properly.
> >> + */
> >
> > Nit: This comment should appear before the lockdep_assert_held_write()
> > Also, it is slightly verbose.. (not against it though).
>
> Right, I'll move it. Agree it's wordy but if anyone changes that I want
> them to "think faulthandler".
>
That's fair I guess.
> >> list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> >> if (!get_file_active(&priv->dmabuf->file))
> >> @@ -549,6 +559,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> >> if (revoked) {
> >> kref_put(&priv->kref, vfio_pci_dma_buf_done);
> >> wait_for_completion(&priv->comp);
> >> + unmap_mapping_range(priv->dmabuf->file->f_mapping,
> >> + 0, priv->size, 1);
> >
> > Have we run this series with lockdep enabled?
> > I guess it'd be nice to check with lockdep once..
>
> I've (generally) always run testing of this series with lockdep. (No
> issues (anymore).)
That sounds good! Thanks for confirming! :)
Praan
Most of this patch series has already been pushed upstream, this is just
the second half of the patch series that has not been pushed yet + some
additional changes which were required to implement changes requested by
the mailing list. This patch series is originally from Asahi, previously
posted by Daniel Almeida.
The previous version of the patch series can be found here:
https://patchwork.freedesktop.org/series/164580/
Branch with patches applied available here:
https://gitlab.freedesktop.org/lyudess/linux/-/commits/rust/gem-shmem
This patch series applies on top of drm-rust-next
Patch-series wide changes since V15:
* Fix some major rebasing errors I somehow didn't notice :(
* Drop the dependency on LazyInit, use the trick that Alice suggested
instead.
* Fix dependency ordering so that Tyr can get the vmap stuff first
without the other bits.
Patch-series wide changes since V16:
* Fix ordering one more time (SetOnce::reset() doesn't need to come
before adding vmap functions)
* Rebase against the latest DeviceContext changes from me that got
pushed.
Patch-series wide changes since V20:
* Lots of Sashiko fixes, excluding the comments that I couldn't prove
weren't just bogus.
Lyude Paul (4):
rust: drm: gem: shmem: Add DmaResvGuard helper
rust: drm: gem: shmem: Add vmap functions
rust: faux: Allow retrieving a bound Device
rust: drm: gem: Introduce shmem::Object::sg_table()
rust/kernel/drm/gem/shmem.rs | 547 ++++++++++++++++++++++++++++++++++-
rust/kernel/faux.rs | 18 +-
2 files changed, 549 insertions(+), 16 deletions(-)
base-commit: 550dc7536644db2d67c6f8cf525bba682fba08d9
--
2.54.0
On Wed, Jun 10, 2026 at 04:43:20PM +0100, Matt Evans wrote:
> Previously, vfio_pci_zap_bars() (and the wrapper
> vfio_pci_zap_and_down_write_memory_lock()) calls were paired with
> calls to vfio_pci_dma_buf_move().
>
> This commit replaces them with a unified new function,
> vfio_pci_zap_revoke_bars() containing both the vfio_pci_dma_buf_move()
> and the unmap_mapping_range(), making it harder for callers to omit
> one. It adds a wrapper, vfio_pci_lock_zap_revoke_bars(), which takes
> the write memory_lock before zapping, and adds a new
> vfio_pci_unrevoke_bars() for the re-enable path.
>
> As of "vfio/pci: Convert BAR mmap() to use a DMABUF", the
> unmap_mapping_range() to zap is no longer performed for vfio-pci since
> the DMABUFs used for BAR mappings already zap PTEs when the
> vfio_pci_dma_buf_move() occurs.
>
> However, it must be assumed that VFIO drivers which override the .mmap
> op could create mappings _not_ backed by DMABUFs. So, the zap is
> still performed on revoke if .mmap is overridden, using a new
> zap_bars_on_revoke flag. A driver can explicitly opt out; the flag is
> cleared by the hisi_acc_vfio_pci driver, since its .mmap just wraps
> vfio_pci_core_mmap() and so still uses DMABUFs.
>
> Signed-off-by: Matt Evans <matt(a)ozlabs.org>
> ---
> .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 8 +++
> drivers/vfio/pci/vfio_pci_config.c | 30 ++++----
> drivers/vfio/pci/vfio_pci_core.c | 70 +++++++++++++------
> drivers/vfio/pci/vfio_pci_priv.h | 3 +-
> include/linux/vfio_pci_core.h | 1 +
> 5 files changed, 73 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> index 86362ec424a5..51990f6d66d5 100644
> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> @@ -1692,6 +1692,14 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
> if (ret)
> goto out_put_vdev;
>
> + /*
> + * hisi_acc_vfio_pci_mmap() calls down to
> + * vfio_pci_core_mmap(), so BAR mappings are still
> + * DMABUF-backed. They don't require a zap on revoke, so opt
> + * out:
> + */
> + hisi_acc_vdev->core_device.zap_bars_on_revoke = false;
> +
This seems to be happening after we vfio_pci_core_register_device, which
could be slightly problematic if another device in the same group races
to trigger a hot reset before we can set this to false. Could we
initialize this flag before registration instead?
> hisi_acc_vfio_debug_init(hisi_acc_vdev);
> return 0;
>
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index a10ed733f0e3..8bfab0da481c 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -590,12 +590,10 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
> virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
> new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
>
> - if (!new_mem) {
> - vfio_pci_zap_and_down_write_memory_lock(vdev);
> - vfio_pci_dma_buf_move(vdev, true);
> - } else {
> + if (!new_mem)
> + vfio_pci_lock_zap_revoke_bars(vdev);
> + else
> down_write(&vdev->memory_lock);
> - }
>
> /*
> * If the user is writing mem/io enable (new_mem/io) and we
> @@ -631,7 +629,7 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
> *virt_cmd |= cpu_to_le16(new_cmd & mask);
>
> if (__vfio_pci_memory_enabled(vdev))
> - vfio_pci_dma_buf_move(vdev, false);
> + vfio_pci_unrevoke_bars(vdev);
> up_write(&vdev->memory_lock);
> }
>
> @@ -712,16 +710,14 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
> static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
> pci_power_t state)
> {
> - if (state >= PCI_D3hot) {
> - vfio_pci_zap_and_down_write_memory_lock(vdev);
> - vfio_pci_dma_buf_move(vdev, true);
> - } else {
> + if (state >= PCI_D3hot)
> + vfio_pci_lock_zap_revoke_bars(vdev);
> + else
> down_write(&vdev->memory_lock);
> - }
>
> vfio_pci_set_power_state(vdev, state);
> if (__vfio_pci_memory_enabled(vdev))
> - vfio_pci_dma_buf_move(vdev, false);
> + vfio_pci_unrevoke_bars(vdev);
> up_write(&vdev->memory_lock);
> }
>
> @@ -908,11 +904,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
> &cap);
>
> if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
> - vfio_pci_zap_and_down_write_memory_lock(vdev);
> - vfio_pci_dma_buf_move(vdev, true);
> + vfio_pci_lock_zap_revoke_bars(vdev);
> pci_try_reset_function(vdev->pdev);
> if (__vfio_pci_memory_enabled(vdev))
> - vfio_pci_dma_buf_move(vdev, false);
> + vfio_pci_unrevoke_bars(vdev);
> up_write(&vdev->memory_lock);
> }
> }
> @@ -993,11 +988,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
> &cap);
>
> if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
> - vfio_pci_zap_and_down_write_memory_lock(vdev);
> - vfio_pci_dma_buf_move(vdev, true);
> + vfio_pci_lock_zap_revoke_bars(vdev);
> pci_try_reset_function(vdev->pdev);
> if (__vfio_pci_memory_enabled(vdev))
> - vfio_pci_dma_buf_move(vdev, false);
> + vfio_pci_unrevoke_bars(vdev);
> up_write(&vdev->memory_lock);
> }
> }
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f9636d8f9e2a..5ea0bd4e7876 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -319,8 +319,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
> * The vdev power related flags are protected with 'memory_lock'
> * semaphore.
> */
> - vfio_pci_zap_and_down_write_memory_lock(vdev);
> - vfio_pci_dma_buf_move(vdev, true);
> + vfio_pci_lock_zap_revoke_bars(vdev);
>
> if (vdev->pm_runtime_engaged) {
> up_write(&vdev->memory_lock);
> @@ -406,7 +405,7 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
> down_write(&vdev->memory_lock);
> __vfio_pci_runtime_pm_exit(vdev);
> if (__vfio_pci_memory_enabled(vdev))
> - vfio_pci_dma_buf_move(vdev, false);
> + vfio_pci_unrevoke_bars(vdev);
> up_write(&vdev->memory_lock);
> }
>
> @@ -1256,6 +1255,8 @@ static int vfio_pci_ioctl_set_irqs(struct vfio_pci_core_device *vdev,
> return ret;
> }
>
> +static void vfio_pci_zap_revoke_bars(struct vfio_pci_core_device *vdev);
> +
> static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
> void __user *arg)
> {
> @@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
> if (!vdev->reset_works)
> return -EINVAL;
>
> - vfio_pci_zap_and_down_write_memory_lock(vdev);
> + down_write(&vdev->memory_lock);
>
> /*
> * This function can be invoked while the power state is non-D0. If
> @@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
> */
> vfio_pci_set_power_state(vdev, PCI_D0);
>
> - vfio_pci_dma_buf_move(vdev, true);
> + vfio_pci_zap_revoke_bars(vdev);
I'm wondering if this change in behavior is correct?
BEFORE this patch the sequence was:
1. zap vma mappings
2. Enter D0
After this patch the sequence becomes
1. Take the lock
2. Enter D0
3. zap vma mappings
My worry is if user-space accesses a BAR *during* the transition to D0,
it could crash since the mappings still exist during the transition?
The old code is immune to it because it removed user-mappings first.
Following the discussion from v1 regarding the ordering of
vfio_pci_dma_buf_move() and the D0 transition.. while it makes sense to
perform the DMABUF revocation/move after the hardware is in D0.. I'm not
too confident about moving zap after D0 :/
I mean, sure, the user would just see all Fs on a read and writes will
be dropped silently until we are in D0.. but the behaviour before this
change was that the user access will fault and hang on the memory_lock
instead which ensures that the user observes a consistent dev state..
> +
> ret = pci_try_reset_function(vdev->pdev);
> if (__vfio_pci_memory_enabled(vdev))
> - vfio_pci_dma_buf_move(vdev, false);
> + vfio_pci_unrevoke_bars(vdev);
> up_write(&vdev->memory_lock);
>
> return ret;
> @@ -1648,20 +1650,37 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu
> }
Thanks,
Praan
Most of this patch series has already been pushed upstream, this is just
the second half of the patch series that has not been pushed yet + some
additional changes which were required to implement changes requested by
the mailing list. This patch series is originally from Asahi, previously
posted by Daniel Almeida.
The previous version of the patch series can be found here:
https://patchwork.freedesktop.org/series/164580/
Branch with patches applied available here:
https://gitlab.freedesktop.org/lyudess/linux/-/commits/rust/gem-shmem
This patch series applies on top of drm-rust-next
Patch-series wide changes since V15:
* Fix some major rebasing errors I somehow didn't notice :(
* Drop the dependency on LazyInit, use the trick that Alice suggested
instead.
* Fix dependency ordering so that Tyr can get the vmap stuff first
without the other bits.
Patch-series wide changes since V16:
* Fix ordering one more time (SetOnce::reset() doesn't need to come
before adding vmap functions)
* Rebase against the latest DeviceContext changes from me that got
pushed.
Patch-series wide changes since V20:
* Lots of Sashiko fixes, excluding the comments that I couldn't prove
weren't just bogus.
Lyude Paul (4):
rust: drm: gem: shmem: Add DmaResvGuard helper
rust: drm: gem: shmem: Add vmap functions
rust: faux: Allow retrieving a bound Device
rust: drm: gem: Introduce shmem::Object::sg_table()
rust/kernel/drm/gem/shmem.rs | 546 ++++++++++++++++++++++++++++++++++-
rust/kernel/faux.rs | 16 +-
2 files changed, 546 insertions(+), 16 deletions(-)
base-commit: 848bf57e98e1678ce7a49eb4e0bf0502da95dc07
--
2.54.0
On Wed, Jun 10, 2026 at 04:43:16PM +0100, Matt Evans wrote:
> Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to
> find a PFN.
>
> This supports multi-range DMABUFs, which typically would be used to
> represent scattered spans but might even represent overlapping or
> aliasing spans of PFNs.
>
> Because this is intended to be used in vfio_pci_core.c, we also need
> to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
>
> Signed-off-by: Matt Evans <matt(a)ozlabs.org>
> ---
> drivers/vfio/pci/vfio_pci_dmabuf.c | 137 ++++++++++++++++++++++++++---
> drivers/vfio/pci/vfio_pci_priv.h | 20 +++++
> 2 files changed, 144 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index c16f460c01d6..9e5e865f6fb6 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -9,19 +9,6 @@
>
> MODULE_IMPORT_NS("DMA_BUF");
>
> -struct vfio_pci_dma_buf {
> - struct dma_buf *dmabuf;
> - struct vfio_pci_core_device *vdev;
> - struct list_head dmabufs_elm;
> - size_t size;
> - struct phys_vec *phys_vec;
> - struct p2pdma_provider *provider;
> - u32 nr_ranges;
> - struct kref kref;
> - struct completion comp;
> - u8 revoked : 1;
> -};
> -
> static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> struct dma_buf_attachment *attachment)
> {
> @@ -106,6 +93,130 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> .release = vfio_pci_dma_buf_release,
> };
>
> +int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
> + struct vm_area_struct *vma,
> + unsigned long address,
Nit: s/address/fault_addr ?
> + unsigned int order,
> + unsigned long *out_pfn)
> +{
> + /*
> + * Given a VMA (start, end, pgoffs) and a fault address,
> + * search the corresponding DMABUF's phys_vec[] to find the
> + * range representing the address's offset into the VMA, and
> + * its PFN.
> + *
> + * The phys_vec[] ranges represent contiguous spans of VAs
> + * upwards from the buffer offset 0; the actual PFNs might be
> + * in any order, overlap/alias, etc. Calculate an offset of
> + * the desired page given VMA start/pgoff and address, then
> + * search upwards from 0 to find which span contains it.
> + *
> + * On success, a valid PFN for a page sized by 'order' is
> + * returned into out_pfn.
> + *
> + * Failure occurs if:
> + * - The page would cross the edge of the VMA
> + * - The page isn't entirely contained within a range
> + * - We find a range, but the final PFN isn't aligned to the
> + * requested order.
> + *
> + * (Upon failure, the caller is expected to try again with a
> + * smaller order; the tests above will always succeed for
> + * order=0 as the limit case.)
> + *
> + * It's suboptimal if DMABUFs are created with neigbouring
> + * ranges that are physically contiguous, since hugepages
> + * can't straddle range boundaries. (The construction of the
> + * ranges vector should merge such ranges.)
> + *
> + * Finally, vma_pgoff_adjust is used for a DMABUF representing
> + * a VFIO BAR mmap, which is created from the start of the
> + * offset region.
> + */
> +
> + const unsigned long pagesize = PAGE_SIZE << order;
> + unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust) <<
> + PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;
> + unsigned long rounded_page_addr = ALIGN_DOWN(address, pagesize);
> + unsigned long rounded_page_end = rounded_page_addr + pagesize;
> + unsigned long page_buf_offset;
> + unsigned long page_buf_offset_end;
> + unsigned long range_buf_offset = 0;
> + unsigned int i;
> +
> + if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) {
> + if (order > 0)
> + return -EAGAIN;
> +
> + /* A fault address outside of the VMA is absurd. */
> + WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",
> + address, vma->vm_start, vma->vm_end);
This could flood dmesg if triggered repeatedly by userspace :(
Since a fault outside the VMA is an invalid access that already results
in a SIGBUS, we could probably avoid the WARN here?
Perhaps pr_warn_ratelimited() should suffice?
> + return -EFAULT;
> + }
> +
> + /*
> + * page_buff_offset[_end] is the span of DMABUF offsets
> + * corresponding to the faulting page:
> + */
> + if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
> + vma_off, &page_buf_offset) ||
> + check_add_overflow(page_buf_offset, pagesize,
> + &page_buf_offset_end)))
> + return -EFAULT;
> +
> + for (i = 0; i < priv->nr_ranges; i++) {
> + size_t range_len = priv->phys_vec[i].len;
> + phys_addr_t range_start = priv->phys_vec[i].paddr;
> +
> + /*
> + * If the current range starts after the page's span,
> + * this and any future range won't match. Bail early.
> + */
> + if (page_buf_offset_end <= range_buf_offset)
> + break;
> +
> + if (page_buf_offset >= range_buf_offset &&
> + page_buf_offset_end <= range_buf_offset + range_len) {
> + /*
> + * The faulting page is wholly contained
> + * within the span represented by the range.
> + * Validate PFN alignment for the order:
> + */
> + unsigned long pfn = (range_start + page_buf_offset -
> + range_buf_offset) / PAGE_SIZE;
Minor nit: I'm aware that decent compilers convert pow(2) divides to >>
However, we seem to be using `>> PAGE_SHIFT` across vfio-pci. E.g.:
return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
Let's consider using the same pattern?
> +
> + if (IS_ALIGNED(pfn, 1 << order)) {
> + *out_pfn = pfn;
> + return 0;
> + }
> + /* Retry with smaller order */
> + return -EAGAIN;
> + }
> + range_buf_offset += range_len;
> + }
> +
> + /*
> + * A hugepage straddling a range boundary will fail to match a
> + * range, but the address will (eventually) match when retried
> + * with a smaller page.
> + */
> + if (order > 0)
> + return -EAGAIN;
> +
> + /*
> + * If we get here, the address fell outside of the span
> + * represented by the (concatenated) ranges. Setup of a
Nit: double space before "Setup" and "But" below.
> + * mapping must ensure that the VMA is <= the total size of
> + * the ranges, so this should never happen. But, if it does,
> + * force SIGBUS for the access and warn.
> + */
> + WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n",
> + address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,
> + priv->nr_ranges, priv->size);
> +
> + return -EFAULT;
The fall-through logic at the end feels a bit redundant.
If we've exhausted the phys_vec list without finding a match, returning
-EAGAIN for order > 0 seems like the correct fallback behavior.
However, the subsequent WARN_ONCE for the order == 0 seems unnecessary?
An out-of-bounds access is an error that should simply return -EFAULT
(converting to SIGBUS) without polluting the kernel log with stackdumps?
Can we instead convert this to a pr_warn or something? Something like:
ret = order ? -EAGAIN : -EFAULT;
if (ret == -EFAULT)
pr_warn_ratelimited("No range for addr 0x%lx...\n", address);
return ret;
(with appropriate comments)
Thanks,
Praan
Every devmem dmabuf binding hands the page_pool PAGE_SIZE niovs today.
On NICs that consume one descriptor per netmem, this caps a single RX
descriptor at PAGE_SIZE and burns CPU on buffer churn.
In this series, we add a bind-time netlink attribute,
NETDEV_A_DMABUF_RX_BUF_SIZE, that lets userspace request a larger niov size
(power of two >= PAGE_SIZE). Drivers must opt in via
queue_mgmt_ops.QCFG_RX_PAGE_SIZE.
Selftests use udmabuf, but udmabuf sgtables were previously hardcoded to
PAGE_SIZE. This series modifies udmabuf to respect folio sizes in its exported
sgtable. The result is that when backing udmabuf with MFD_HUGETLB 2MB pages,
the sgtable is populated with 2MB entries, allowing devmem's gen_pool to carve
out large (eg. 64K) niovs.
Measurements
------------
Setup: kperf devmem RX/TX cuda, 4 flows, 64 MB messages, 60s, dctcp,
num-rx-queues=4, dmabuf-rx/tx-size-mb=2048, 10 runs per niov size,
mlx5.
niov RX dev Gbps RX flow avg Gbps app sys %
----- ---------------- ----------------- ----------------
4K 300.63 +/- 53.21 75.16 +/- 13.30 54.15 +/- 10.23
16K 321.35 +/- 28.20 80.34 +/- 7.05 41.05 +/- 8.87
32K 347.63 +/- 2.20 86.91 +/- 0.55 44.54 +/- 3.51
64K 332.11 +/- 14.26 83.03 +/- 3.56 35.47 +/- 3.11
RX app sys % drops ~19% from 4K to 64K.
kperf support (not yet merged):
https://github.com/facebookexperimental/kperf/commit/8837577f920876bce6986e…
Signed-off-by: Bobby Eshleman <bobbyeshleman(a)meta.com>
---
Changes in v3:
- fix a bunch of non-reverse christmas tree declarations (Stan)
- remove extra uint32 cast for getpagesize() (Stan)
- remove overzealous strtoul checking (Stan)
- remove value checks that the kernel already performs on rx_buf_size
(Stan)
- Link to v2: https://lore.kernel.org/r/20260611-tcpdm-large-niovs-v2-0-ee2bf15e7523@meta…
Changes in v2:
- Use NL_SET_ERR_MSG_FMT for sg alignment failure details (Stan)
- Keep -E2BIG (not a direct ask, but seemed preferred, Stan)
- Update udmabuf commit message and comments explaining why
"one sg ent per folio" is useful (Christian)
- Set/restore nr_hugepages in py harness (Stan)
- Link to v1: https://lore.kernel.org/r/20260603-tcpdm-large-niovs-v1-0-f37a4ac6726c@meta…
---
Bobby Eshleman (4):
net: devmem: allow rx-buf-size > PAGE_SIZE per dmabuf binding
udmabuf: emit one sg entry per pinned folio
selftests/net: ncdevmem: add -b option to set rx-buf-size on bind
selftests/net: devmem.py: add check_rx_large_niov
Documentation/netlink/specs/netdev.yaml | 8 +++
drivers/dma-buf/udmabuf.c | 52 +++++++++++++++++--
include/uapi/linux/netdev.h | 1 +
net/core/devmem.c | 51 +++++++++++--------
net/core/devmem.h | 13 +++--
net/core/netdev-genl-gen.c | 5 +-
net/core/netdev-genl.c | 19 ++++++-
tools/include/uapi/linux/netdev.h | 1 +
tools/testing/selftests/drivers/net/hw/config | 1 +
tools/testing/selftests/drivers/net/hw/devmem.py | 12 ++++-
.../testing/selftests/drivers/net/hw/devmem_lib.py | 58 +++++++++++++++++++++-
tools/testing/selftests/drivers/net/hw/ncdevmem.c | 36 ++++++++++++--
.../testing/selftests/drivers/net/hw/nk_devmem.py | 11 +++-
13 files changed, 225 insertions(+), 43 deletions(-)
---
base-commit: 518d8d0199538a4d6d5e51064044ece71e0c42e7
change-id: 20260602-tcpdm-large-niovs-56523a3a1077
Best regards,
--
Bobby Eshleman <bobbyeshleman(a)meta.com>
On Fri, Jun 12, 2026 at 04:11:50PM +0100, Matt Evans wrote:
> Hi Kevin,
>
> On 12/06/2026 09:27, Tian, Kevin wrote:
> >> From: Matt Evans <matt(a)ozlabs.org>
> >> Sent: Wednesday, June 10, 2026 11:43 PM
> >>
> > [...]
> >>
> >> vfio/pci: Support mmap() of a VFIO DMABUF
> >>
> >> Adds mmap() for a DMABUF fd exported from vfio-pci.
> >>
> >> It was a goal to keep the VFIO device fd lifetime behaviour
> >> unchanged with respect to the DMABUFs. An application can close
> >> all device fds, and this will revoke/clean up all DMABUFs; no
> >> mappings or other access can be performed now. When enabling
> >> mmap() of the DMABUFs, this means access through the VMA is also
> >> revoked. This complicates the fault handler because whilst the
> >> DMABUF exists, it has no guarantee that the corresponding VFIO
> >> device is still alive. Adds synchronisation ensuring the vdev is
> >> available before vdev->memory_lock is touched; this holds the
> >> device registration so that even if the buffer has been cleaned up,
> >> vdev hasn't been freed and so the lock can be safely taken.
> >>
> >> This commit makes VFIO_PCI_CORE depend on PCI_P2PDMA_CORE
> >> (commit
> >> 1) to bring in (only) the P2PDMA provider code.
> >
> > the last sentence is stale as the dependency is now added in patch4.
>
> Right, will fix.
>
> >>
> >> End
> >> ===
> >>
> >> This is based on VFIO next (e.g. at b9285405c5f6).
> >>
> >
> > Sashiko failed to apply this series. Is there dependent work in vfio-next?
> >
> > otherwise getting a Sashiko review is helpful here.
>
> It _did_ depend on (at least the context of) some fixes in vfio-next.
> Looks like it'll rebase on master now those are merged. I should've
> re-checked this for v3, oops. :|
>
> (FWIW, I had Robot Claude Opus 4.8 to review several times up to v3.
> But I agree, Sashiko would be interesting too. Can it be manually
> triggered with branch guidance?)
I guess relevant steps to run locally are here:
https://github.com/sashiko-dev/sashiko/blob/main/README.md
Additionally, we can try providing a base-commit (which points to a
public commit).
Thanks,
Praan
On Wed, Jun 10, 2026 at 04:43:18PM +0100, Matt Evans wrote:
> Convert the VFIO device fd fops->mmap to create a DMABUF representing
> the BAR mapping, and make the VMA fault handler look up PFNs from the
> corresponding DMABUF. This supports future code mmap()ing BAR
> DMABUFs, and iommufd work to support Type1 P2P.
>
> First, vfio_pci_core_mmap() uses the new
> vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF
> representing a single BAR range. Then, the vfio_pci_mmap_huge_fault()
> callback is updated to understand revoked buffers, and uses the new
> vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given
> fault address.
>
> Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move()
> zaps PTEs (used on the revocation and cleanup paths).
>
> CONFIG_VFIO_PCI_CORE now unconditionally depends on
> CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The
> CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for
> VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of
> CONFIG_PCI_P2PDMA.
>
> Signed-off-by: Matt Evans <matt(a)ozlabs.org>
> ---
> drivers/vfio/pci/Kconfig | 5 +-
> drivers/vfio/pci/Makefile | 3 +-
> drivers/vfio/pci/vfio_pci_core.c | 75 +++++++++++++++++++-----------
> drivers/vfio/pci/vfio_pci_dmabuf.c | 12 +++++
> drivers/vfio/pci/vfio_pci_priv.h | 11 +----
> 5 files changed, 67 insertions(+), 39 deletions(-)
>
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 296bf01e185e..67a2ae1fbc04 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -6,6 +6,8 @@ config VFIO_PCI_CORE
> tristate
> select VFIO_VIRQFD
> select IRQ_BYPASS_MANAGER
> + select PCI_P2PDMA_CORE
> + select DMA_SHARED_BUFFER
>
> config VFIO_PCI_INTX
> def_bool y if !S390
> @@ -56,7 +58,8 @@ config VFIO_PCI_ZDEV_KVM
> To enable s390x KVM vfio-pci extensions, say Y.
>
> config VFIO_PCI_DMABUF
> - def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
> + def_bool y if PCI_P2PDMA
> + depends on VFIO_PCI_CORE
>
> source "drivers/vfio/pci/mlx5/Kconfig"
>
[...]
> int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
> struct vm_area_struct *vma,
> @@ -532,6 +538,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> struct vfio_pci_dma_buf *tmp;
>
> lockdep_assert_held_write(&vdev->memory_lock);
> + /*
> + * Holding memory_lock ensures a racing VMA fault observes
> + * priv->revoked properly.
> + */
Nit: This comment should appear before the lockdep_assert_held_write()
Also, it is slightly verbose.. (not against it though).
>
> list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> if (!get_file_active(&priv->dmabuf->file))
> @@ -549,6 +559,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> if (revoked) {
> kref_put(&priv->kref, vfio_pci_dma_buf_done);
> wait_for_completion(&priv->comp);
> + unmap_mapping_range(priv->dmabuf->file->f_mapping,
> + 0, priv->size, 1);
Have we run this series with lockdep enabled?
I guess it'd be nice to check with lockdep once..
Apart from these,
Reviewed-by: Pranjal Shrivastava <praan(a)google.com>
Thanks,
Praan