Hi all,
This series is based on previous RFCs/discussions:
Tech topic: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/ RFCv1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/ RFCv2: https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
The background/rationale is covered in more detail in the RFC cover letters. The TL;DR is:
The goal is to enable userspace driver designs that use VFIO to export DMABUFs representing subsets of PCI device BARs, and "vend" those buffers from a primary process to other subordinate processes by fd. These processes then mmap() the buffers and their access to the device is isolated to the exported ranges. This is an improvement on sharing the VFIO device fd to subordinate processes, which would allow unfettered access.
This is achieved by enabling mmap() of vfio-pci DMABUFs, passed by fd to subordinate processes. Second, a new revocation mechanism is added to allow the primary process to forcibly revoke access to previously-shared BAR spans, even if the subordinate processes haven't cleanly exited.
(The related topic of safe delegation of iommufd control to the subordinate processes is not addressed here, and is follow-up work.)
As well as isolation and revocation, another advantage to accessing a BAR through a VMA backed by a DMABUF is that it's straightforward to mmap() the buffer with access attributes, such as write-combining.
Feedback from the RFCs requested that, instead of creating DMABUF-specific vm_ops and .fault paths, to go the whole way and migrate the existing VFIO PCI BAR mmap() to be backed by a DMABUF too, resulting in a common vm_ops and fault handler for mmap()s of both the VFIO device and explicitly-exported DMABUFs. This will help future iommufd emulation of VFIO Type1 peer-to-peer, making it easier to get a DMABUF for a VFIO BAR as a DMA target.
mmap() conversion to use DMABUF underneath has been done for vfio-pci, but not sub-drivers:
nvgrace-gpu's mmap() override path is unchanged; I kept this out of scope for now not least because I don't have a thorough test setup for this system. I would prefer to help the nvgrace-gpu maintainers enable BAR mmap() DMABUFs themselves.
Notes on patches ================
PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE
Later in the series, vfio-pci's mmap() is going to depend on pcim_p2pdma_provider() which depended on CONFIG_PCI_P2PDMA, which in turn depended on ZONE_DEVICE (which isn't available on 32-bit and some archs, because they lack MEMORY_HOTPLUG and friends). VFIO does _not_ require actual P2P to be present for basic mmap() functionality, only for the optional CONFIG_DMA_SHARED_BUFFER feature.
This splits out p2pdma_core.c under CONFIG_PCI_P2PDMA_CORE (which currently contains pcim_p2pdma_provider()), and an optional CONFIG_PCI_P2PDMA which depends on ZONE_DEVICE etc. providing P2P functionality in the existing p2pdma.c.
vfio/pci: Add a helper to look up PFNs for DMABUFs vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
The first is for a DMABUF VMA fault handler to determine arbitrary-sized PFNs from ranges in DMABUF. Secondly, refactor DMABUF export for use by the existing export feature and add a helper that creates a DMABUF corresponding to a VFIO BAR mmap() request.
vfio/pci: Convert BAR mmap() to use a DMABUF
The vfio-pci core mmap() creates a DMABUF with the helper, and the vm_ops fault handler uses the other helper to resolve the fault. Because this depends on DMABUF structs/code, CONFIG_VFIO_PCI_CORE needs to depend on CONFIG_DMA_SHARED_BUFFER. The CONFIG_VFIO_PCI_DMABUF still conditionally enables the export support code.
NOTE: The user mmap()s a device fd, but the resulting VMA's vm_file becomes that of the DMABUF. The DMABUF takes ownership of the device and put()s it on release, which maintains the existing behaviour of a VMA keeping the VFIO device open.
BAR zapping then happens via the existing vfio_pci_dma_buf_move() path, which now needs to unmap PTEs in the DMABUF's address_space.
vfio/pci: Provide a user-facing name for BAR mappings
There was a request for decent debug naming in /proc/<pid>/maps etc. comparable to the existing VFIO names: since the VMAs are DMABUFs, they have a "dmabuf:" prefix and can't be 100% identical to before. This is a user-visible change, but this patch at least now gives us extra info on the BDF & BAR being mapped.
vfio/pci: Clean up BAR zap and revocation
In general (see NOTE!) the vfio_pci_zap_bars() is now obsolete, since it unmaps PTEs in the VFIO device address_space which is now unused. This consolidates all calls (e.g. around reset) with the neighbouring vfio_pci_dma_buf_move()s into new functions, to revoke-zap/unrevoke. This makes the "revoke/un-revoke" steps clearer.
NOTE: Because drivers can use their own vm_ops and override .mmap, the core must conservatively assume an overridden .mmap might still add PTEs to the VFIO device address_space and therefore still does the zap. A new flag, zap_bars_on_revoke, enables the zap when .mmap is overridden. A driver that does not need the zap can clear this to opt-out, e.g. if the driver calls down to the common mmap (and so uses DMABUFs).
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour unchanged with respect to the DMABUFs. An application can close all device fds, and this will revoke/clean up all DMABUFs; no mappings or other access can be performed now. When enabling mmap() of the DMABUFs, this means access through the VMA is also revoked. This complicates the fault handler because whilst the DMABUF exists, it has no guarantee that the corresponding VFIO device is still alive. Adds synchronisation ensuring the vdev is available before vdev->memory_lock is touched; this holds the device registration so that even if the buffer has been cleaned up, vdev hasn't been freed and so the lock can be safely taken.
This commit makes VFIO_PCI_CORE depend on PCI_P2PDMA_CORE (commit 1) to bring in (only) the P2PDMA provider code.
vfio/pci: Permanently revoke a DMABUF on request
By weight, this is mostly a rename of revoked to an enum, status. There are now 3 states for a buffer, usable and revoked temporary/permanent. A new VFIO feature is added, VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE, which takes a DMABUF (exported from the same device) and permanently revokes it. Thus a userspace driver can guarantee any downstream consumers of a shared fd are prevented from accessing a BAR range, and that range can be reused. NOTE: This might block userspace, waiting on importers to detach.
The code doing revocation in vfio_pci_dma_buf_move() is moved, unchanged, to a common function for use by ..._move() and this new feature.
vfio/pci: Add mmap() attributes to DMABUF feature
Adds a new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR. After a DMABUF is exported, this feature is used to set a memory attribute that will be used by future mmap()s of the DMABUF fd. It doesn't affect existing maps.
The default is UC, and via the feature one can specify CPU access as WC. The attribute is an enum/scalar rather than bitmap/cumulative. The attributes follow a "try-fail" model where a client can request an attribute and either succeed or fail with ENOENT if it's unknown; if future attributes are platform-specific then their support can be probed.
(Since it's just UC/WC for now, there is no reservation or numeric structure to the namespace yet, but we could support system/arch-specific values in future by carving out base + arch-specific + IMPDEF ranges.)
Testing =======
(The [RFC ONLY] userspace test program, for QEMU edu-plus, can be found in the GitHub branch below. It at least illustrates how the export, map, revoke, attribute, and close semantics interoperate.)
This code has been tested in mapping DMABUFs of single/multiple ranges from multiple BARs, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff > 0, revocation, shutdown/cleanup scenarios, and hugepage mappings. I've lightly tested WC mappings also (by observing resulting PTEs as having the correct attributes...). No regressions observed on the VFIO selftests, or on our internal vfio-pci applications. VFIO on i386 has been build-tested.
End ===
This is based on VFIO next (e.g. at b9285405c5f6).
These commits are on GitHub for easier browsing, along with "[RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test":
https://github.com/metamev/linux/compare/b9285405c5f6...metamev:linux:dev/me...
Thanks for reading,
Matt
================================================================================ Change log:
v3: - Refactor p2pdma.c: split out pcim_p2pdma_provider() into a new p2pdma_core.c under CONFIG_PCI_P2PDMA_CORE.
- vfio_pci_dma_buf_find_pfn() cleanups: Rename parameter to priv, remove bad WARN, move unnecessary addition out of inner loop.
- vfio_pci_core_mmap_prep_dmabuf() cleanups: Remove uint32_t, remove unnecessary const variable.
- Conversion of BAR mmap() to DMABUF: VFIO_PCI_DMABUF depends on VFIO_PCI_CORE. vfio_pci_mmap_huge_fault(): move dev_dbg() outside of lock (argh), remove READ_ONCE(vdev)/move priv->vdev read and improve comment explanation.
- On revoke, BAR zap defaults to on if .mmap is overridden by a driver (and implements an opt-out for the hisi_acc_vfio_pci driver, which overrides mmap() with a simple wrapper that ends up using the common DMABUF mmap() rather than custom mappings).
- Reworded commit "vfio/pci: Support mmap() of a VFIO DMABUF" message for clarity. Reworded vfio_pci_mmap_huge_fault() comment for accuracy (vdev validity depends on not being revoked). Added comment in mmap() explaining belt-and-braces approach for early detecting a map of a revoked buffer.
- Revoke now uses VFIO_DEVICE_FEATURE_DMA_BUF rather than a new ioctl(); instead of the revoke helper taking 'revoked/permanently' bools, it's become vfio_pci_dma_buf_set_status() taking a single status enum. Added a READ_ONCE() for the lockless test of priv->vdev (flags it as intentional, even if it's in practice going to be a single-copy atomic read).
- Removed GET on vfio_pci_core_feature_dma_buf_memattr(), removed unnecessary taking of memory_lock, fixed error return values. In particular, removes ENOTSUPP, and uses ENOENT to indicate an unknown attribute enum value was passed to SET. In the discussion here, https://lore.kernel.org/all/20260602131417.41366391@shazbot.org/ we'd agreed on EOPNOTSUPP before I realised that's already used elsewhere. ENOENT uniquely indicates an unknown attribute.
v2: https://lore.kernel.org/all/20260527102319.100128-1-mattev@meta.com/
- Rebase on VFIO next, picking up Alex's vfio_pci_dma_buf_move()/vfio_pci_dma_buf_cleanup() fixes, and dropping "vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put"
- Added "PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE" so that the newly-added vfio-pci hard dependency on the P2PDMA provider instead pulls in the _CORE variant and not the full-fat CONFIG_PCI_P2PDMA. This means that the core of vfio-pci does not need ZONE_DEVICE, but if it's available then enabling P2PDMA in turn enables DMABUF export. Fixes basic VFIO operation on 32b or other platforms without ZONE_DEVICE.
- Fixed comment inaccuracy in vfio_pci_dma_buf_revoke() and cleaned up vdev validity test.
- vfio_pci_dma_buf_find_pfn(): use PAGE_ALIGN(), better span variable naming, OVF check
- Made vm_pgoffs use consistent (keeping the resource index at the top and masking where offset is used). For BAR mmap, use new vma_pgoff_adjust to create the DMABUF with the exact mmap()ed span instead of from the start of the BAR with an invisible portion before the mapping.
- Added VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR to set memory attributes, instead of using the export `flags` field.
- vfio_pci_ioctl_reset: Moved vfio_pci_zap_revoke_bars() (effectively, vfio_pci_dma_buf_move()) back after D0 transition. Note, if a BAR zap is needed, it's done in this function so now happens after this D0 transition with the _move; it was done before it at the time of the memory_lock taking.
- Minimised vfio_pci_dma_buf_mmap() (removed redundant span check), added READ_ONCE for memattr
- Misc fixes: comment in DMABUF name generation, removed superfluous READ_ONCE from faulthandler
v1: https://lore.kernel.org/kvm/20260416131815.2729131-1-mattev@meta.com/
- Cleanup of the common DMABUF-aware VMA vm_ops fault handler and export code. - Fixed a lot of races, particularly faults racing with DMABUF cleanup (if the VFIO device fds close, for example). - Added nicer human-readable names for VFIO mmap() VMAs
RFCv2: Respin based on the feedback/suggestions: https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
- Transform the existing VFIO BAR mmap path to also use DMABUFs behind the scenes, and then simply share that code for explicitly-mapped DMABUFs. Jason wanted to go that direction to enable iommufd VFIO type 1 emulation to pick up a DMABUF for an IO mapping.
- Revoke buffers using a VFIO device fd ioctl
RFCv1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
Matt Evans (9): PCI/P2PDMA: Add CONFIG_PCI_P2PDMA_CORE vfio/pci: Add a helper to look up PFNs for DMABUFs vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA vfio/pci: Convert BAR mmap() to use a DMABUF vfio/pci: Provide a user-facing name for BAR mappings vfio/pci: Clean up BAR zap and revocation vfio/pci: Support mmap() of a VFIO DMABUF vfio/pci: Permanently revoke a DMABUF on request vfio/pci: Add mmap() attributes to DMABUF feature
MAINTAINERS | 2 +- drivers/pci/Kconfig | 10 +- drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 109 +--- drivers/pci/p2pdma.h | 29 + drivers/pci/p2pdma_core.c | 118 ++++ drivers/vfio/pci/Kconfig | 5 +- drivers/vfio/pci/Makefile | 3 +- .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 8 + drivers/vfio/pci/vfio_pci_config.c | 30 +- drivers/vfio/pci/vfio_pci_core.c | 213 +++++-- drivers/vfio/pci/vfio_pci_dmabuf.c | 564 +++++++++++++++--- drivers/vfio/pci/vfio_pci_priv.h | 64 +- include/linux/pci-p2pdma.h | 24 +- include/linux/pci.h | 2 +- include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 47 ++ 17 files changed, 958 insertions(+), 272 deletions(-) create mode 100644 drivers/pci/p2pdma.h create mode 100644 drivers/pci/p2pdma_core.c
The P2PDMA code currently provides two features under the same CONFIG_PCI_P2PDMA option:
1. Locate providers via pcim_p2pdma_provider() 2. Manage actual P2P DMA
Some drivers (such as vfio-pci) depend on 1, without having a hard dependency on 2.
A future commit expands the use of DMABUF in vfio-pci for non-P2P scenarios, relying on pcim_p2pdma_provider() always being present. If that depended on CONFIG_PCI_P2PDMA, it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), even when P2P is not needed.
To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the basic provider functionality into a new p2pdma_core.c file. This is available even if the CONFIG_PCI_P2PDMA feature is disabled (or unavailable due to !CONFIG_ZONE_DEVICE). Then, drivers can enable any additional P2P features with the original CONFIG_PCI_P2PDMA (available when CONFIG_ZONE_DEVICE is set).
Signed-off-by: Matt Evans matt@ozlabs.org --- MAINTAINERS | 2 +- drivers/pci/Kconfig | 10 ++-- drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 109 ++-------------------------------- drivers/pci/p2pdma.h | 29 +++++++++ drivers/pci/p2pdma_core.c | 118 +++++++++++++++++++++++++++++++++++++ include/linux/pci-p2pdma.h | 24 ++++---- include/linux/pci.h | 2 +- 8 files changed, 174 insertions(+), 121 deletions(-) create mode 100644 drivers/pci/p2pdma.h create mode 100644 drivers/pci/p2pdma_core.c
diff --git a/MAINTAINERS b/MAINTAINERS index c2c6d79275c6..b21523b3bd8b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -20617,7 +20617,7 @@ B: https://bugzilla.kernel.org C: irc://irc.oftc.net/linux-pci T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git F: Documentation/driver-api/pci/p2pdma.rst -F: drivers/pci/p2pdma.c +F: drivers/pci/p2pdma* F: include/linux/pci-p2pdma.h
PCI POWER CONTROL diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 33c88432b728..59d70bc84cc9 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH config PCI_P2PDMA bool "PCI peer-to-peer transfer support" depends on ZONE_DEVICE - # - # The need for the scatterlist DMA bus address flag means PCI P2PDMA - # requires 64bit - # - depends on 64BIT + select PCI_P2PDMA_CORE select GENERIC_ALLOCATOR select NEED_SG_DMA_FLAGS help @@ -226,6 +222,10 @@ config PCI_P2PDMA
If unsure, say N.
+config PCI_P2PDMA_CORE + default n + bool + config PCI_LABEL def_bool y if (DMI || ACPI) select NLS diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 41ebc3b9a518..0b32572d57a1 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -30,6 +30,7 @@ obj-$(CONFIG_PCI_SYSCALL) += syscall.o obj-$(CONFIG_PCI_STUB) += pci-stub.o obj-$(CONFIG_PCI_PF_STUB) += pci-pf-stub.o obj-$(CONFIG_PCI_ECAM) += ecam.o +obj-$(CONFIG_PCI_P2PDMA_CORE) += p2pdma_core.o obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o obj-$(CONFIG_VGA_ARB) += vgaarb.o diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 7c898542af8d..50b1a7daf55c 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -21,12 +21,7 @@ #include <linux/seq_buf.h> #include <linux/xarray.h>
-struct pci_p2pdma { - struct gen_pool *pool; - bool p2pmem_published; - struct xarray map_types; - struct p2pdma_provider mem[PCI_STD_NUM_BARS]; -}; +#include "p2pdma.h"
struct pci_p2pdma_pagemap { struct dev_pagemap pgmap; @@ -226,110 +221,16 @@ static const struct dev_pagemap_ops p2pdma_pgmap_ops = { .folio_free = p2pdma_folio_free, };
-static void pci_p2pdma_release(void *data) +void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma) { - struct pci_dev *pdev = data; - struct pci_p2pdma *p2pdma; - - p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); - if (!p2pdma) - return; - - /* Flush and disable pci_alloc_p2p_mem() */ - pdev->p2pdma = NULL; - if (p2pdma->pool) - synchronize_rcu(); - xa_destroy(&p2pdma->map_types); - if (!p2pdma->pool) return;
+ synchronize_rcu(); gen_pool_destroy(p2pdma->pool); sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); }
-/** - * pcim_p2pdma_init - Initialise peer-to-peer DMA providers - * @pdev: The PCI device to enable P2PDMA for - * - * This function initializes the peer-to-peer DMA infrastructure - * for a PCI device. It allocates and sets up the necessary data - * structures to support P2PDMA operations, including mapping type - * tracking. - */ -int pcim_p2pdma_init(struct pci_dev *pdev) -{ - struct pci_p2pdma *p2p; - int i, ret; - - p2p = rcu_dereference_protected(pdev->p2pdma, 1); - if (p2p) - return 0; - - p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); - if (!p2p) - return -ENOMEM; - - xa_init(&p2p->map_types); - /* - * Iterate over all standard PCI BARs and record only those that - * correspond to MMIO regions. Skip non-memory resources (e.g. I/O - * port BARs) since they cannot be used for peer-to-peer (P2P) - * transactions. - */ - for (i = 0; i < PCI_STD_NUM_BARS; i++) { - if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM)) - continue; - - p2p->mem[i].owner = &pdev->dev; - p2p->mem[i].bus_offset = - pci_bus_address(pdev, i) - pci_resource_start(pdev, i); - } - - ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); - if (ret) - goto out_p2p; - - rcu_assign_pointer(pdev->p2pdma, p2p); - return 0; - -out_p2p: - devm_kfree(&pdev->dev, p2p); - return ret; -} -EXPORT_SYMBOL_GPL(pcim_p2pdma_init); - -/** - * pcim_p2pdma_provider - Get peer-to-peer DMA provider - * @pdev: The PCI device to enable P2PDMA for - * @bar: BAR index to get provider - * - * This function gets peer-to-peer DMA provider for a PCI device. The lifetime - * of the provider (and of course the MMIO) is bound to the lifetime of the - * driver. A driver calling this function must ensure that all references to the - * provider, and any DMA mappings created for any MMIO, are all cleaned up - * before the driver remove() completes. - * - * Since P2P is almost always shared with a second driver this means some system - * to notify, invalidate and revoke the MMIO's DMA must be in place to use this - * function. For example a revoke can be built using DMABUF. - */ -struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar) -{ - struct pci_p2pdma *p2p; - - if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) - return NULL; - - p2p = rcu_dereference_protected(pdev->p2pdma, 1); - if (WARN_ON(!p2p)) - /* Someone forgot to call to pcim_p2pdma_init() before */ - return NULL; - - return &p2p->mem[bar]; -} -EXPORT_SYMBOL_GPL(pcim_p2pdma_provider); - static int pci_p2pdma_setup_pool(struct pci_dev *pdev) { struct pci_p2pdma *p2pdma; @@ -932,8 +833,8 @@ void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size) struct pci_p2pdma *p2pdma;
/* - * Pairs with synchronize_rcu() in pci_p2pdma_release() to - * ensure pdev->p2pdma is non-NULL for the duration of the + * Pairs with synchronize_rcu() in pci_p2pdma_release_pool() + * to ensure pdev->p2pdma is non-NULL for the duration of the * read-lock. */ rcu_read_lock(); diff --git a/drivers/pci/p2pdma.h b/drivers/pci/p2pdma.h new file mode 100644 index 000000000000..453f4aa7ade8 --- /dev/null +++ b/drivers/pci/p2pdma.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * PCI Peer 2 Peer DMA support. + */ + +#ifndef _PCI_P2PDMA_H +#define _PCI_P2PDMA_H + +#include <linux/genalloc.h> +#include <linux/pci-p2pdma.h> +#include <linux/xarray.h> + +struct pci_p2pdma { + struct gen_pool *pool; + bool p2pmem_published; + struct xarray map_types; + struct p2pdma_provider mem[PCI_STD_NUM_BARS]; +}; + +#ifdef CONFIG_PCI_P2PDMA +void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma); +#else +static inline void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma) +{ +} +#endif + +#endif + diff --git a/drivers/pci/p2pdma_core.c b/drivers/pci/p2pdma_core.c new file mode 100644 index 000000000000..1fda15d40196 --- /dev/null +++ b/drivers/pci/p2pdma_core.c @@ -0,0 +1,118 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * PCI Peer 2 Peer DMA support core, providing a bare-bones + * pcim_p2pdma_provider() interface to drivers even if full P2PDMA + * isn't present. The full P2PDMA feature is in p2pdma.c (see + * CONFIG_PCI_P2PDMA). + * + * Copyright (c) 2016-2018, Logan Gunthorpe + * Copyright (c) 2016-2017, Microsemi Corporation + * Copyright (c) 2017, Christoph Hellwig + * Copyright (c) 2018, Eideticom Inc. + */ + +#define pr_fmt(fmt) "pci-p2pdma: " fmt +#include <linux/ctype.h> +#include <linux/genalloc.h> +#include <linux/memremap.h> +#include <linux/pci-p2pdma.h> +#include <linux/xarray.h> + +#include "p2pdma.h" + +static void pci_p2pdma_release(void *data) +{ + struct pci_dev *pdev = data; + struct pci_p2pdma *p2pdma; + + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); + if (!p2pdma) + return; + + /* Flush and disable pci_alloc_p2p_mem() */ + pdev->p2pdma = NULL; + pci_p2pdma_release_pool(pdev, p2pdma); + xa_destroy(&p2pdma->map_types); +} + +/** + * pcim_p2pdma_init - Initialise peer-to-peer DMA providers + * @pdev: The PCI device to enable P2PDMA for + * + * This function initializes the peer-to-peer DMA infrastructure + * for a PCI device. It allocates and sets up the necessary data + * structures to support P2PDMA operations, including mapping type + * tracking. + */ +int pcim_p2pdma_init(struct pci_dev *pdev) +{ + struct pci_p2pdma *p2p; + int i, ret; + + p2p = rcu_dereference_protected(pdev->p2pdma, 1); + if (p2p) + return 0; + + p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); + if (!p2p) + return -ENOMEM; + + xa_init(&p2p->map_types); + /* + * Iterate over all standard PCI BARs and record only those that + * correspond to MMIO regions. Skip non-memory resources (e.g. I/O + * port BARs) since they cannot be used for peer-to-peer (P2P) + * transactions. + */ + for (i = 0; i < PCI_STD_NUM_BARS; i++) { + if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM)) + continue; + + p2p->mem[i].owner = &pdev->dev; + p2p->mem[i].bus_offset = + pci_bus_address(pdev, i) - pci_resource_start(pdev, i); + } + + ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); + if (ret) + goto out_p2p; + + rcu_assign_pointer(pdev->p2pdma, p2p); + return 0; + +out_p2p: + devm_kfree(&pdev->dev, p2p); + return ret; +} +EXPORT_SYMBOL_GPL(pcim_p2pdma_init); + +/** + * pcim_p2pdma_provider - Get peer-to-peer DMA provider + * @pdev: The PCI device to enable P2PDMA for + * @bar: BAR index to get provider + * + * This function gets peer-to-peer DMA provider for a PCI device. The lifetime + * of the provider (and of course the MMIO) is bound to the lifetime of the + * driver. A driver calling this function must ensure that all references to the + * provider, and any DMA mappings created for any MMIO, are all cleaned up + * before the driver remove() completes. + * + * Since P2P is almost always shared with a second driver this means some system + * to notify, invalidate and revoke the MMIO's DMA must be in place to use this + * function. For example a revoke can be built using DMABUF. + */ +struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar) +{ + struct pci_p2pdma *p2p; + + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) + return NULL; + + p2p = rcu_dereference_protected(pdev->p2pdma, 1); + if (WARN_ON(!p2p)) + /* Someone forgot to call to pcim_p2pdma_init() before */ + return NULL; + + return &p2p->mem[bar]; +} +EXPORT_SYMBOL_GPL(pcim_p2pdma_provider); diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 873de20a2247..4c42a7b2ee85 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -67,9 +67,22 @@ enum pci_p2pdma_map_type { PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, };
-#ifdef CONFIG_PCI_P2PDMA +#ifdef CONFIG_PCI_P2PDMA_CORE int pcim_p2pdma_init(struct pci_dev *pdev); struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar); +#else +static inline int pcim_p2pdma_init(struct pci_dev *pdev) +{ + return -EOPNOTSUPP; +} +static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, + int bar) +{ + return NULL; +} +#endif + +#ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, @@ -89,15 +102,6 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev); #else /* CONFIG_PCI_P2PDMA */ -static inline int pcim_p2pdma_init(struct pci_dev *pdev) -{ - return -EOPNOTSUPP; -} -static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, - int bar) -{ - return NULL; -} static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { diff --git a/include/linux/pci.h b/include/linux/pci.h index 2c4454583c11..531aec355686 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -557,7 +557,7 @@ struct pci_dev { u16 pasid_cap; /* PASID Capability offset */ u16 pasid_features; #endif -#ifdef CONFIG_PCI_P2PDMA +#ifdef CONFIG_PCI_P2PDMA_CORE struct pci_p2pdma __rcu *p2pdma; #endif #ifdef CONFIG_PCI_DOE
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
The P2PDMA code currently provides two features under the same CONFIG_PCI_P2PDMA option:
- Locate providers via pcim_p2pdma_provider()
- Manage actual P2P DMA
Some drivers (such as vfio-pci) depend on 1, without having a hard dependency on 2.
A future commit expands the use of DMABUF in vfio-pci for non-P2P scenarios, relying on pcim_p2pdma_provider() always being present. If that depended on CONFIG_PCI_P2PDMA, it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), even when P2P is not needed.
To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the basic provider functionality into a new p2pdma_core.c file. This is available even if the CONFIG_PCI_P2PDMA feature is disabled (or unavailable due to !CONFIG_ZONE_DEVICE). Then, drivers can enable any additional P2P features with the original CONFIG_PCI_P2PDMA (available when CONFIG_ZONE_DEVICE is set).
Signed-off-by: Matt Evans matt@ozlabs.org
MAINTAINERS | 2 +- drivers/pci/Kconfig | 10 ++-- drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 109 ++-------------------------------- drivers/pci/p2pdma.h | 29 +++++++++ drivers/pci/p2pdma_core.c | 118 +++++++++++++++++++++++++++++++++++++ include/linux/pci-p2pdma.h | 24 ++++---- include/linux/pci.h | 2 +- 8 files changed, 174 insertions(+), 121 deletions(-) create mode 100644 drivers/pci/p2pdma.h create mode 100644 drivers/pci/p2pdma_core.c
Thanks, Reviewed-by: Leon Romanovsky leonro@nvidia.com
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
The P2PDMA code currently provides two features under the same CONFIG_PCI_P2PDMA option:
- Locate providers via pcim_p2pdma_provider()
- Manage actual P2P DMA
Some drivers (such as vfio-pci) depend on 1, without having a hard dependency on 2.
A future commit expands the use of DMABUF in vfio-pci for non-P2P scenarios, relying on pcim_p2pdma_provider() always being present. If that depended on CONFIG_PCI_P2PDMA, it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), even when P2P is not needed.
To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the basic provider functionality into a new p2pdma_core.c file. This is available even if the CONFIG_PCI_P2PDMA feature is disabled (or unavailable due to !CONFIG_ZONE_DEVICE). Then, drivers can enable any additional P2P features with the original CONFIG_PCI_P2PDMA (available when CONFIG_ZONE_DEVICE is set).
Signed-off-by: Matt Evans matt@ozlabs.org
I thought this was going to be just a code move and new Kconfig option, but it involves a little more than that, e.g., adding pci_p2pdma_release_pool() and tweaking the RCU synchronization.
If possible, it would be nice to do that refactoring in a smaller preliminary patch so it's easier to review/bisect/etc and make this one a pure code move.
I guess CONFIG_PCI_P2PDMA_CORE selects just part 1 ("Locate providers via pcim_p2pdma_provider()"), right?
+++ b/drivers/pci/p2pdma.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/*
- PCI Peer 2 Peer DMA support.
- */
+#ifndef _PCI_P2PDMA_H +#define _PCI_P2PDMA_H
+#include <linux/genalloc.h> +#include <linux/pci-p2pdma.h> +#include <linux/xarray.h>
+struct pci_p2pdma {
- struct gen_pool *pool;
- bool p2pmem_published;
- struct xarray map_types;
- struct p2pdma_provider mem[PCI_STD_NUM_BARS];
+};
+#ifdef CONFIG_PCI_P2PDMA +void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma); +#else +static inline void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma)
Wrap to fit in 80 columns like the rest of drivers/pci/
+{ +} +#endif
+#endif
Spurious blank line at end.
+++ b/drivers/pci/p2pdma_core.c @@ -0,0 +1,118 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- PCI Peer 2 Peer DMA support core, providing a bare-bones
In this English text, I think I would spell out "Peer to Peer" instead of relying on the "2" homophone. Same in p2pdma.h.
Hi Bjorn,
On 11/06/2026 17:07, Bjorn Helgaas wrote:
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
The P2PDMA code currently provides two features under the same CONFIG_PCI_P2PDMA option:
- Locate providers via pcim_p2pdma_provider()
- Manage actual P2P DMA
Some drivers (such as vfio-pci) depend on 1, without having a hard dependency on 2.
A future commit expands the use of DMABUF in vfio-pci for non-P2P scenarios, relying on pcim_p2pdma_provider() always being present. If that depended on CONFIG_PCI_P2PDMA, it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), even when P2P is not needed.
To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the basic provider functionality into a new p2pdma_core.c file. This is available even if the CONFIG_PCI_P2PDMA feature is disabled (or unavailable due to !CONFIG_ZONE_DEVICE). Then, drivers can enable any additional P2P features with the original CONFIG_PCI_P2PDMA (available when CONFIG_ZONE_DEVICE is set).
Signed-off-by: Matt Evans matt@ozlabs.org
I thought this was going to be just a code move and new Kconfig option, but it involves a little more than that, e.g., adding pci_p2pdma_release_pool() and tweaking the RCU synchronization.
If possible, it would be nice to do that refactoring in a smaller preliminary patch so it's easier to review/bisect/etc and make this one a pure code move.
pci_p2pdma_release_pool() is really part of the split, not a functional change, i.e. factoring the (P2P-only) pool handling out from the rest of the pci_p2pdma_release() actions still needed for the _CORE case.
I guess .._release() could be split in place in a small patch, and then moved alongside the rest in a second.
I guess CONFIG_PCI_P2PDMA_CORE selects just part 1 ("Locate providers via pcim_p2pdma_provider()"), right?
Yes, I can reword that last paragraph of the commit message to make this clearer.
+++ b/drivers/pci/p2pdma.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/*
- PCI Peer 2 Peer DMA support.
- */
+#ifndef _PCI_P2PDMA_H +#define _PCI_P2PDMA_H
+#include <linux/genalloc.h> +#include <linux/pci-p2pdma.h> +#include <linux/xarray.h>
+struct pci_p2pdma {
- struct gen_pool *pool;
- bool p2pmem_published;
- struct xarray map_types;
- struct p2pdma_provider mem[PCI_STD_NUM_BARS];
+};
+#ifdef CONFIG_PCI_P2PDMA +void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma); +#else +static inline void pci_p2pdma_release_pool(struct pci_dev *pdev, struct pci_p2pdma *p2pdma)
Wrap to fit in 80 columns like the rest of drivers/pci/
Fixed.
+{ +} +#endif
+#endif
Spurious blank line at end.
Doh, fixed.
+++ b/drivers/pci/p2pdma_core.c @@ -0,0 +1,118 @@ +// SPDX-License-Identifier: GPL-2.0 +/*
- PCI Peer 2 Peer DMA support core, providing a bare-bones
In this English text, I think I would spell out "Peer to Peer" instead of relying on the "2" homophone. Same in p2pdma.h.
I would've preferred that too, it isn't correct; I was seeking to keep the blurb here (and in p2pdma.h) as close to the original p2pdma.c as possible. I'll correct it in the new files (hyphenated "peer-to-peer" in PCI-SIG language).
Thanks,
Matt
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
The P2PDMA code currently provides two features under the same CONFIG_PCI_P2PDMA option:
- Locate providers via pcim_p2pdma_provider()
- Manage actual P2P DMA
Some drivers (such as vfio-pci) depend on 1, without having a hard dependency on 2.
A future commit expands the use of DMABUF in vfio-pci for non-P2P scenarios, relying on pcim_p2pdma_provider() always being present. If that depended on CONFIG_PCI_P2PDMA, it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), even when P2P is not needed.
To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the basic provider functionality into a new p2pdma_core.c file. This is available even if the CONFIG_PCI_P2PDMA feature is disabled (or unavailable due to !CONFIG_ZONE_DEVICE). Then, drivers can enable any additional P2P features with the original CONFIG_PCI_P2PDMA (available when CONFIG_ZONE_DEVICE is set).
Signed-off-by: Matt Evans matt@ozlabs.org
MAINTAINERS | 2 +- drivers/pci/Kconfig | 10 ++-- drivers/pci/Makefile | 1 + drivers/pci/p2pdma.c | 109 ++-------------------------------- drivers/pci/p2pdma.h | 29 +++++++++ drivers/pci/p2pdma_core.c | 118 +++++++++++++++++++++++++++++++++++++ include/linux/pci-p2pdma.h | 24 ++++---- include/linux/pci.h | 2 +- 8 files changed, 174 insertions(+), 121 deletions(-) create mode 100644 drivers/pci/p2pdma.h create mode 100644 drivers/pci/p2pdma_core.c
diff --git a/MAINTAINERS b/MAINTAINERS index c2c6d79275c6..b21523b3bd8b 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -20617,7 +20617,7 @@ B: https://bugzilla.kernel.org C: irc://irc.oftc.net/linux-pci T: git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git F: Documentation/driver-api/pci/p2pdma.rst -F: drivers/pci/p2pdma.c +F: drivers/pci/p2pdma* F: include/linux/pci-p2pdma.h PCI POWER CONTROL diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 33c88432b728..59d70bc84cc9 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH config PCI_P2PDMA bool "PCI peer-to-peer transfer support" depends on ZONE_DEVICE
- #
- # The need for the scatterlist DMA bus address flag means PCI P2PDMA
- # requires 64bit
- #
- depends on 64BIT
- select PCI_P2PDMA_CORE select GENERIC_ALLOCATOR select NEED_SG_DMA_FLAGS help
Nit: Did we drop depends on 64BIT intentionally here? I guess the full PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC, NEED_SG_DMA_FLAGS doesn't select 64BIT?
With the nit (and Bjorn's comments addressed) Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks, Praan
From: Pranjal Shrivastava praan@google.com Sent: Friday, June 12, 2026 2:38 AM
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
--- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH config PCI_P2PDMA bool "PCI peer-to-peer transfer support" depends on ZONE_DEVICE
- #
- # The need for the scatterlist DMA bus address flag means PCI
P2PDMA
- # requires 64bit
- #
- depends on 64BIT
- select PCI_P2PDMA_CORE select GENERIC_ALLOCATOR select NEED_SG_DMA_FLAGS help
Nit: Did we drop depends on 64BIT intentionally here? I guess the full PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC, NEED_SG_DMA_FLAGS doesn't select 64BIT?
seems that comment is stale. According to the commit msg:
" it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), "
so it sounds a redundant dependency hence is removed.
Hi Kevin, Pranjal, (+Robin, hi!)
On 12/06/2026 04:39, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Friday, June 12, 2026 2:38 AM
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
--- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH config PCI_P2PDMA bool "PCI peer-to-peer transfer support" depends on ZONE_DEVICE
- #
- # The need for the scatterlist DMA bus address flag means PCI
P2PDMA
- # requires 64bit
- #
- depends on 64BIT
- select PCI_P2PDMA_CORE select GENERIC_ALLOCATOR select NEED_SG_DMA_FLAGS help
Nit: Did we drop depends on 64BIT intentionally here? I guess the full PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC, NEED_SG_DMA_FLAGS doesn't select 64BIT?
seems that comment is stale. According to the commit msg:
" it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), "
so it sounds a redundant dependency hence is removed.
This was intentional. In practice there is still a dependency on 64BIT for PCI_P2PDMA, but it is because of ZONE_DEVICE (and mem hotplug). The key need is PCI_P2PDMA_CORE is available on !64BIT for VFIO, but I didn't see a requirement from PCI_P2PDMA itself (as opposed to its dependencies). If I've missed one, I can put it back...
But NEED_SG_DMA_FLAGS doesn't smell quite right; I see from comments in
af2880ec44021 ("scatterlist: add dedicated config for DMA flags")
that it assumes 64BIT, but it seems to be missing a "depends on 64BIT".
Robin -- should that depend on 64BIT?
Cheers,
Matt
On 12/06/2026 3:31 pm, Matt Evans wrote:
Hi Kevin, Pranjal, (+Robin, hi!)
Oh hey there! :)
On 12/06/2026 04:39, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Friday, June 12, 2026 2:38 AM
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
--- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH config PCI_P2PDMA bool "PCI peer-to-peer transfer support" depends on ZONE_DEVICE
- #
- # The need for the scatterlist DMA bus address flag means PCI
P2PDMA
- # requires 64bit
- #
- depends on 64BIT
- select PCI_P2PDMA_CORE select GENERIC_ALLOCATOR select NEED_SG_DMA_FLAGS help
Nit: Did we drop depends on 64BIT intentionally here? I guess the full PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC, NEED_SG_DMA_FLAGS doesn't select 64BIT?
seems that comment is stale. According to the commit msg:
" it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), "
so it sounds a redundant dependency hence is removed.
This was intentional. In practice there is still a dependency on 64BIT for PCI_P2PDMA, but it is because of ZONE_DEVICE (and mem hotplug). The key need is PCI_P2PDMA_CORE is available on !64BIT for VFIO, but I didn't see a requirement from PCI_P2PDMA itself (as opposed to its dependencies). If I've missed one, I can put it back...
But NEED_SG_DMA_FLAGS doesn't smell quite right; I see from comments in
af2880ec44021 ("scatterlist: add dedicated config for DMA flags")
that it assumes 64BIT, but it seems to be missing a "depends on 64BIT".
Robin -- should that depend on 64BIT?
Indeed, looking at the history it seems like that was overlooked, but it worked out at the time since the only selector of NEED_SG_DMA_FLAGS was PCI_P2PDMA as you say. If we're now generalising then moving the explicit 64BIT dependency to NEED_SG_DMA_FLAGS itself sounds like the right thing to do.
Cheers, Robin.
Heya Robin,
On 23/06/2026 16:48, Robin Murphy wrote:
On 12/06/2026 3:31 pm, Matt Evans wrote:
Hi Kevin, Pranjal, (+Robin, hi!)
Oh hey there! :)
On 12/06/2026 04:39, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Friday, June 12, 2026 2:38 AM
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
--- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH Â config PCI_P2PDMA Â Â Â Â Â bool "PCI peer-to-peer transfer support" Â Â Â Â Â depends on ZONE_DEVICE -Â Â Â # -Â Â Â # The need for the scatterlist DMA bus address flag means PCI
P2PDMA
-Â Â Â # requires 64bit -Â Â Â # -Â Â Â depends on 64BIT +Â Â Â select PCI_P2PDMA_CORE Â Â Â Â Â select GENERIC_ALLOCATOR Â Â Â Â Â select NEED_SG_DMA_FLAGS Â Â Â Â Â help
Nit: Did we drop depends on 64BIT intentionally here? I guess the full PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC, NEED_SG_DMA_FLAGS doesn't select 64BIT?
seems that comment is stale. According to the commit msg:
" it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), "
so it sounds a redundant dependency hence is removed.
This was intentional. In practice there is still a dependency on 64BIT for PCI_P2PDMA, but it is because of ZONE_DEVICE (and mem hotplug). The key need is PCI_P2PDMA_CORE is available on !64BIT for VFIO, but I didn't see a requirement from PCI_P2PDMA itself (as opposed to its dependencies). If I've missed one, I can put it back...
But NEED_SG_DMA_FLAGS doesn't smell quite right; I see from comments in
af2880ec44021 ("scatterlist: add dedicated config for DMA flags")
that it assumes 64BIT, but it seems to be missing a "depends on 64BIT".
Robin -- should that depend on 64BIT?
Indeed, looking at the history it seems like that was overlooked, but it worked out at the time since the only selector of NEED_SG_DMA_FLAGS was PCI_P2PDMA as you say. If we're now generalising then moving the explicit 64BIT dependency to NEED_SG_DMA_FLAGS itself sounds like the right thing to do.
Cheers for confirming. I'll send a patch separate to this series (since the deps work out OK for PCI_P2PDMA for the reasons mentioned).
Ta,
Matt
Hi Robin, me,
On 23/06/2026 16:59, Matt Evans wrote:
Heya Robin,
On 23/06/2026 16:48, Robin Murphy wrote:
On 12/06/2026 3:31 pm, Matt Evans wrote:
Hi Kevin, Pranjal, (+Robin, hi!)
Oh hey there! :)
On 12/06/2026 04:39, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Friday, June 12, 2026 2:38 AM
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote:
--- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -206,11 +206,7 @@ config PCIE_TPH Â config PCI_P2PDMA Â Â Â Â Â bool "PCI peer-to-peer transfer support" Â Â Â Â Â depends on ZONE_DEVICE -Â Â Â # -Â Â Â # The need for the scatterlist DMA bus address flag means PCI
P2PDMA
-Â Â Â # requires 64bit -Â Â Â # -Â Â Â depends on 64BIT +Â Â Â select PCI_P2PDMA_CORE Â Â Â Â Â select GENERIC_ALLOCATOR Â Â Â Â Â select NEED_SG_DMA_FLAGS Â Â Â Â Â help
Nit: Did we drop depends on 64BIT intentionally here? I guess the full PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC, NEED_SG_DMA_FLAGS doesn't select 64BIT?
seems that comment is stale. According to the commit msg:
" it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), "
so it sounds a redundant dependency hence is removed.
This was intentional. In practice there is still a dependency on 64BIT for PCI_P2PDMA, but it is because of ZONE_DEVICE (and mem hotplug). The key need is PCI_P2PDMA_CORE is available on !64BIT for VFIO, but I didn't see a requirement from PCI_P2PDMA itself (as opposed to its dependencies). If I've missed one, I can put it back...
But NEED_SG_DMA_FLAGS doesn't smell quite right; I see from comments in
af2880ec44021 ("scatterlist: add dedicated config for DMA flags")
that it assumes 64BIT, but it seems to be missing a "depends on 64BIT".
Robin -- should that depend on 64BIT?
Indeed, looking at the history it seems like that was overlooked, but it worked out at the time since the only selector of NEED_SG_DMA_FLAGS was PCI_P2PDMA as you say. If we're now generalising then moving the explicit 64BIT dependency to NEED_SG_DMA_FLAGS itself sounds like the right thing to do.
Cheers for confirming. I'll send a patch separate to this series (since the deps work out OK for PCI_P2PDMA for the reasons mentioned).
I think we were wrong, NEED_SG_DMA_FLAGS doesn't _need_ 64BIT.
Other than P2PDMA, the other consumer of NEED_SG_DMA_FLAGS is IOMMU_DMA, and turns out if one builds an i386 kernel with INTEL_IOMMU (or some other configs, like Xen) then NEED_SG_DMA_FLAGS is enabled on 32-bit builds too.
The scatterlist.h comments af2880ec44021 touched are just saying that _since_ P2PDMA depends on 64BIT, there _is_ circumstantial padding so let's use it for flags. It doesn't require 64BIT.
For example struct scatterlist isn't pushed over some special (e.g. power-of-two) size when NEED_SG_DMA_FLAGS is enabled on 32-bit; I can't find a reason it should be prevented on 32-bit builds (and found cases above in which it is already enabled in them).
So I won't change the NEED_SG_DMA_FLAGS dependencies after all. Sorry for the noise -- as ever if I've missed something do please explain.
(I'll continue with removing the P2PDMA dependency on 64BIT because it seems P2PDMA's dependencies rely on 64BIT, though P2PDMA itself doesn't.)
Thanks,
Matt
On 23/06/2026 6:47 pm, Matt Evans wrote:
Hi Robin, me,
On 23/06/2026 16:59, Matt Evans wrote:
Heya Robin,
On 23/06/2026 16:48, Robin Murphy wrote:
On 12/06/2026 3:31 pm, Matt Evans wrote:
Hi Kevin, Pranjal, (+Robin, hi!)
Oh hey there! :)
On 12/06/2026 04:39, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Friday, June 12, 2026 2:38 AM
On Wed, Jun 10, 2026 at 04:43:15PM +0100, Matt Evans wrote: > --- a/drivers/pci/Kconfig > +++ b/drivers/pci/Kconfig > @@ -206,11 +206,7 @@ config PCIE_TPH > Â config PCI_P2PDMA > Â Â Â Â Â bool "PCI peer-to-peer transfer support" > Â Â Â Â Â depends on ZONE_DEVICE > -Â Â Â # > -Â Â Â # The need for the scatterlist DMA bus address flag means PCI P2PDMA > -Â Â Â # requires 64bit > -Â Â Â # > -Â Â Â depends on 64BIT > +Â Â Â select PCI_P2PDMA_CORE > Â Â Â Â Â select GENERIC_ALLOCATOR > Â Â Â Â Â select NEED_SG_DMA_FLAGS > Â Â Â Â Â help
Nit: Did we drop depends on 64BIT intentionally here? I guess the full PCI_P2PDMA stack still selects NEED_SG_DMA_FLAGS? IIRC, NEED_SG_DMA_FLAGS doesn't select 64BIT?
seems that comment is stale. According to the commit msg:
" it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), "
so it sounds a redundant dependency hence is removed.
This was intentional. In practice there is still a dependency on 64BIT for PCI_P2PDMA, but it is because of ZONE_DEVICE (and mem hotplug). The key need is PCI_P2PDMA_CORE is available on !64BIT for VFIO, but I didn't see a requirement from PCI_P2PDMA itself (as opposed to its dependencies). If I've missed one, I can put it back...
But NEED_SG_DMA_FLAGS doesn't smell quite right; I see from comments in
af2880ec44021 ("scatterlist: add dedicated config for DMA flags")
that it assumes 64BIT, but it seems to be missing a "depends on 64BIT".
Robin -- should that depend on 64BIT?
Indeed, looking at the history it seems like that was overlooked, but it worked out at the time since the only selector of NEED_SG_DMA_FLAGS was PCI_P2PDMA as you say. If we're now generalising then moving the explicit 64BIT dependency to NEED_SG_DMA_FLAGS itself sounds like the right thing to do.
Cheers for confirming. I'll send a patch separate to this series (since the deps work out OK for PCI_P2PDMA for the reasons mentioned).
I think we were wrong, NEED_SG_DMA_FLAGS doesn't _need_ 64BIT.
Other than P2PDMA, the other consumer of NEED_SG_DMA_FLAGS is IOMMU_DMA, and turns out if one builds an i386 kernel with INTEL_IOMMU (or some other configs, like Xen) then NEED_SG_DMA_FLAGS is enabled on 32-bit builds too.
The scatterlist.h comments af2880ec44021 touched are just saying that _since_ P2PDMA depends on 64BIT, there _is_ circumstantial padding so let's use it for flags. It doesn't require 64BIT.
For example struct scatterlist isn't pushed over some special (e.g. power-of-two) size when NEED_SG_DMA_FLAGS is enabled on 32-bit; I can't find a reason it should be prevented on 32-bit builds (and found cases above in which it is already enabled in them).
So I won't change the NEED_SG_DMA_FLAGS dependencies after all. Sorry for the noise -- as ever if I've missed something do please explain.
Ah, I was also taking the comments at face value, plus conflating the memory of the original series which did stash the P2P flag in bit 2 of the page_link pointer - in my defence, it's far too hot for my brain to work properly :)
Doing a bit more archaeology, the separate dma_flags field was a late suggestion just before the series was merged, so I guess it's actually just that the comments never got updated accordingly. Sorry for perpetuating the confusion!
Cheers, Robin.
(I'll continue with removing the P2PDMA dependency on 64BIT because it seems P2PDMA's dependencies rely on 64BIT, though P2PDMA itself doesn't.)
Thanks,
Matt
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
The P2PDMA code currently provides two features under the same CONFIG_PCI_P2PDMA option:
- Locate providers via pcim_p2pdma_provider()
- Manage actual P2P DMA
Some drivers (such as vfio-pci) depend on 1, without having a hard dependency on 2.
A future commit expands the use of DMABUF in vfio-pci for non-P2P scenarios, relying on pcim_p2pdma_provider() always being present. If that depended on CONFIG_PCI_P2PDMA, it would make vfio-pci only available if CONFIG_ZONE_DEVICE is present (e.g. 64-bit systems), even when P2P is not needed.
To resolve this, introduce CONFIG_PCI_P2PDMA_CORE and refactor the basic provider functionality into a new p2pdma_core.c file. This is available even if the CONFIG_PCI_P2PDMA feature is disabled (or unavailable due to !CONFIG_ZONE_DEVICE). Then, drivers can enable any additional P2P features with the original CONFIG_PCI_P2PDMA (available when CONFIG_ZONE_DEVICE is set).
Signed-off-by: Matt Evans matt@ozlabs.org
Reviewed-by: Kevin Tian kevin.tian@intel.com
Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to find a PFN.
This supports multi-range DMABUFs, which typically would be used to represent scattered spans but might even represent overlapping or aliasing spans of PFNs.
Because this is intended to be used in vfio_pci_core.c, we also need to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
Signed-off-by: Matt Evans matt@ozlabs.org --- drivers/vfio/pci/vfio_pci_dmabuf.c | 137 ++++++++++++++++++++++++++--- drivers/vfio/pci/vfio_pci_priv.h | 20 +++++ 2 files changed, 144 insertions(+), 13 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index c16f460c01d6..9e5e865f6fb6 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -9,19 +9,6 @@
MODULE_IMPORT_NS("DMA_BUF");
-struct vfio_pci_dma_buf { - struct dma_buf *dmabuf; - struct vfio_pci_core_device *vdev; - struct list_head dmabufs_elm; - size_t size; - struct phys_vec *phys_vec; - struct p2pdma_provider *provider; - u32 nr_ranges; - struct kref kref; - struct completion comp; - u8 revoked : 1; -}; - static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -106,6 +93,130 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = { .release = vfio_pci_dma_buf_release, };
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv, + struct vm_area_struct *vma, + unsigned long address, + unsigned int order, + unsigned long *out_pfn) +{ + /* + * Given a VMA (start, end, pgoffs) and a fault address, + * search the corresponding DMABUF's phys_vec[] to find the + * range representing the address's offset into the VMA, and + * its PFN. + * + * The phys_vec[] ranges represent contiguous spans of VAs + * upwards from the buffer offset 0; the actual PFNs might be + * in any order, overlap/alias, etc. Calculate an offset of + * the desired page given VMA start/pgoff and address, then + * search upwards from 0 to find which span contains it. + * + * On success, a valid PFN for a page sized by 'order' is + * returned into out_pfn. + * + * Failure occurs if: + * - The page would cross the edge of the VMA + * - The page isn't entirely contained within a range + * - We find a range, but the final PFN isn't aligned to the + * requested order. + * + * (Upon failure, the caller is expected to try again with a + * smaller order; the tests above will always succeed for + * order=0 as the limit case.) + * + * It's suboptimal if DMABUFs are created with neigbouring + * ranges that are physically contiguous, since hugepages + * can't straddle range boundaries. (The construction of the + * ranges vector should merge such ranges.) + * + * Finally, vma_pgoff_adjust is used for a DMABUF representing + * a VFIO BAR mmap, which is created from the start of the + * offset region. + */ + + const unsigned long pagesize = PAGE_SIZE << order; + unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust) << + PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK; + unsigned long rounded_page_addr = ALIGN_DOWN(address, pagesize); + unsigned long rounded_page_end = rounded_page_addr + pagesize; + unsigned long page_buf_offset; + unsigned long page_buf_offset_end; + unsigned long range_buf_offset = 0; + unsigned int i; + + if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) { + if (order > 0) + return -EAGAIN; + + /* A fault address outside of the VMA is absurd. */ + WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n", + address, vma->vm_start, vma->vm_end); + return -EFAULT; + } + + /* + * page_buff_offset[_end] is the span of DMABUF offsets + * corresponding to the faulting page: + */ + if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start, + vma_off, &page_buf_offset) || + check_add_overflow(page_buf_offset, pagesize, + &page_buf_offset_end))) + return -EFAULT; + + for (i = 0; i < priv->nr_ranges; i++) { + size_t range_len = priv->phys_vec[i].len; + phys_addr_t range_start = priv->phys_vec[i].paddr; + + /* + * If the current range starts after the page's span, + * this and any future range won't match. Bail early. + */ + if (page_buf_offset_end <= range_buf_offset) + break; + + if (page_buf_offset >= range_buf_offset && + page_buf_offset_end <= range_buf_offset + range_len) { + /* + * The faulting page is wholly contained + * within the span represented by the range. + * Validate PFN alignment for the order: + */ + unsigned long pfn = (range_start + page_buf_offset - + range_buf_offset) / PAGE_SIZE; + + if (IS_ALIGNED(pfn, 1 << order)) { + *out_pfn = pfn; + return 0; + } + /* Retry with smaller order */ + return -EAGAIN; + } + range_buf_offset += range_len; + } + + /* + * A hugepage straddling a range boundary will fail to match a + * range, but the address will (eventually) match when retried + * with a smaller page. + */ + if (order > 0) + return -EAGAIN; + + /* + * If we get here, the address fell outside of the span + * represented by the (concatenated) ranges. Setup of a + * mapping must ensure that the VMA is <= the total size of + * the ranges, so this should never happen. But, if it does, + * force SIGBUS for the access and warn. + */ + WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n", + address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff, + priv->nr_ranges, priv->size); + + return -EFAULT; +} + /* * This is a temporary "private interconnect" between VFIO DMABUF and iommufd. * It allows the two co-operating drivers to exchange the physical address of diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index fca9d0dfac90..c8f6f959056a 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -23,6 +23,20 @@ struct vfio_pci_ioeventfd { bool test_mem; };
+struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + size_t size; + struct phys_vec *phys_vec; + struct p2pdma_provider *provider; + u32 nr_ranges; + struct kref kref; + struct completion comp; + unsigned long vma_pgoff_adjust; + u8 revoked : 1; +}; + bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
@@ -114,6 +128,12 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; }
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf, + struct vm_area_struct *vma, + unsigned long address, + unsigned int order, + unsigned long *out_pfn); + #ifdef CONFIG_VFIO_PCI_DMABUF int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg,
On Wed, Jun 10, 2026 at 04:43:16PM +0100, Matt Evans wrote:
Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to find a PFN.
This supports multi-range DMABUFs, which typically would be used to represent scattered spans but might even represent overlapping or aliasing spans of PFNs.
Because this is intended to be used in vfio_pci_core.c, we also need to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_dmabuf.c | 137 ++++++++++++++++++++++++++--- drivers/vfio/pci/vfio_pci_priv.h | 20 +++++ 2 files changed, 144 insertions(+), 13 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index c16f460c01d6..9e5e865f6fb6 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -9,19 +9,6 @@ MODULE_IMPORT_NS("DMA_BUF"); -struct vfio_pci_dma_buf {
- struct dma_buf *dmabuf;
- struct vfio_pci_core_device *vdev;
- struct list_head dmabufs_elm;
- size_t size;
- struct phys_vec *phys_vec;
- struct p2pdma_provider *provider;
- u32 nr_ranges;
- struct kref kref;
- struct completion comp;
- u8 revoked : 1;
-};
static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -106,6 +93,130 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = { .release = vfio_pci_dma_buf_release, }; +int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
struct vm_area_struct *vma,unsigned long address,
Nit: s/address/fault_addr ?
unsigned int order,unsigned long *out_pfn)+{
- /*
* Given a VMA (start, end, pgoffs) and a fault address,* search the corresponding DMABUF's phys_vec[] to find the* range representing the address's offset into the VMA, and* its PFN.** The phys_vec[] ranges represent contiguous spans of VAs* upwards from the buffer offset 0; the actual PFNs might be* in any order, overlap/alias, etc. Calculate an offset of* the desired page given VMA start/pgoff and address, then* search upwards from 0 to find which span contains it.** On success, a valid PFN for a page sized by 'order' is* returned into out_pfn.** Failure occurs if:* - The page would cross the edge of the VMA* - The page isn't entirely contained within a range* - We find a range, but the final PFN isn't aligned to the* requested order.** (Upon failure, the caller is expected to try again with a* smaller order; the tests above will always succeed for* order=0 as the limit case.)** It's suboptimal if DMABUFs are created with neigbouring* ranges that are physically contiguous, since hugepages* can't straddle range boundaries. (The construction of the* ranges vector should merge such ranges.)** Finally, vma_pgoff_adjust is used for a DMABUF representing* a VFIO BAR mmap, which is created from the start of the* offset region.*/- const unsigned long pagesize = PAGE_SIZE << order;
- unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust) <<
PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;- unsigned long rounded_page_addr = ALIGN_DOWN(address, pagesize);
- unsigned long rounded_page_end = rounded_page_addr + pagesize;
- unsigned long page_buf_offset;
- unsigned long page_buf_offset_end;
- unsigned long range_buf_offset = 0;
- unsigned int i;
- if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) {
if (order > 0)return -EAGAIN;/* A fault address outside of the VMA is absurd. */WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",address, vma->vm_start, vma->vm_end);
This could flood dmesg if triggered repeatedly by userspace :( Since a fault outside the VMA is an invalid access that already results in a SIGBUS, we could probably avoid the WARN here? Perhaps pr_warn_ratelimited() should suffice?
return -EFAULT;- }
- /*
* page_buff_offset[_end] is the span of DMABUF offsets* corresponding to the faulting page:*/- if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
vma_off, &page_buf_offset) ||check_add_overflow(page_buf_offset, pagesize,&page_buf_offset_end)))return -EFAULT;- for (i = 0; i < priv->nr_ranges; i++) {
size_t range_len = priv->phys_vec[i].len;phys_addr_t range_start = priv->phys_vec[i].paddr;/** If the current range starts after the page's span,* this and any future range won't match. Bail early.*/if (page_buf_offset_end <= range_buf_offset)break;if (page_buf_offset >= range_buf_offset &&page_buf_offset_end <= range_buf_offset + range_len) {/** The faulting page is wholly contained* within the span represented by the range.* Validate PFN alignment for the order:*/unsigned long pfn = (range_start + page_buf_offset -range_buf_offset) / PAGE_SIZE;
Minor nit: I'm aware that decent compilers convert pow(2) divides to >> However, we seem to be using `>> PAGE_SHIFT` across vfio-pci. E.g.:
return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
Let's consider using the same pattern?
if (IS_ALIGNED(pfn, 1 << order)) {*out_pfn = pfn;return 0;}/* Retry with smaller order */return -EAGAIN;}range_buf_offset += range_len;- }
- /*
* A hugepage straddling a range boundary will fail to match a* range, but the address will (eventually) match when retried* with a smaller page.*/- if (order > 0)
return -EAGAIN;- /*
* If we get here, the address fell outside of the span* represented by the (concatenated) ranges. Setup of a
Nit: double space before "Setup" and "But" below.
* mapping must ensure that the VMA is <= the total size of* the ranges, so this should never happen. But, if it does,* force SIGBUS for the access and warn.*/- WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n",
address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,priv->nr_ranges, priv->size);- return -EFAULT;
The fall-through logic at the end feels a bit redundant.
If we've exhausted the phys_vec list without finding a match, returning -EAGAIN for order > 0 seems like the correct fallback behavior.
However, the subsequent WARN_ONCE for the order == 0 seems unnecessary? An out-of-bounds access is an error that should simply return -EFAULT (converting to SIGBUS) without polluting the kernel log with stackdumps? Can we instead convert this to a pr_warn or something? Something like:
ret = order ? -EAGAIN : -EFAULT;
if (ret == -EFAULT) pr_warn_ratelimited("No range for addr 0x%lx...\n", address);
return ret;
(with appropriate comments)
Thanks, Praan
On Thu, 11 Jun 2026 20:30:32 +0000 Pranjal Shrivastava praan@google.com wrote:
On Wed, Jun 10, 2026 at 04:43:16PM +0100, Matt Evans wrote:
- /*
* If we get here, the address fell outside of the span* represented by the (concatenated) ranges. Setup of aNit: double space before "Setup" and "But" below.
Some of us old school'ers consider this proper writing style ;)
Alex
* mapping must ensure that the VMA is <= the total size of* the ranges, so this should never happen. But, if it does,* force SIGBUS for the access and warn.*/- WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n",
address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,priv->nr_ranges, priv->size);- return -EFAULT;
On Fri, Jun 12, 2026 at 11:37:35AM -0600, Alex Williamson wrote:
On Thu, 11 Jun 2026 20:30:32 +0000 Pranjal Shrivastava praan@google.com wrote:
On Wed, Jun 10, 2026 at 04:43:16PM +0100, Matt Evans wrote:
- /*
* If we get here, the address fell outside of the span* represented by the (concatenated) ranges. Setup of aNit: double space before "Setup" and "But" below.
Some of us old school'ers consider this proper writing style ;)
Ohh okay, I wasn't aware, I withdraw that nit then :) and come on, I don't think you're old school! :)
Thanks, Praan
Hi Praan,
On 11/06/2026 21:30, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:16PM +0100, Matt Evans wrote:
Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to find a PFN.
This supports multi-range DMABUFs, which typically would be used to represent scattered spans but might even represent overlapping or aliasing spans of PFNs.
Because this is intended to be used in vfio_pci_core.c, we also need to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_dmabuf.c | 137 ++++++++++++++++++++++++++--- drivers/vfio/pci/vfio_pci_priv.h | 20 +++++ 2 files changed, 144 insertions(+), 13 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index c16f460c01d6..9e5e865f6fb6 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -9,19 +9,6 @@ MODULE_IMPORT_NS("DMA_BUF"); -struct vfio_pci_dma_buf {
- struct dma_buf *dmabuf;
- struct vfio_pci_core_device *vdev;
- struct list_head dmabufs_elm;
- size_t size;
- struct phys_vec *phys_vec;
- struct p2pdma_provider *provider;
- u32 nr_ranges;
- struct kref kref;
- struct completion comp;
- u8 revoked : 1;
-};
static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -106,6 +93,130 @@ static const struct dma_buf_ops vfio_pci_dmabuf_ops = { .release = vfio_pci_dma_buf_release, }; +int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
struct vm_area_struct *vma,unsigned long address,Nit: s/address/fault_addr ?
Sure.
unsigned int order,unsigned long *out_pfn)+{
- /*
* Given a VMA (start, end, pgoffs) and a fault address,* search the corresponding DMABUF's phys_vec[] to find the* range representing the address's offset into the VMA, and* its PFN.** The phys_vec[] ranges represent contiguous spans of VAs* upwards from the buffer offset 0; the actual PFNs might be* in any order, overlap/alias, etc. Calculate an offset of* the desired page given VMA start/pgoff and address, then* search upwards from 0 to find which span contains it.** On success, a valid PFN for a page sized by 'order' is* returned into out_pfn.** Failure occurs if:* - The page would cross the edge of the VMA* - The page isn't entirely contained within a range* - We find a range, but the final PFN isn't aligned to the* requested order.** (Upon failure, the caller is expected to try again with a* smaller order; the tests above will always succeed for* order=0 as the limit case.)** It's suboptimal if DMABUFs are created with neigbouring* ranges that are physically contiguous, since hugepages* can't straddle range boundaries. (The construction of the* ranges vector should merge such ranges.)** Finally, vma_pgoff_adjust is used for a DMABUF representing* a VFIO BAR mmap, which is created from the start of the* offset region.*/- const unsigned long pagesize = PAGE_SIZE << order;
- unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust) <<
PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;- unsigned long rounded_page_addr = ALIGN_DOWN(address, pagesize);
- unsigned long rounded_page_end = rounded_page_addr + pagesize;
- unsigned long page_buf_offset;
- unsigned long page_buf_offset_end;
- unsigned long range_buf_offset = 0;
- unsigned int i;
- if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) {
if (order > 0)return -EAGAIN;/* A fault address outside of the VMA is absurd. */WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",address, vma->vm_start, vma->vm_end);This could flood dmesg if triggered repeatedly by userspace :( Since a fault outside the VMA is an invalid access that already results in a SIGBUS, we could probably avoid the WARN here? Perhaps pr_warn_ratelimited() should suffice?
I'm OK moving to a pr_warn_ratelimited(). Note though that this case is "genuinely impossible" currently and the check exists in case something changes elsewhere. (Re your flood comment, am I missing a way for userspace to trigger this? The scenario is a faulthandler for a VMA getting a VA outside the bounds of that VMA; such a fault address wouldn't match that VMA.)
return -EFAULT;- }
- /*
* page_buff_offset[_end] is the span of DMABUF offsets* corresponding to the faulting page:*/- if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
vma_off, &page_buf_offset) ||check_add_overflow(page_buf_offset, pagesize,&page_buf_offset_end)))return -EFAULT;- for (i = 0; i < priv->nr_ranges; i++) {
size_t range_len = priv->phys_vec[i].len;phys_addr_t range_start = priv->phys_vec[i].paddr;/** If the current range starts after the page's span,* this and any future range won't match. Bail early.*/if (page_buf_offset_end <= range_buf_offset)break;if (page_buf_offset >= range_buf_offset &&page_buf_offset_end <= range_buf_offset + range_len) {/** The faulting page is wholly contained* within the span represented by the range.* Validate PFN alignment for the order:*/unsigned long pfn = (range_start + page_buf_offset -range_buf_offset) / PAGE_SIZE;Minor nit: I'm aware that decent compilers convert pow(2) divides to >> However, we seem to be using `>> PAGE_SHIFT` across vfio-pci. E.g.:
return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
Let's consider using the same pattern?
(Do you know of a compiler that both builds the kernel and does NOT perform this transformation? I am confident that resulting object code will be OK here.)
In an earlier revision I was using shifts but they were fairly messy compared to this expression, which arises from a request by Jason.
if (IS_ALIGNED(pfn, 1 << order)) {*out_pfn = pfn;return 0;}/* Retry with smaller order */return -EAGAIN;}range_buf_offset += range_len;- }
- /*
* A hugepage straddling a range boundary will fail to match a* range, but the address will (eventually) match when retried* with a smaller page.*/- if (order > 0)
return -EAGAIN;- /*
* If we get here, the address fell outside of the span* represented by the (concatenated) ranges. Setup of aNit: double space before "Setup" and "But" below.
I liked Alex's response :-) This is common practice for monospaced text since increasing inter-sentence spacing helps readability in paragraph blocks (see Documentation/ for many examples ...).
* mapping must ensure that the VMA is <= the total size of* the ranges, so this should never happen. But, if it does,* force SIGBUS for the access and warn.*/- WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n",
address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,priv->nr_ranges, priv->size);- return -EFAULT;
The fall-through logic at the end feels a bit redundant.
If we've exhausted the phys_vec list without finding a match, returning -EAGAIN for order > 0 seems like the correct fallback behavior.
This path can happen (for order > 0) e.g. mis-alignment of VA versus the PFN, i.e. is likely...
However, the subsequent WARN_ONCE for the order == 0 seems unnecessary? An out-of-bounds access is an error that should simply return -EFAULT (converting to SIGBUS) without polluting the kernel log with stackdumps?
...but the only way this can happen, for order == 0, is if the VMA extends beyond the underlying resource. For example, if the VMA is larger than the DMABUF size (the total length of phys ranges set up inside the DMABUF). Both VFIO BAR mmap() and a DMABUF mmap() disallow mapping off the end of the underlying resource. That is, this also "cannot happen" but if logic changes elsewhere then we will really want to know about hitting this case -- the check is not redundant.
Still, it doesn't need a regdump/backtrace (at least while this is only called from one spot), so a pr_warn_* is better.
Thanks,
Matt
Can we instead convert this to a pr_warn or something? Something like:
ret = order ? -EAGAIN : -EFAULT;
if (ret == -EFAULT) pr_warn_ratelimited("No range for addr 0x%lx...\n", address);
return ret;
(with appropriate comments)
Thanks, Praan
On Mon, Jun 15, 2026 at 03:27:01PM +0100, Matt Evans wrote:
Hi Matt,
On 11/06/2026 21:30, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:16PM +0100, Matt Evans wrote:
Add vfio_pci_dma_buf_find_pfn(), which a VMA fault handler can use to find a PFN.
This supports multi-range DMABUFs, which typically would be used to represent scattered spans but might even represent overlapping or aliasing spans of PFNs.
Because this is intended to be used in vfio_pci_core.c, we also need to expose the struct vfio_pci_dma_buf in the vfio_pci_priv.h header.
Signed-off-by: Matt Evans matt@ozlabs.org
[...]
- const unsigned long pagesize = PAGE_SIZE << order;
- unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust) <<
PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;- unsigned long rounded_page_addr = ALIGN_DOWN(address, pagesize);
- unsigned long rounded_page_end = rounded_page_addr + pagesize;
- unsigned long page_buf_offset;
- unsigned long page_buf_offset_end;
- unsigned long range_buf_offset = 0;
- unsigned int i;
- if (rounded_page_addr < vma->vm_start || rounded_page_end > vma->vm_end) {
if (order > 0)return -EAGAIN;/* A fault address outside of the VMA is absurd. */WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",address, vma->vm_start, vma->vm_end);This could flood dmesg if triggered repeatedly by userspace :( Since a fault outside the VMA is an invalid access that already results in a SIGBUS, we could probably avoid the WARN here? Perhaps pr_warn_ratelimited() should suffice?
I'm OK moving to a pr_warn_ratelimited(). Note though that this case is "genuinely impossible" currently and the check exists in case something changes elsewhere. (Re your flood comment, am I missing a way for userspace to trigger this? The scenario is a faulthandler for a VMA getting a VA outside the bounds of that VMA; such a fault address wouldn't match that VMA.)
I should've been explicit, I guess I worded it wrong, my bad. I didn't mean that a user-space could hit this on it's own. However, I meant to say if there's a bug in core mm during some future work, dmesg being flooded by stackdumps gets messy (especially during dev) as the underlying reason might get missed in the flood. Hence, I prefer moving to pr_warn_ratelimited.
return -EFAULT;- }
- /*
* page_buff_offset[_end] is the span of DMABUF offsets* corresponding to the faulting page:*/- if (unlikely(check_add_overflow(rounded_page_addr - vma->vm_start,
vma_off, &page_buf_offset) ||check_add_overflow(page_buf_offset, pagesize,&page_buf_offset_end)))return -EFAULT;- for (i = 0; i < priv->nr_ranges; i++) {
size_t range_len = priv->phys_vec[i].len;phys_addr_t range_start = priv->phys_vec[i].paddr;/** If the current range starts after the page's span,* this and any future range won't match. Bail early.*/if (page_buf_offset_end <= range_buf_offset)break;if (page_buf_offset >= range_buf_offset &&page_buf_offset_end <= range_buf_offset + range_len) {/** The faulting page is wholly contained* within the span represented by the range.* Validate PFN alignment for the order:*/unsigned long pfn = (range_start + page_buf_offset -range_buf_offset) / PAGE_SIZE;Minor nit: I'm aware that decent compilers convert pow(2) divides to >> However, we seem to be using `>> PAGE_SHIFT` across vfio-pci. E.g.:
return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT;
Let's consider using the same pattern?
(Do you know of a compiler that both builds the kernel and does NOT perform this transformation? I am confident that resulting object code will be OK here.)
I guess most of the modern compiler do, I was just referring to the style across the file. I don't have any strong opinion.
In an earlier revision I was using shifts but they were fairly messy compared to this expression, which arises from a request by Jason.
Yea, looking back at Jason's comment [1], I think he was mainly pointing out that the common factor (PAGE_SIZE) could be grouped together. But again, no strong feeling about this, just picked up a pattern across the file. If it breaks on some compiler we can fix it later..
if (IS_ALIGNED(pfn, 1 << order)) {*out_pfn = pfn;return 0;}/* Retry with smaller order */return -EAGAIN;}range_buf_offset += range_len;- }
- /*
* A hugepage straddling a range boundary will fail to match a* range, but the address will (eventually) match when retried* with a smaller page.*/- if (order > 0)
return -EAGAIN;- /*
* If we get here, the address fell outside of the span* represented by the (concatenated) ranges. Setup of aNit: double space before "Setup" and "But" below.
I liked Alex's response :-) This is common practice for monospaced text since increasing inter-sentence spacing helps readability in paragraph blocks (see Documentation/ for many examples ...).
Ack. :)
* mapping must ensure that the VMA is <= the total size of* the ranges, so this should never happen. But, if it does,* force SIGBUS for the access and warn.*/- WARN_ONCE(1, "No range for addr 0x%lx, order %d: VMA 0x%lx-0x%lx pgoff 0x%lx, %u ranges, size 0x%zx\n",
address, order, vma->vm_start, vma->vm_end, vma->vm_pgoff,priv->nr_ranges, priv->size);- return -EFAULT;
The fall-through logic at the end feels a bit redundant.
If we've exhausted the phys_vec list without finding a match, returning -EAGAIN for order > 0 seems like the correct fallback behavior.
This path can happen (for order > 0) e.g. mis-alignment of VA versus the PFN, i.e. is likely...
However, the subsequent WARN_ONCE for the order == 0 seems unnecessary? An out-of-bounds access is an error that should simply return -EFAULT (converting to SIGBUS) without polluting the kernel log with stackdumps?
...but the only way this can happen, for order == 0, is if the VMA extends beyond the underlying resource. For example, if the VMA is larger than the DMABUF size (the total length of phys ranges set up inside the DMABUF). Both VFIO BAR mmap() and a DMABUF mmap() disallow mapping off the end of the underlying resource. That is, this also "cannot happen" but if logic changes elsewhere then we will really want to know about hitting this case -- the check is not redundant.
I didn't mean to imply that the path itself is impossible or won't happen.. I just meant that the logic / structure felt a bit redundant at the end of the function..
Instead of having the separate `if (order > 0)` block falling through to the base case, I suggest it could be cleaner as:
ret = order ? -EAGAIN : -EFAULT;
if (ret == -EFAULT) pr_warn_ratelimited(...);
return ret;
But again, that's a preference. I'd leave that to your judgement.
Still, it doesn't need a regdump/backtrace (at least while this is only called from one spot), so a pr_warn_* is better.
Ack.
Thanks, Praan
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
struct vm_area_struct *vma,unsigned long address,unsigned int order,unsigned long *out_pfn)+{
- /*
* Given a VMA (start, end, pgoffs) and a fault address,* search the corresponding DMABUF's phys_vec[] to find the* range representing the address's offset into the VMA, and* its PFN.** The phys_vec[] ranges represent contiguous spans of VAs* upwards from the buffer offset 0; the actual PFNs might be* in any order, overlap/alias, etc. Calculate an offset of* the desired page given VMA start/pgoff and address, then* search upwards from 0 to find which span contains it.** On success, a valid PFN for a page sized by 'order' is* returned into out_pfn.** Failure occurs if:* - The page would cross the edge of the VMA* - The page isn't entirely contained within a range* - We find a range, but the final PFN isn't aligned to the* requested order.** (Upon failure, the caller is expected to try again with a* smaller order; the tests above will always succeed for* order=0 as the limit case.)** It's suboptimal if DMABUFs are created with neigbouring
s/neigbouring/neighboring/
* ranges that are physically contiguous, since hugepages* can't straddle range boundaries. (The construction of the* ranges vector should merge such ranges.)
though the field is called 'phys_vec', removing 'vector' in description is clearer here.
** Finally, vma_pgoff_adjust is used for a DMABUF representing* a VFIO BAR mmap, which is created from the start of the* offset region.
Elaborate it a little bit that the vm_pgoff is already counted in paddr of phys_vec so it should be skipped when finding the pfn.
*/- const unsigned long pagesize = PAGE_SIZE << order;
- unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust)
<<
PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;- unsigned long rounded_page_addr = ALIGN_DOWN(address,
pagesize);
- unsigned long rounded_page_end = rounded_page_addr + pagesize;
- unsigned long page_buf_offset;
- unsigned long page_buf_offset_end;
what about "fault_offset[_end]"? page_buf is a bit confusing.
- unsigned long range_buf_offset = 0;
could this be called 'range_start' then the 'range_start' in latter loop is renamed to 'phys_start'?
Not strong... just feel such naming helps me understand the logic easier
- unsigned int i;
- if (rounded_page_addr < vma->vm_start || rounded_page_end >
vma->vm_end) {
if (order > 0)return -EAGAIN;/* A fault address outside of the VMA is absurd. */WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",address, vma->vm_start, vma->vm_end);return -EFAULT;- }
- /*
* page_buff_offset[_end] is the span of DMABUF offsets* corresponding to the faulting page:*/
if the naming is kept then s/page_buff_offset/page_buf_offset/
otherwise,
Reviewed-by: Kevin Tian kevin.tian@intel.com
Hi Kevin,
On 12/06/2026 09:42, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
struct vm_area_struct *vma,unsigned long address,unsigned int order,unsigned long *out_pfn)+{
- /*
* Given a VMA (start, end, pgoffs) and a fault address,* search the corresponding DMABUF's phys_vec[] to find the* range representing the address's offset into the VMA, and* its PFN.** The phys_vec[] ranges represent contiguous spans of VAs* upwards from the buffer offset 0; the actual PFNs might be* in any order, overlap/alias, etc. Calculate an offset of* the desired page given VMA start/pgoff and address, then* search upwards from 0 to find which span contains it.** On success, a valid PFN for a page sized by 'order' is* returned into out_pfn.** Failure occurs if:* - The page would cross the edge of the VMA* - The page isn't entirely contained within a range* - We find a range, but the final PFN isn't aligned to the* requested order.** (Upon failure, the caller is expected to try again with a* smaller order; the tests above will always succeed for* order=0 as the limit case.)** It's suboptimal if DMABUFs are created with neigbourings/neigbouring/neighboring/
Ah, not a typo. :) That is en_GB and AFAIK is permitted.
* ranges that are physically contiguous, since hugepages* can't straddle range boundaries. (The construction of the* ranges vector should merge such ranges.)though the field is called 'phys_vec', removing 'vector' in description is clearer here.
Fair, reworded.
** Finally, vma_pgoff_adjust is used for a DMABUF representing* a VFIO BAR mmap, which is created from the start of the* offset region.Elaborate it a little bit that the vm_pgoff is already counted in paddr of phys_vec so it should be skipped when finding the pfn.
OK! Expanded this paragraph slightly to explain that vma_pgoff_adjust avoids double-accounting, and that a BAR mmap() DMABUF is created such that the start of the VMA (even with an offset) equals the start of the DMABUF and equals the start of the physical range.
*/- const unsigned long pagesize = PAGE_SIZE << order;
- unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust)
<<
PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;- unsigned long rounded_page_addr = ALIGN_DOWN(address,
pagesize);
- unsigned long rounded_page_end = rounded_page_addr + pagesize;
- unsigned long page_buf_offset;
- unsigned long page_buf_offset_end;
what about "fault_offset[_end]"? page_buf is a bit confusing.
I went round several times with these names, thanks for the input. Just tried it out and your suggestion is clearer.
- unsigned long range_buf_offset = 0;
could this be called 'range_start' then the 'range_start' in latter loop is renamed to 'phys_start'?
Not strong... just feel such naming helps me understand the logic easier
Anything that helps helps, thanks. I ended up renaming this to range_start_offset (as offset is IMHO important).
I'm a fan of diagrams but this is too large to include in a comment. But for posterity on the list, and using the new names, an illustration of a DMABUF with 3 ranges in phys_vec, where a mapping's faulting page offset lies in range [1]:
fault_addr--+ v VMA +-----------------+----------+-----------------+ | | Faulting | | | | (hg)page | | | | | | |---- vma_off ---->+-----------------+----------+-----------------+ | . . | . . |--------- fault_offset ------------>. . DMABUF +-------------------------+---------------------------+--------------+ | phys_vec[0] | phys_vec[1] . | phys_vec[2] | | .paddr | . . | | | .len | . . | | +-------------------------+---------------------------+--------------+ 0 : . . : L |-- range_start_offset -->: . . -->: range_len : . . : V . . : +----------+----------+-----+ |.paddr | PFN | | | | | | | | | | +----------+----------+-----+ P
P = paddr + (fault_offset - range_start_offset) L = sum(phys_vec[0...2].len)
- unsigned int i;
- if (rounded_page_addr < vma->vm_start || rounded_page_end >
vma->vm_end) {
if (order > 0)return -EAGAIN;/* A fault address outside of the VMA is absurd. */WARN(1, "Fault addr 0x%lx outside VMA 0x%lx-0x%lx\n",address, vma->vm_start, vma->vm_end);return -EFAULT;- }
- /*
* page_buff_offset[_end] is the span of DMABUF offsets* corresponding to the faulting page:*/if the naming is kept then s/page_buff_offset/page_buf_offset/
otherwise,
Reviewed-by: Kevin Tian kevin.tian@intel.com
Thank you,
Matt
From: Matt Evans matt@ozlabs.org Sent: Tuesday, June 16, 2026 2:04 AM
On 12/06/2026 09:42, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
struct vm_area_struct *vma,unsigned long address,unsigned int order,unsigned long *out_pfn)+{
- /*
* Given a VMA (start, end, pgoffs) and a fault address,* search the corresponding DMABUF's phys_vec[] to find the* range representing the address's offset into the VMA, and* its PFN.** The phys_vec[] ranges represent contiguous spans of VAs* upwards from the buffer offset 0; the actual PFNs might be* in any order, overlap/alias, etc. Calculate an offset of* the desired page given VMA start/pgoff and address, then* search upwards from 0 to find which span contains it.** On success, a valid PFN for a page sized by 'order' is* returned into out_pfn.** Failure occurs if:* - The page would cross the edge of the VMA* - The page isn't entirely contained within a range* - We find a range, but the final PFN isn't aligned to the* requested order.** (Upon failure, the caller is expected to try again with a* smaller order; the tests above will always succeed for* order=0 as the limit case.)** It's suboptimal if DMABUFs are created with neigbourings/neigbouring/neighboring/
Ah, not a typo. :) That is en_GB and AFAIK is permitted.
I guess you meant 'neighbouring' and 'neighboring' are both valid.
but here lacking a 'h' should be a typo? :)
*/- const unsigned long pagesize = PAGE_SIZE << order;
- unsigned long vma_off = ((vma->vm_pgoff - priv->vma_pgoff_adjust)
<<
PAGE_SHIFT) & VFIO_PCI_OFFSET_MASK;- unsigned long rounded_page_addr = ALIGN_DOWN(address,
pagesize);
- unsigned long rounded_page_end = rounded_page_addr + pagesize;
- unsigned long page_buf_offset;
- unsigned long page_buf_offset_end;
what about "fault_offset[_end]"? page_buf is a bit confusing.
I went round several times with these names, thanks for the input. Just tried it out and your suggestion is clearer.
- unsigned long range_buf_offset = 0;
could this be called 'range_start' then the 'range_start' in latter loop is renamed to 'phys_start'?
Not strong... just feel such naming helps me understand the logic easier
Anything that helps helps, thanks. I ended up renaming this to range_start_offset (as offset is IMHO important).
I'm a fan of diagrams but this is too large to include in a comment. But for posterity on the list, and using the new names, an illustration of a DMABUF with 3 ranges in phys_vec, where a mapping's faulting page offset lies in range [1]:
fault_addr--+ v VMA +-----------------+----------+-----------------+ | | Faulting | | | | (hg)page | | | | | ||---- vma_off ---->+-----------------+----------+-----------------+ | . . | . . |--------- fault_offset ------------>. . DMABUF +-------------------------+---------------------------+--------------+ | phys_vec[0] | phys_vec[1] . | phys_vec[2] | | .paddr | . . | | | .len | . . | | +-------------------------+---------------------------+--------------+ 0 : . . : L |-- range_start_offset -->: . . -->: range_len : . . : V . . : +----------+----------+-----+ |.paddr | PFN | | | | | | | | | | +----------+----------+-----+ P
P = paddr + (fault_offset - range_start_offset) L = sum(phys_vec[0...2].len)
yes, much clearer now.
Hi Kevin,
On 16/06/2026 10:28, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Tuesday, June 16, 2026 2:04 AM
On 12/06/2026 09:42, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
+int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv,
struct vm_area_struct *vma,unsigned long address,unsigned int order,unsigned long *out_pfn)+{ [...]
** It's suboptimal if DMABUFs are created with neigbourings/neigbouring/neighboring/
Ah, not a typo. 🙂 That is en_GB and AFAIK is permitted.
I guess you meant 'neighbouring' and 'neighboring' are both valid.
but here lacking a 'h' should be a typo? 🙂
:D Doh! Sure enough, you're right and that very much _is_ a typo after all. :)
Thanks,
Matt
This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range DMABUF for the purpose of mapping a PCI BAR. This is used in a future commit by VFIO's ordinary mmap() path.
This function transfers ownership of the VFIO device fd to the DMABUF, which fput()s when it's released.
Refactor the existing vfio_pci_core_feature_dma_buf() to split out export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and this new VFIO_BAR mmap().
Signed-off-by: Matt Evans matt@ozlabs.org --- drivers/vfio/pci/vfio_pci_dmabuf.c | 142 +++++++++++++++++++++++------ drivers/vfio/pci/vfio_pci_priv.h | 5 + 2 files changed, 117 insertions(+), 30 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 9e5e865f6fb6..21180acf9600 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -82,6 +82,8 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) up_write(&priv->vdev->memory_lock); vfio_device_put_registration(&priv->vdev->vdev); } + if (priv->vfile) + fput(priv->vfile); kfree(priv->phys_vec); kfree(priv); } @@ -217,6 +219,45 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *priv, return -EFAULT; }
+/* + * Create a DMABUF corresponding to priv, add it to vdev->dmabufs list + * for tracking (meaning cleanup or revocation will zap it), and take + * a vfio_device registration. + */ +static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev, + struct vfio_pci_dma_buf *priv, u32 flags) +{ + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + + if (!vfio_device_try_get_registration(&vdev->vdev)) + return -ENODEV; + + exp_info.ops = &vfio_pci_dmabuf_ops; + exp_info.size = priv->size; + exp_info.flags = flags; + exp_info.priv = priv; + + priv->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + vfio_device_put_registration(&vdev->vdev); + return PTR_ERR(priv->dmabuf); + } + + kref_init(&priv->kref); + init_completion(&priv->comp); + + /* dma_buf_put() now frees priv */ + INIT_LIST_HEAD(&priv->dmabufs_elm); + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = !__vfio_pci_memory_enabled(vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + return 0; +} + /* * This is a temporary "private interconnect" between VFIO DMABUF and iommufd. * It allows the two co-operating drivers to exchange the physical address of @@ -335,7 +376,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, { struct vfio_device_feature_dma_buf get_dma_buf = {}; struct vfio_region_dma_range *dma_ranges; - DEFINE_DMA_BUF_EXPORT_INFO(exp_info); struct vfio_pci_dma_buf *priv; size_t length; int ret; @@ -395,34 +435,9 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, kfree(dma_ranges); dma_ranges = NULL;
- if (!vfio_device_try_get_registration(&vdev->vdev)) { - ret = -ENODEV; + ret = vfio_pci_dmabuf_export(vdev, priv, get_dma_buf.open_flags); + if (ret) goto err_free_phys; - } - - exp_info.ops = &vfio_pci_dmabuf_ops; - exp_info.size = priv->size; - exp_info.flags = get_dma_buf.open_flags; - exp_info.priv = priv; - - priv->dmabuf = dma_buf_export(&exp_info); - if (IS_ERR(priv->dmabuf)) { - ret = PTR_ERR(priv->dmabuf); - goto err_dev_put; - } - - kref_init(&priv->kref); - init_completion(&priv->comp); - - /* dma_buf_put() now frees priv */ - INIT_LIST_HEAD(&priv->dmabufs_elm); - down_write(&vdev->memory_lock); - dma_resv_lock(priv->dmabuf->resv, NULL); - priv->revoked = !__vfio_pci_memory_enabled(vdev); - list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); - dma_resv_unlock(priv->dmabuf->resv); - up_write(&vdev->memory_lock); - /* * dma_buf_fd() consumes the reference, when the file closes the dmabuf * will be released. @@ -433,8 +448,6 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
return ret;
-err_dev_put: - vfio_device_put_registration(&vdev->vdev); err_free_phys: kfree(priv->phys_vec); err_free_priv: @@ -444,6 +457,75 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, return ret; }
+int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, + struct vm_area_struct *vma, + u64 phys_start, u64 req_len, + unsigned int res_index) +{ + struct vfio_pci_dma_buf *priv; + unsigned long vma_pgoff = vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE_SHIFT); + int ret; + + priv = kzalloc_obj(*priv); + if (!priv) + return -ENOMEM; + + priv->phys_vec = kzalloc_obj(*priv->phys_vec); + if (!priv->phys_vec) { + ret = -ENOMEM; + goto err_free_priv; + } + + /* + * The DMABUF begins from the mmap()'s BAR offset, i.e. the + * start of the VMA corresponds to byte 0 of the DMABUF and + * byte (vma_pgoff << PAGE_SHIFT) of the BAR. + * + * vfio_pci_dma_buf_find_pfn() reverses this offset using + * vma_pgoff_adjust, so that ultimately a fault's offset from + * the start of the _VMA_ has a consistent usage whether the + * VMA originates from an mmap() of the VFIO device here or a + * direct DMABUF mmap(). + */ + priv->vdev = vdev; + priv->size = req_len; + priv->nr_ranges = 1; + priv->vma_pgoff_adjust = vma_pgoff; + priv->provider = pcim_p2pdma_provider(vdev->pdev, res_index); + if (!priv->provider) { + ret = -EINVAL; + goto err_free_phys; + } + + priv->phys_vec[0].paddr = phys_start + ((u64)vma_pgoff << PAGE_SHIFT); + priv->phys_vec[0].len = priv->size; + + ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR); + if (ret) + goto err_free_phys; + + /* + * Ownership of the DMABUF file transfers to the VMA so that + * other users can locate the DMABUF via a VA. Ownership of + * the original VFIO device file being mmap()ed transfers to + * priv, and is put when the DMABUF is released. This + * intentionally does not use get_file()/vma_set_file() + * because the references are already held, and ownership + * moves. + */ + priv->vfile = vma->vm_file; + vma->vm_file = priv->dmabuf->file; + vma->vm_private_data = priv; + + return 0; + +err_free_phys: + kfree(priv->phys_vec); +err_free_priv: + kfree(priv); + return ret; +} + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) { struct vfio_pci_dma_buf *priv; diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index c8f6f959056a..06dc0fd3e230 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -30,6 +30,7 @@ struct vfio_pci_dma_buf { size_t size; struct phys_vec *phys_vec; struct p2pdma_provider *provider; + struct file *vfile; u32 nr_ranges; struct kref kref; struct completion comp; @@ -133,6 +134,10 @@ int vfio_pci_dma_buf_find_pfn(struct vfio_pci_dma_buf *vpdmabuf, unsigned long address, unsigned int order, unsigned long *out_pfn); +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, + struct vm_area_struct *vma, + u64 phys_start, u64 req_len, + unsigned int res_index);
#ifdef CONFIG_VFIO_PCI_DMABUF int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range DMABUF for the purpose of mapping a PCI BAR. This is used in a future commit by VFIO's ordinary mmap() path.
This function transfers ownership of the VFIO device fd to the DMABUF, which fput()s when it's released.
Refactor the existing vfio_pci_core_feature_dma_buf() to split out export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and this new VFIO_BAR mmap().
Signed-off-by: Matt Evans matt@ozlabs.org
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Wed, Jun 10, 2026 at 04:43:17PM +0100, Matt Evans wrote:
This helper, vfio_pci_core_mmap_prep_dmabuf(), creates a single-range DMABUF for the purpose of mapping a PCI BAR. This is used in a future commit by VFIO's ordinary mmap() path.
This function transfers ownership of the VFIO device fd to the DMABUF, which fput()s when it's released.
Refactor the existing vfio_pci_core_feature_dma_buf() to split out export code common to the two paths, VFIO_DEVICE_FEATURE_DMA_BUF and this new VFIO_BAR mmap().
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_dmabuf.c | 142 +++++++++++++++++++++++------ drivers/vfio/pci/vfio_pci_priv.h | 5 + 2 files changed, 117 insertions(+), 30 deletions(-)
[...]
- /*
* Ownership of the DMABUF file transfers to the VMA so that* other users can locate the DMABUF via a VA. Ownership of* the original VFIO device file being mmap()ed transfers to* priv, and is put when the DMABUF is released. This* intentionally does not use get_file()/vma_set_file()* because the references are already held, and ownership* moves.*/- priv->vfile = vma->vm_file;
- vma->vm_file = priv->dmabuf->file;
- vma->vm_private_data = priv;
I appreciate this comment. Thanks for being clear!
Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks, Praan
Convert the VFIO device fd fops->mmap to create a DMABUF representing the BAR mapping, and make the VMA fault handler look up PFNs from the corresponding DMABUF. This supports future code mmap()ing BAR DMABUFs, and iommufd work to support Type1 P2P.
First, vfio_pci_core_mmap() uses the new vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF representing a single BAR range. Then, the vfio_pci_mmap_huge_fault() callback is updated to understand revoked buffers, and uses the new vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given fault address.
Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() zaps PTEs (used on the revocation and cleanup paths).
CONFIG_VFIO_PCI_CORE now unconditionally depends on CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of CONFIG_PCI_P2PDMA.
Signed-off-by: Matt Evans matt@ozlabs.org --- drivers/vfio/pci/Kconfig | 5 +- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_core.c | 75 +++++++++++++++++++----------- drivers/vfio/pci/vfio_pci_dmabuf.c | 12 +++++ drivers/vfio/pci/vfio_pci_priv.h | 11 +---- 5 files changed, 67 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 296bf01e185e..67a2ae1fbc04 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -6,6 +6,8 @@ config VFIO_PCI_CORE tristate select VFIO_VIRQFD select IRQ_BYPASS_MANAGER + select PCI_P2PDMA_CORE + select DMA_SHARED_BUFFER
config VFIO_PCI_INTX def_bool y if !S390 @@ -56,7 +58,8 @@ config VFIO_PCI_ZDEV_KVM To enable s390x KVM vfio-pci extensions, say Y.
config VFIO_PCI_DMABUF - def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER + def_bool y if PCI_P2PDMA + depends on VFIO_PCI_CORE
source "drivers/vfio/pci/mlx5/Kconfig"
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 6138f1bf241d..881452ea89be 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -1,8 +1,7 @@ # SPDX-License-Identifier: GPL-2.0-only
-vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o +vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio_pci_dmabuf.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o -vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 041243a84d81..f9636d8f9e2a 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1683,18 +1683,6 @@ void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 c up_write(&vdev->memory_lock); }
-static unsigned long vma_to_pfn(struct vm_area_struct *vma) -{ - struct vfio_pci_core_device *vdev = vma->vm_private_data; - int index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); - u64 pgoff; - - pgoff = vma->vm_pgoff & - ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); - - return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff; -} - vm_fault_t vfio_pci_vmf_insert_pfn(struct vfio_pci_core_device *vdev, struct vm_fault *vmf, unsigned long pfn, @@ -1722,23 +1710,42 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, unsigned int order) { struct vm_area_struct *vma = vmf->vma; - struct vfio_pci_core_device *vdev = vma->vm_private_data; - unsigned long addr = vmf->address & ~((PAGE_SIZE << order) - 1); - unsigned long pgoff = (addr - vma->vm_start) >> PAGE_SHIFT; - unsigned long pfn = vma_to_pfn(vma) + pgoff; - vm_fault_t ret = VM_FAULT_FALLBACK; - - if (is_aligned_for_order(vma, addr, pfn, order)) { - scoped_guard(rwsem_read, &vdev->memory_lock) - ret = vfio_pci_vmf_insert_pfn(vdev, vmf, pfn, order); + struct vfio_pci_dma_buf *priv = vma->vm_private_data; + struct vfio_pci_core_device *vdev; + unsigned long pfn = 0; + vm_fault_t ret = VM_FAULT_SIGBUS; + + /* + * We can rely on the existence of both a DMABUF (priv) and + * the VFIO device it was exported from (vdev). This fault's + * VMA was established using vfio_pci_core_mmap_prep_dmabuf() + * which transfers ownership of the VFIO device fd to the + * DMABUF, and so the VFIO device is held open because the + * VMA's vm_file (DMABUF) is open. + * + * Since vfio_pci_dma_buf_cleanup() cannot have happened, + * vdev must be valid; we can take memory_lock. + */ + vdev = priv->vdev; + + scoped_guard(rwsem_read, &vdev->memory_lock) { + if (!priv->revoked) { + int pres = vfio_pci_dma_buf_find_pfn(priv, vma, + vmf->address, + order, &pfn); + + if (pres == 0) + ret = vfio_pci_vmf_insert_pfn(vdev, vmf, + pfn, order); + else if (pres == -EAGAIN) + ret = VM_FAULT_FALLBACK; + } }
dev_dbg_ratelimited(&vdev->pdev->dev, - "%s(,order = %d) BAR %ld page offset 0x%lx: 0x%x\n", - __func__, order, - vma->vm_pgoff >> - (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT), - pgoff, (unsigned int)ret); + "%s(order = %d) PFN 0x%lx, VA 0x%lx, pgoff 0x%lx: 0x%x\n", + __func__, order, pfn, vmf->address, + vma->vm_pgoff, (unsigned int)ret);
return ret; } @@ -1763,6 +1770,7 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma unsigned int index; u64 phys_len, req_len, pgoff, req_start; void __iomem *bar_io; + int ret;
index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
@@ -1802,7 +1810,20 @@ int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma if (IS_ERR(bar_io)) return PTR_ERR(bar_io);
- vma->vm_private_data = vdev; + /* + * Create a DMABUF with a single range corresponding to this + * mapping, and wire it into vma->vm_private_data. The VMA's + * vm_file becomes that of the DMABUF, and the DMABUF takes + * ownership of the VFIO device file (put upon DMABUF + * release). This maintains the behaviour of a live VMA + * mapping holding the VFIO device file open. + */ + ret = vfio_pci_core_mmap_prep_dmabuf(vdev, vma, + pci_resource_start(pdev, index), + req_len, index); + if (ret) + return ret; + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 21180acf9600..2fd3629789bf 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -9,6 +9,7 @@
MODULE_IMPORT_NS("DMA_BUF");
+#ifdef CONFIG_VFIO_PCI_DMABUF static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { @@ -25,6 +26,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
return 0; } +#endif /* CONFIG_VFIO_PCI_DMABUF */
static void vfio_pci_dma_buf_done(struct kref *kref) { @@ -89,7 +91,9 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) }
static const struct dma_buf_ops vfio_pci_dmabuf_ops = { +#ifdef CONFIG_VFIO_PCI_DMABUF .attach = vfio_pci_dma_buf_attach, +#endif .map_dma_buf = vfio_pci_dma_buf_map, .unmap_dma_buf = vfio_pci_dma_buf_unmap, .release = vfio_pci_dma_buf_release, @@ -258,6 +262,7 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev, return 0; }
+#ifdef CONFIG_VFIO_PCI_DMABUF /* * This is a temporary "private interconnect" between VFIO DMABUF and iommufd. * It allows the two co-operating drivers to exchange the physical address of @@ -456,6 +461,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, kfree(dma_ranges); return ret; } +#endif /* CONFIG_VFIO_PCI_DMABUF */
int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, struct vm_area_struct *vma, @@ -532,6 +538,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) struct vfio_pci_dma_buf *tmp;
lockdep_assert_held_write(&vdev->memory_lock); + /* + * Holding memory_lock ensures a racing VMA fault observes + * priv->revoked properly. + */
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!get_file_active(&priv->dmabuf->file)) @@ -549,6 +559,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done); wait_for_completion(&priv->comp); + unmap_mapping_range(priv->dmabuf->file->f_mapping, + 0, priv->size, 1); /* * Re-arm the registered kref reference and the * completion so the post-revoke state matches the diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 06dc0fd3e230..d38e1b98b2e9 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -138,13 +138,13 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, struct vm_area_struct *vma, u64 phys_start, u64 req_len, unsigned int res_index); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
#ifdef CONFIG_VFIO_PCI_DMABUF int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz); -void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); -void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); #else static inline int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, @@ -153,13 +153,6 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, { return -ENOTTY; } -static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) -{ -} -static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, - bool revoked) -{ -} #endif
#endif
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
Convert the VFIO device fd fops->mmap to create a DMABUF representing the BAR mapping, and make the VMA fault handler look up PFNs from the corresponding DMABUF. This supports future code mmap()ing BAR DMABUFs, and iommufd work to support Type1 P2P.
First, vfio_pci_core_mmap() uses the new vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF representing a single BAR range. Then, the vfio_pci_mmap_huge_fault() callback is updated to understand revoked buffers, and uses the new vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given fault address.
Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() zaps PTEs (used on the revocation and cleanup paths).
CONFIG_VFIO_PCI_CORE now unconditionally depends on CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of CONFIG_PCI_P2PDMA.
Signed-off-by: Matt Evans matt@ozlabs.org
Reviewed-by: Kevin Tian kevin.tian@intel.com
with a nit:
- vma->vm_private_data = vdev;
- /*
* Create a DMABUF with a single range corresponding to this* mapping, and wire it into vma->vm_private_data. The VMA's* vm_file becomes that of the DMABUF, and the DMABUF takes* ownership of the VFIO device file (put upon DMABUF* release). This maintains the behaviour of a live VMA* mapping holding the VFIO device file open.*/- ret = vfio_pci_core_mmap_prep_dmabuf(vdev, vma,
pci_resource_start(pdev, index),req_len, index);
the comment is redundant as it's about internal logic of the callee and is well covered by the comment there.
Hi Kevin,
On 12/06/2026 09:46, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
Convert the VFIO device fd fops->mmap to create a DMABUF representing the BAR mapping, and make the VMA fault handler look up PFNs from the corresponding DMABUF. This supports future code mmap()ing BAR DMABUFs, and iommufd work to support Type1 P2P.
First, vfio_pci_core_mmap() uses the new vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF representing a single BAR range. Then, the vfio_pci_mmap_huge_fault() callback is updated to understand revoked buffers, and uses the new vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given fault address.
Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() zaps PTEs (used on the revocation and cleanup paths).
CONFIG_VFIO_PCI_CORE now unconditionally depends on CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of CONFIG_PCI_P2PDMA.
Signed-off-by: Matt Evans matt@ozlabs.org
Reviewed-by: Kevin Tian kevin.tian@intel.com
with a nit:
- vma->vm_private_data = vdev;
- /*
* Create a DMABUF with a single range corresponding to this* mapping, and wire it into vma->vm_private_data. The VMA's* vm_file becomes that of the DMABUF, and the DMABUF takes* ownership of the VFIO device file (put upon DMABUF* release). This maintains the behaviour of a live VMA* mapping holding the VFIO device file open.*/- ret = vfio_pci_core_mmap_prep_dmabuf(vdev, vma,
pci_resource_start(pdev, index),req_len, index);the comment is redundant as it's about internal logic of the callee and is well covered by the comment there.
Thanks on both points! I see what you mean, removed.
Matt
On Wed, Jun 10, 2026 at 04:43:18PM +0100, Matt Evans wrote:
Convert the VFIO device fd fops->mmap to create a DMABUF representing the BAR mapping, and make the VMA fault handler look up PFNs from the corresponding DMABUF. This supports future code mmap()ing BAR DMABUFs, and iommufd work to support Type1 P2P.
First, vfio_pci_core_mmap() uses the new vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF representing a single BAR range. Then, the vfio_pci_mmap_huge_fault() callback is updated to understand revoked buffers, and uses the new vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given fault address.
Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() zaps PTEs (used on the revocation and cleanup paths).
CONFIG_VFIO_PCI_CORE now unconditionally depends on CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of CONFIG_PCI_P2PDMA.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/Kconfig | 5 +- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_core.c | 75 +++++++++++++++++++----------- drivers/vfio/pci/vfio_pci_dmabuf.c | 12 +++++ drivers/vfio/pci/vfio_pci_priv.h | 11 +---- 5 files changed, 67 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 296bf01e185e..67a2ae1fbc04 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -6,6 +6,8 @@ config VFIO_PCI_CORE tristate select VFIO_VIRQFD select IRQ_BYPASS_MANAGER
- select PCI_P2PDMA_CORE
- select DMA_SHARED_BUFFER
config VFIO_PCI_INTX def_bool y if !S390 @@ -56,7 +58,8 @@ config VFIO_PCI_ZDEV_KVM To enable s390x KVM vfio-pci extensions, say Y. config VFIO_PCI_DMABUF
- def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
- def_bool y if PCI_P2PDMA
- depends on VFIO_PCI_CORE
source "drivers/vfio/pci/mlx5/Kconfig"
[...]
int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, struct vm_area_struct *vma, @@ -532,6 +538,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) struct vfio_pci_dma_buf *tmp; lockdep_assert_held_write(&vdev->memory_lock);
- /*
* Holding memory_lock ensures a racing VMA fault observes* priv->revoked properly.*/
Nit: This comment should appear before the lockdep_assert_held_write() Also, it is slightly verbose.. (not against it though).
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!get_file_active(&priv->dmabuf->file)) @@ -549,6 +559,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done); wait_for_completion(&priv->comp);
unmap_mapping_range(priv->dmabuf->file->f_mapping,0, priv->size, 1);
Have we run this series with lockdep enabled? I guess it'd be nice to check with lockdep once..
Apart from these,
Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks, Praan
Hi Pranjal,
On 12/06/2026 11:41, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:18PM +0100, Matt Evans wrote:
Convert the VFIO device fd fops->mmap to create a DMABUF representing the BAR mapping, and make the VMA fault handler look up PFNs from the corresponding DMABUF. This supports future code mmap()ing BAR DMABUFs, and iommufd work to support Type1 P2P.
First, vfio_pci_core_mmap() uses the new vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF representing a single BAR range. Then, the vfio_pci_mmap_huge_fault() callback is updated to understand revoked buffers, and uses the new vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given fault address.
Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() zaps PTEs (used on the revocation and cleanup paths).
CONFIG_VFIO_PCI_CORE now unconditionally depends on CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of CONFIG_PCI_P2PDMA.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/Kconfig | 5 +- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_core.c | 75 +++++++++++++++++++----------- drivers/vfio/pci/vfio_pci_dmabuf.c | 12 +++++ drivers/vfio/pci/vfio_pci_priv.h | 11 +---- 5 files changed, 67 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 296bf01e185e..67a2ae1fbc04 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -6,6 +6,8 @@ config VFIO_PCI_CORE tristate select VFIO_VIRQFD select IRQ_BYPASS_MANAGER
- select PCI_P2PDMA_CORE
- select DMA_SHARED_BUFFER
config VFIO_PCI_INTX def_bool y if !S390 @@ -56,7 +58,8 @@ config VFIO_PCI_ZDEV_KVM To enable s390x KVM vfio-pci extensions, say Y. config VFIO_PCI_DMABUF
- def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
- def_bool y if PCI_P2PDMA
- depends on VFIO_PCI_CORE
source "drivers/vfio/pci/mlx5/Kconfig"
[...]
int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, struct vm_area_struct *vma, @@ -532,6 +538,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) struct vfio_pci_dma_buf *tmp; lockdep_assert_held_write(&vdev->memory_lock);
- /*
* Holding memory_lock ensures a racing VMA fault observes* priv->revoked properly.*/Nit: This comment should appear before the lockdep_assert_held_write() Also, it is slightly verbose.. (not against it though).
Right, I'll move it. Agree it's wordy but if anyone changes that I want them to "think faulthandler".
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!get_file_active(&priv->dmabuf->file)) @@ -549,6 +559,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done); wait_for_completion(&priv->comp);
unmap_mapping_range(priv->dmabuf->file->f_mapping,0, priv->size, 1);Have we run this series with lockdep enabled? I guess it'd be nice to check with lockdep once..
I've (generally) always run testing of this series with lockdep. (No issues (anymore).)
Apart from these,
Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks!
Matt
On Fri, Jun 12, 2026 at 04:22:12PM +0100, Matt Evans wrote:
Hi Pranjal,
On 12/06/2026 11:41, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:18PM +0100, Matt Evans wrote:
Convert the VFIO device fd fops->mmap to create a DMABUF representing the BAR mapping, and make the VMA fault handler look up PFNs from the corresponding DMABUF. This supports future code mmap()ing BAR DMABUFs, and iommufd work to support Type1 P2P.
First, vfio_pci_core_mmap() uses the new vfio_pci_core_mmap_prep_dmabuf() helper to export a DMABUF representing a single BAR range. Then, the vfio_pci_mmap_huge_fault() callback is updated to understand revoked buffers, and uses the new vfio_pci_dma_buf_find_pfn() helper to determine the PFN for a given fault address.
Now that the VFIO DMABUFs can be mmap()ed, vfio_pci_dma_buf_move() zaps PTEs (used on the revocation and cleanup paths).
CONFIG_VFIO_PCI_CORE now unconditionally depends on CONFIG_DMA_SHARED_BUFFER and CONFIG_PCI_P2PDMA_CORE. The CONFIG_VFIO_PCI_DMABUF feature conditionally includes support for VFIO_DEVICE_FEATURE_DMA_BUF, depending on the availability of CONFIG_PCI_P2PDMA.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/Kconfig | 5 +- drivers/vfio/pci/Makefile | 3 +- drivers/vfio/pci/vfio_pci_core.c | 75 +++++++++++++++++++----------- drivers/vfio/pci/vfio_pci_dmabuf.c | 12 +++++ drivers/vfio/pci/vfio_pci_priv.h | 11 +---- 5 files changed, 67 insertions(+), 39 deletions(-)
Hi Matt,
[...]
int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, struct vm_area_struct *vma, @@ -532,6 +538,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) struct vfio_pci_dma_buf *tmp; lockdep_assert_held_write(&vdev->memory_lock);
- /*
* Holding memory_lock ensures a racing VMA fault observes* priv->revoked properly.*/Nit: This comment should appear before the lockdep_assert_held_write() Also, it is slightly verbose.. (not against it though).
Right, I'll move it. Agree it's wordy but if anyone changes that I want them to "think faulthandler".
That's fair I guess.
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!get_file_active(&priv->dmabuf->file)) @@ -549,6 +559,8 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) if (revoked) { kref_put(&priv->kref, vfio_pci_dma_buf_done); wait_for_completion(&priv->comp);
unmap_mapping_range(priv->dmabuf->file->f_mapping,0, priv->size, 1);Have we run this series with lockdep enabled? I guess it'd be nice to check with lockdep once..
I've (generally) always run testing of this series with lockdep. (No issues (anymore).)
That sounds good! Thanks for confirming! :)
Praan
Since converting BAR mmap()s to using DMABUFs, we lose the original device path in /proc/<pid>/maps, lsof, etc. Generate a debug-oriented synthetic 'filename' based on the cdev, plus BDF, plus resource index.
This applies only to BAR mappings via the VFIO device fd, as explicitly-exported DMABUFs are named by userspace via the DMA_BUF_SET_NAME ioctl.
Signed-off-by: Matt Evans matt@ozlabs.org --- drivers/vfio/pci/vfio_pci_dmabuf.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 2fd3629789bf..8f7f1b909b94 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -4,6 +4,7 @@ #include <linux/dma-buf-mapping.h> #include <linux/pci-p2pdma.h> #include <linux/dma-resv.h> +#include <uapi/linux/dma-buf.h>
#include "vfio_pci_priv.h"
@@ -470,6 +471,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, { struct vfio_pci_dma_buf *priv; unsigned long vma_pgoff = vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE_SHIFT); + char *bufname; int ret;
priv = kzalloc_obj(*priv); @@ -482,6 +484,20 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, goto err_free_priv; }
+ bufname = kzalloc(DMA_BUF_NAME_LEN, GFP_KERNEL); + if (!bufname) { + ret = -ENOMEM; + goto err_free_phys; + } + + /* + * Maximum size of the friendly debug name is + * vfio1234567890:ffff:ff:3f.7/5 = 30, which fits within + * DMA_BUF_NAME_LEN. + */ + snprintf(bufname, DMA_BUF_NAME_LEN, "%s:%s/%x", + dev_name(&vdev->vdev.device), pci_name(vdev->pdev), res_index); + /* * The DMABUF begins from the mmap()'s BAR offset, i.e. the * start of the VMA corresponds to byte 0 of the DMABUF and @@ -500,7 +516,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, priv->provider = pcim_p2pdma_provider(vdev->pdev, res_index); if (!priv->provider) { ret = -EINVAL; - goto err_free_phys; + goto err_free_name; }
priv->phys_vec[0].paddr = phys_start + ((u64)vma_pgoff << PAGE_SHIFT); @@ -508,7 +524,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR); if (ret) - goto err_free_phys; + goto err_free_name;
/* * Ownership of the DMABUF file transfers to the VMA so that @@ -523,8 +539,15 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, vma->vm_file = priv->dmabuf->file; vma->vm_private_data = priv;
+ spin_lock(&priv->dmabuf->name_lock); + kfree(priv->dmabuf->name); + priv->dmabuf->name = bufname; + spin_unlock(&priv->dmabuf->name_lock); + return 0;
+err_free_name: + kfree(bufname); err_free_phys: kfree(priv->phys_vec); err_free_priv:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
Since converting BAR mmap()s to using DMABUFs, we lose the original device path in /proc/<pid>/maps, lsof, etc. Generate a debug-oriented synthetic 'filename' based on the cdev, plus BDF, plus resource index.
This applies only to BAR mappings via the VFIO device fd, as explicitly-exported DMABUFs are named by userspace via the DMA_BUF_SET_NAME ioctl.
Signed-off-by: Matt Evans matt@ozlabs.org
Reviewed-by: Kevin Tian kevin.tian@intel.com
On Wed, Jun 10, 2026 at 04:43:19PM +0100, Matt Evans wrote:
Since converting BAR mmap()s to using DMABUFs, we lose the original device path in /proc/<pid>/maps, lsof, etc. Generate a debug-oriented synthetic 'filename' based on the cdev, plus BDF, plus resource index.
This applies only to BAR mappings via the VFIO device fd, as explicitly-exported DMABUFs are named by userspace via the DMA_BUF_SET_NAME ioctl.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_dmabuf.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 2fd3629789bf..8f7f1b909b94 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -4,6 +4,7 @@ #include <linux/dma-buf-mapping.h> #include <linux/pci-p2pdma.h> #include <linux/dma-resv.h> +#include <uapi/linux/dma-buf.h> #include "vfio_pci_priv.h" @@ -470,6 +471,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, { struct vfio_pci_dma_buf *priv; unsigned long vma_pgoff = vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE_SHIFT);
- char *bufname; int ret;
priv = kzalloc_obj(*priv); @@ -482,6 +484,20 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, goto err_free_priv; }
- bufname = kzalloc(DMA_BUF_NAME_LEN, GFP_KERNEL);
- if (!bufname) {
ret = -ENOMEM;goto err_free_phys;- }
- /*
* Maximum size of the friendly debug name is* vfio1234567890:ffff:ff:3f.7/5 = 30, which fits within* DMA_BUF_NAME_LEN.*/- snprintf(bufname, DMA_BUF_NAME_LEN, "%s:%s/%x",
dev_name(&vdev->vdev.device), pci_name(vdev->pdev), res_index);
Nit: Could we instead use:
bufname = kasprintf(GFP_KERNEL, "%s:%s/%x", dev_name(&vdev->vdev.device), pci_name(vdev->pdev), res_index); if (!bufname) ret = -ENOMEM; [...]
/* * The DMABUF begins from the mmap()'s BAR offset, i.e. the * start of the VMA corresponds to byte 0 of the DMABUF and @@ -500,7 +516,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, priv->provider = pcim_p2pdma_provider(vdev->pdev, res_index); if (!priv->provider) { ret = -EINVAL;
goto err_free_phys;
}goto err_free_name;priv->phys_vec[0].paddr = phys_start + ((u64)vma_pgoff << PAGE_SHIFT); @@ -508,7 +524,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR); if (ret)
goto err_free_phys;
goto err_free_name;/* * Ownership of the DMABUF file transfers to the VMA so that @@ -523,8 +539,15 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, vma->vm_file = priv->dmabuf->file; vma->vm_private_data = priv;
- spin_lock(&priv->dmabuf->name_lock);
- kfree(priv->dmabuf->name);
- priv->dmabuf->name = bufname;
- spin_unlock(&priv->dmabuf->name_lock);
- return 0;
+err_free_name:
- kfree(bufname);
err_free_phys: kfree(priv->phys_vec); err_free_priv:
Apart from that,
Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks, Praan
Hi Praan,
On 12/06/2026 15:06, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:19PM +0100, Matt Evans wrote:
Since converting BAR mmap()s to using DMABUFs, we lose the original device path in /proc/<pid>/maps, lsof, etc. Generate a debug-oriented synthetic 'filename' based on the cdev, plus BDF, plus resource index.
This applies only to BAR mappings via the VFIO device fd, as explicitly-exported DMABUFs are named by userspace via the DMA_BUF_SET_NAME ioctl.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_dmabuf.c | 27 +++++++++++++++++++++++++-- 1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 2fd3629789bf..8f7f1b909b94 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -4,6 +4,7 @@ #include <linux/dma-buf-mapping.h> #include <linux/pci-p2pdma.h> #include <linux/dma-resv.h> +#include <uapi/linux/dma-buf.h> #include "vfio_pci_priv.h" @@ -470,6 +471,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, { struct vfio_pci_dma_buf *priv; unsigned long vma_pgoff = vma->vm_pgoff & (VFIO_PCI_OFFSET_MASK >> PAGE_SHIFT);
- char *bufname; int ret;
priv = kzalloc_obj(*priv); @@ -482,6 +484,20 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, goto err_free_priv; }
- bufname = kzalloc(DMA_BUF_NAME_LEN, GFP_KERNEL);
- if (!bufname) {
ret = -ENOMEM;goto err_free_phys;- }
- /*
* Maximum size of the friendly debug name is* vfio1234567890:ffff:ff:3f.7/5 = 30, which fits within* DMA_BUF_NAME_LEN.*/- snprintf(bufname, DMA_BUF_NAME_LEN, "%s:%s/%x",
dev_name(&vdev->vdev.device), pci_name(vdev->pdev), res_index);Nit: Could we instead use:
bufname = kasprintf(GFP_KERNEL, "%s:%s/%x", dev_name(&vdev->vdev.device), pci_name(vdev->pdev), res_index); if (!bufname) ret = -ENOMEM; [...]
That's a great suggestion, thank you. Done.
/* * The DMABUF begins from the mmap()'s BAR offset, i.e. the * start of the VMA corresponds to byte 0 of the DMABUF and @@ -500,7 +516,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, priv->provider = pcim_p2pdma_provider(vdev->pdev, res_index); if (!priv->provider) { ret = -EINVAL;
goto err_free_phys;
}goto err_free_name;priv->phys_vec[0].paddr = phys_start + ((u64)vma_pgoff << PAGE_SHIFT); @@ -508,7 +524,7 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, ret = vfio_pci_dmabuf_export(vdev, priv, O_CLOEXEC | O_RDWR); if (ret)
goto err_free_phys;
goto err_free_name;/* * Ownership of the DMABUF file transfers to the VMA so that @@ -523,8 +539,15 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, vma->vm_file = priv->dmabuf->file; vma->vm_private_data = priv;
- spin_lock(&priv->dmabuf->name_lock);
- kfree(priv->dmabuf->name);
- priv->dmabuf->name = bufname;
- spin_unlock(&priv->dmabuf->name_lock);
- return 0;
+err_free_name:
- kfree(bufname);
err_free_phys: kfree(priv->phys_vec); err_free_priv:
Apart from that,
Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks!
Matt
Previously, vfio_pci_zap_bars() (and the wrapper vfio_pci_zap_and_down_write_memory_lock()) calls were paired with calls to vfio_pci_dma_buf_move().
This commit replaces them with a unified new function, vfio_pci_zap_revoke_bars() containing both the vfio_pci_dma_buf_move() and the unmap_mapping_range(), making it harder for callers to omit one. It adds a wrapper, vfio_pci_lock_zap_revoke_bars(), which takes the write memory_lock before zapping, and adds a new vfio_pci_unrevoke_bars() for the re-enable path.
As of "vfio/pci: Convert BAR mmap() to use a DMABUF", the unmap_mapping_range() to zap is no longer performed for vfio-pci since the DMABUFs used for BAR mappings already zap PTEs when the vfio_pci_dma_buf_move() occurs.
However, it must be assumed that VFIO drivers which override the .mmap op could create mappings _not_ backed by DMABUFs. So, the zap is still performed on revoke if .mmap is overridden, using a new zap_bars_on_revoke flag. A driver can explicitly opt out; the flag is cleared by the hisi_acc_vfio_pci driver, since its .mmap just wraps vfio_pci_core_mmap() and so still uses DMABUFs.
Signed-off-by: Matt Evans matt@ozlabs.org --- .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 8 +++ drivers/vfio/pci/vfio_pci_config.c | 30 ++++---- drivers/vfio/pci/vfio_pci_core.c | 70 +++++++++++++------ drivers/vfio/pci/vfio_pci_priv.h | 3 +- include/linux/vfio_pci_core.h | 1 + 5 files changed, 73 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c index 86362ec424a5..51990f6d66d5 100644 --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c @@ -1692,6 +1692,14 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device if (ret) goto out_put_vdev;
+ /* + * hisi_acc_vfio_pci_mmap() calls down to + * vfio_pci_core_mmap(), so BAR mappings are still + * DMABUF-backed. They don't require a zap on revoke, so opt + * out: + */ + hisi_acc_vdev->core_device.zap_bars_on_revoke = false; + hisi_acc_vfio_debug_init(hisi_acc_vdev); return 0;
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index a10ed733f0e3..8bfab0da481c 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -590,12 +590,10 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
- if (!new_mem) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); - } else { + if (!new_mem) + vfio_pci_lock_zap_revoke_bars(vdev); + else down_write(&vdev->memory_lock); - }
/* * If the user is writing mem/io enable (new_mem/io) and we @@ -631,7 +629,7 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd |= cpu_to_le16(new_cmd & mask);
if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); }
@@ -712,16 +710,14 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state) { - if (state >= PCI_D3hot) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); - } else { + if (state >= PCI_D3hot) + vfio_pci_lock_zap_revoke_bars(vdev); + else down_write(&vdev->memory_lock); - }
vfio_pci_set_power_state(vdev, state); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); }
@@ -908,11 +904,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos, &cap);
if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); + vfio_pci_lock_zap_revoke_bars(vdev); pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } } @@ -993,11 +988,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos, &cap);
if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); + vfio_pci_lock_zap_revoke_bars(vdev); pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); } } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index f9636d8f9e2a..5ea0bd4e7876 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -319,8 +319,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, * The vdev power related flags are protected with 'memory_lock' * semaphore. */ - vfio_pci_zap_and_down_write_memory_lock(vdev); - vfio_pci_dma_buf_move(vdev, true); + vfio_pci_lock_zap_revoke_bars(vdev);
if (vdev->pm_runtime_engaged) { up_write(&vdev->memory_lock); @@ -406,7 +405,7 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); __vfio_pci_runtime_pm_exit(vdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); }
@@ -1256,6 +1255,8 @@ static int vfio_pci_ioctl_set_irqs(struct vfio_pci_core_device *vdev, return ret; }
+static void vfio_pci_zap_revoke_bars(struct vfio_pci_core_device *vdev); + static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, void __user *arg) { @@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, if (!vdev->reset_works) return -EINVAL;
- vfio_pci_zap_and_down_write_memory_lock(vdev); + down_write(&vdev->memory_lock);
/* * This function can be invoked while the power state is non-D0. If @@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, */ vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true); + vfio_pci_zap_revoke_bars(vdev); + ret = pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock);
return ret; @@ -1648,20 +1650,37 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu } EXPORT_SYMBOL_GPL(vfio_pci_core_write);
-static void vfio_pci_zap_bars(struct vfio_pci_core_device *vdev) +static void vfio_pci_zap_revoke_bars(struct vfio_pci_core_device *vdev) { - struct vfio_device *core_vdev = &vdev->vdev; - loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX); - loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX); - loff_t len = end - start; + lockdep_assert_held_write(&vdev->memory_lock); + vfio_pci_dma_buf_move(vdev, true);
- unmap_mapping_range(core_vdev->inode->i_mapping, start, len, true); + /* + * If a driver could possibly create BAR mappings in the + * vdev's address_space, do an additional zap on revoke. See + * vfio_pci_core_register_device(). + */ + if (vdev->zap_bars_on_revoke) { + struct vfio_device *core_vdev = &vdev->vdev; + loff_t start = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_BAR0_REGION_INDEX); + loff_t end = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_ROM_REGION_INDEX); + loff_t len = end - start; + + unmap_mapping_range(core_vdev->inode->i_mapping, + start, len, true); + } }
-void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev) +void vfio_pci_lock_zap_revoke_bars(struct vfio_pci_core_device *vdev) { down_write(&vdev->memory_lock); - vfio_pci_zap_bars(vdev); + vfio_pci_zap_revoke_bars(vdev); +} + +void vfio_pci_unrevoke_bars(struct vfio_pci_core_device *vdev) +{ + lockdep_assert_held_write(&vdev->memory_lock); + vfio_pci_dma_buf_move(vdev, false); }
u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev) @@ -2242,6 +2261,17 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev) if (ret) goto out_vf;
+ /* + * If a driver overrides .mmap, it has to be assumed that it + * might not use the DMABUF-backed core mmap; this flag + * enables a zap at revoke time. Drivers can opt out by + * clearing this flag after registration, e.g. if their .mmap + * override calls down to vfio_pci_core_mmap() so that maps + * are DMABUF-backed. + */ + if (vdev->vdev.ops->mmap != vfio_pci_core_mmap) + vdev->zap_bars_on_revoke = true; + vfio_pci_probe_power_state(vdev);
/* @@ -2517,9 +2547,10 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, }
/* - * Take the memory write lock for each device and zap BAR - * mappings to prevent the user accessing the device while in - * reset. Locking multiple devices is prone to deadlock, + * Take the memory write lock for each device and + * zap/revoke BAR mappings to prevent the user (or + * peers) accessing the device while in reset. + * Locking multiple devices is prone to deadlock, * runaway and unwind if we hit contention. */ if (!down_write_trylock(&vdev->memory_lock)) { @@ -2527,8 +2558,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, break; }
- vfio_pci_dma_buf_move(vdev, true); - vfio_pci_zap_bars(vdev); + vfio_pci_zap_revoke_bars(vdev); }
if (!list_entry_is_head(vdev, @@ -2558,7 +2588,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, list_for_each_entry_from_reverse(vdev, &dev_set->device_list, vdev.dev_set_list) { if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev)) - vfio_pci_dma_buf_move(vdev, false); + vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock); }
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index d38e1b98b2e9..10833aabd7fb 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -83,7 +83,8 @@ void vfio_config_free(struct vfio_pci_core_device *vdev); int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state);
-void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device *vdev); +void vfio_pci_lock_zap_revoke_bars(struct vfio_pci_core_device *vdev); +void vfio_pci_unrevoke_bars(struct vfio_pci_core_device *vdev); u16 vfio_pci_memory_lock_and_enable(struct vfio_pci_core_device *vdev); void vfio_pci_memory_unlock_and_restore(struct vfio_pci_core_device *vdev, u16 cmd); diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 7accd0eac457..93629b7fb8b8 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -127,6 +127,7 @@ struct vfio_pci_core_device { bool needs_pm_restore:1; bool pm_intx_masked:1; bool pm_runtime_engaged:1; + bool zap_bars_on_revoke:1; struct pci_saved_state *pci_saved_state; struct pci_saved_state *pm_save; int ioeventfds_nr;
On Wed, Jun 10, 2026 at 04:43:20PM +0100, Matt Evans wrote:
Previously, vfio_pci_zap_bars() (and the wrapper vfio_pci_zap_and_down_write_memory_lock()) calls were paired with calls to vfio_pci_dma_buf_move().
This commit replaces them with a unified new function, vfio_pci_zap_revoke_bars() containing both the vfio_pci_dma_buf_move() and the unmap_mapping_range(), making it harder for callers to omit one. It adds a wrapper, vfio_pci_lock_zap_revoke_bars(), which takes the write memory_lock before zapping, and adds a new vfio_pci_unrevoke_bars() for the re-enable path.
As of "vfio/pci: Convert BAR mmap() to use a DMABUF", the unmap_mapping_range() to zap is no longer performed for vfio-pci since the DMABUFs used for BAR mappings already zap PTEs when the vfio_pci_dma_buf_move() occurs.
However, it must be assumed that VFIO drivers which override the .mmap op could create mappings _not_ backed by DMABUFs. So, the zap is still performed on revoke if .mmap is overridden, using a new zap_bars_on_revoke flag. A driver can explicitly opt out; the flag is cleared by the hisi_acc_vfio_pci driver, since its .mmap just wraps vfio_pci_core_mmap() and so still uses DMABUFs.
Signed-off-by: Matt Evans matt@ozlabs.org
.../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 8 +++ drivers/vfio/pci/vfio_pci_config.c | 30 ++++---- drivers/vfio/pci/vfio_pci_core.c | 70 +++++++++++++------ drivers/vfio/pci/vfio_pci_priv.h | 3 +- include/linux/vfio_pci_core.h | 1 + 5 files changed, 73 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c index 86362ec424a5..51990f6d66d5 100644 --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c @@ -1692,6 +1692,14 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device if (ret) goto out_put_vdev;
- /*
* hisi_acc_vfio_pci_mmap() calls down to* vfio_pci_core_mmap(), so BAR mappings are still* DMABUF-backed. They don't require a zap on revoke, so opt* out:*/- hisi_acc_vdev->core_device.zap_bars_on_revoke = false;
This seems to be happening after we vfio_pci_core_register_device, which could be slightly problematic if another device in the same group races to trigger a hot reset before we can set this to false. Could we initialize this flag before registration instead?
hisi_acc_vfio_debug_init(hisi_acc_vdev); return 0; diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index a10ed733f0e3..8bfab0da481c 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -590,12 +590,10 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
if (!new_mem) {vfio_pci_zap_and_down_write_memory_lock(vdev);vfio_pci_dma_buf_move(vdev, true);} else {
if (!new_mem)vfio_pci_lock_zap_revoke_bars(vdev);else down_write(&vdev->memory_lock);
}/* * If the user is writing mem/io enable (new_mem/io) and we @@ -631,7 +629,7 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd |= cpu_to_le16(new_cmd & mask); if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock); }vfio_pci_unrevoke_bars(vdev);@@ -712,16 +710,14 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state) {
- if (state >= PCI_D3hot) {
vfio_pci_zap_and_down_write_memory_lock(vdev);vfio_pci_dma_buf_move(vdev, true);- } else {
- if (state >= PCI_D3hot)
vfio_pci_lock_zap_revoke_bars(vdev);- else down_write(&vdev->memory_lock);
- }
vfio_pci_set_power_state(vdev, state); if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);vfio_pci_unrevoke_bars(vdev);} @@ -908,11 +904,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos, &cap); if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);vfio_pci_dma_buf_move(vdev, true);
vfio_pci_lock_zap_revoke_bars(vdev); pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
} }vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock);@@ -993,11 +988,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos, &cap); if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);vfio_pci_dma_buf_move(vdev, true);
vfio_pci_lock_zap_revoke_bars(vdev); pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
} }vfio_pci_unrevoke_bars(vdev); up_write(&vdev->memory_lock);diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index f9636d8f9e2a..5ea0bd4e7876 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -319,8 +319,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, * The vdev power related flags are protected with 'memory_lock' * semaphore. */
- vfio_pci_zap_and_down_write_memory_lock(vdev);
- vfio_pci_dma_buf_move(vdev, true);
- vfio_pci_lock_zap_revoke_bars(vdev);
if (vdev->pm_runtime_engaged) { up_write(&vdev->memory_lock); @@ -406,7 +405,7 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) down_write(&vdev->memory_lock); __vfio_pci_runtime_pm_exit(vdev); if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);vfio_pci_unrevoke_bars(vdev);} @@ -1256,6 +1255,8 @@ static int vfio_pci_ioctl_set_irqs(struct vfio_pci_core_device *vdev, return ret; } +static void vfio_pci_zap_revoke_bars(struct vfio_pci_core_device *vdev);
static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, void __user *arg) { @@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, if (!vdev->reset_works) return -EINVAL;
- vfio_pci_zap_and_down_write_memory_lock(vdev);
- down_write(&vdev->memory_lock);
/* * This function can be invoked while the power state is non-D0. If @@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, */ vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true);
- vfio_pci_zap_revoke_bars(vdev);
I'm wondering if this change in behavior is correct? BEFORE this patch the sequence was:
1. zap vma mappings 2. Enter D0
After this patch the sequence becomes
1. Take the lock 2. Enter D0 3. zap vma mappings
My worry is if user-space accesses a BAR *during* the transition to D0, it could crash since the mappings still exist during the transition?
The old code is immune to it because it removed user-mappings first.
Following the discussion from v1 regarding the ordering of vfio_pci_dma_buf_move() and the D0 transition.. while it makes sense to perform the DMABUF revocation/move after the hardware is in D0.. I'm not too confident about moving zap after D0 :/
I mean, sure, the user would just see all Fs on a read and writes will be dropped silently until we are in D0.. but the behaviour before this change was that the user access will fault and hang on the memory_lock instead which ensures that the user observes a consistent dev state..
- ret = pci_try_reset_function(vdev->pdev); if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);vfio_pci_unrevoke_bars(vdev);return ret; @@ -1648,20 +1650,37 @@ ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *bu }
Thanks, Praan
From: Pranjal Shrivastava praan@google.com Sent: Saturday, June 13, 2026 3:39 AM
On Wed, Jun 10, 2026 at 04:43:20PM +0100, Matt Evans wrote:
@@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
if (!vdev->reset_works) return -EINVAL;
- vfio_pci_zap_and_down_write_memory_lock(vdev);
down_write(&vdev->memory_lock);
/*
- This function can be invoked while the power state is non-D0. If
@@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
*/vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true);
- vfio_pci_zap_revoke_bars(vdev);
I'm wondering if this change in behavior is correct? BEFORE this patch the sequence was:
- zap vma mappings
- Enter D0
After this patch the sequence becomes
- Take the lock
- Enter D0
- zap vma mappings
My worry is if user-space accesses a BAR *during* the transition to D0, it could crash since the mappings still exist during the transition?
not 'crash' as you also noted later with all Fs on read and dropped writes.
The old code is immune to it because it removed user-mappings first.
Following the discussion from v1 regarding the ordering of vfio_pci_dma_buf_move() and the D0 transition.. while it makes sense to perform the DMABUF revocation/move after the hardware is in D0.. I'm not too confident about moving zap after D0 :/
probably add a comment to remind that ordering requirement for dma
I mean, sure, the user would just see all Fs on a read and writes will be dropped silently until we are in D0.. but the behaviour before this change was that the user access will fault and hang on the memory_lock instead which ensures that the user observes a consistent dev state..
I see this more consistent from another angle.
Old code only removes/blocks cpu access but not for device. DMAs are allowed to this device while it's transitioning between D0/D3.
New code at least make this part consistent - both cpu/p2p are allowed in the transition window.
Ideally a sane userspace shouldn't rely on the content read back when it has initiated a reset in parallel. So this behavior change sounds ok?
On Tue, Jun 16, 2026 at 09:48:14AM +0000, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Saturday, June 13, 2026 3:39 AM
On Wed, Jun 10, 2026 at 04:43:20PM +0100, Matt Evans wrote:
@@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
if (!vdev->reset_works) return -EINVAL;
- vfio_pci_zap_and_down_write_memory_lock(vdev);
down_write(&vdev->memory_lock);
/*
- This function can be invoked while the power state is non-D0. If
@@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
*/vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true);
- vfio_pci_zap_revoke_bars(vdev);
I'm wondering if this change in behavior is correct? BEFORE this patch the sequence was:
- zap vma mappings
- Enter D0
After this patch the sequence becomes
- Take the lock
- Enter D0
- zap vma mappings
My worry is if user-space accesses a BAR *during* the transition to D0, it could crash since the mappings still exist during the transition?
not 'crash' as you also noted later with all Fs on read and dropped writes.
Ack, "crash" is definitely a strong word, I just meant that the user-space program isn't expecting to see all Fs today. Since today any access during reset is faulted, however with this all apps may have to lookout for all Fs during a read. Could this change cause existing apps to crash?
The old code is immune to it because it removed user-mappings first.
Following the discussion from v1 regarding the ordering of vfio_pci_dma_buf_move() and the D0 transition.. while it makes sense to perform the DMABUF revocation/move after the hardware is in D0.. I'm not too confident about moving zap after D0 :/
probably add a comment to remind that ordering requirement for dma
+1. That'd be helpful.
I mean, sure, the user would just see all Fs on a read and writes will be dropped silently until we are in D0.. but the behaviour before this change was that the user access will fault and hang on the memory_lock instead which ensures that the user observes a consistent dev state..
I see this more consistent from another angle.
Old code only removes/blocks cpu access but not for device. DMAs are allowed to this device while it's transitioning between D0/D3.
New code at least make this part consistent - both cpu/p2p are allowed in the transition window.
Ideally a sane userspace shouldn't rely on the content read back when it has initiated a reset in parallel. So this behavior change sounds ok?
I agree on the CPU / P2P consistency part. However, my concern is for a shared reset scenario where a reset triggered by one process (I guess it was vfio_assign_device_set?) can affect multiple devices in a dev_set that are owned by different, unrelated processes.
In the old code, these peer processes are protected because their BAR mappings are zapped immediately. Their MMIO threads simply stall in a page fault until the reset is complete.
I agree for a single-reset scenario, sane user-space should never access regions during a self-triggered reset.
Am I missing something?
Thanks, Praan
From: Pranjal Shrivastava praan@google.com Sent: Wednesday, June 17, 2026 2:52 AM
On Tue, Jun 16, 2026 at 09:48:14AM +0000, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Saturday, June 13, 2026 3:39 AM
On Wed, Jun 10, 2026 at 04:43:20PM +0100, Matt Evans wrote:
@@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
if (!vdev->reset_works) return -EINVAL;
- vfio_pci_zap_and_down_write_memory_lock(vdev);
down_write(&vdev->memory_lock);
/*
- This function can be invoked while the power state is non-D0. If
@@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
*/vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true);
- vfio_pci_zap_revoke_bars(vdev);
I'm wondering if this change in behavior is correct? BEFORE this patch the sequence was:
- zap vma mappings
- Enter D0
After this patch the sequence becomes
- Take the lock
- Enter D0
- zap vma mappings
My worry is if user-space accesses a BAR *during* the transition to D0, it could crash since the mappings still exist during the transition?
not 'crash' as you also noted later with all Fs on read and dropped writes.
Ack, "crash" is definitely a strong word, I just meant that the user-space program isn't expecting to see all Fs today. Since today any access during reset is faulted, however with this all apps may have to lookout for all Fs during a read. Could this change cause existing apps to crash?
I expect there will be certain handshake between the resetting process and any subordinary processes using the exported dmabuf. The device state right after a resetting is not functional. Presumably the resetting process (as the userspace driver of the entire device) needs to re-initialize the device into a state allowing dmabuf to work correctly again. This window is much larger than above, within which I'm not sure what'd be reasonable expectations from those apps.
The old code is immune to it because it removed user-mappings first.
Following the discussion from v1 regarding the ordering of vfio_pci_dma_buf_move() and the D0 transition.. while it makes sense to perform the DMABUF revocation/move after the hardware is in D0.. I'm
not
too confident about moving zap after D0 :/
probably add a comment to remind that ordering requirement for dma
+1. That'd be helpful.
I mean, sure, the user would just see all Fs on a read and writes will be dropped silently until we are in D0.. but the behaviour before this change was that the user access will fault and hang on the memory_lock instead which ensures that the user observes a consistent dev state..
I see this more consistent from another angle.
Old code only removes/blocks cpu access but not for device. DMAs are allowed to this device while it's transitioning between D0/D3.
New code at least make this part consistent - both cpu/p2p are allowed in the transition window.
Ideally a sane userspace shouldn't rely on the content read back when it has initiated a reset in parallel. So this behavior change sounds ok?
I agree on the CPU / P2P consistency part. However, my concern is for a shared reset scenario where a reset triggered by one process (I guess it was vfio_assign_device_set?) can affect multiple devices in a dev_set that are owned by different, unrelated processes.
In the old code, these peer processes are protected because their BAR mappings are zapped immediately. Their MMIO threads simply stall in a page fault until the reset is complete.
I agree for a single-reset scenario, sane user-space should never access regions during a self-triggered reset.
Am I missing something?
Given the resetting impact is intrusive, IMHO handshake/coordination is also required between processes operating on devices in a same dev_set otherwise peer processes will break quickly even with the protection in the old code.
btw I don't remember all the detail but holds an impression there are restrictions on the caller owning all devices in a dev_set or they all belong to the same iommufd context...
On Wed, Jun 17, 2026 at 06:22:59AM +0000, Tian, Kevin wrote:
I'm wondering if this change in behavior is correct? BEFORE this patch the sequence was:
- zap vma mappings
- Enter D0
After this patch the sequence becomes
- Take the lock
- Enter D0
- zap vma mappings
My worry is if user-space accesses a BAR *during* the transition to D0, it could crash since the mappings still exist during the transition?
not 'crash' as you also noted later with all Fs on read and dropped writes.
Ack, "crash" is definitely a strong word, I just meant that the user-space program isn't expecting to see all Fs today. Since today any access during reset is faulted, however with this all apps may have to lookout for all Fs during a read. Could this change cause existing apps to crash?
I don't think that order should be changed, you might get an AER on access in D0 too..
The purpose of all this zappery is to only allow access while the device is fully operating normally.
Jason
Hi Kevin, Praan, (+bonus Jason)
On 17/06/2026 07:22, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Wednesday, June 17, 2026 2:52 AM
On Tue, Jun 16, 2026 at 09:48:14AM +0000, Tian, Kevin wrote:
From: Pranjal Shrivastava praan@google.com Sent: Saturday, June 13, 2026 3:39 AM
On Wed, Jun 10, 2026 at 04:43:20PM +0100, Matt Evans wrote:
@@ -1264,7 +1265,7 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
if (!vdev->reset_works) return -EINVAL;
- vfio_pci_zap_and_down_write_memory_lock(vdev);
down_write(&vdev->memory_lock);
/*
- This function can be invoked while the power state is non-D0. If
@@ -1277,10 +1278,11 @@ static int vfio_pci_ioctl_reset(struct
vfio_pci_core_device *vdev,
*/vfio_pci_set_power_state(vdev, PCI_D0);
- vfio_pci_dma_buf_move(vdev, true);
- vfio_pci_zap_revoke_bars(vdev);
I'm wondering if this change in behavior is correct? BEFORE this patch the sequence was:
- zap vma mappings
- Enter D0
After this patch the sequence becomes
- Take the lock
- Enter D0
- zap vma mappings
My worry is if user-space accesses a BAR *during* the transition to D0, it could crash since the mappings still exist during the transition?
not 'crash' as you also noted later with all Fs on read and dropped writes.
Ack, "crash" is definitely a strong word, I just meant that the user-space program isn't expecting to see all Fs today. Since today any access during reset is faulted, however with this all apps may have to lookout for all Fs during a read. Could this change cause existing apps to crash?
I expect there will be certain handshake between the resetting process and any subordinary processes using the exported dmabuf. The device state right after a resetting is not functional. Presumably the resetting process (as the userspace driver of the entire device) needs to re-initialize the device into a state allowing dmabuf to work correctly again. This window is much larger than above, within which I'm not sure what'd be reasonable expectations from those apps.
The old code is immune to it because it removed user-mappings first.
Following the discussion from v1 regarding the ordering of vfio_pci_dma_buf_move() and the D0 transition.. while it makes sense to perform the DMABUF revocation/move after the hardware is in D0.. I'm
not
too confident about moving zap after D0 :/
probably add a comment to remind that ordering requirement for dma
+1. That'd be helpful.
Will do.
I mean, sure, the user would just see all Fs on a read and writes will be dropped silently until we are in D0.. but the behaviour before this change was that the user access will fault and hang on the memory_lock instead which ensures that the user observes a consistent dev state..
I see this more consistent from another angle.
Old code only removes/blocks cpu access but not for device. DMAs are allowed to this device while it's transitioning between D0/D3.
New code at least make this part consistent - both cpu/p2p are allowed in the transition window.
Ideally a sane userspace shouldn't rely on the content read back when it has initiated a reset in parallel. So this behavior change sounds ok?
I agree on the CPU / P2P consistency part. However, my concern is for a shared reset scenario where a reset triggered by one process (I guess it was vfio_assign_device_set?) can affect multiple devices in a dev_set that are owned by different, unrelated processes.
In the old code, these peer processes are protected because their BAR mappings are zapped immediately. Their MMIO threads simply stall in a page fault until the reset is complete.
I agree for a single-reset scenario, sane user-space should never access regions during a self-triggered reset.
Am I missing something?
Given the resetting impact is intrusive, IMHO handshake/coordination is also required between processes operating on devices in a same dev_set otherwise peer processes will break quickly even with the protection in the old code.
btw I don't remember all the detail but holds an impression there are restrictions on the caller owning all devices in a dev_set or they all belong to the same iommufd context...
(Apologies for being late to the thread.) This is my understanding too; whether it's one process or a group, a reset has to be coordinated and access held back, otherwise you're exposed to all manner of unpredictable things. But isn't this thread really about possible powersave states leading up to the reset, rather than the reset itself?
Revisiting the sequence Praan queried (with add'l context):
Current:
A1. Take the lock A2. Zap VMA mappings A3. Enter D0 A4. DMABUF move (revoke) [*] A5. Function reset A6. DMABUF move (unrevoke) A7. Unlock
[*]: In v1, Alex/Leon point out that this is after D0 to give the importer the opportunity to access the device in order to let go. Added a comment to this effect, as requested above.
This patch:
B1. Take the lock B2. Enter D0 _ switched B3. Zap VMA mappings. / B4. DMABUF move (revoke) B5. Function reset B6. DMABUF move (unrevoke) B7. Unlock
Can you please say a bit more about the access racing in these sequences? I'm afraid I don't follow.
Since access to a BAR on a device in D3 will UR, it's a bit of a problem if we have it exposed to userspace at any time.
My understanding is that the sequences above wake a device that happens to have previously been put into D3, and AFAICT it could only have got there because of a previous vfio_pci_set_power_state(). Seems its only caller is from the emulation of PCI_PM_CTRL using vfio_lock_and_set_power_state(), and this zaps/revokes BAR access before a transition to D3. Similarly, an attempt to access a BAR via an ioctl/through vfio_pci_core_do_io_rw() fails the D3 check in __vfio_pci_memory_enabled(), and besides will try to take the memory_lock.
What accesses could there be to the BARs in order to trigger the race you're warning of? If there _could_ a path to access somehow, then couldn't it happen currently before A2 or B3 above anyway?
I still think this sequence is okay; it's not unlikely that I'm missing a nuance and would be grateful for help seeing it.
Other thing to highlight is both of these sequences ensure the device is inaccessible from CPU or P2P _across the reset itself_. I think we're just debating the preamble (up to A4 or B4).
(Kevin, aside: regarding your suggestion on PATCH 8/9 for a warning when setting a revocation status that is the same as current, smells like this is a situation where that could arise. I'll dig, but if so perhaps we just warn on the double-PERM_REVOKED case.)
Thoughts?
Thanks,
Matt
On Thu, Jun 18, 2026 at 05:02:58PM +0100, Matt Evans wrote:
My understanding is that the sequences above wake a device that happens to have previously been put into D3, and AFAICT it could only have got there because of a previous vfio_pci_set_power_state(). Seems its only caller is from the emulation of PCI_PM_CTRL using vfio_lock_and_set_power_state(), and this zaps/revokes BAR access before a transition to D3. Similarly, an attempt to access a BAR via an ioctl/through vfio_pci_core_do_io_rw() fails the D3 check in __vfio_pci_memory_enabled(), and besides will try to take the memory_lock.
I thought the general design was the bars were made inaccessible before going to a low power state, and remain inaccessible while it is in low power?
So the order of D0 doesn't matter. If it is not in D0 then there is no mappings and zap/revoke is a NOP.
If is it in D0 then it doesn't matter because D0 is a nop.
Jason
Hi Jason,
On 19/06/2026 14:31, Jason Gunthorpe wrote:
On Thu, Jun 18, 2026 at 05:02:58PM +0100, Matt Evans wrote:
My understanding is that the sequences above wake a device that happens to have previously been put into D3, and AFAICT it could only have got there because of a previous vfio_pci_set_power_state(). Seems its only caller is from the emulation of PCI_PM_CTRL using vfio_lock_and_set_power_state(), and this zaps/revokes BAR access before a transition to D3. Similarly, an attempt to access a BAR via an ioctl/through vfio_pci_core_do_io_rw() fails the D3 check in __vfio_pci_memory_enabled(), and besides will try to take the memory_lock.
I thought the general design was the bars were made inaccessible before going to a low power state, and remain inaccessible while it is in low power?
So the order of D0 doesn't matter. If it is not in D0 then there is no mappings and zap/revoke is a NOP.
If is it in D0 then it doesn't matter because D0 is a nop.
Yes, that's what I'm getting at. :) If it's in D3 then BARs are inaccessible, so as long as we go into D0 before the DMABUF move, the order of the zap relative to the "go to D0" doesn't matter.
M
On Fri, 19 Jun 2026 16:13:17 +0100 Matt Evans matt@ozlabs.org wrote:
Hi Jason,
On 19/06/2026 14:31, Jason Gunthorpe wrote:
On Thu, Jun 18, 2026 at 05:02:58PM +0100, Matt Evans wrote:
My understanding is that the sequences above wake a device that happens to have previously been put into D3, and AFAICT it could only have got there because of a previous vfio_pci_set_power_state(). Seems its only caller is from the emulation of PCI_PM_CTRL using vfio_lock_and_set_power_state(), and this zaps/revokes BAR access before a transition to D3. Similarly, an attempt to access a BAR via an ioctl/through vfio_pci_core_do_io_rw() fails the D3 check in __vfio_pci_memory_enabled(), and besides will try to take the memory_lock.
I thought the general design was the bars were made inaccessible before going to a low power state, and remain inaccessible while it is in low power?
So the order of D0 doesn't matter. If it is not in D0 then there is no mappings and zap/revoke is a NOP.
If is it in D0 then it doesn't matter because D0 is a nop.
Yes, that's what I'm getting at. :) If it's in D3 then BARs are inaccessible, so as long as we go into D0 before the DMABUF move, the order of the zap relative to the "go to D0" doesn't matter.
I believe this is correct as well, but importantly we cannot assume that a stray read or write just returns -1 or gets dropped. This is exactly why we have such hard protections against the user accessing the device while it's disabled. Not all platforms, even within architectures that might otherwise be considered lenient of such accesses, consider this benign and might escalate to system level faults.
Let's be careful not to frame this as "the access doesn't matter anyway", the answer is instead that non-D0 devices already lack any mappings to access the device. Thanks,
Alex
Hi Alex,
On 23/06/2026 00:13, Alex Williamson wrote:
On Fri, 19 Jun 2026 16:13:17 +0100 Matt Evans matt@ozlabs.org wrote:
Hi Jason,
On 19/06/2026 14:31, Jason Gunthorpe wrote:
On Thu, Jun 18, 2026 at 05:02:58PM +0100, Matt Evans wrote:
My understanding is that the sequences above wake a device that happens to have previously been put into D3, and AFAICT it could only have got there because of a previous vfio_pci_set_power_state(). Seems its only caller is from the emulation of PCI_PM_CTRL using vfio_lock_and_set_power_state(), and this zaps/revokes BAR access before a transition to D3. Similarly, an attempt to access a BAR via an ioctl/through vfio_pci_core_do_io_rw() fails the D3 check in __vfio_pci_memory_enabled(), and besides will try to take the memory_lock.
I thought the general design was the bars were made inaccessible before going to a low power state, and remain inaccessible while it is in low power?
So the order of D0 doesn't matter. If it is not in D0 then there is no mappings and zap/revoke is a NOP.
If is it in D0 then it doesn't matter because D0 is a nop.
Yes, that's what I'm getting at. :) If it's in D3 then BARs are inaccessible, so as long as we go into D0 before the DMABUF move, the order of the zap relative to the "go to D0" doesn't matter.
I believe this is correct as well, but importantly we cannot assume that a stray read or write just returns -1 or gets dropped. This is exactly why we have such hard protections against the user accessing the device while it's disabled. Not all platforms, even within architectures that might otherwise be considered lenient of such accesses, consider this benign and might escalate to system level faults.
We are in enthusiastic agreement here.
Let's be careful not to frame this as "the access doesn't matter anyway", the answer is instead that non-D0 devices already lack any mappings to access the device. Thanks,
I agree that is not the right thing to say, for exactly that reason. (For avoidance of any doubt, I didn't say that :) )
Thanks for confirming the behaviour. I hope Praan and Kevin are satisfied that this patch doesn't cause the issues they first worried about (the changed order of the zap relative to the D0 transition doesn't have a detrimental effect because of the existing inaccessibility).
Alex, I'll post v4 soon, but if you have any comments in the pipeline please shout and I'll hold off awhile.
Thanks,
Matt
On Tue, Jun 23, 2026 at 12:08:30PM +0100, Matt Evans wrote:
Hi Alex,
On 23/06/2026 00:13, Alex Williamson wrote:
On Fri, 19 Jun 2026 16:13:17 +0100 Matt Evans matt@ozlabs.org wrote:
Hi Jason,
On 19/06/2026 14:31, Jason Gunthorpe wrote:
On Thu, Jun 18, 2026 at 05:02:58PM +0100, Matt Evans wrote:
My understanding is that the sequences above wake a device that happens to have previously been put into D3, and AFAICT it could only have got there because of a previous vfio_pci_set_power_state(). Seems its only caller is from the emulation of PCI_PM_CTRL using vfio_lock_and_set_power_state(), and this zaps/revokes BAR access before a transition to D3. Similarly, an attempt to access a BAR via an ioctl/through vfio_pci_core_do_io_rw() fails the D3 check in __vfio_pci_memory_enabled(), and besides will try to take the memory_lock.
I thought the general design was the bars were made inaccessible before going to a low power state, and remain inaccessible while it is in low power?
So the order of D0 doesn't matter. If it is not in D0 then there is no mappings and zap/revoke is a NOP.
If is it in D0 then it doesn't matter because D0 is a nop.
Yes, that's what I'm getting at. :) If it's in D3 then BARs are inaccessible, so as long as we go into D0 before the DMABUF move, the order of the zap relative to the "go to D0" doesn't matter.
I believe this is correct as well, but importantly we cannot assume that a stray read or write just returns -1 or gets dropped. This is exactly why we have such hard protections against the user accessing the device while it's disabled. Not all platforms, even within architectures that might otherwise be considered lenient of such accesses, consider this benign and might escalate to system level faults.
We are in enthusiastic agreement here.
Let's be careful not to frame this as "the access doesn't matter anyway", the answer is instead that non-D0 devices already lack any mappings to access the device. Thanks,
I agree that is not the right thing to say, for exactly that reason. (For avoidance of any doubt, I didn't say that :) )
Thanks for confirming the behaviour. I hope Praan and Kevin are satisfied that this patch doesn't cause the issues they first worried about (the changed order of the zap relative to the D0 transition doesn't have a detrimental effect because of the existing inaccessibility).
Alex, I'll post v4 soon, but if you have any comments in the pipeline please shout and I'll hold off awhile.
I think the discussion addresses my concerns. I'm in agreement as well.
Thanks, Praan
Hi Praan,
On 12/06/2026 20:39, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:20PM +0100, Matt Evans wrote:
Previously, vfio_pci_zap_bars() (and the wrapper vfio_pci_zap_and_down_write_memory_lock()) calls were paired with calls to vfio_pci_dma_buf_move().
This commit replaces them with a unified new function, vfio_pci_zap_revoke_bars() containing both the vfio_pci_dma_buf_move() and the unmap_mapping_range(), making it harder for callers to omit one. It adds a wrapper, vfio_pci_lock_zap_revoke_bars(), which takes the write memory_lock before zapping, and adds a new vfio_pci_unrevoke_bars() for the re-enable path.
As of "vfio/pci: Convert BAR mmap() to use a DMABUF", the unmap_mapping_range() to zap is no longer performed for vfio-pci since the DMABUFs used for BAR mappings already zap PTEs when the vfio_pci_dma_buf_move() occurs.
However, it must be assumed that VFIO drivers which override the .mmap op could create mappings _not_ backed by DMABUFs. So, the zap is still performed on revoke if .mmap is overridden, using a new zap_bars_on_revoke flag. A driver can explicitly opt out; the flag is cleared by the hisi_acc_vfio_pci driver, since its .mmap just wraps vfio_pci_core_mmap() and so still uses DMABUFs.
Signed-off-by: Matt Evans matt@ozlabs.org
.../vfio/pci/hisilicon/hisi_acc_vfio_pci.c | 8 +++ drivers/vfio/pci/vfio_pci_config.c | 30 ++++---- drivers/vfio/pci/vfio_pci_core.c | 70 +++++++++++++------ drivers/vfio/pci/vfio_pci_priv.h | 3 +- include/linux/vfio_pci_core.h | 1 + 5 files changed, 73 insertions(+), 39 deletions(-)
diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c index 86362ec424a5..51990f6d66d5 100644 --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c @@ -1692,6 +1692,14 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device if (ret) goto out_put_vdev;
- /*
* hisi_acc_vfio_pci_mmap() calls down to* vfio_pci_core_mmap(), so BAR mappings are still* DMABUF-backed. They don't require a zap on revoke, so opt* out:*/- hisi_acc_vdev->core_device.zap_bars_on_revoke = false;
This seems to be happening after we vfio_pci_core_register_device, which could be slightly problematic if another device in the same group races to trigger a hot reset before we can set this to false. Could we initialize this flag before registration instead?
Remember it is a safe default, so in the event of a driver not managing to opt-out before it's required then all that happens is a redundant unmap_mapping_range(). The default-safe was a nice suggestion from Alex on v2.
Matt
On Thu, Jun 18, 2026 at 05:06:27PM +0100, Matt Evans wrote:
Hi Praan,
On 12/06/2026 20:39, Pranjal Shrivastava wrote:
[...]
diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c index 86362ec424a5..51990f6d66d5 100644 --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c @@ -1692,6 +1692,14 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device if (ret) goto out_put_vdev;
- /*
* hisi_acc_vfio_pci_mmap() calls down to* vfio_pci_core_mmap(), so BAR mappings are still* DMABUF-backed. They don't require a zap on revoke, so opt* out:*/- hisi_acc_vdev->core_device.zap_bars_on_revoke = false;
This seems to be happening after we vfio_pci_core_register_device, which could be slightly problematic if another device in the same group races to trigger a hot reset before we can set this to false. Could we initialize this flag before registration instead?
Remember it is a safe default, so in the event of a driver not managing to opt-out before it's required then all that happens is a redundant unmap_mapping_range(). The default-safe was a nice suggestion from Alex on v2.
Ack. I see. That makes sense.
Thanks, Praan
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
Previously, vfio_pci_zap_bars() (and the wrapper vfio_pci_zap_and_down_write_memory_lock()) calls were paired with calls to vfio_pci_dma_buf_move().
This commit replaces them with a unified new function, vfio_pci_zap_revoke_bars() containing both the vfio_pci_dma_buf_move() and the unmap_mapping_range(), making it harder for callers to omit one. It adds a wrapper, vfio_pci_lock_zap_revoke_bars(), which takes the write memory_lock before zapping, and adds a new vfio_pci_unrevoke_bars() for the re-enable path.
It's unusual to have three verbs (lock/zap/revoke) in one function name.
I wonder whether it's simpler to have: vfio_pci_zap_bars_locked() // caller already holds the lock vfio_pci_zap_bars()
'revoke' is just a side-effect of 'zap', not necessarily to highlight it in the name.
As of "vfio/pci: Convert BAR mmap() to use a DMABUF", the unmap_mapping_range() to zap is no longer performed for vfio-pci since the DMABUFs used for BAR mappings already zap PTEs when the vfio_pci_dma_buf_move() occurs.
However, it must be assumed that VFIO drivers which override the .mmap op could create mappings _not_ backed by DMABUFs. So, the zap is still performed on revoke if .mmap is overridden, using a new zap_bars_on_revoke flag. A driver can explicitly opt out; the flag is cleared by the hisi_acc_vfio_pci driver, since its .mmap just wraps vfio_pci_core_mmap() and so still uses DMABUFs.
the cost of unmap_mapping_range() is trivial when there is no mmap on the device fd.
so it could be simpler by always doing:
vfio_pci_dma_buf_move(); unmap_mapping_range();
and remove the flag.
A VFIO DMABUF can export a subset of a BAR to userspace by fd; add support for mmap() of this fd. This provides another route for a process to map BARs, in which the process can only map a specific subset of a BAR represented by the exported DMABUF.
mmap() support enables userspace driver designs that safely delegate access to BAR sub-ranges to other client processes by sharing a DMABUF fd, without having to share the (omnipotent) VFIO device fd with them.
Since the main VFIO BAR mmap() is now DMABUF-aware, the new mmap() reuses the existing vm_ops.
The lifecycle of an exported DMABUF remains decoupled from that of the device fd it came from, i.e. the device fd could be closed with DMABUF VMAs present, meaning a fault on a VMA could happen concurrently with vfio_pci_dma_buf_cleanup().
To deal with this scenario, the fault handler now temporarily takes a VFIO device registration to ensure the vdev remains valid, and then vdev->memory_lock can be taken on it.
Signed-off-by: Matt Evans matt@ozlabs.org --- drivers/vfio/pci/vfio_pci_core.c | 80 ++++++++++++++++++++++++++---- drivers/vfio/pci/vfio_pci_dmabuf.c | 36 ++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 2 + 3 files changed, 109 insertions(+), 9 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 5ea0bd4e7876..508a5eca910a 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -12,6 +12,8 @@
#include <linux/aperture.h> #include <linux/device.h> +#include <linux/dma-buf.h> +#include <linux/dma-resv.h> #include <linux/eventfd.h> #include <linux/file.h> #include <linux/interrupt.h> @@ -1735,19 +1737,78 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, vm_fault_t ret = VM_FAULT_SIGBUS;
/* - * We can rely on the existence of both a DMABUF (priv) and - * the VFIO device it was exported from (vdev). This fault's - * VMA was established using vfio_pci_core_mmap_prep_dmabuf() - * which transfers ownership of the VFIO device fd to the - * DMABUF, and so the VFIO device is held open because the - * VMA's vm_file (DMABUF) is open. + * The only thing this can rely on is that the DMABUF relating + * to the VMA's vm_file exists (priv). * - * Since vfio_pci_dma_buf_cleanup() cannot have happened, - * vdev must be valid; we can take memory_lock. + * A DMABUF for a VFIO device fd mmap() holds a reference to + * the original VFIO device fd, but an explicitly-exported + * DMABUF does not. The original fd might have closed, + * meaning this fault can race with + * vfio_pci_dma_buf_cleanup(), meaning the buffer could have + * been revoked (in which case priv->vdev might be NULL), and + * the VFIO device registration might have been dropped. + * + * With the goal of taking vdev->memory_lock in a world where + * vdev might not still exist: + * + * 1. Take the resv lock on the DMABUF: + * - If racing cleanup got in first, the buffer is revoked; + * stop/exit if so. + * - If we got in first, the buffer is not revoked so vdev is + * non-NULL, accessible, and cleanup _has not yet put the + * VFIO device registration_. So, the device refcount must + * be >0. + * + * 2. Take vfio_device registration (refcount guaranteed >0 + * hereafter). + * + * 3. Unlock the DMABUF's resv lock: + * - A racing cleanup can now complete. + * - But, the device refcount >0, meaning the vfio_device + * (and vfio_pcie_core device vdev) have not yet been + * freed. vdev is accessible, even if the DMABUF has been + * revoked or cleanup has happened, because + * vfio_unregister_group_dev() can't complete. + * + * 4. Take the vdev->memory_lock + * - Either the DMABUF is usable, or has been cleaned up. + * Whichever, it can no longer change under us. + * - Test the DMABUF revocation status again: if it was + * revoked between 1 and 4 return a SIGBUS. Otherwise, + * return a PFN. + * - It's not necessary to also take the resv lock, because + * the status/vdev can't change while memory_lock is held. + * + * 5. Unlock, done. */ + + dma_resv_lock(priv->dmabuf->resv, NULL); + + if (priv->revoked) { + pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n", + __func__, vmf->address, vma->vm_pgoff); + dma_resv_unlock(priv->dmabuf->resv); + return VM_FAULT_SIGBUS; + } + + /* If the buffer isn't revoked, vdev is valid */ vdev = priv->vdev;
+ if (!vfio_device_try_get_registration(&vdev->vdev)) { + /* + * If vdev != NULL (above), the registration should + * already be >0 and so this try_get should never + * fail. + */ + dev_warn(&vdev->pdev->dev, "%s: Unexpected registration failure\n", + __func__); + dma_resv_unlock(priv->dmabuf->resv); + return VM_FAULT_SIGBUS; + } + dma_resv_unlock(priv->dmabuf->resv); + scoped_guard(rwsem_read, &vdev->memory_lock) { + /* Revocation status must be re-read, under memory_lock */ if (!priv->revoked) { int pres = vfio_pci_dma_buf_find_pfn(priv, vma, vmf->address, @@ -1766,6 +1827,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, __func__, order, pfn, vmf->address, vma->vm_pgoff, (unsigned int)ret);
+ vfio_device_put_registration(&vdev->vdev); return ret; }
@@ -1774,7 +1836,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf) return vfio_pci_mmap_huge_fault(vmf, 0); }
-static const struct vm_operations_struct vfio_pci_mmap_ops = { +const struct vm_operations_struct vfio_pci_mmap_ops = { .fault = vfio_pci_mmap_page_fault, #ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP .huge_fault = vfio_pci_mmap_huge_fault, diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 8f7f1b909b94..2fb09a2c0f6b 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -27,6 +27,41 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
return 0; } + +static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + /* + * If we observe that the buffer is revoked now then refuse + * the mmap(). This is a belt-and-braces early failure to + * ease debugging a revoked buffer being used. Userspace + * might also race an mmap() against an explicit revocation, + * or an action doing a temporary revoke; race scenarios are + * still safe because the fault handler ultimately prevents + * access to a revoked buffer if it isn't caught here. + */ + if (READ_ONCE(priv->revoked)) + return -ENODEV; + if ((vma->vm_flags & VM_SHARED) == 0) + return -EINVAL; + + /* + * dma_buf_mmap_internal() has asserted that the VMA is + * contained within the DMABUF size before calling this. + */ + + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); + + /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */ + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP | + VM_DONTEXPAND | VM_DONTDUMP); + vma->vm_private_data = priv; + vma->vm_ops = &vfio_pci_mmap_ops; + + return 0; +} #endif /* CONFIG_VFIO_PCI_DMABUF */
static void vfio_pci_dma_buf_done(struct kref *kref) @@ -94,6 +129,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) static const struct dma_buf_ops vfio_pci_dmabuf_ops = { #ifdef CONFIG_VFIO_PCI_DMABUF .attach = vfio_pci_dma_buf_attach, + .mmap = vfio_pci_dma_buf_mmap, #endif .map_dma_buf = vfio_pci_dma_buf_map, .unmap_dma_buf = vfio_pci_dma_buf_unmap, diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 10833aabd7fb..db2e2aeae88f 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -38,6 +38,8 @@ struct vfio_pci_dma_buf { u8 revoked : 1; };
+extern const struct vm_operations_struct vfio_pci_mmap_ops; + bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev); void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
On Wed, Jun 10, 2026 at 04:43:21PM +0100, Matt Evans wrote:
Hi Matt,
[...]
** With the goal of taking vdev->memory_lock in a world where* vdev might not still exist:** 1. Take the resv lock on the DMABUF:* - If racing cleanup got in first, the buffer is revoked;* stop/exit if so.* - If we got in first, the buffer is not revoked so vdev is* non-NULL, accessible, and cleanup _has not yet put the* VFIO device registration_. So, the device refcount must* be >0.** 2. Take vfio_device registration (refcount guaranteed >0* hereafter).** 3. Unlock the DMABUF's resv lock:* - A racing cleanup can now complete.* - But, the device refcount >0, meaning the vfio_device* (and vfio_pcie_core device vdev) have not yet been* freed. vdev is accessible, even if the DMABUF has been* revoked or cleanup has happened, because* vfio_unregister_group_dev() can't complete.** 4. Take the vdev->memory_lock* - Either the DMABUF is usable, or has been cleaned up.* Whichever, it can no longer change under us.* - Test the DMABUF revocation status again: if it was* revoked between 1 and 4 return a SIGBUS. Otherwise,* return a PFN.* - It's not necessary to also take the resv lock, because* the status/vdev can't change while memory_lock is held.* */* 5. Unlock, done.- dma_resv_lock(priv->dmabuf->resv, NULL);
- if (priv->revoked) {
pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n",__func__, vmf->address, vma->vm_pgoff);dma_resv_unlock(priv->dmabuf->resv);return VM_FAULT_SIGBUS;- }
- /* If the buffer isn't revoked, vdev is valid */ vdev = priv->vdev;
- if (!vfio_device_try_get_registration(&vdev->vdev)) {
/** If vdev != NULL (above), the registration should* already be >0 and so this try_get should never* fail.*/dev_warn(&vdev->pdev->dev, "%s: Unexpected registration failure\n",__func__);dma_resv_unlock(priv->dmabuf->resv);return VM_FAULT_SIGBUS;- }
- dma_resv_unlock(priv->dmabuf->resv);
scoped_guard(rwsem_read, &vdev->memory_lock) {
if (!priv->revoked) { int pres = vfio_pci_dma_buf_find_pfn(priv, vma, vmf->address,/* Revocation status must be re-read, under memory_lock */
Wait, I noticed that the is_aligned_for_order() check from mainline was removed here. Was that intentional?
For hugepage faults (order > 0), we must ensure the PFN and address are properly aligned before calling vfio_pci_vmf_insert_pfn().
In the current upstream code, we have: if (is_aligned_for_order(vma, addr, pfn, order))
Should we restore that check here?
@@ -1766,6 +1827,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, __func__, order, pfn, vmf->address, vma->vm_pgoff, (unsigned int)ret);
- vfio_device_put_registration(&vdev->vdev); return ret;
} @@ -1774,7 +1836,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf) return vfio_pci_mmap_huge_fault(vmf, 0); } -static const struct vm_operations_struct vfio_pci_mmap_ops = { +const struct vm_operations_struct vfio_pci_mmap_ops = { .fault = vfio_pci_mmap_page_fault,
Nit: Instead of making this global, should we add a helper? E.g.:
void vfio_pci_set_vma_ops(struct vm_area_struct *vma) { vma->vm_ops = &vfio_pci_mmap_ops; }
[...]
+static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma) +{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- /*
* If we observe that the buffer is revoked now then refuse* the mmap(). This is a belt-and-braces early failure to* ease debugging a revoked buffer being used. Userspace* might also race an mmap() against an explicit revocation,* or an action doing a temporary revoke; race scenarios are* still safe because the fault handler ultimately prevents* access to a revoked buffer if it isn't caught here.*/- if (READ_ONCE(priv->revoked))
return -ENODEV;- if ((vma->vm_flags & VM_SHARED) == 0)
return -EINVAL;- /*
* dma_buf_mmap_internal() has asserted that the VMA is* contained within the DMABUF size before calling this.*/- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
- vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
- /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
- vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
VM_DONTEXPAND | VM_DONTDUMP);- vma->vm_private_data = priv;
- vma->vm_ops = &vfio_pci_mmap_ops;
- return 0;
+} #endif /* CONFIG_VFIO_PCI_DMABUF */
Thanks, Praan
Hi Praan,
On 12/06/2026 21:35, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:21PM +0100, Matt Evans wrote:
Hi Matt,
[...]
** With the goal of taking vdev->memory_lock in a world where* vdev might not still exist:** 1. Take the resv lock on the DMABUF:* - If racing cleanup got in first, the buffer is revoked;* stop/exit if so.* - If we got in first, the buffer is not revoked so vdev is* non-NULL, accessible, and cleanup _has not yet put the* VFIO device registration_. So, the device refcount must* be >0.** 2. Take vfio_device registration (refcount guaranteed >0* hereafter).** 3. Unlock the DMABUF's resv lock:* - A racing cleanup can now complete.* - But, the device refcount >0, meaning the vfio_device* (and vfio_pcie_core device vdev) have not yet been* freed. vdev is accessible, even if the DMABUF has been* revoked or cleanup has happened, because* vfio_unregister_group_dev() can't complete.** 4. Take the vdev->memory_lock* - Either the DMABUF is usable, or has been cleaned up.* Whichever, it can no longer change under us.* - Test the DMABUF revocation status again: if it was* revoked between 1 and 4 return a SIGBUS. Otherwise,* return a PFN.* - It's not necessary to also take the resv lock, because* the status/vdev can't change while memory_lock is held.* */* 5. Unlock, done.- dma_resv_lock(priv->dmabuf->resv, NULL);
- if (priv->revoked) {
pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n",__func__, vmf->address, vma->vm_pgoff);dma_resv_unlock(priv->dmabuf->resv);return VM_FAULT_SIGBUS;- }
- /* If the buffer isn't revoked, vdev is valid */ vdev = priv->vdev;
- if (!vfio_device_try_get_registration(&vdev->vdev)) {
/** If vdev != NULL (above), the registration should* already be >0 and so this try_get should never* fail.*/dev_warn(&vdev->pdev->dev, "%s: Unexpected registration failure\n",__func__);dma_resv_unlock(priv->dmabuf->resv);return VM_FAULT_SIGBUS;- }
- dma_resv_unlock(priv->dmabuf->resv);
scoped_guard(rwsem_read, &vdev->memory_lock) {
if (!priv->revoked) { int pres = vfio_pci_dma_buf_find_pfn(priv, vma, vmf->address,/* Revocation status must be re-read, under memory_lock */Wait, I noticed that the is_aligned_for_order() check from mainline was removed here. Was that intentional?
For hugepage faults (order > 0), we must ensure the PFN and address are properly aligned before calling vfio_pci_vmf_insert_pfn().
In the current upstream code, we have: if (is_aligned_for_order(vma, addr, pfn, order))
Should we restore that check here?
The alignment check is done within the helper vfio_pci_dma_buf_find_pfn(), which returns -EAGAIN if order > 0 and a search result isn't usable due to alignment. That leads to VM_FAULT_FALLBACK here, ensuring vfio_pci_vmf_insert_pfn() isn't called with anything weird.
@@ -1766,6 +1827,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf, __func__, order, pfn, vmf->address, vma->vm_pgoff, (unsigned int)ret);
- vfio_device_put_registration(&vdev->vdev); return ret;
} @@ -1774,7 +1836,7 @@ static vm_fault_t vfio_pci_mmap_page_fault(struct vm_fault *vmf) return vfio_pci_mmap_huge_fault(vmf, 0); } -static const struct vm_operations_struct vfio_pci_mmap_ops = { +const struct vm_operations_struct vfio_pci_mmap_ops = { .fault = vfio_pci_mmap_page_fault,
Nit: Instead of making this global, should we add a helper? E.g.:
void vfio_pci_set_vma_ops(struct vm_area_struct *vma) { vma->vm_ops = &vfio_pci_mmap_ops; }
I'll give it a go, it would be nice to keep that encapsulated.
Thanks,
Matt
[...]
+static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma) +{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- /*
* If we observe that the buffer is revoked now then refuse* the mmap(). This is a belt-and-braces early failure to* ease debugging a revoked buffer being used. Userspace* might also race an mmap() against an explicit revocation,* or an action doing a temporary revoke; race scenarios are* still safe because the fault handler ultimately prevents* access to a revoked buffer if it isn't caught here.*/- if (READ_ONCE(priv->revoked))
return -ENODEV;- if ((vma->vm_flags & VM_SHARED) == 0)
return -EINVAL;- /*
* dma_buf_mmap_internal() has asserted that the VMA is* contained within the DMABUF size before calling this.*/- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
- vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
- /* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */
- vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
VM_DONTEXPAND | VM_DONTDUMP);- vma->vm_private_data = priv;
- vma->vm_ops = &vfio_pci_mmap_ops;
- return 0;
+} #endif /* CONFIG_VFIO_PCI_DMABUF */
Thanks, Praan
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
- /*
* dma_buf_mmap_internal() has asserted that the VMA is* contained within the DMABUF size before calling this.*/- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
- vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
remove the blank line.
Reviewed-by: Kevin Tian kevin.tian@intel.com
Expand the VFIO DMABUF revocation state to three states: Not revoked, temporarily revoked, and permanently revoked.
The first two are for existing transient revocation, e.g. across a function reset, and the DMABUF is put into the last in response to a new VFIO feature VFIO_DEVICE_FEATURE_DMA_BUF.
VFIO_DEVICE_FEATURE_DMA_BUF passes a DMABUF by fd and requests that the DMABUF is permanently revoked. On success, it's guaranteed that the buffer can never be imported/attached/mmap()ed in future, that dynamic imports have been cleanly detached, and that all mappings have been made inaccessible/PTEs zapped.
This is useful for lifecycle management, to reclaim VFIO PCI BAR ranges previously delegated to a subordinate client process: The driver process can ensure that the loaned resources are revoked when the client is deemed "done", and exported ranges can be safely re-used elsewhere.
Refactor the revocation code out of vfio_pci_dma_buf_move() to a function common to move and the new feature request path.
Signed-off-by: Matt Evans matt@ozlabs.org --- drivers/vfio/pci/vfio_pci_core.c | 6 +- drivers/vfio/pci/vfio_pci_dmabuf.c | 169 ++++++++++++++++++++++------- drivers/vfio/pci/vfio_pci_priv.h | 19 +++- include/uapi/linux/vfio.h | 20 ++++ 4 files changed, 173 insertions(+), 41 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 508a5eca910a..064906b25467 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1573,6 +1573,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, return vfio_pci_core_feature_token(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_DMA_BUF: return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE: + return vfio_pci_core_feature_dma_buf_revoke(vdev, flags, arg, argsz); default: return -ENOTTY; } @@ -1784,7 +1786,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
dma_resv_lock(priv->dmabuf->resv, NULL);
- if (priv->revoked) { + if (priv->status != VFIO_PCI_DMABUF_OK) { pr_debug_ratelimited("%s VA 0x%lx, pgoff 0x%lx: DMABUF revoked/cleaned up\n", __func__, vmf->address, vma->vm_pgoff); dma_resv_unlock(priv->dmabuf->resv); @@ -1809,7 +1811,7 @@ static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
scoped_guard(rwsem_read, &vdev->memory_lock) { /* Revocation status must be re-read, under memory_lock */ - if (!priv->revoked) { + if (priv->status == VFIO_PCI_DMABUF_OK) { int pres = vfio_pci_dma_buf_find_pfn(priv, vma, vmf->address, order, &pfn); diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 2fb09a2c0f6b..b47411992ab6 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -19,7 +19,7 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, if (!attachment->peer2peer) return -EOPNOTSUPP;
- if (priv->revoked) + if (priv->status != VFIO_PCI_DMABUF_OK) return -ENODEV;
if (!dma_buf_attach_revocable(attachment)) @@ -41,7 +41,7 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct * * still safe because the fault handler ultimately prevents * access to a revoked buffer if it isn't caught here. */ - if (READ_ONCE(priv->revoked)) + if (READ_ONCE(priv->status) != VFIO_PCI_DMABUF_OK) return -ENODEV; if ((vma->vm_flags & VM_SHARED) == 0) return -EINVAL; @@ -81,7 +81,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
dma_resv_assert_held(priv->dmabuf->resv);
- if (priv->revoked) + if (priv->status != VFIO_PCI_DMABUF_OK) return ERR_PTR(-ENODEV);
ret = dma_buf_phys_vec_to_sgt(attachment, priv->provider, @@ -291,7 +291,8 @@ static int vfio_pci_dmabuf_export(struct vfio_pci_core_device *vdev, INIT_LIST_HEAD(&priv->dmabufs_elm); down_write(&vdev->memory_lock); dma_resv_lock(priv->dmabuf->resv, NULL); - priv->revoked = !__vfio_pci_memory_enabled(vdev); + priv->status = __vfio_pci_memory_enabled(vdev) ? VFIO_PCI_DMABUF_OK : + VFIO_PCI_DMABUF_TEMP_REVOKED; list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); dma_resv_unlock(priv->dmabuf->resv); up_write(&vdev->memory_lock); @@ -322,7 +323,7 @@ int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, return -EOPNOTSUPP;
priv = attachment->dmabuf->priv; - if (priv->revoked) + if (priv->status != VFIO_PCI_DMABUF_OK) return -ENODEV;
/* More than one range to iommufd will require proper DMABUF support */ @@ -591,6 +592,64 @@ int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev, return ret; }
+/* Set the DMABUF's revocation status (OK or temporarily/permanently revoked) */ +static void vfio_pci_dma_buf_set_status(struct vfio_pci_dma_buf *priv, + enum vfio_pci_dma_buf_status new_status) +{ + bool was_revoked; + + lockdep_assert_held_write(&priv->vdev->memory_lock); + + if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED || + priv->status == new_status) { + return; + } + + dma_resv_lock(priv->dmabuf->resv, NULL); + was_revoked = (priv->status == VFIO_PCI_DMABUF_TEMP_REVOKED); + + if (new_status != VFIO_PCI_DMABUF_OK) { + priv->status = new_status; /* Temp or permanently revoked */ + + if (was_revoked) { + /* + * TEMP_REVOKED is being upgraded to + * PERM_REVOKED. The buffer is already gone, + * don't wait on it again. + */ + dma_resv_unlock(priv->dmabuf->resv); + return; + } + } + + dma_buf_invalidate_mappings(priv->dmabuf); + dma_resv_wait_timeout(priv->dmabuf->resv, + DMA_RESV_USAGE_BOOKKEEP, false, + MAX_SCHEDULE_TIMEOUT); + dma_resv_unlock(priv->dmabuf->resv); + if (new_status != VFIO_PCI_DMABUF_OK) { + kref_put(&priv->kref, vfio_pci_dma_buf_done); + wait_for_completion(&priv->comp); + unmap_mapping_range(priv->dmabuf->file->f_mapping, + 0, priv->size, 1); + /* + * Re-arm the registered kref reference and the + * completion so the post-revoke state matches the + * post-creation state. An un-revoke followed by a + * new mapping needs the kref to be non-zero before + * kref_get(), and vfio_pci_dma_buf_cleanup() + * delegates its drain back through this revoke + * path on a possibly-already-revoked dma-buf. + */ + kref_init(&priv->kref); + reinit_completion(&priv->comp); + } else { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->status = VFIO_PCI_DMABUF_OK; + dma_resv_unlock(priv->dmabuf->resv); + } +} + void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) { struct vfio_pci_dma_buf *priv; @@ -599,44 +658,15 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) lockdep_assert_held_write(&vdev->memory_lock); /* * Holding memory_lock ensures a racing VMA fault observes - * priv->revoked properly. + * priv->status properly. */
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!get_file_active(&priv->dmabuf->file)) continue; - - if (priv->revoked != revoked) { - dma_resv_lock(priv->dmabuf->resv, NULL); - if (revoked) - priv->revoked = true; - dma_buf_invalidate_mappings(priv->dmabuf); - dma_resv_wait_timeout(priv->dmabuf->resv, - DMA_RESV_USAGE_BOOKKEEP, false, - MAX_SCHEDULE_TIMEOUT); - dma_resv_unlock(priv->dmabuf->resv); - if (revoked) { - kref_put(&priv->kref, vfio_pci_dma_buf_done); - wait_for_completion(&priv->comp); - unmap_mapping_range(priv->dmabuf->file->f_mapping, - 0, priv->size, 1); - /* - * Re-arm the registered kref reference and the - * completion so the post-revoke state matches the - * post-creation state. An un-revoke followed by a - * new mapping needs the kref to be non-zero before - * kref_get(), and vfio_pci_dma_buf_cleanup() - * delegates its drain back through this revoke - * path on a possibly-already-revoked dma-buf. - */ - kref_init(&priv->kref); - reinit_completion(&priv->comp); - } else { - dma_resv_lock(priv->dmabuf->resv, NULL); - priv->revoked = false; - dma_resv_unlock(priv->dmabuf->resv); - } - } + vfio_pci_dma_buf_set_status(priv, revoked ? + VFIO_PCI_DMABUF_TEMP_REVOKED : + VFIO_PCI_DMABUF_OK); fput(priv->dmabuf->file); } } @@ -668,3 +698,66 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) } up_write(&vdev->memory_lock); } + +#ifdef CONFIG_VFIO_PCI_DMABUF +int vfio_pci_core_feature_dma_buf_revoke( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_revoke __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf_revoke db_revoke; + struct vfio_pci_dma_buf *priv; + struct dma_buf *dmabuf; + int ret; + + if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys) + return -EOPNOTSUPP; + + ret = vfio_check_feature(flags, argsz, + VFIO_DEVICE_FEATURE_SET, + sizeof(db_revoke)); + if (ret != 1) + return ret; + + if (copy_from_user(&db_revoke, arg, sizeof(db_revoke))) + return -EFAULT; + + dmabuf = dma_buf_get(db_revoke.dmabuf_fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + priv = dmabuf->priv; + /* + * Sanity-check the DMABUF is really a vfio_pci_dma_buf _and_ + * relates to the VFIO device it was provided with. + * + * If the DMABUF relates to this vdev then priv->vdev is + * stable because this open fd prevents cleanup. + * + * If it relates to a different vdev, reading priv->vdev might + * race with a concurrent cleanup on that device. But if so, + * it points to a non-matching vdev or NULL and is unusable + * either way. + */ + if (dmabuf->ops != &vfio_pci_dmabuf_ops || + READ_ONCE(priv->vdev) != vdev) { + ret = -ENODEV; + goto out_put_buf; + } + + scoped_guard(rwsem_write, &vdev->memory_lock) { + if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED) { + ret = -EBADFD; + } else { + vfio_pci_dma_buf_set_status(priv, + VFIO_PCI_DMABUF_PERM_REVOKED); + ret = 0; + } + } + +out_put_buf: + dma_buf_put(dmabuf); + + return ret; +} +#endif /* CONFIG_VFIO_PCI_DMABUF */ diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index db2e2aeae88f..3c2f2575b670 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -23,6 +23,12 @@ struct vfio_pci_ioeventfd { bool test_mem; };
+enum vfio_pci_dma_buf_status { + VFIO_PCI_DMABUF_OK = 0, + VFIO_PCI_DMABUF_TEMP_REVOKED = 1, + VFIO_PCI_DMABUF_PERM_REVOKED = 2, +}; + struct vfio_pci_dma_buf { struct dma_buf *dmabuf; struct vfio_pci_core_device *vdev; @@ -35,7 +41,7 @@ struct vfio_pci_dma_buf { struct kref kref; struct completion comp; unsigned long vma_pgoff_adjust; - u8 revoked : 1; + enum vfio_pci_dma_buf_status status; };
extern const struct vm_operations_struct vfio_pci_mmap_ops; @@ -148,6 +154,10 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz); +int vfio_pci_core_feature_dma_buf_revoke( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_revoke __user *arg, + size_t argsz); #else static inline int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, @@ -156,6 +166,13 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, { return -ENOTTY; } +static inline int vfio_pci_core_feature_dma_buf_revoke( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_revoke __user *arg, + size_t argsz) +{ + return -ENOTTY; +} #endif
#endif diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 5de618a3a5ee..697c0bb4b9bc 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1534,6 +1534,26 @@ struct vfio_device_feature_dma_buf { */ #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12
+/** + * Given a dma_buf fd previously created by + * VFIO_DEVICE_FEATURE_DMA_BUF, a SET of this feature requests that + * access to the corresponding DMABUF is immediately and permanently + * revoked. On successful return, the buffer is not accessible + * through any mmap() or dma-buf import. The buffer is permanently + * disabled, and VFIO refuses all map, mmap, attach, etc. requests. + * + * Return: 0 on success, -1 and errno is set on failure: + * + * EBADF, EINVAL: dmabuf_fd is not a DMABUF fd. + * ENODEV: The dmabuf_fd does not match this VFIO device. + * EBADFD: The DMABUF is already revoked. + */ +#define VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE 13 + +struct vfio_device_feature_dma_buf_revoke { + __s32 dmabuf_fd; +}; + /* -------- API for Type1 VFIO IOMMU -------- */
/**
On Wed, Jun 10, 2026 at 04:43:22PM +0100, Matt Evans wrote:
Expand the VFIO DMABUF revocation state to three states: Not revoked, temporarily revoked, and permanently revoked.
The first two are for existing transient revocation, e.g. across a function reset, and the DMABUF is put into the last in response to a new VFIO feature VFIO_DEVICE_FEATURE_DMA_BUF.
VFIO_DEVICE_FEATURE_DMA_BUF passes a DMABUF by fd and requests that the DMABUF is permanently revoked. On success, it's guaranteed that the buffer can never be imported/attached/mmap()ed in future, that dynamic imports have been cleanly detached, and that all mappings have been made inaccessible/PTEs zapped.
This is useful for lifecycle management, to reclaim VFIO PCI BAR ranges previously delegated to a subordinate client process: The driver process can ensure that the loaned resources are revoked when the client is deemed "done", and exported ranges can be safely re-used elsewhere.
Refactor the revocation code out of vfio_pci_dma_buf_move() to a function common to move and the new feature request path.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_core.c | 6 +- drivers/vfio/pci/vfio_pci_dmabuf.c | 169 ++++++++++++++++++++++------- drivers/vfio/pci/vfio_pci_priv.h | 19 +++- include/uapi/linux/vfio.h | 20 ++++ 4 files changed, 173 insertions(+), 41 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 508a5eca910a..064906b25467 100644
[...]
+/* Set the DMABUF's revocation status (OK or temporarily/permanently revoked) */ +static void vfio_pci_dma_buf_set_status(struct vfio_pci_dma_buf *priv,
enum vfio_pci_dma_buf_status new_status)+{
- bool was_revoked;
- lockdep_assert_held_write(&priv->vdev->memory_lock);
- if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED ||
priv->status == new_status) {return;- }
- dma_resv_lock(priv->dmabuf->resv, NULL);
- was_revoked = (priv->status == VFIO_PCI_DMABUF_TEMP_REVOKED);
- if (new_status != VFIO_PCI_DMABUF_OK) {
priv->status = new_status; /* Temp or permanently revoked */if (was_revoked) {/** TEMP_REVOKED is being upgraded to* PERM_REVOKED. The buffer is already gone,* don't wait on it again.*/dma_resv_unlock(priv->dmabuf->resv);return;}- }
- dma_buf_invalidate_mappings(priv->dmabuf);
Nit: We seem to be calling this even if new_status == OK, while it works as importers (like IOMMUFD and RDMA core) are immune to a double invalidate / revoke. I'm wondering if we could move this within the if (new_status != VFIO_PCI_DMABUF_OK) block?
Since this is only needed to be called when we TEMP/PERM _REVOKE?
I'm just worried that this may overload the dma_buf_invalidate_mappings to be a state-change notification instead of a revoke / invalidate notification.
- dma_resv_wait_timeout(priv->dmabuf->resv,
DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT);- dma_resv_unlock(priv->dmabuf->resv);
- if (new_status != VFIO_PCI_DMABUF_OK) {
kref_put(&priv->kref, vfio_pci_dma_buf_done);wait_for_completion(&priv->comp);unmap_mapping_range(priv->dmabuf->file->f_mapping,0, priv->size, 1);/** Re-arm the registered kref reference and the* completion so the post-revoke state matches the* post-creation state. An un-revoke followed by a* new mapping needs the kref to be non-zero before* kref_get(), and vfio_pci_dma_buf_cleanup()* delegates its drain back through this revoke* path on a possibly-already-revoked dma-buf.*/kref_init(&priv->kref);reinit_completion(&priv->comp);- } else {
dma_resv_lock(priv->dmabuf->resv, NULL);priv->status = VFIO_PCI_DMABUF_OK;dma_resv_unlock(priv->dmabuf->resv);- }
+}
Otherwise, Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks, Praan
Hi Praan,
On 16/06/2026 09:05, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:22PM +0100, Matt Evans wrote:
Expand the VFIO DMABUF revocation state to three states: Not revoked, temporarily revoked, and permanently revoked.
The first two are for existing transient revocation, e.g. across a function reset, and the DMABUF is put into the last in response to a new VFIO feature VFIO_DEVICE_FEATURE_DMA_BUF.
VFIO_DEVICE_FEATURE_DMA_BUF passes a DMABUF by fd and requests that the DMABUF is permanently revoked. On success, it's guaranteed that the buffer can never be imported/attached/mmap()ed in future, that dynamic imports have been cleanly detached, and that all mappings have been made inaccessible/PTEs zapped.
This is useful for lifecycle management, to reclaim VFIO PCI BAR ranges previously delegated to a subordinate client process: The driver process can ensure that the loaned resources are revoked when the client is deemed "done", and exported ranges can be safely re-used elsewhere.
Refactor the revocation code out of vfio_pci_dma_buf_move() to a function common to move and the new feature request path.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_core.c | 6 +- drivers/vfio/pci/vfio_pci_dmabuf.c | 169 ++++++++++++++++++++++------- drivers/vfio/pci/vfio_pci_priv.h | 19 +++- include/uapi/linux/vfio.h | 20 ++++ 4 files changed, 173 insertions(+), 41 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 508a5eca910a..064906b25467 100644
[...]
+/* Set the DMABUF's revocation status (OK or temporarily/permanently revoked) */ +static void vfio_pci_dma_buf_set_status(struct vfio_pci_dma_buf *priv,
enum vfio_pci_dma_buf_status new_status)+{
- bool was_revoked;
- lockdep_assert_held_write(&priv->vdev->memory_lock);
- if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED ||
priv->status == new_status) {return;- }
- dma_resv_lock(priv->dmabuf->resv, NULL);
- was_revoked = (priv->status == VFIO_PCI_DMABUF_TEMP_REVOKED);
- if (new_status != VFIO_PCI_DMABUF_OK) {
priv->status = new_status; /* Temp or permanently revoked */if (was_revoked) {/** TEMP_REVOKED is being upgraded to* PERM_REVOKED. The buffer is already gone,* don't wait on it again.*/dma_resv_unlock(priv->dmabuf->resv);return;}- }
- dma_buf_invalidate_mappings(priv->dmabuf);
Nit: We seem to be calling this even if new_status == OK, while it works as importers (like IOMMUFD and RDMA core) are immune to a double invalidate / revoke. I'm wondering if we could move this within the if (new_status != VFIO_PCI_DMABUF_OK) block?
Since this is only needed to be called when we TEMP/PERM _REVOKE?
I'm just worried that this may overload the dma_buf_invalidate_mappings to be a state-change notification instead of a revoke / invalidate notification.
(In case you didn't see my reply to Kevin who made a similar request: https://lore.kernel.org/all/31d1265b-e264-4dc6-a8c3-1b64dc9867a1@ozlabs.org/ )
Done, moved to a common `!= VFIO_PCI_DMABUF_OK` block.
Just to flag that since it's currently called for both revoke and un-revoke paths, someone could already rely on this as a state-change notification. I don't see any current examples, but giving it a mention in case any readers know of any.
- dma_resv_wait_timeout(priv->dmabuf->resv,
DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT);- dma_resv_unlock(priv->dmabuf->resv);
- if (new_status != VFIO_PCI_DMABUF_OK) {
kref_put(&priv->kref, vfio_pci_dma_buf_done);wait_for_completion(&priv->comp);unmap_mapping_range(priv->dmabuf->file->f_mapping,0, priv->size, 1);/** Re-arm the registered kref reference and the* completion so the post-revoke state matches the* post-creation state. An un-revoke followed by a* new mapping needs the kref to be non-zero before* kref_get(), and vfio_pci_dma_buf_cleanup()* delegates its drain back through this revoke* path on a possibly-already-revoked dma-buf.*/kref_init(&priv->kref);reinit_completion(&priv->comp);- } else {
dma_resv_lock(priv->dmabuf->resv, NULL);priv->status = VFIO_PCI_DMABUF_OK;dma_resv_unlock(priv->dmabuf->resv);- }
+}
Otherwise, Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks, Praan
Thanks!
Matt
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
Expand the VFIO DMABUF revocation state to three states: Not revoked, temporarily revoked, and permanently revoked.
The first two are for existing transient revocation, e.g. across a function reset, and the DMABUF is put into the last in response to a new VFIO feature VFIO_DEVICE_FEATURE_DMA_BUF.
VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE
VFIO_DEVICE_FEATURE_DMA_BUF passes a DMABUF by fd and requests that the DMABUF is permanently revoked. On success, it's guaranteed that
ditto
the buffer can never be imported/attached/mmap()ed in future, that dynamic imports have been cleanly detached, and that all mappings have been made inaccessible/PTEs zapped.
This is useful for lifecycle management, to reclaim VFIO PCI BAR ranges previously delegated to a subordinate client process: The driver process can ensure that the loaned resources are revoked when the client is deemed "done", and exported ranges can be safely re-used elsewhere.
probably clarify that re-use by creating a new dmabuf fd as the original one is essentially zombie now.
+/* Set the DMABUF's revocation status (OK or temporarily/permanently revoked) */ +static void vfio_pci_dma_buf_set_status(struct vfio_pci_dma_buf *priv,
enum vfio_pci_dma_buf_statusnew_status) +{
- bool was_revoked;
- lockdep_assert_held_write(&priv->vdev->memory_lock);
- if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED ||
priv->status == new_status) {return;- }
the only interface to request PERM_REVOKED is via the new ioctl.
vfio_pci_core_feature_dma_buf_revoke() returns -EBADFD if it's already in PERM_REVOKED.
so this check shouldn't be reached, suggesting a warning.
- dma_buf_invalidate_mappings(priv->dmabuf);
- dma_resv_wait_timeout(priv->dmabuf->resv,
DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT);- dma_resv_unlock(priv->dmabuf->resv);
It's existing code but while at it let's make above conditional to the actual revoke path. for unrevoked it's not required given the previous revoke already cleans up everything.
otherwise,
Reviewed-by: Kevin Tian kevin.tian@intel.com
Hi Kevin,
On 16/06/2026 10:26, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
Expand the VFIO DMABUF revocation state to three states: Not revoked, temporarily revoked, and permanently revoked.
The first two are for existing transient revocation, e.g. across a function reset, and the DMABUF is put into the last in response to a new VFIO feature VFIO_DEVICE_FEATURE_DMA_BUF.
VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE
VFIO_DEVICE_FEATURE_DMA_BUF passes a DMABUF by fd and requests that the DMABUF is permanently revoked. On success, it's guaranteed that
ditto
Argh, thanks for catching these. Fixed.
the buffer can never be imported/attached/mmap()ed in future, that dynamic imports have been cleanly detached, and that all mappings have been made inaccessible/PTEs zapped.
This is useful for lifecycle management, to reclaim VFIO PCI BAR ranges previously delegated to a subordinate client process: The driver process can ensure that the loaned resources are revoked when the client is deemed "done", and exported ranges can be safely re-used elsewhere.
probably clarify that re-use by creating a new dmabuf fd as the original one is essentially zombie now.
Reworded this, plus added a note re the change below.
+/* Set the DMABUF's revocation status (OK or temporarily/permanently revoked) */ +static void vfio_pci_dma_buf_set_status(struct vfio_pci_dma_buf *priv,
enum vfio_pci_dma_buf_statusnew_status) +{
- bool was_revoked;
- lockdep_assert_held_write(&priv->vdev->memory_lock);
- if (priv->status == VFIO_PCI_DMABUF_PERM_REVOKED ||
priv->status == new_status) {return;- }
the only interface to request PERM_REVOKED is via the new ioctl.
vfio_pci_core_feature_dma_buf_revoke() returns -EBADFD if it's already in PERM_REVOKED.
so this check shouldn't be reached, suggesting a warning.
Good point, both any change to PERM_REVOKED or a double-set of the same state indicate some caller has gone wrong. Added a warning.
- dma_buf_invalidate_mappings(priv->dmabuf);
- dma_resv_wait_timeout(priv->dmabuf->resv,
DMA_RESV_USAGE_BOOKKEEP, false,MAX_SCHEDULE_TIMEOUT);- dma_resv_unlock(priv->dmabuf->resv);
It's existing code but while at it let's make above conditional to the actual revoke path. for unrevoked it's not required given the previous revoke already cleans up everything.
I noticed this too though I was consciously trying to keep the diff as small as possible. But with this feedback from both you and Praan, I'll move this. It's still pretty readable before/after.
otherwise,
Reviewed-by: Kevin Tian kevin.tian@intel.com
Thank you.
Matt
A new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR, is added to set CPU-facing memory type attributes for a DMABUF exported from vfio-pci. These are used for subsequent mmap()s of the buffer.
There are two attributes supported: - The default, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC - VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC, which results in WC PTEs for the DMABUF's BAR region.
Signed-off-by: Matt Evans matt@ozlabs.org --- drivers/vfio/pci/vfio_pci_core.c | 2 ++ drivers/vfio/pci/vfio_pci_dmabuf.c | 57 +++++++++++++++++++++++++++++- drivers/vfio/pci/vfio_pci_priv.h | 14 ++++++++ include/uapi/linux/vfio.h | 27 ++++++++++++++ 4 files changed, 99 insertions(+), 1 deletion(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 064906b25467..dc9c6f479e2c 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -1575,6 +1575,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_DMA_BUF_REVOKE: return vfio_pci_core_feature_dma_buf_revoke(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR: + return vfio_pci_core_feature_dma_buf_memattr(vdev, flags, arg, argsz); default: return -ENOTTY; } diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index b47411992ab6..58b769e65ab8 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -51,7 +51,10 @@ static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct * * contained within the DMABUF size before calling this. */
- vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); + if (READ_ONCE(priv->memattr) == VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC) + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot); + else + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
/* See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED. */ @@ -468,6 +471,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, priv->vdev = vdev; priv->nr_ranges = get_dma_buf.nr_ranges; priv->size = length; + priv->memattr = VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC; ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider, get_dma_buf.region_index, priv->phys_vec, dma_ranges, @@ -755,6 +759,57 @@ int vfio_pci_core_feature_dma_buf_revoke( } }
+out_put_buf: + dma_buf_put(dmabuf); + + return ret; +} + +int vfio_pci_core_feature_dma_buf_memattr( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_memattr __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf_memattr db_attr; + struct vfio_pci_dma_buf *priv; + struct dma_buf *dmabuf; + int ret; + + if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys) + return -EOPNOTSUPP; + + ret = vfio_check_feature(flags, argsz, + VFIO_DEVICE_FEATURE_SET, + sizeof(db_attr)); + if (ret != 1) + return ret; + + if (copy_from_user(&db_attr, arg, sizeof(db_attr))) + return -EFAULT; + + dmabuf = dma_buf_get(db_attr.dmabuf_fd); + if (IS_ERR(dmabuf)) + return PTR_ERR(dmabuf); + + /* Verify DMABUF: see comments in vfio_pci_dma_buf_revoke() */ + priv = dmabuf->priv; + if (dmabuf->ops != &vfio_pci_dmabuf_ops || + READ_ONCE(priv->vdev) != vdev) { + ret = -ENODEV; + goto out_put_buf; + } + + switch (db_attr.memattr) { + case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC: + case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC: + WRITE_ONCE(priv->memattr, db_attr.memattr); + ret = 0; + break; + + default: + ret = -ENOENT; + } + out_put_buf: dma_buf_put(dmabuf);
diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index 3c2f2575b670..3ec8b62194f3 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -41,6 +41,7 @@ struct vfio_pci_dma_buf { struct kref kref; struct completion comp; unsigned long vma_pgoff_adjust; + u32 memattr; enum vfio_pci_dma_buf_status status; };
@@ -158,6 +159,10 @@ int vfio_pci_core_feature_dma_buf_revoke( struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf_revoke __user *arg, size_t argsz); +int vfio_pci_core_feature_dma_buf_memattr( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_memattr __user *arg, + size_t argsz); #else static inline int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, @@ -166,6 +171,7 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, { return -ENOTTY; } + static inline int vfio_pci_core_feature_dma_buf_revoke( struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf_revoke __user *arg, @@ -173,6 +179,14 @@ static inline int vfio_pci_core_feature_dma_buf_revoke( { return -ENOTTY; } + +static inline int vfio_pci_core_feature_dma_buf_memattr( + struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf_memattr __user *arg, + size_t argsz) +{ + return -ENOTTY; +} #endif
#endif diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 697c0bb4b9bc..ab30b89399d0 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1554,6 +1554,33 @@ struct vfio_device_feature_dma_buf_revoke { __s32 dmabuf_fd; };
+/** + * Given a dma_buf fd previously created by + * VFIO_DEVICE_FEATURE_DMA_BUF, SET the memory attribute that will be + * used by future mmap()s of that fd. SETting a new attribute does + * not affect existing VMAs. + * + * The default, if no previous SET has been performed, is NC. + * + * Return: 0 on success, -1 and errno is set on failure: + * + * EBADF, EINVAL: dmabuf_fd is not a DMABUF fd. + * ENODEV: The dmabuf_fd does not match this VFIO device. + * ENOENT: The given memattr is not supported. + */ +#define VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR 14 + +/* Valid memory attributes for the memattr field */ +enum vfio_device_dma_buf_memattr { + VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC = 0, /* pgprot_noncached */ + VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC = 1, /* pgprot_writecombine */ +}; + +struct vfio_device_feature_dma_buf_memattr { + __s32 dmabuf_fd; + __u32 memattr; +}; + /* -------- API for Type1 VFIO IOMMU -------- */
/**
On Wed, Jun 10, 2026 at 04:43:23PM +0100, Matt Evans wrote:
A new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR, is added to set CPU-facing memory type attributes for a DMABUF exported from vfio-pci. These are used for subsequent mmap()s of the buffer.
There are two attributes supported:
- The default, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC
- VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC, which results in WC PTEs for the DMABUF's BAR region.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_core.c | 2 ++ drivers/vfio/pci/vfio_pci_dmabuf.c | 57 +++++++++++++++++++++++++++++- drivers/vfio/pci/vfio_pci_priv.h | 14 ++++++++ include/uapi/linux/vfio.h | 27 ++++++++++++++ 4 files changed, 99 insertions(+), 1 deletion(-)
+int vfio_pci_core_feature_dma_buf_memattr(
- struct vfio_pci_core_device *vdev, u32 flags,
- struct vfio_device_feature_dma_buf_memattr __user *arg,
- size_t argsz)
+{
- struct vfio_device_feature_dma_buf_memattr db_attr;
- struct vfio_pci_dma_buf *priv;
- struct dma_buf *dmabuf;
- int ret;
- if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
return -EOPNOTSUPP;- ret = vfio_check_feature(flags, argsz,
VFIO_DEVICE_FEATURE_SET,sizeof(db_attr));- if (ret != 1)
return ret;- if (copy_from_user(&db_attr, arg, sizeof(db_attr)))
return -EFAULT;- dmabuf = dma_buf_get(db_attr.dmabuf_fd);
- if (IS_ERR(dmabuf))
return PTR_ERR(dmabuf);- /* Verify DMABUF: see comments in vfio_pci_dma_buf_revoke() */
- priv = dmabuf->priv;
- if (dmabuf->ops != &vfio_pci_dmabuf_ops ||
READ_ONCE(priv->vdev) != vdev) {ret = -ENODEV;goto out_put_buf;- }
- switch (db_attr.memattr) {
- case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC:
- case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC:
WRITE_ONCE(priv->memattr, db_attr.memattr);ret = 0;break;- default:
ret = -ENOENT;
Nit: Looks like the agreement [1] was on -EOPNOTSUPP / -EINVAL but we took -ENOENT here and in the doc string? Was that intentional?
I tend to agree with Alex's suggestion here, we'd prefer one of those two (-EINVAL / -EOPNOTSUPP) since it clearly communicates to the user that "You sent a wrong arg" or "We don't support this"
-ENOENT means no such file or directory [2] to the user. Users may not be kernel engineers who'd wanna peek into the code and they may simply look at the uAPI files which doesn't give them an answer as to what went wrong.
- }
out_put_buf: dma_buf_put(dmabuf);
Apart from that, Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks, Praan
[1] https://lore.kernel.org/all/20260602131417.41366391@shazbot.org/ [2] https://elixir.bootlin.com/linux/v7.1-rc6/source/include/uapi/asm-generic/er...
Hi Praan,
On 16/06/2026 09:47, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:23PM +0100, Matt Evans wrote:
A new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR, is added to set CPU-facing memory type attributes for a DMABUF exported from vfio-pci. These are used for subsequent mmap()s of the buffer.
There are two attributes supported:
- The default, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC
- VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC, which results in WC PTEs for the DMABUF's BAR region.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_core.c | 2 ++ drivers/vfio/pci/vfio_pci_dmabuf.c | 57 +++++++++++++++++++++++++++++- drivers/vfio/pci/vfio_pci_priv.h | 14 ++++++++ include/uapi/linux/vfio.h | 27 ++++++++++++++ 4 files changed, 99 insertions(+), 1 deletion(-)
+int vfio_pci_core_feature_dma_buf_memattr(
- struct vfio_pci_core_device *vdev, u32 flags,
- struct vfio_device_feature_dma_buf_memattr __user *arg,
- size_t argsz)
+{
- struct vfio_device_feature_dma_buf_memattr db_attr;
- struct vfio_pci_dma_buf *priv;
- struct dma_buf *dmabuf;
- int ret;
- if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
return -EOPNOTSUPP;- ret = vfio_check_feature(flags, argsz,
VFIO_DEVICE_FEATURE_SET,sizeof(db_attr));- if (ret != 1)
return ret;- if (copy_from_user(&db_attr, arg, sizeof(db_attr)))
return -EFAULT;- dmabuf = dma_buf_get(db_attr.dmabuf_fd);
- if (IS_ERR(dmabuf))
return PTR_ERR(dmabuf);- /* Verify DMABUF: see comments in vfio_pci_dma_buf_revoke() */
- priv = dmabuf->priv;
- if (dmabuf->ops != &vfio_pci_dmabuf_ops ||
READ_ONCE(priv->vdev) != vdev) {ret = -ENODEV;goto out_put_buf;- }
- switch (db_attr.memattr) {
- case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC:
- case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC:
WRITE_ONCE(priv->memattr, db_attr.memattr);ret = 0;break;- default:
ret = -ENOENT;Nit: Looks like the agreement [1] was on -EOPNOTSUPP / -EINVAL but we took -ENOENT here and in the doc string? Was that intentional?
I tend to agree with Alex's suggestion here, we'd prefer one of those two (-EINVAL / -EOPNOTSUPP) since it clearly communicates to the user that "You sent a wrong arg" or "We don't support this"
Yes, it was intentional. This was noted in the v3 changelog entry in the cover letter:
- Removed GET on vfio_pci_core_feature_dma_buf_memattr(), removed unnecessary taking of memory_lock, fixed error return values. In particular, removes ENOTSUPP, and uses ENOENT to indicate an unknown attribute enum value was passed to SET. In the discussion here, https://lore.kernel.org/all/20260602131417.41366391@shazbot.org/ we'd agreed on EOPNOTSUPP before I realised that's already used elsewhere. ENOENT uniquely indicates an unknown attribute.
EINVAL/EOPNOTSUPP would indeed be semantically perfect, but after posting my reply there I remembered they are already overloaded with a load of different meanings.
I think uniqueness is important here so that memattr issues (for example any future arch-specific porting issues) show up as an immediately-understandable error value.
-ENOENT means no such file or directory [2] to the user. Users may not be kernel engineers who'd wanna peek into the code and they may simply look at the uAPI files which doesn't give them an answer as to what went wrong.
But surely when they look at the uAPI header they will then see "* ENOENT: The given memattr is not supported." and understand what went wrong.
- }
out_put_buf: dma_buf_put(dmabuf);
Apart from that, Reviewed-by: Pranjal Shrivastava praan@google.com
Thanks!
Matt
Thanks, Praan
[1] https://lore.kernel.org/all/20260602131417.41366391@shazbot.org/ [2] https://elixir.bootlin.com/linux/v7.1-rc6/source/include/uapi/asm-generic/er...
On Tue, Jun 16, 2026 at 12:37:29PM +0100, Matt Evans wrote:
Hi Praan,
On 16/06/2026 09:47, Pranjal Shrivastava wrote:
On Wed, Jun 10, 2026 at 04:43:23PM +0100, Matt Evans wrote:
A new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR, is added to set CPU-facing memory type attributes for a DMABUF exported from vfio-pci. These are used for subsequent mmap()s of the buffer.
There are two attributes supported:
- The default, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC
- VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC, which results in WC PTEs for the DMABUF's BAR region.
Signed-off-by: Matt Evans matt@ozlabs.org
drivers/vfio/pci/vfio_pci_core.c | 2 ++ drivers/vfio/pci/vfio_pci_dmabuf.c | 57 +++++++++++++++++++++++++++++- drivers/vfio/pci/vfio_pci_priv.h | 14 ++++++++ include/uapi/linux/vfio.h | 27 ++++++++++++++ 4 files changed, 99 insertions(+), 1 deletion(-)
[...]
- /* Verify DMABUF: see comments in vfio_pci_dma_buf_revoke() */
- priv = dmabuf->priv;
- if (dmabuf->ops != &vfio_pci_dmabuf_ops ||
READ_ONCE(priv->vdev) != vdev) {ret = -ENODEV;goto out_put_buf;- }
- switch (db_attr.memattr) {
- case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC:
- case VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC:
WRITE_ONCE(priv->memattr, db_attr.memattr);ret = 0;break;- default:
ret = -ENOENT;Nit: Looks like the agreement [1] was on -EOPNOTSUPP / -EINVAL but we took -ENOENT here and in the doc string? Was that intentional?
I tend to agree with Alex's suggestion here, we'd prefer one of those two (-EINVAL / -EOPNOTSUPP) since it clearly communicates to the user that "You sent a wrong arg" or "We don't support this"
Yes, it was intentional. This was noted in the v3 changelog entry in the cover letter:
- Removed GET on vfio_pci_core_feature_dma_buf_memattr(), removed unnecessary taking of memory_lock, fixed error return values. In particular, removes ENOTSUPP, and uses ENOENT to indicate an unknown attribute enum value was passed to SET. In the discussion here, https://lore.kernel.org/all/20260602131417.41366391@shazbot.org/ we'd agreed on EOPNOTSUPP before I realised that's already used elsewhere. ENOENT uniquely indicates an unknown attribute.
Ahh okay. I missed the changelogs in the cover letter.
EINVAL/EOPNOTSUPP would indeed be semantically perfect, but after posting my reply there I remembered they are already overloaded with a load of different meanings.
I think uniqueness is important here so that memattr issues (for example any future arch-specific porting issues) show up as an immediately-understandable error value.
-ENOENT means no such file or directory [2] to the user. Users may not be kernel engineers who'd wanna peek into the code and they may simply look at the uAPI files which doesn't give them an answer as to what went wrong.
But surely when they look at the uAPI header they will then see "* ENOENT: The given memattr is not supported." and understand what went wrong.
Fair enough. Since its documented it clearly in the uAPI header.
Thanks, Praan
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
A new VFIO feature, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR, is added to set CPU-facing memory type attributes for a DMABUF exported from vfio-pci. These are used for subsequent mmap()s of the buffer.
There are two attributes supported:
- The default, VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_NC
- VFIO_DEVICE_FEATURE_DMA_BUF_MEMATTR_WC, which results in WC PTEs for the DMABUF's BAR region.
Signed-off-by: Matt Evans matt@ozlabs.org
Reviewed-by: Kevin Tian kevin.tian@intel.com
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
[...]
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour unchanged with respect to the DMABUFs. An application can close all device fds, and this will revoke/clean up all DMABUFs; no mappings or other access can be performed now. When enabling mmap() of the DMABUFs, this means access through the VMA is also revoked. This complicates the fault handler because whilst the DMABUF exists, it has no guarantee that the corresponding VFIO device is still alive. Adds synchronisation ensuring the vdev is available before vdev->memory_lock is touched; this holds the device registration so that even if the buffer has been cleaned up, vdev hasn't been freed and so the lock can be safely taken.
This commit makes VFIO_PCI_CORE depend on PCI_P2PDMA_CORE (commit
- to bring in (only) the P2PDMA provider code.
the last sentence is stale as the dependency is now added in patch4.
End
This is based on VFIO next (e.g. at b9285405c5f6).
Sashiko failed to apply this series. Is there dependent work in vfio-next?
otherwise getting a Sashiko review is helpful here.
Hi Kevin,
On 12/06/2026 09:27, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
[...]
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour unchanged with respect to the DMABUFs. An application can close all device fds, and this will revoke/clean up all DMABUFs; no mappings or other access can be performed now. When enabling mmap() of the DMABUFs, this means access through the VMA is also revoked. This complicates the fault handler because whilst the DMABUF exists, it has no guarantee that the corresponding VFIO device is still alive. Adds synchronisation ensuring the vdev is available before vdev->memory_lock is touched; this holds the device registration so that even if the buffer has been cleaned up, vdev hasn't been freed and so the lock can be safely taken.
This commit makes VFIO_PCI_CORE depend on PCI_P2PDMA_CORE (commit
- to bring in (only) the P2PDMA provider code.
the last sentence is stale as the dependency is now added in patch4.
Right, will fix.
End
This is based on VFIO next (e.g. at b9285405c5f6).
Sashiko failed to apply this series. Is there dependent work in vfio-next?
otherwise getting a Sashiko review is helpful here.
It _did_ depend on (at least the context of) some fixes in vfio-next. Looks like it'll rebase on master now those are merged. I should've re-checked this for v3, oops. :|
(FWIW, I had Robot Claude Opus 4.8 to review several times up to v3. But I agree, Sashiko would be interesting too. Can it be manually triggered with branch guidance?)
Thanks,
Matt
On Fri, Jun 12, 2026 at 04:11:50PM +0100, Matt Evans wrote:
Hi Kevin,
On 12/06/2026 09:27, Tian, Kevin wrote:
From: Matt Evans matt@ozlabs.org Sent: Wednesday, June 10, 2026 11:43 PM
[...]
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour unchanged with respect to the DMABUFs. An application can close all device fds, and this will revoke/clean up all DMABUFs; no mappings or other access can be performed now. When enabling mmap() of the DMABUFs, this means access through the VMA is also revoked. This complicates the fault handler because whilst the DMABUF exists, it has no guarantee that the corresponding VFIO device is still alive. Adds synchronisation ensuring the vdev is available before vdev->memory_lock is touched; this holds the device registration so that even if the buffer has been cleaned up, vdev hasn't been freed and so the lock can be safely taken.
This commit makes VFIO_PCI_CORE depend on PCI_P2PDMA_CORE (commit
- to bring in (only) the P2PDMA provider code.
the last sentence is stale as the dependency is now added in patch4.
Right, will fix.
End
This is based on VFIO next (e.g. at b9285405c5f6).
Sashiko failed to apply this series. Is there dependent work in vfio-next?
otherwise getting a Sashiko review is helpful here.
It _did_ depend on (at least the context of) some fixes in vfio-next. Looks like it'll rebase on master now those are merged. I should've re-checked this for v3, oops. :|
(FWIW, I had Robot Claude Opus 4.8 to review several times up to v3. But I agree, Sashiko would be interesting too. Can it be manually triggered with branch guidance?)
I guess relevant steps to run locally are here: https://github.com/sashiko-dev/sashiko/blob/main/README.md
Additionally, we can try providing a base-commit (which points to a public commit).
Thanks, Praan
linaro-mm-sig@lists.linaro.org