Changelog: v5: * Rebased on top of v6.18-rc1. * Added more validation logic to make sure that DMA-BUF length doesn't overflow in various scenarios. * Hide kernel config from the users. * Fixed type conversion issue. DMA ranges are exposed with u64 length, but DMA-BUF uses "unsigned int" as a length for SG entries. * Added check to prevent from VFIO drivers which reports BAR size different from PCI, do not use DMA-BUF functionality. v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org * Split pcim_p2pdma_provider() to two functions, one that initializes array of providers and another to return right provider pointer. v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org * Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider(). * Cache provider in vfio_pci_dma_buf struct instead of BAR index. * Removed misleading comment from pcim_p2pdma_provider(). * Moved MMIO check to be in pcim_p2pdma_provider(). v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/ * Added extra patch which adds new CONFIG, so next patches can reuse * it. * Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state" into the other patch. * Fixed revoke calls to be aligned with true->false semantics. * Extended p2pdma_providers to be per-BAR and not global to whole * device. * Fixed possible race between dmabuf states and revoke. * Moved revoke to PCI BAR zap block. v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org * Changed commit messages. * Reused DMA_ATTR_MMIO attribute. * Returned support for multiple DMA ranges per-dMABUF. v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
--------------------------------------------------------------------------- Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API" https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series. ---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO regions from PCI device BARs as dma-buf objects, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be owned by SPDK through VFIO but interacting with a RDMA device. The RDMA device may directly access the NVMe CMB or directly manipulate the NVMe device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with VFIO. This dmabuf approach can be usable by iommufd as well for generic and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added in this patch series can also be useful when a buffer (located in device memory such as VRAM) needs to be shared between any two dGPU devices or instances (assuming one of them is bound to VFIO PCI) as long as they are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem to separate core P2P functionality from memory allocation features, making it more modular and suitable for VFIO use cases that don't need struct page support.
----------------------------------------------------------------------- The series is based originally on https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.co... but heavily rewritten to be based on DMA physical API. ----------------------------------------------------------------------- The WIP branch can be found here: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=d...
Thanks
Leon Romanovsky (7): PCI/P2PDMA: Separate the mmap() support from the core logic PCI/P2PDMA: Simplify bus address mapping API PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation PCI/P2PDMA: Export pci_p2pdma_map_type() function types: move phys_vec definition to common header vfio/pci: Enable peer-to-peer DMA transactions by default vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2): vfio: Export vfio device get and put registration helpers vfio/pci: Share the core device pointer while invoking feature functions
block/blk-mq-dma.c | 7 +- drivers/iommu/dma-iommu.c | 4 +- drivers/pci/p2pdma.c | 175 ++++++++--- drivers/vfio/pci/Kconfig | 3 + drivers/vfio/pci/Makefile | 2 + drivers/vfio/pci/vfio_pci_config.c | 22 +- drivers/vfio/pci/vfio_pci_core.c | 63 ++-- drivers/vfio/pci/vfio_pci_dmabuf.c | 446 +++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 ++ drivers/vfio/vfio_main.c | 2 + include/linux/pci-p2pdma.h | 120 +++++--- include/linux/types.h | 5 + include/linux/vfio.h | 2 + include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 25 ++ kernel/dma/direct.c | 4 +- mm/hmm.c | 2 +- 17 files changed, 785 insertions(+), 121 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
From: Leon Romanovsky leonro@nvidia.com
Currently the P2PDMA code requires a pgmap and a struct page to function. The was serving three important purposes:
- DMA API compatibility, where scatterlist required a struct page as input
- Life cycle management, the percpu_ref is used to prevent UAF during device hot unplug
- A way to get the P2P provider data through the pci_p2pdma_pagemap
The DMA API now has a new flow, and has gained phys_addr_t support, so it no longer needs struct pages to perform P2P mapping.
Lifecycle management can be delegated to the user, DMABUF for instance has a suitable invalidation protocol that does not require struct page.
Finding the P2P provider data can also be managed by the caller without need to look it up from the phys_addr.
Split the P2PDMA code into two layers. The optional upper layer, effectively, provides a way to mmap() P2P memory into a VMA by providing struct page, pgmap, a genalloc and sysfs.
The lower layer provides the actual P2P infrastructure and is wrapped up in a new struct p2pdma_provider. Rework the mmap layer to use new p2pdma_provider based APIs.
Drivers that do not want to put P2P memory into VMA's can allocate a struct p2pdma_provider after probe() starts and free it before remove() completes. When DMA mapping the driver must convey the struct p2pdma_provider to the DMA mapping code along with a phys_addr of the MMIO BAR slice to map. The driver must ensure that no DMA mapping outlives the lifetime of the struct p2pdma_provider.
The intended target of this new API layer is DMABUF. There is usually only a single p2pdma_provider for a DMABUF exporter. Most drivers can establish the p2pdma_provider during probe, access the single instance during DMABUF attach and use that to drive the DMA mapping.
DMABUF provides an invalidation mechanism that can guarantee all DMA is halted and the DMA mappings are undone prior to destroying the struct p2pdma_provider. This ensures there is no UAF through DMABUFs that are lingering past driver removal.
The new p2pdma_provider layer cannot be used to create P2P memory that can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and so on. These use cases must still use the mmap() layer. The p2pdma_provider layer is principally for DMABUF-like use cases where DMABUF natively manages the life cycle and access instead of vmas/pin_user_pages()/struct page.
In addition, remove the bus_off field from pci_p2pdma_map_state since it duplicates information already available in the pgmap structure. The bus_offset is only used in one location (pci_p2pdma_bus_addr_map) and is always identical to pgmap->bus_offset.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/pci/p2pdma.c | 43 ++++++++++++++++++++------------------ include/linux/pci-p2pdma.h | 19 ++++++++++++----- 2 files changed, 37 insertions(+), 25 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 78e108e47254..59cd6fb40e83 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -28,9 +28,8 @@ struct pci_p2pdma { };
struct pci_p2pdma_pagemap { - struct pci_dev *provider; - u64 bus_offset; struct dev_pagemap pgmap; + struct p2pdma_provider mem; };
static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) @@ -204,8 +203,8 @@ static void p2pdma_page_free(struct page *page) { struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ - struct pci_p2pdma *p2pdma = - rcu_dereference_protected(pgmap->provider->p2pdma, 1); + struct pci_p2pdma *p2pdma = rcu_dereference_protected( + to_pci_dev(pgmap->mem.owner)->p2pdma, 1); struct percpu_ref *ref;
gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), @@ -270,14 +269,15 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
static void pci_p2pdma_unmap_mappings(void *data) { - struct pci_dev *pdev = data; + struct pci_p2pdma_pagemap *p2p_pgmap = data;
/* * Removing the alloc attribute from sysfs will call * unmap_mapping_range() on the inode, teardown any existing userspace * mappings and prevent new ones from being created. */ - sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr, + sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj, + &p2pmem_alloc_attr.attr, p2pmem_group.name); }
@@ -328,10 +328,9 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap->nr_range = 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; pgmap->ops = &p2pdma_pgmap_ops; - - p2p_pgmap->provider = pdev; - p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) - - pci_resource_start(pdev, bar); + p2p_pgmap->mem.owner = &pdev->dev; + p2p_pgmap->mem.bus_offset = + pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar);
addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -340,7 +339,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, }
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings, - pdev); + p2p_pgmap); if (error) goto pages_free;
@@ -972,16 +971,16 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
-static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, - struct device *dev) +static enum pci_p2pdma_map_type +pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) { enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; - struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider; + struct pci_dev *pdev = to_pci_dev(provider->owner); struct pci_dev *client; struct pci_p2pdma *p2pdma; int dist;
- if (!provider->p2pdma) + if (!pdev->p2pdma) return PCI_P2PDMA_MAP_NOT_SUPPORTED;
if (!dev_is_pci(dev)) @@ -990,7 +989,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, client = to_pci_dev(dev);
rcu_read_lock(); - p2pdma = rcu_dereference(provider->p2pdma); + p2pdma = rcu_dereference(pdev->p2pdma);
if (p2pdma) type = xa_to_value(xa_load(&p2pdma->map_types, @@ -998,7 +997,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, rcu_read_unlock();
if (type == PCI_P2PDMA_MAP_UNKNOWN) - return calc_map_type_and_dist(provider, client, &dist, true); + return calc_map_type_and_dist(pdev, client, &dist, true);
return type; } @@ -1006,9 +1005,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap, void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { - state->pgmap = page_pgmap(page); - state->map = pci_p2pdma_map_type(state->pgmap, dev); - state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset; + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page)); + + if (state->mem == &p2p_pgmap->mem) + return; + + state->mem = &p2p_pgmap->mem; + state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev); }
/** diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 951f81a38f3a..1400f3ad4299 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -16,6 +16,16 @@ struct block_device; struct scatterlist;
+/** + * struct p2pdma_provider + * + * A p2pdma provider is a range of MMIO address space available to the CPU. + */ +struct p2pdma_provider { + struct device *owner; + u64 bus_offset; +}; + #ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); @@ -139,11 +149,11 @@ enum pci_p2pdma_map_type { };
struct pci_p2pdma_map_state { - struct dev_pagemap *pgmap; + struct p2pdma_provider *mem; enum pci_p2pdma_map_type map; - u64 bus_off; };
+ /* helper for pci_p2pdma_state(), do not use directly */ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page); @@ -162,8 +172,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) { if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) { - if (state->pgmap != page_pgmap(page)) - __pci_p2pdma_update_state(state, dev, page); + __pci_p2pdma_update_state(state, dev, page); return state->map; } return PCI_P2PDMA_MAP_NONE; @@ -181,7 +190,7 @@ static inline dma_addr_t pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) { WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); - return paddr + state->bus_off; + return paddr + state->mem->bus_offset; }
#endif /* _LINUX_PCI_P2P_H */
On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote:
The DMA API now has a new flow, and has gained phys_addr_t support, so it no longer needs struct pages to perform P2P mapping.
That's news to me. All the pci_p2pdma_map_state machinery is still based on pgmaps and thus pages.
Lifecycle management can be delegated to the user, DMABUF for instance has a suitable invalidation protocol that does not require struct page.
How?
On Thu, Oct 16, 2025 at 11:30:06PM -0700, Christoph Hellwig wrote:
On Mon, Oct 13, 2025 at 06:26:03PM +0300, Leon Romanovsky wrote:
The DMA API now has a new flow, and has gained phys_addr_t support, so it no longer needs struct pages to perform P2P mapping.
That's news to me. All the pci_p2pdma_map_state machinery is still based on pgmaps and thus pages.
We had this discussion already three months ago:
https://lore.kernel.org/all/20250729131502.GJ36037@nvidia.com/
These couple patches make the core pci_p2pdma_map_state machinery work on struct p2pdma_provider, and pgmap is just one way to get a p2pdma_provider *
The struct page paths through pgmap go page->pgmap->mem to get p2pdma_provider.
The non-struct page paths just have a p2pdma_provider * without a pgmap. In this series VFIO uses
+ *provider = pcim_p2pdma_provider(pdev, bar);
To get the provider for a specific BAR.
Lifecycle management can be delegated to the user, DMABUF for instance has a suitable invalidation protocol that does not require struct page.
How?
I think I've answered this three times now - for DMABUF the DMABUF invalidation scheme is used to control the lifetime and no DMA mapping outlives the provider, and the provider doesn't outlive the driver.
Hotplug works fine. VFIO gets the driver removal callback, it invalidates all the DMABUFs, refuses to re-validate them, destroys the P2P provider, and ends its driver. There is no lifetime issue.
Obviously you cannot use the new p2provider mechanism without some kind of protection against use after hot unplug, but it doesn't have to be struct page based.
For VFIO the invalidation scheme is linked to dma_buf_move_notify(), for instance the hotunplug case goes:
static const struct vfio_device_ops vfio_pci_ops = { .close_device = vfio_pci_core_close_device,
vfio_pci_dma_buf_cleanup(vdev);
dma_buf_move_notify(priv->dmabuf);
And then if we follow that into an importer like RDMA:
static struct dma_buf_attach_ops mlx5_ib_dmabuf_attach_ops = { .move_notify = mlx5_ib_dmabuf_invalidate_cb,
mlx5r_umr_update_mr_pas(mr, MLX5_IB_UPD_XLT_ZAP); ib_umem_dmabuf_unmap_pages(umem_dmabuf); dma_buf_unmap_attachment(umem_dmabuf->attach, umem_dmabuf->sgt, DMA_BIDIRECTIONAL); vfio_pci_dma_buf_unmap()
XLT_ZAP tells the HW to stop doing DMA and the unmap_pages -> unmap_attachment -> vfio_pci_dma_buf_unmap() flow will tear down the DMA API mapping and remove it from the IOMMU. All of this happens before device_driver remove completes.
There is no lifecycle issue here and we don't need pgmap to solve a livecycle problem or to help find the p2pdma_provider.
Jason
From: Leon Romanovsky leonro@nvidia.com
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer to the p2pdma_provider structure instead of the pci_p2pdma_map_state. This simplifies the API by removing the need for callers to extract the provider from the state structure.
The change updates all callers across the kernel (block layer, IOMMU, DMA direct, and HMM) to pass the provider pointer directly, making the code more explicit and reducing unnecessary indirection. This also removes the runtime warning check since callers now have direct control over which provider they use.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- block/blk-mq-dma.c | 2 +- drivers/iommu/dma-iommu.c | 4 ++-- include/linux/pci-p2pdma.h | 7 +++---- kernel/dma/direct.c | 4 ++-- mm/hmm.c | 2 +- 5 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c index 9495b78b6fd3..badef1d925b2 100644 --- a/block/blk-mq-dma.c +++ b/block/blk-mq-dma.c @@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec) { - iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr); + iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr); iter->len = vec->len; return true; } diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7944a3af4545..e52d19d2e833 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents, * as a bus address, __finalise_sg() will copy the dma * address into the output segment. */ - s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, - sg_phys(s)); + s->dma_address = pci_p2pdma_bus_addr_map( + p2pdma_state.mem, sg_phys(s)); sg_dma_len(s) = sg->length; sg_dma_mark_bus_address(s); continue; diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 1400f3ad4299..9516ef97b17a 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -181,16 +181,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev, /** * pci_p2pdma_bus_addr_map - Translate a physical address to a bus address * for a PCI_P2PDMA_MAP_BUS_ADDR transfer. - * @state: P2P state structure + * @provider: P2P provider structure * @paddr: physical address to map * * Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer. */ static inline dma_addr_t -pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr) +pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr) { - WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR); - return paddr + state->mem->bus_offset; + return paddr + provider->bus_offset; }
#endif /* _LINUX_PCI_P2P_H */ diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 1f9ee9759426..d8b3dfc598b2 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -479,8 +479,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents, } break; case PCI_P2PDMA_MAP_BUS_ADDR: - sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state, - sg_phys(sg)); + sg->dma_address = pci_p2pdma_bus_addr_map( + p2pdma_state.mem, sg_phys(sg)); sg_dma_mark_bus_address(sg); continue; default: diff --git a/mm/hmm.c b/mm/hmm.c index 87562914670a..9bf0b831a029 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -811,7 +811,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map, break; case PCI_P2PDMA_MAP_BUS_ADDR: pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED; - return pci_p2pdma_bus_addr_map(p2pdma_state, paddr); + return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr); default: return DMA_MAPPING_ERROR; }
From: Leon Romanovsky leonro@nvidia.com
Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA functionality from the optional memory allocation layer. This creates a two-tier architecture:
The core layer provides P2P mapping functionality for physical addresses based on PCI device MMIO BARs and integrates with the DMA API for mapping operations. This layer is required for all P2PDMA users.
The optional upper layer provides memory allocation capabilities including gen_pool allocator, struct page support, and sysfs interface for user space access.
This separation allows subsystems like VFIO to use only the core P2P mapping functionality without the overhead of memory allocation features they don't need. The core functionality is now available through the new pcim_p2pdma_provider() function that returns a p2pdma_provider structure.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/pci/p2pdma.c | 139 ++++++++++++++++++++++++++++--------- include/linux/pci-p2pdma.h | 11 +++ 2 files changed, 119 insertions(+), 31 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index 59cd6fb40e83..a2ec7e93fd71 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -25,11 +25,12 @@ struct pci_p2pdma { struct gen_pool *pool; bool p2pmem_published; struct xarray map_types; + struct p2pdma_provider mem[PCI_STD_NUM_BARS]; };
struct pci_p2pdma_pagemap { struct dev_pagemap pgmap; - struct p2pdma_provider mem; + struct p2pdma_provider *mem; };
static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap) @@ -204,7 +205,7 @@ static void p2pdma_page_free(struct page *page) struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page)); /* safe to dereference while a reference is held to the percpu ref */ struct pci_p2pdma *p2pdma = rcu_dereference_protected( - to_pci_dev(pgmap->mem.owner)->p2pdma, 1); + to_pci_dev(pgmap->mem->owner)->p2pdma, 1); struct percpu_ref *ref;
gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page), @@ -227,44 +228,111 @@ static void pci_p2pdma_release(void *data)
/* Flush and disable pci_alloc_p2p_mem() */ pdev->p2pdma = NULL; - synchronize_rcu(); + if (p2pdma->pool) + synchronize_rcu(); + xa_destroy(&p2pdma->map_types); + + if (!p2pdma->pool) + return;
gen_pool_destroy(p2pdma->pool); sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group); - xa_destroy(&p2pdma->map_types); }
-static int pci_p2pdma_setup(struct pci_dev *pdev) +/** + * pcim_p2pdma_init - Initialise peer-to-peer DMA providers + * @pdev: The PCI device to enable P2PDMA for + * + * This function initializes the peer-to-peer DMA infrastructure + * for a PCI device. It allocates and sets up the necessary data + * structures to support P2PDMA operations, including mapping type + * tracking. + */ +int pcim_p2pdma_init(struct pci_dev *pdev) { - int error = -ENOMEM; struct pci_p2pdma *p2p; + int i, ret; + + p2p = rcu_dereference_protected(pdev->p2pdma, 1); + if (p2p) + return 0;
p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL); if (!p2p) return -ENOMEM;
xa_init(&p2p->map_types); + /* + * Iterate over all standard PCI BARs and record only those that + * correspond to MMIO regions. Skip non-memory resources (e.g. I/O + * port BARs) since they cannot be used for peer-to-peer (P2P) + * transactions. + */ + for (i = 0; i < PCI_STD_NUM_BARS; i++) { + if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM)) + continue;
- p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); - if (!p2p->pool) - goto out; + p2p->mem[i].owner = &pdev->dev; + p2p->mem[i].bus_offset = + pci_bus_address(pdev, i) - pci_resource_start(pdev, i); + }
- error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); - if (error) - goto out_pool_destroy; + ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev); + if (ret) + goto out_p2p;
- error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); - if (error) + rcu_assign_pointer(pdev->p2pdma, p2p); + return 0; + +out_p2p: + devm_kfree(&pdev->dev, p2p); + return ret; +} +EXPORT_SYMBOL_GPL(pcim_p2pdma_init); + +/** + * pcim_p2pdma_provider - Get peer-to-peer DMA provider + * @pdev: The PCI device to enable P2PDMA for + * @bar: BAR index to get provider + * + * This function gets peer-to-peer DMA provider for a PCI device. + */ +struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar) +{ + struct pci_p2pdma *p2p; + + if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM)) + return NULL; + + p2p = rcu_dereference_protected(pdev->p2pdma, 1); + return &p2p->mem[bar]; +} +EXPORT_SYMBOL_GPL(pcim_p2pdma_provider); + +static int pci_p2pdma_setup_pool(struct pci_dev *pdev) +{ + struct pci_p2pdma *p2pdma; + int ret; + + p2pdma = rcu_dereference_protected(pdev->p2pdma, 1); + if (p2pdma->pool) + /* We already setup pools, do nothing, */ + return 0; + + p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev)); + if (!p2pdma->pool) + return -ENOMEM; + + ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group); + if (ret) goto out_pool_destroy;
- rcu_assign_pointer(pdev->p2pdma, p2p); return 0;
out_pool_destroy: - gen_pool_destroy(p2p->pool); -out: - devm_kfree(&pdev->dev, p2p); - return error; + gen_pool_destroy(p2pdma->pool); + p2pdma->pool = NULL; + return ret; }
static void pci_p2pdma_unmap_mappings(void *data) @@ -276,7 +344,7 @@ static void pci_p2pdma_unmap_mappings(void *data) * unmap_mapping_range() on the inode, teardown any existing userspace * mappings and prevent new ones from being created. */ - sysfs_remove_file_from_group(&p2p_pgmap->mem.owner->kobj, + sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj, &p2pmem_alloc_attr.attr, p2pmem_group.name); } @@ -295,6 +363,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) { struct pci_p2pdma_pagemap *p2p_pgmap; + struct p2pdma_provider *mem; struct dev_pagemap *pgmap; struct pci_p2pdma *p2pdma; void *addr; @@ -312,11 +381,21 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, if (size + offset > pci_resource_len(pdev, bar)) return -EINVAL;
- if (!pdev->p2pdma) { - error = pci_p2pdma_setup(pdev); - if (error) - return error; - } + error = pcim_p2pdma_init(pdev); + if (error) + return error; + + error = pci_p2pdma_setup_pool(pdev); + if (error) + return error; + + mem = pcim_p2pdma_provider(pdev, bar); + /* + * We checked validity of BAR prior to call + * to pcim_p2pdma_provider. It should never return NULL. + */ + if (WARN_ON(!mem)) + return -EINVAL;
p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL); if (!p2p_pgmap) @@ -328,9 +407,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, pgmap->nr_range = 1; pgmap->type = MEMORY_DEVICE_PCI_P2PDMA; pgmap->ops = &p2pdma_pgmap_ops; - p2p_pgmap->mem.owner = &pdev->dev; - p2p_pgmap->mem.bus_offset = - pci_bus_address(pdev, bar) - pci_resource_start(pdev, bar); + p2p_pgmap->mem = mem;
addr = devm_memremap_pages(&pdev->dev, pgmap); if (IS_ERR(addr)) { @@ -1007,11 +1084,11 @@ void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, { struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
- if (state->mem == &p2p_pgmap->mem) + if (state->mem == p2p_pgmap->mem) return;
- state->mem = &p2p_pgmap->mem; - state->map = pci_p2pdma_map_type(&p2p_pgmap->mem, dev); + state->mem = p2p_pgmap->mem; + state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev); }
/** diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 9516ef97b17a..e307c9380d46 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -27,6 +27,8 @@ struct p2pdma_provider { };
#ifdef CONFIG_PCI_P2PDMA +int pcim_p2pdma_init(struct pci_dev *pdev); +struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar); int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients, @@ -44,6 +46,15 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, bool use_p2pdma); #else /* CONFIG_PCI_P2PDMA */ +static inline int pcim_p2pdma_init(struct pci_dev *pdev) +{ + return -EOPNOTSUPP; +} +static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, + int bar) +{ + return ERR_PTR(-EOPNOTSUPP); +} static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) {
From: Leon Romanovsky leonro@nvidia.com
Export the pci_p2pdma_map_type() function to allow external modules and subsystems to determine the appropriate mapping type for P2PDMA transfers between a provider and target device.
The function determines whether peer-to-peer DMA transfers can be done directly through PCI switches (PCI_P2PDMA_MAP_BUS_ADDR) or must go through the host bridge (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE), or if the transfer is not supported at all.
This export enables subsystems like VFIO to properly handle P2PDMA operations by querying the mapping type before attempting transfers, ensuring correct DMA address programming and error handling.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/pci/p2pdma.c | 15 ++++++- include/linux/pci-p2pdma.h | 85 +++++++++++++++++++++----------------- 2 files changed, 59 insertions(+), 41 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index a2ec7e93fd71..bdbbc49f46ee 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -1048,8 +1048,18 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish) } EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
-static enum pci_p2pdma_map_type -pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) +/** + * pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers + * @provider: P2PDMA provider structure + * @dev: Target device for the transfer + * + * Determines how peer-to-peer DMA transfers should be mapped between + * the provider and the target device. The mapping type indicates whether + * the transfer can be done directly through PCI switches or must go + * through the host bridge. + */ +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, + struct device *dev) { enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED; struct pci_dev *pdev = to_pci_dev(provider->owner); @@ -1078,6 +1088,7 @@ pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
return type; } +EXPORT_SYMBOL_GPL(pci_p2pdma_map_type);
void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state, struct device *dev, struct page *page) diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index e307c9380d46..1e499a8e0099 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -26,6 +26,45 @@ struct p2pdma_provider { u64 bus_offset; };
+enum pci_p2pdma_map_type { + /* + * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before + * the mapping type has been calculated. Exported routines for the API + * will never return this value. + */ + PCI_P2PDMA_MAP_UNKNOWN = 0, + + /* + * Not a PCI P2PDMA transfer. + */ + PCI_P2PDMA_MAP_NONE, + + /* + * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will + * traverse the host bridge and the host bridge is not in the + * allowlist. DMA Mapping routines should return an error when + * this is returned. + */ + PCI_P2PDMA_MAP_NOT_SUPPORTED, + + /* + * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to + * each other directly through a PCI switch and the transaction will + * not traverse the host bridge. Such a mapping should program + * the DMA engine with PCI bus addresses. + */ + PCI_P2PDMA_MAP_BUS_ADDR, + + /* + * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk + * to each other, but the transaction traverses a host bridge on the + * allowlist. In this case, a normal mapping either with CPU physical + * addresses (in the case of dma-direct) or IOVA addresses (in the + * case of IOMMUs) should be used to program the DMA engine. + */ + PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, +}; + #ifdef CONFIG_PCI_P2PDMA int pcim_p2pdma_init(struct pci_dev *pdev); struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar); @@ -45,6 +84,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, bool *use_p2pdma); ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, bool use_p2pdma); +enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider, + struct device *dev); #else /* CONFIG_PCI_P2PDMA */ static inline int pcim_p2pdma_init(struct pci_dev *pdev) { @@ -106,6 +147,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page, { return sprintf(page, "none\n"); } +static inline enum pci_p2pdma_map_type +pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev) +{ + return PCI_P2PDMA_MAP_NOT_SUPPORTED; +} #endif /* CONFIG_PCI_P2PDMA */
@@ -120,45 +166,6 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client) return pci_p2pmem_find_many(&client, 1); }
-enum pci_p2pdma_map_type { - /* - * PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before - * the mapping type has been calculated. Exported routines for the API - * will never return this value. - */ - PCI_P2PDMA_MAP_UNKNOWN = 0, - - /* - * Not a PCI P2PDMA transfer. - */ - PCI_P2PDMA_MAP_NONE, - - /* - * PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will - * traverse the host bridge and the host bridge is not in the - * allowlist. DMA Mapping routines should return an error when - * this is returned. - */ - PCI_P2PDMA_MAP_NOT_SUPPORTED, - - /* - * PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to - * each other directly through a PCI switch and the transaction will - * not traverse the host bridge. Such a mapping should program - * the DMA engine with PCI bus addresses. - */ - PCI_P2PDMA_MAP_BUS_ADDR, - - /* - * PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk - * to each other, but the transaction traverses a host bridge on the - * allowlist. In this case, a normal mapping either with CPU physical - * addresses (in the case of dma-direct) or IOVA addresses (in the - * case of IOMMUs) should be used to program the DMA engine. - */ - PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, -}; - struct pci_p2pdma_map_state { struct p2pdma_provider *mem; enum pci_p2pdma_map_type map;
Nacked-by: Christoph Hellwig hch@lst.de
As explained to you multiple times, pci_p2pdma_map_type is a low-level helper that absolutely MUST be wrapper in proper accessors. It is dangerous when used incorrectly and requires too much boiler plate. There is no way this can be directly exported, and you really need to stop resending this.
On Thu, Oct 16, 2025 at 11:31:53PM -0700, Christoph Hellwig wrote:
Nacked-by: Christoph Hellwig hch@lst.de
As explained to you multiple times, pci_p2pdma_map_type is a low-level helper that absolutely MUST be wrapper in proper accessors.
You never responded to the discussion:
https://lore.kernel.org/all/20250727190252.GF7551@nvidia.com/
What is the plan here? Is the new DMA API unusable by modules? That seems a little challenging.
It is dangerous when used incorrectly and requires too much boiler plate. There is no way this can be directly exported, and you really need to stop resending this.
Yeah, I don't like the boilerplate at all either.
It looks like there is a simple enough solution here. I wanted to tackle this after, but maybe it is small enough to do it now.
dmabuf should gain some helpers like BIO has to manage its map/unmap flows, so lets put a start of some helpers in drivers/dma/dma-mapping.c (or whatever). dmabuf is a built in so it can call the function without exporting it just like block and hmm are doing.
The same code as in this vfio patch will get moved into the helper and vfio will call it under its dmabuf map/unmap ops.
The goal would be to make it much easier for other dmabuf exporters to switch from dma_map_resource() to this new dmabuf api which is safe for P2P.
Jason
From: Leon Romanovsky leonro@nvidia.com
Move the struct phys_vec definition from block/blk-mq-dma.c to include/linux/types.h to make it available for use across the kernel.
The phys_vec structure represents a physical address range with a length, which is used by the new physical address-based DMA mapping API. This structure is already used by the block layer and will be needed by upcoming VFIO patches for dma-buf operations.
Moving this definition to types.h provides a centralized location for this common data structure and eliminates code duplication across subsystems that need to work with physical address ranges.
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- block/blk-mq-dma.c | 5 ----- include/linux/types.h | 5 +++++ 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/block/blk-mq-dma.c b/block/blk-mq-dma.c index badef1d925b2..38f5c34ca223 100644 --- a/block/blk-mq-dma.c +++ b/block/blk-mq-dma.c @@ -6,11 +6,6 @@ #include <linux/blk-mq-dma.h> #include "blk.h"
-struct phys_vec { - phys_addr_t paddr; - u32 len; -}; - static bool __blk_map_iter_next(struct blk_map_iter *iter) { if (iter->iter.bi_size) diff --git a/include/linux/types.h b/include/linux/types.h index 6dfdb8e8e4c3..2bc56681b2e6 100644 --- a/include/linux/types.h +++ b/include/linux/types.h @@ -170,6 +170,11 @@ typedef u64 phys_addr_t; typedef u32 phys_addr_t; #endif
+struct phys_vec { + phys_addr_t paddr; + u32 len; +}; + typedef phys_addr_t resource_size_t;
/*
From: Vivek Kasireddy vivek.kasireddy@intel.com
These helpers are useful for managing additional references taken on the device from other associated VFIO modules.
Original-patch-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/vfio_main.c | 2 ++ include/linux/vfio.h | 2 ++ 2 files changed, 4 insertions(+)
diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c index 38c8e9350a60..9aa4a5d081e8 100644 --- a/drivers/vfio/vfio_main.c +++ b/drivers/vfio/vfio_main.c @@ -172,11 +172,13 @@ void vfio_device_put_registration(struct vfio_device *device) if (refcount_dec_and_test(&device->refcount)) complete(&device->comp); } +EXPORT_SYMBOL_GPL(vfio_device_put_registration);
bool vfio_device_try_get_registration(struct vfio_device *device) { return refcount_inc_not_zero(&device->refcount); } +EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
/* * VFIO driver API diff --git a/include/linux/vfio.h b/include/linux/vfio.h index eb563f538dee..217ba4ef1752 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -297,6 +297,8 @@ static inline void vfio_put_device(struct vfio_device *device) int vfio_register_group_dev(struct vfio_device *device); int vfio_register_emulated_iommu_dev(struct vfio_device *device); void vfio_unregister_group_dev(struct vfio_device *device); +bool vfio_device_try_get_registration(struct vfio_device *device); +void vfio_device_put_registration(struct vfio_device *device);
int vfio_assign_device_set(struct vfio_device *device, void *set_id); unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);
From: Vivek Kasireddy vivek.kasireddy@intel.com
There is no need to share the main device pointer (struct vfio_device *) with all the feature functions as they only need the core device pointer. Therefore, extract the core device pointer once in the caller (vfio_pci_core_ioctl_feature) and share it instead.
Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/vfio_pci_core.c | 30 +++++++++++++----------------- 1 file changed, 13 insertions(+), 17 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 7dcf5439dedc..ca9a95716a85 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -299,11 +299,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, return 0; }
-static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -320,12 +318,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags, }
static int vfio_pci_core_pm_entry_with_wakeup( - struct vfio_device *device, u32 flags, + struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_low_power_entry_with_wakeup __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); struct vfio_device_low_power_entry_with_wakeup entry; struct eventfd_ctx *efdctx; int ret; @@ -376,11 +372,9 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) up_write(&vdev->memory_lock); }
-static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags, +static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags, void __user *arg, size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0); @@ -1473,11 +1467,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, } EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
-static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, - uuid_t __user *arg, size_t argsz) +static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev, + u32 flags, uuid_t __user *arg, + size_t argsz) { - struct vfio_pci_core_device *vdev = - container_of(device, struct vfio_pci_core_device, vdev); uuid_t uuid; int ret;
@@ -1504,16 +1497,19 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, void __user *arg, size_t argsz) { + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + switch (flags & VFIO_DEVICE_FEATURE_MASK) { case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY: - return vfio_pci_core_pm_entry(device, flags, arg, argsz); + return vfio_pci_core_pm_entry(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP: - return vfio_pci_core_pm_entry_with_wakeup(device, flags, + return vfio_pci_core_pm_entry_with_wakeup(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT: - return vfio_pci_core_pm_exit(device, flags, arg, argsz); + return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: - return vfio_pci_core_feature_token(device, flags, arg, argsz); + return vfio_pci_core_feature_token(vdev, flags, arg, argsz); default: return -ENOTTY; }
From: Leon Romanovsky leonro@nvidia.com
Make sure that all VFIO PCI devices have peer-to-peer capabilities enables, so we would be able to export their MMIO memory through DMABUF,
Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/vfio_pci_core.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index ca9a95716a85..fe247d0e2831 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -28,6 +28,7 @@ #include <linux/nospec.h> #include <linux/sched/mm.h> #include <linux/iommufd.h> +#include <linux/pci-p2pdma.h> #if IS_ENABLED(CONFIG_EEH) #include <asm/eeh.h> #endif @@ -2081,6 +2082,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) { struct vfio_pci_core_device *vdev = container_of(core_vdev, struct vfio_pci_core_device, vdev); + int ret;
vdev->pdev = to_pci_dev(core_vdev->dev); vdev->irq_type = VFIO_PCI_NUM_IRQS; @@ -2090,6 +2092,9 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) INIT_LIST_HEAD(&vdev->dummy_resources_list); INIT_LIST_HEAD(&vdev->ioeventfds_list); INIT_LIST_HEAD(&vdev->sriov_pfs_item); + ret = pcim_p2pdma_init(vdev->pdev); + if (ret != -EOPNOTSUPP) + return ret; init_rwsem(&vdev->memory_lock); xa_init(&vdev->ctx);
On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Make sure that all VFIO PCI devices have peer-to-peer capabilities enables, so we would be able to export their MMIO memory through DMABUF,
How do you know that they are safe to use with P2P?
On Thu, Oct 16, 2025 at 11:32:59PM -0700, Christoph Hellwig wrote:
On Mon, Oct 13, 2025 at 06:26:10PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Make sure that all VFIO PCI devices have peer-to-peer capabilities enables, so we would be able to export their MMIO memory through DMABUF,
How do you know that they are safe to use with P2P?
All PCI devices are "safe" for P2P by spec. I've never heard of a non-complaint device causing problems in this area.
The issue is always SOC support inside the CPU and that is delt with inside the P2P subsystem logic.
If we ever see a problem it would be delt with by quirking the broken device through pci-quirks and having the p2p subsystem refuse any p2p with that device.
Jason
From: Leon Romanovsky leonro@nvidia.com
Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace.
Signed-off-by: Jason Gunthorpe jgg@nvidia.com Signed-off-by: Vivek Kasireddy vivek.kasireddy@intel.com Signed-off-by: Leon Romanovsky leonro@nvidia.com --- drivers/vfio/pci/Kconfig | 3 + drivers/vfio/pci/Makefile | 2 + drivers/vfio/pci/vfio_pci_config.c | 22 +- drivers/vfio/pci/vfio_pci_core.c | 28 ++ drivers/vfio/pci/vfio_pci_dmabuf.c | 446 +++++++++++++++++++++++++++++ drivers/vfio/pci/vfio_pci_priv.h | 23 ++ include/linux/vfio_pci_core.h | 1 + include/uapi/linux/vfio.h | 25 ++ 8 files changed, 546 insertions(+), 4 deletions(-) create mode 100644 drivers/vfio/pci/vfio_pci_dmabuf.c
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 2b0172f54665..2b9fca00e9e8 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -55,6 +55,9 @@ config VFIO_PCI_ZDEV_KVM
To enable s390x KVM vfio-pci extensions, say Y.
+config VFIO_PCI_DMABUF + def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER + source "drivers/vfio/pci/mlx5/Kconfig"
source "drivers/vfio/pci/hisilicon/Kconfig" diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index cf00c0a7e55c..f9155e9c5f63 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -2,7 +2,9 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o + obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o +vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
vfio-pci-y := vfio_pci.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 8f02f236b5b4..1f6008eabf23 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY); new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
- if (!new_mem) + if (!new_mem) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + }
/* * If the user is writing mem/io enable (new_mem/io) and we @@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos, *virt_cmd &= cpu_to_le16(~mask); *virt_cmd |= cpu_to_le16(new_cmd & mask);
+ if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
@@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state) { - if (state >= PCI_D3hot) + if (state >= PCI_D3hot) { vfio_pci_zap_and_down_write_memory_lock(vdev); - else + vfio_pci_dma_buf_move(vdev, true); + } else { down_write(&vdev->memory_lock); + }
vfio_pci_set_power_state(vdev, state); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
@@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) { vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } } @@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) { vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); } } diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index fe247d0e2831..56b1320238a9 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -287,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev, * semaphore. */ vfio_pci_zap_and_down_write_memory_lock(vdev); + vfio_pci_dma_buf_move(vdev, true); + if (vdev->pm_runtime_engaged) { up_write(&vdev->memory_lock); return -EINVAL; @@ -370,6 +372,8 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev) */ down_write(&vdev->memory_lock); __vfio_pci_runtime_pm_exit(vdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock); }
@@ -690,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev) #endif vfio_pci_core_disable(vdev);
+ vfio_pci_dma_buf_cleanup(vdev); + mutex_lock(&vdev->igate); if (vdev->err_trigger) { eventfd_ctx_put(vdev->err_trigger); @@ -1222,7 +1228,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev, */ vfio_pci_set_power_state(vdev, PCI_D0);
+ vfio_pci_dma_buf_move(vdev, true); ret = pci_try_reset_function(vdev->pdev); + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); up_write(&vdev->memory_lock);
return ret; @@ -1511,6 +1520,19 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, return vfio_pci_core_pm_exit(vdev, flags, arg, argsz); case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: return vfio_pci_core_feature_token(vdev, flags, arg, argsz); + case VFIO_DEVICE_FEATURE_DMA_BUF: + if (device->ops->ioctl != vfio_pci_core_ioctl) + /* + * Devices that overwrite general .ioctl() callback + * usually do it to implement their own + * VFIO_DEVICE_GET_REGION_INFO handlerm and they present + * different BAR information from the real PCI. + * + * DMABUF relies on real PCI information. + */ + return -EOPNOTSUPP; + + return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz); default: return -ENOTTY; } @@ -2095,6 +2117,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev) ret = pcim_p2pdma_init(vdev->pdev); if (ret != -EOPNOTSUPP) return ret; + INIT_LIST_HEAD(&vdev->dmabufs); init_rwsem(&vdev->memory_lock); xa_init(&vdev->ctx);
@@ -2459,6 +2482,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, break; }
+ vfio_pci_dma_buf_move(vdev, true); vfio_pci_zap_bars(vdev); }
@@ -2482,6 +2506,10 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
ret = pci_reset_bus(pdev);
+ list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list) + if (__vfio_pci_memory_enabled(vdev)) + vfio_pci_dma_buf_move(vdev, false); + vdev = list_last_entry(&dev_set->device_list, struct vfio_pci_core_device, vdev.dev_set_list);
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c new file mode 100644 index 000000000000..eaba010777f3 --- /dev/null +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -0,0 +1,446 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. + */ +#include <linux/dma-buf.h> +#include <linux/pci-p2pdma.h> +#include <linux/dma-resv.h> + +#include "vfio_pci_priv.h" + +MODULE_IMPORT_NS("DMA_BUF"); + +struct vfio_pci_dma_buf { + struct dma_buf *dmabuf; + struct vfio_pci_core_device *vdev; + struct list_head dmabufs_elm; + size_t size; + struct phys_vec *phys_vec; + struct p2pdma_provider *provider; + u32 nr_ranges; + u8 revoked : 1; +}; + +static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + if (!attachment->peer2peer) + return -EOPNOTSUPP; + + if (priv->revoked) + return -ENODEV; + + switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) { + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: + break; + case PCI_P2PDMA_MAP_BUS_ADDR: + /* + * There is no need in IOVA at all for this flow. + * We rely on attachment->priv == NULL as a marker + * for this mode. + */ + return 0; + default: + return -EINVAL; + } + + attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL); + if (!attachment->priv) + return -ENOMEM; + + dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size); + return 0; +} + +static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf, + struct dma_buf_attachment *attachment) +{ + kfree(attachment->priv); +} + +static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, u64 length, + dma_addr_t addr) +{ + unsigned int len, nents; + int i; + + nents = DIV_ROUND_UP(length, UINT_MAX); + for (i = 0; i < nents; i++) { + len = min_t(u64, length, UINT_MAX); + length -= len; + /* + * Follow the DMABUF rules for scatterlist, the struct page can + * be NULL'd for MMIO only memory. + */ + sg_set_page(sgl, NULL, len, 0); + sg_dma_address(sgl) = addr + i * UINT_MAX; + sg_dma_len(sgl) = len; + sgl = sg_next(sgl); + } + + return sgl; +} + +static unsigned int calc_sg_nents(struct vfio_pci_dma_buf *priv, + struct dma_iova_state *state) +{ + struct phys_vec *phys_vec = priv->phys_vec; + unsigned int nents = 0; + u32 i; + + if (!state || !dma_use_iova(state)) + for (i = 0; i < priv->nr_ranges; i++) + nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); + else + /* + * In IOVA case, there is only one SG entry which spans + * for whole IOVA address space, but we need to make sure + * that it fits sg->length, maybe we need more. + */ + nents = DIV_ROUND_UP(priv->size, UINT_MAX); + + return nents; +} + +static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct dma_iova_state *state = attachment->priv; + struct phys_vec *phys_vec = priv->phys_vec; + unsigned long attrs = DMA_ATTR_MMIO; + unsigned int nents, mapped_len = 0; + struct scatterlist *sgl; + struct sg_table *sgt; + dma_addr_t addr; + int ret; + u32 i; + + dma_resv_assert_held(priv->dmabuf->resv); + + if (priv->revoked) + return ERR_PTR(-ENODEV); + + sgt = kzalloc(sizeof(*sgt), GFP_KERNEL); + if (!sgt) + return ERR_PTR(-ENOMEM); + + nents = calc_sg_nents(priv, state); + ret = sg_alloc_table(sgt, nents, GFP_KERNEL | __GFP_ZERO); + if (ret) + goto err_kfree_sgt; + + sgl = sgt->sgl; + + for (i = 0; i < priv->nr_ranges; i++) { + if (!state) { + addr = pci_p2pdma_bus_addr_map(priv->provider, + phys_vec[i].paddr); + } else if (dma_use_iova(state)) { + ret = dma_iova_link(attachment->dev, state, + phys_vec[i].paddr, 0, + phys_vec[i].len, dir, attrs); + if (ret) + goto err_unmap_dma; + + mapped_len += phys_vec[i].len; + } else { + addr = dma_map_phys(attachment->dev, phys_vec[i].paddr, + phys_vec[i].len, dir, attrs); + ret = dma_mapping_error(attachment->dev, addr); + if (ret) + goto err_unmap_dma; + } + + if (!state || !dma_use_iova(state)) + sgl = fill_sg_entry(sgl, phys_vec[i].len, addr); + } + + if (state && dma_use_iova(state)) { + WARN_ON_ONCE(mapped_len != priv->size); + ret = dma_iova_sync(attachment->dev, state, 0, mapped_len); + if (ret) + goto err_unmap_dma; + sgl = fill_sg_entry(sgl, mapped_len, state->addr); + } + + /* + * SGL must be NULL to indicate that SGL is the last one + * and we allocated correct number of entries in sg_alloc_table() + */ + WARN_ON_ONCE(sgl); + return sgt; + +err_unmap_dma: + if (!i || !state) + ; /* Do nothing */ + else if (dma_use_iova(state)) + dma_iova_destroy(attachment->dev, state, mapped_len, dir, + attrs); + else + for_each_sgtable_dma_sg(sgt, sgl, i) + dma_unmap_phys(attachment->dev, sg_dma_address(sgl), + sg_dma_len(sgl), dir, attrs); + sg_free_table(sgt); +err_kfree_sgt: + kfree(sgt); + return ERR_PTR(ret); +} + +static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, + struct sg_table *sgt, + enum dma_data_direction dir) +{ + struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; + struct dma_iova_state *state = attachment->priv; + unsigned long attrs = DMA_ATTR_MMIO; + struct scatterlist *sgl; + int i; + + if (!state) + ; /* Do nothing */ + else if (dma_use_iova(state)) + dma_iova_destroy(attachment->dev, state, priv->size, dir, + attrs); + else + for_each_sgtable_dma_sg(sgt, sgl, i) + dma_unmap_phys(attachment->dev, sg_dma_address(sgl), + sg_dma_len(sgl), dir, attrs); + + sg_free_table(sgt); + kfree(sgt); +} + +static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) +{ + struct vfio_pci_dma_buf *priv = dmabuf->priv; + + /* + * Either this or vfio_pci_dma_buf_cleanup() will remove from the list. + * The refcount prevents both. + */ + if (priv->vdev) { + down_write(&priv->vdev->memory_lock); + list_del_init(&priv->dmabufs_elm); + up_write(&priv->vdev->memory_lock); + vfio_device_put_registration(&priv->vdev->vdev); + } + kfree(priv->phys_vec); + kfree(priv); +} + +static const struct dma_buf_ops vfio_pci_dmabuf_ops = { + .attach = vfio_pci_dma_buf_attach, + .detach = vfio_pci_dma_buf_detach, + .map_dma_buf = vfio_pci_dma_buf_map, + .release = vfio_pci_dma_buf_release, + .unmap_dma_buf = vfio_pci_dma_buf_unmap, +}; + +static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv, + struct vfio_device_feature_dma_buf *dma_buf, + struct vfio_region_dma_range *dma_ranges, + struct p2pdma_provider *provider) +{ + struct pci_dev *pdev = priv->vdev->pdev; + phys_addr_t pci_start; + u32 i; + + pci_start = pci_resource_start(pdev, dma_buf->region_index); + for (i = 0; i < dma_buf->nr_ranges; i++) { + priv->phys_vec[i].len = dma_ranges[i].length; + priv->phys_vec[i].paddr = pci_start + dma_ranges[i].offset; + priv->size += priv->phys_vec[i].len; + } + priv->nr_ranges = dma_buf->nr_ranges; + priv->provider = provider; +} + +static int validate_dmabuf_input(struct vfio_pci_core_device *vdev, + struct vfio_device_feature_dma_buf *dma_buf, + struct vfio_region_dma_range *dma_ranges, + struct p2pdma_provider **provider) +{ + struct pci_dev *pdev = vdev->pdev; + u32 bar = dma_buf->region_index; + resource_size_t bar_size; + u64 length = 0, sum; + u32 i; + + if (dma_buf->flags) + return -EINVAL; + /* + * For PCI the region_index is the BAR number like everything else. + */ + if (bar >= VFIO_PCI_ROM_REGION_INDEX) + return -ENODEV; + + *provider = pcim_p2pdma_provider(pdev, bar); + if (!*provider) + return -EINVAL; + + bar_size = pci_resource_len(pdev, bar); + for (i = 0; i < dma_buf->nr_ranges; i++) { + u64 offset = dma_ranges[i].offset; + u64 len = dma_ranges[i].length; + + if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) + return -EINVAL; + + if (check_add_overflow(offset, len, &sum) || sum > bar_size) + return -EINVAL; + + /* Total requested length can't overflow IOVA size */ + if (check_add_overflow(length, len, &sum)) + return -EINVAL; + + length = sum; + } + + /* + * DMA API uses size_t, so make sure that requested region length + * can fit into size_t variable, which can be unsigned int (32bits). + * + * In addition make sure that high bit of total length is not used too + * as it is used as a marker for DMA IOVA API. + */ + if (overflows_type(length, size_t) || length & DMA_IOVA_USE_SWIOTLB) + return -EINVAL; + + return 0; +} + +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + struct vfio_device_feature_dma_buf get_dma_buf = {}; + struct vfio_region_dma_range *dma_ranges; + DEFINE_DMA_BUF_EXPORT_INFO(exp_info); + struct p2pdma_provider *provider; + struct vfio_pci_dma_buf *priv; + int ret; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(get_dma_buf)); + if (ret != 1) + return ret; + + if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf))) + return -EFAULT; + + if (!get_dma_buf.nr_ranges) + return -EINVAL; + + dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges, + sizeof(*dma_ranges)); + if (IS_ERR(dma_ranges)) + return PTR_ERR(dma_ranges); + + ret = validate_dmabuf_input(vdev, &get_dma_buf, dma_ranges, &provider); + if (ret) + goto err_free_ranges; + + priv = kzalloc(sizeof(*priv), GFP_KERNEL); + if (!priv) { + ret = -ENOMEM; + goto err_free_ranges; + } + priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec), + GFP_KERNEL); + if (!priv->phys_vec) { + ret = -ENOMEM; + goto err_free_priv; + } + + priv->vdev = vdev; + dma_ranges_to_p2p_phys(priv, &get_dma_buf, dma_ranges, provider); + kfree(dma_ranges); + dma_ranges = NULL; + + if (!vfio_device_try_get_registration(&vdev->vdev)) { + ret = -ENODEV; + goto err_free_phys; + } + + exp_info.ops = &vfio_pci_dmabuf_ops; + exp_info.size = priv->size; + exp_info.flags = get_dma_buf.open_flags; + exp_info.priv = priv; + + priv->dmabuf = dma_buf_export(&exp_info); + if (IS_ERR(priv->dmabuf)) { + ret = PTR_ERR(priv->dmabuf); + goto err_dev_put; + } + + /* dma_buf_put() now frees priv */ + INIT_LIST_HEAD(&priv->dmabufs_elm); + down_write(&vdev->memory_lock); + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = !__vfio_pci_memory_enabled(vdev); + list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); + dma_resv_unlock(priv->dmabuf->resv); + up_write(&vdev->memory_lock); + + /* + * dma_buf_fd() consumes the reference, when the file closes the dmabuf + * will be released. + */ + return dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags); + +err_dev_put: + vfio_device_put_registration(&vdev->vdev); +err_free_phys: + kfree(priv->phys_vec); +err_free_priv: + kfree(priv); +err_free_ranges: + kfree(dma_ranges); + return ret; +} + +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + lockdep_assert_held_write(&vdev->memory_lock); + + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!get_file_active(&priv->dmabuf->file)) + continue; + + if (priv->revoked != revoked) { + dma_resv_lock(priv->dmabuf->resv, NULL); + priv->revoked = revoked; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + } + dma_buf_put(priv->dmabuf); + } +} + +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ + struct vfio_pci_dma_buf *priv; + struct vfio_pci_dma_buf *tmp; + + down_write(&vdev->memory_lock); + list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { + if (!get_file_active(&priv->dmabuf->file)) + continue; + + dma_resv_lock(priv->dmabuf->resv, NULL); + list_del_init(&priv->dmabufs_elm); + priv->vdev = NULL; + priv->revoked = true; + dma_buf_move_notify(priv->dmabuf); + dma_resv_unlock(priv->dmabuf->resv); + vfio_device_put_registration(&vdev->vdev); + dma_buf_put(priv->dmabuf); + } + up_write(&vdev->memory_lock); +} diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h index a9972eacb293..28a405f8b97c 100644 --- a/drivers/vfio/pci/vfio_pci_priv.h +++ b/drivers/vfio/pci/vfio_pci_priv.h @@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev) return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA; }
+#ifdef CONFIG_VFIO_PCI_DMABUF +int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz); +void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev); +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked); +#else +static inline int +vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, + struct vfio_device_feature_dma_buf __user *arg, + size_t argsz) +{ + return -ENOTTY; +} +static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev) +{ +} +static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, + bool revoked) +{ +} +#endif + #endif diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index f541044e42a2..30d74b364f25 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -94,6 +94,7 @@ struct vfio_pci_core_device { struct vfio_pci_core_device *sriov_pf_core_dev; struct notifier_block nb; struct rw_semaphore memory_lock; + struct list_head dmabufs; };
/* Will be exported for vfio pci drivers usage */ diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 75100bf009ba..63214467c875 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1478,6 +1478,31 @@ struct vfio_device_feature_bus_master { }; #define VFIO_DEVICE_FEATURE_BUS_MASTER 10
+/** + * Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the + * regions selected. + * + * open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC, + * etc. offset/length specify a slice of the region to create the dmabuf from. + * nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf. + * + * Return: The fd number on success, -1 and errno is set on failure. + */ +#define VFIO_DEVICE_FEATURE_DMA_BUF 11 + +struct vfio_region_dma_range { + __u64 offset; + __u64 length; +}; + +struct vfio_device_feature_dma_buf { + __u32 region_index; + __u32 open_flags; + __u32 flags; + __u32 nr_ranges; + struct vfio_region_dma_range dma_ranges[]; +}; + /* -------- API for Type1 VFIO IOMMU -------- */
/**
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;
- if (priv->revoked)
return -ENODEV;
- switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
- case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
break;
- case PCI_P2PDMA_MAP_BUS_ADDR:
/*
* There is no need in IOVA at all for this flow.
* We rely on attachment->priv == NULL as a marker
* for this mode.
*/
return 0;
- default:
return -EINVAL;
- }
- attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
- if (!attachment->priv)
return -ENOMEM;
- dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size);
The lifetime of this isn't good..
- return 0;
+}
+static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- kfree(attachment->priv);
+}
If the caller fails to call map then it leaks the iova.
+static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
+{
[..]
+err_unmap_dma:
- if (!i || !state)
; /* Do nothing */
- else if (dma_use_iova(state))
dma_iova_destroy(attachment->dev, state, mapped_len, dir,
attrs);
If we hit this error path then it is freed..
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
+{
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct dma_iova_state *state = attachment->priv;
- unsigned long attrs = DMA_ATTR_MMIO;
- struct scatterlist *sgl;
- int i;
- if (!state)
; /* Do nothing */
- else if (dma_use_iova(state))
dma_iova_destroy(attachment->dev, state, priv->size, dir,
attrs);
It is freed here too, but we can call map multiple times. Every time a move_notify happens can trigger another call to map.
I think just call unlink in those two and put dma_iova_free in detach
Jason
On Thu, Oct 16, 2025 at 08:53:32PM -0300, Jason Gunthorpe wrote:
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;
- if (priv->revoked)
return -ENODEV;
- switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
- case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
break;
- case PCI_P2PDMA_MAP_BUS_ADDR:
/*
* There is no need in IOVA at all for this flow.
* We rely on attachment->priv == NULL as a marker
* for this mode.
*/
return 0;
- default:
return -EINVAL;
- }
- attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL);
- if (!attachment->priv)
return -ENOMEM;
- dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size);
The lifetime of this isn't good..
- return 0;
+}
+static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- kfree(attachment->priv);
+}
If the caller fails to call map then it leaks the iova.
I'm relying on dmabuf code and documentation:
926 /** 927 * dma_buf_dynamic_attach - Add the device to dma_buf's attachments list ... 932 * 933 * Returns struct dma_buf_attachment pointer for this attachment. Attachments 934 * must be cleaned up by calling dma_buf_detach().
Successful call to vfio_pci_dma_buf_attach() MUST be accompanied by call to vfio_pci_dma_buf_detach(), so as far as dmabuf implementation follows it, there is no leak.
+static struct sg_table * +vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
+{
[..]
+err_unmap_dma:
- if (!i || !state)
; /* Do nothing */
- else if (dma_use_iova(state))
dma_iova_destroy(attachment->dev, state, mapped_len, dir,
attrs);
If we hit this error path then it is freed..
+static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
+{
- struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
- struct dma_iova_state *state = attachment->priv;
- unsigned long attrs = DMA_ATTR_MMIO;
- struct scatterlist *sgl;
- int i;
- if (!state)
; /* Do nothing */
- else if (dma_use_iova(state))
dma_iova_destroy(attachment->dev, state, priv->size, dir,
attrs);
It is freed here too, but we can call map multiple times. Every time a move_notify happens can trigger another call to map.
I think just call unlink in those two and put dma_iova_free in detach
Yes, it can work.
Thanks
Jason
On Fri, Oct 17, 2025 at 08:40:07AM +0300, Leon Romanovsky wrote:
+static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- kfree(attachment->priv);
+}
If the caller fails to call map then it leaks the iova.
I'm relying on dmabuf code and documentation:
926 /** 927 * dma_buf_dynamic_attach - Add the device to dma_buf's attachments list ... 932 * 933 * Returns struct dma_buf_attachment pointer for this attachment. Attachments 934 * must be cleaned up by calling dma_buf_detach().
Successful call to vfio_pci_dma_buf_attach() MUST be accompanied by call to vfio_pci_dma_buf_detach(), so as far as dmabuf implementation follows it, there is no leak.
It leaks the ivoa because there is no dma_iova_destroy() unless you call unmap. detach is not unmap and unmap is not mandatory to call.
Jason
On Fri, Oct 17, 2025 at 12:58:50PM -0300, Jason Gunthorpe wrote:
On Fri, Oct 17, 2025 at 08:40:07AM +0300, Leon Romanovsky wrote:
+static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- kfree(attachment->priv);
+}
If the caller fails to call map then it leaks the iova.
I'm relying on dmabuf code and documentation:
926 /** 927 * dma_buf_dynamic_attach - Add the device to dma_buf's attachments list ... 932 * 933 * Returns struct dma_buf_attachment pointer for this attachment. Attachments 934 * must be cleaned up by calling dma_buf_detach().
Successful call to vfio_pci_dma_buf_attach() MUST be accompanied by call to vfio_pci_dma_buf_detach(), so as far as dmabuf implementation follows it, there is no leak.
It leaks the ivoa because there is no dma_iova_destroy() unless you call unmap. detach is not unmap and unmap is not mandatory to call.
Though putting iova free in detach is problematic for the hot-unplug case. In that instance we need to ensure the iova is cleaned up prior to returning from vfio's remove(). detached is called on the importers timeline but unmap is required to be called in move_notify..
So I guess some kind of flag to trigger the unmap after cleanup to free the iova?
Jason
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace.
I have drafted the iommufd importer side of this using the private interconnect approach for now.
https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf/
Due to this iommufd never calls map and we run into trouble here:
+static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
+{
- struct vfio_pci_dma_buf *priv = dmabuf->priv;
- if (!attachment->peer2peer)
return -EOPNOTSUPP;
- if (priv->revoked)
return -ENODEV;
- switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) {
- case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
break;
- case PCI_P2PDMA_MAP_BUS_ADDR:
/*
* There is no need in IOVA at all for this flow.
* We rely on attachment->priv == NULL as a marker
* for this mode.
*/
return 0;
- default:
return -EINVAL;
Where the dev from iommufd is also not p2p capable so the attach fails.
This is OK since it won't call map.
So I reworked this logic to succeed attach but block map in this case.. Can we fold this in for the next version? This diff has the fixing for the iova lifecycle too.
I have a few more checks to make but so far it looks Ok and with some luck we can get some iommufd p2p support this cycle..
Jason
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index eaba010777f3b7..a0650bd816d99b 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -20,10 +20,21 @@ struct vfio_pci_dma_buf { u8 revoked : 1; };
+struct vfio_pci_attach { + struct dma_iova_state state; + enum { + VFIO_ATTACH_NONE, + VFIO_ATTACH_HOST_BRIDGE_DMA, + VFIO_ATTACH_HOST_BRIDGE_IOVA, + VFIO_ATTACH_BUS + } kind; +}; + static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { struct vfio_pci_dma_buf *priv = dmabuf->priv; + struct vfio_pci_attach *attach;
if (!attachment->peer2peer) return -EOPNOTSUPP; @@ -31,32 +42,38 @@ static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf, if (priv->revoked) return -ENODEV;
+ attach = kzalloc(sizeof(*attach), GFP_KERNEL); + if (!attach) + return -ENOMEM; + attachment->priv = attach; + switch (pci_p2pdma_map_type(priv->provider, attachment->dev)) { case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: - break; + if (dma_iova_try_alloc(attachment->dev, &attach->state, 0, + priv->size)) + attach->kind = VFIO_ATTACH_HOST_BRIDGE_IOVA; + else + attach->kind = VFIO_ATTACH_HOST_BRIDGE_DMA; + return 0; case PCI_P2PDMA_MAP_BUS_ADDR: - /* - * There is no need in IOVA at all for this flow. - * We rely on attachment->priv == NULL as a marker - * for this mode. - */ + /* There is no need in IOVA at all for this flow. */ + attach->kind = VFIO_ATTACH_BUS; return 0; default: - return -EINVAL; + attach->kind = VFIO_ATTACH_NONE; + return 0; } - - attachment->priv = kzalloc(sizeof(struct dma_iova_state), GFP_KERNEL); - if (!attachment->priv) - return -ENOMEM; - - dma_iova_try_alloc(attachment->dev, attachment->priv, 0, priv->size); return 0; }
static void vfio_pci_dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attachment) { - kfree(attachment->priv); + struct vfio_pci_attach *attach = attachment->priv; + + if (attach->kind == VFIO_ATTACH_HOST_BRIDGE_IOVA) + dma_iova_free(attachment->dev, &attach->state); + kfree(attach); }
static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, u64 length, @@ -83,22 +100,23 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, u64 length, }
static unsigned int calc_sg_nents(struct vfio_pci_dma_buf *priv, - struct dma_iova_state *state) + struct vfio_pci_attach *attach) { struct phys_vec *phys_vec = priv->phys_vec; unsigned int nents = 0; u32 i;
- if (!state || !dma_use_iova(state)) + if (attach->kind != VFIO_ATTACH_HOST_BRIDGE_IOVA) { for (i = 0; i < priv->nr_ranges; i++) nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX); - else + } else { /* * In IOVA case, there is only one SG entry which spans * for whole IOVA address space, but we need to make sure * that it fits sg->length, maybe we need more. */ nents = DIV_ROUND_UP(priv->size, UINT_MAX); + }
return nents; } @@ -108,7 +126,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; - struct dma_iova_state *state = attachment->priv; + struct vfio_pci_attach *attach = attachment->priv; struct phys_vec *phys_vec = priv->phys_vec; unsigned long attrs = DMA_ATTR_MMIO; unsigned int nents, mapped_len = 0; @@ -127,7 +145,7 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, if (!sgt) return ERR_PTR(-ENOMEM);
- nents = calc_sg_nents(priv, state); + nents = calc_sg_nents(priv, attach); ret = sg_alloc_table(sgt, nents, GFP_KERNEL | __GFP_ZERO); if (ret) goto err_kfree_sgt; @@ -135,35 +153,42 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, sgl = sgt->sgl;
for (i = 0; i < priv->nr_ranges; i++) { - if (!state) { + switch (attach->kind) { + case VFIO_ATTACH_BUS: addr = pci_p2pdma_bus_addr_map(priv->provider, phys_vec[i].paddr); - } else if (dma_use_iova(state)) { - ret = dma_iova_link(attachment->dev, state, + break; + case VFIO_ATTACH_HOST_BRIDGE_IOVA: + ret = dma_iova_link(attachment->dev, &attach->state, phys_vec[i].paddr, 0, phys_vec[i].len, dir, attrs); if (ret) goto err_unmap_dma;
mapped_len += phys_vec[i].len; - } else { + break; + case VFIO_ATTACH_HOST_BRIDGE_DMA: addr = dma_map_phys(attachment->dev, phys_vec[i].paddr, phys_vec[i].len, dir, attrs); ret = dma_mapping_error(attachment->dev, addr); if (ret) goto err_unmap_dma; + break; + default: + ret = -EINVAL; + goto err_unmap_dma; }
- if (!state || !dma_use_iova(state)) + if (attach->kind != VFIO_ATTACH_HOST_BRIDGE_IOVA) sgl = fill_sg_entry(sgl, phys_vec[i].len, addr); }
- if (state && dma_use_iova(state)) { + if (attach->kind == VFIO_ATTACH_HOST_BRIDGE_IOVA) { WARN_ON_ONCE(mapped_len != priv->size); - ret = dma_iova_sync(attachment->dev, state, 0, mapped_len); + ret = dma_iova_sync(attachment->dev, &attach->state, 0, mapped_len); if (ret) goto err_unmap_dma; - sgl = fill_sg_entry(sgl, mapped_len, state->addr); + sgl = fill_sg_entry(sgl, mapped_len, attach->state.addr); }
/* @@ -174,15 +199,22 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment, return sgt;
err_unmap_dma: - if (!i || !state) - ; /* Do nothing */ - else if (dma_use_iova(state)) - dma_iova_destroy(attachment->dev, state, mapped_len, dir, - attrs); - else + switch (attach->kind) { + case VFIO_ATTACH_HOST_BRIDGE_IOVA: + if (mapped_len) + dma_iova_unlink(attachment->dev, &attach->state, 0, + mapped_len, dir, attrs); + break; + case VFIO_ATTACH_HOST_BRIDGE_DMA: + if (!i) + break; for_each_sgtable_dma_sg(sgt, sgl, i) dma_unmap_phys(attachment->dev, sg_dma_address(sgl), - sg_dma_len(sgl), dir, attrs); + sg_dma_len(sgl), dir, attrs); + break; + default: + break; + } sg_free_table(sgt); err_kfree_sgt: kfree(sgt); @@ -194,20 +226,24 @@ static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment, enum dma_data_direction dir) { struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv; - struct dma_iova_state *state = attachment->priv; + struct vfio_pci_attach *attach = attachment->priv; unsigned long attrs = DMA_ATTR_MMIO; struct scatterlist *sgl; int i;
- if (!state) - ; /* Do nothing */ - else if (dma_use_iova(state)) - dma_iova_destroy(attachment->dev, state, priv->size, dir, - attrs); - else + switch (attach->kind) { + case VFIO_ATTACH_HOST_BRIDGE_IOVA: + dma_iova_destroy(attachment->dev, &attach->state, priv->size, + dir, attrs); + break; + case VFIO_ATTACH_HOST_BRIDGE_DMA: for_each_sgtable_dma_sg(sgt, sgl, i) dma_unmap_phys(attachment->dev, sg_dma_address(sgl), sg_dma_len(sgl), dir, attrs); + break; + default: + break; + }
sg_free_table(sgt); kfree(sgt);
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace.
This still completely fails to explain why you think that it actually is safe without the proper pgmap handling.
On Thu, Oct 16, 2025 at 11:33:52PM -0700, Christoph Hellwig wrote:
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace.
This still completely fails to explain why you think that it actually is safe without the proper pgmap handling.
Leon, this keeps coming up, please clean up and copy the text from my prior email into the commit message & cover letter explaining in detail how lifetime magement works by using revocation/invalidation driven by dmabuf move_notify callbacks.
Jason
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
+static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
struct vfio_device_feature_dma_buf *dma_buf,
struct vfio_region_dma_range *dma_ranges,
struct p2pdma_provider *provider)
+{
- struct pci_dev *pdev = priv->vdev->pdev;
- phys_addr_t pci_start;
- u32 i;
- pci_start = pci_resource_start(pdev, dma_buf->region_index);
- for (i = 0; i < dma_buf->nr_ranges; i++) {
priv->phys_vec[i].len = dma_ranges[i].length;
priv->phys_vec[i].paddr = pci_start + dma_ranges[i].offset;
priv->size += priv->phys_vec[i].len;
- }
This is missing validation, the userspace can pass in any length/offset but the resource is of limited size. Like this:
static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv, struct vfio_device_feature_dma_buf *dma_buf, struct vfio_region_dma_range *dma_ranges, struct p2pdma_provider *provider) { struct pci_dev *pdev = priv->vdev->pdev; phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index); phys_addr_t pci_start; phys_addr_t pci_last; u32 i;
if (!len) return -EINVAL; pci_start = pci_resource_start(pdev, dma_buf->region_index); pci_last = pci_start + len - 1; for (i = 0; i < dma_buf->nr_ranges; i++) { phys_addr_t last;
if (!dma_ranges[i].length) return -EINVAL;
if (check_add_overflow(pci_start, dma_ranges[i].offset, &priv->phys_vec[i].paddr) || check_add_overflow(priv->phys_vec[i].paddr, dma_ranges[i].length - 1, &last)) return -EOVERFLOW; if (last > pci_last) return -EINVAL;
priv->phys_vec[i].len = dma_ranges[i].length; priv->size += priv->phys_vec[i].len; } priv->nr_ranges = dma_buf->nr_ranges; priv->provider = provider; return 0; }
Jason
On Fri, Oct 17, 2025 at 10:02:49AM -0300, Jason Gunthorpe wrote:
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
+static void dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv,
struct vfio_device_feature_dma_buf *dma_buf,
struct vfio_region_dma_range *dma_ranges,
struct p2pdma_provider *provider)
+{
- struct pci_dev *pdev = priv->vdev->pdev;
- phys_addr_t pci_start;
- u32 i;
- pci_start = pci_resource_start(pdev, dma_buf->region_index);
- for (i = 0; i < dma_buf->nr_ranges; i++) {
priv->phys_vec[i].len = dma_ranges[i].length;
priv->phys_vec[i].paddr = pci_start + dma_ranges[i].offset;
priv->size += priv->phys_vec[i].len;
- }
This is missing validation, the userspace can pass in any length/offset but the resource is of limited size. Like this:
static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv, struct vfio_device_feature_dma_buf *dma_buf, struct vfio_region_dma_range *dma_ranges, struct p2pdma_provider *provider) { struct pci_dev *pdev = priv->vdev->pdev; phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index); phys_addr_t pci_start; phys_addr_t pci_last; u32 i;
if (!len) return -EINVAL; pci_start = pci_resource_start(pdev, dma_buf->region_index); pci_last = pci_start + len - 1; for (i = 0; i < dma_buf->nr_ranges; i++) { phys_addr_t last;
if (!dma_ranges[i].length) return -EINVAL; if (check_add_overflow(pci_start, dma_ranges[i].offset, &priv->phys_vec[i].paddr) || check_add_overflow(priv->phys_vec[i].paddr, dma_ranges[i].length - 1, &last)) return -EOVERFLOW; if (last > pci_last) return -EINVAL; priv->phys_vec[i].len = dma_ranges[i].length; priv->size += priv->phys_vec[i].len;
} priv->nr_ranges = dma_buf->nr_ranges; priv->provider = provider; return 0; }
I have these checks in validate_dmabuf_input(). Do you think that I need to add extra checks?
Thanks
Jason
On Mon, Oct 13, 2025 at 06:26:11PM +0300, Leon Romanovsky wrote:
From: Leon Romanovsky leonro@nvidia.com
Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations.
I was looking at how to address Alex's note about not all drivers being compatible, and how to enable the non-compatible drivers.
It looks like the simplest thing is to make dma_ranges_to_p2p_phys into an ops and have the driver provide it. If not provided the no support.
Drivers with special needs can fill in phys in their own way and get their own provider.
Sort of like this:
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index ac10f14417f2f3..6d41cf26b53994 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -147,6 +147,10 @@ static const struct vfio_device_ops vfio_pci_ops = { .pasid_detach_ioas = vfio_iommufd_physical_pasid_detach_ioas, };
+static const struct vfio_pci_device_ops vfio_pci_dev_ops = { + .get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys, +}; + static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) { struct vfio_pci_core_device *vdev; @@ -161,6 +165,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) return PTR_ERR(vdev);
dev_set_drvdata(&pdev->dev, vdev); + vdev->pci_ops = &vfio_pci_dev_ops; ret = vfio_pci_core_register_device(vdev); if (ret) goto out_put_vdev; diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c index 358856e6b8a820..dad880781a9352 100644 --- a/drivers/vfio/pci/vfio_pci_dmabuf.c +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c @@ -309,47 +309,52 @@ int vfio_pci_dma_buf_iommufd_map(struct dma_buf_attachment *attachment, } EXPORT_SYMBOL_GPL(vfio_pci_dma_buf_iommufd_map);
-static int dma_ranges_to_p2p_phys(struct vfio_pci_dma_buf *priv, - struct vfio_device_feature_dma_buf *dma_buf, +int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, + struct p2pdma_provider **provider, + unsigned int region_index, + struct phys_vec *phys_vec, struct vfio_region_dma_range *dma_ranges, - struct p2pdma_provider *provider) + size_t nr_ranges) { - struct pci_dev *pdev = priv->vdev->pdev; - phys_addr_t len = pci_resource_len(pdev, dma_buf->region_index); + struct pci_dev *pdev = vdev->pdev; + phys_addr_t len = pci_resource_len(pdev, region_index); phys_addr_t pci_start; phys_addr_t pci_last; u32 i;
if (!len) return -EINVAL; - pci_start = pci_resource_start(pdev, dma_buf->region_index); + + *provider = pcim_p2pdma_provider(pdev, region_index); + if (!*provider) + return -EINVAL; + + pci_start = pci_resource_start(pdev, region_index); pci_last = pci_start + len - 1; - for (i = 0; i < dma_buf->nr_ranges; i++) { + for (i = 0; i < nr_ranges; i++) { phys_addr_t last;
if (!dma_ranges[i].length) return -EINVAL;
if (check_add_overflow(pci_start, dma_ranges[i].offset, - &priv->phys_vec[i].paddr) || - check_add_overflow(priv->phys_vec[i].paddr, + &phys_vec[i].paddr) || + check_add_overflow(phys_vec[i].paddr, dma_ranges[i].length - 1, &last)) return -EOVERFLOW; if (last > pci_last) return -EINVAL;
- priv->phys_vec[i].len = dma_ranges[i].length; - priv->size += priv->phys_vec[i].len; + phys_vec[i].len = dma_ranges[i].length; } - priv->nr_ranges = dma_buf->nr_ranges; - priv->provider = provider; return 0; } +EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys);
static int validate_dmabuf_input(struct vfio_pci_core_device *vdev, struct vfio_device_feature_dma_buf *dma_buf, struct vfio_region_dma_range *dma_ranges, - struct p2pdma_provider **provider) + size_t *lengthp) { struct pci_dev *pdev = vdev->pdev; u32 bar = dma_buf->region_index; @@ -365,10 +370,6 @@ static int validate_dmabuf_input(struct vfio_pci_core_device *vdev, if (bar >= VFIO_PCI_ROM_REGION_INDEX) return -ENODEV;
- *provider = pcim_p2pdma_provider(pdev, bar); - if (!*provider) - return -EINVAL; - bar_size = pci_resource_len(pdev, bar); for (i = 0; i < dma_buf->nr_ranges; i++) { u64 offset = dma_ranges[i].offset; @@ -397,6 +398,7 @@ static int validate_dmabuf_input(struct vfio_pci_core_device *vdev, if (overflows_type(length, size_t) || length & DMA_IOVA_USE_SWIOTLB) return -EINVAL;
+ *lengthp = length; return 0; }
@@ -407,10 +409,13 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf get_dma_buf = {}; struct vfio_region_dma_range *dma_ranges; DEFINE_DMA_BUF_EXPORT_INFO(exp_info); - struct p2pdma_provider *provider; struct vfio_pci_dma_buf *priv; + size_t length; int ret;
+ if (!vdev->pci_ops->get_dmabuf_phys) + return -EOPNOTSUPP; + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, sizeof(get_dma_buf)); if (ret != 1) @@ -427,7 +432,7 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, if (IS_ERR(dma_ranges)) return PTR_ERR(dma_ranges);
- ret = validate_dmabuf_input(vdev, &get_dma_buf, dma_ranges, &provider); + ret = validate_dmabuf_input(vdev, &get_dma_buf, dma_ranges, &length); if (ret) goto err_free_ranges;
@@ -444,10 +449,16 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, }
priv->vdev = vdev; - ret = dma_ranges_to_p2p_phys(priv, &get_dma_buf, dma_ranges, provider); + priv->nr_ranges = get_dma_buf.nr_ranges; + priv->size = length; + ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider, + get_dma_buf.region_index, + priv->phys_vec, dma_ranges, + priv->nr_ranges); if (ret) goto err_free_phys;
+ kfree(dma_ranges); dma_ranges = NULL;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 37ce02e30c7632..4ea2095381eb24 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -26,6 +26,7 @@
struct vfio_pci_core_device; struct vfio_pci_region; +struct p2pdma_provider;
struct vfio_pci_regops { ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf, @@ -49,9 +50,26 @@ struct vfio_pci_region { u32 flags; };
+struct vfio_pci_device_ops { + int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev, + struct p2pdma_provider **provider, + unsigned int region_index, + struct phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges); +}; + +int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev, + struct p2pdma_provider **provider, + unsigned int region_index, + struct phys_vec *phys_vec, + struct vfio_region_dma_range *dma_ranges, + size_t nr_ranges); + struct vfio_pci_core_device { struct vfio_device vdev; struct pci_dev *pdev; + const struct vfio_pci_device_ops *pci_ops; void __iomem *barmap[PCI_STD_NUM_BARS]; bool bar_mmap_supported[PCI_STD_NUM_BARS]; u8 *pci_config_map;
linaro-mm-sig@lists.linaro.org