Changelog:
v8:
* Fixed spelling errors in p2pdma documentation file.
* Added vdev->pci_ops check for NULL in vfio_pci_core_feature_dma_buf().
* Simplified the nvgrace_get_dmabuf_phys() function.
* Added extra check in pcim_p2pdma_provider() to catch missing call
to pcim_p2pdma_init().
v7: https://patch.msgid.link/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com
* Dropped restore_revoke flag and added vfio_pci_dma_buf_move
to reverse loop.
* Fixed spelling errors in documentation patch.
* Rebased on top of v6.18-rc3.
* Added include to stddef.h to vfio.h, to keep uapi header file independent.
v6: https://patch.msgid.link/20251102-dmabuf-vfio-v6-0-d773cff0db9f@nvidia.com
* Fixed wrong error check from pcim_p2pdma_init().
* Documented pcim_p2pdma_provider() function.
* Improved commit messages.
* Added VFIO DMA-BUF selftest, not sent yet.
* Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf.
* Fixed error unwind when dma_buf_fd() fails.
* Document latest changes to p2pmem.
* Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type.
* Moved DMA mapping logic to DMA-BUF.
* Removed types patch to avoid dependencies between subsystems.
* Moved vfio_pci_dma_buf_move() in err_undo block.
* Added nvgrace patch.
v5: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org
* Rebased on top of v6.18-rc1.
* Added more validation logic to make sure that DMA-BUF length doesn't
overflow in various scenarios.
* Hide kernel config from the users.
* Fixed type conversion issue. DMA ranges are exposed with u64 length,
but DMA-BUF uses "unsigned int" as a length for SG entries.
* Added check to prevent from VFIO drivers which reports BAR size
different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
* Split pcim_p2pdma_provider() to two functions, one that initializes
array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
* Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
* Cache provider in vfio_pci_dma_buf struct instead of BAR index.
* Removed misleading comment from pcim_p2pdma_provider().
* Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
* Added extra patch which adds new CONFIG, so next patches can reuse
* it.
* Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
into the other patch.
* Fixed revoke calls to be aligned with true->false semantics.
* Extended p2pdma_providers to be per-BAR and not global to whole
* device.
* Fixed possible race between dmabuf states and revoke.
* Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
* Changed commit messages.
* Reused DMA_ATTR_MMIO attribute.
* Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
---
Jason Gunthorpe (2):
PCI/P2PDMA: Document DMABUF model
vfio/nvgrace: Support get_dmabuf_phys
Leon Romanovsky (7):
PCI/P2PDMA: Separate the mmap() support from the core logic
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
dma-buf: provide phys_vec to scatter-gather mapping routine
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature functions
Documentation/driver-api/pci/p2pdma.rst | 95 +++++++---
block/blk-mq-dma.c | 2 +-
drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 186 ++++++++++++++-----
drivers/vfio/pci/Kconfig | 3 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/nvgrace-gpu/main.c | 56 ++++++
drivers/vfio/pci/vfio_pci.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 22 ++-
drivers/vfio/pci/vfio_pci_core.c | 53 ++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 315 ++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 +++
drivers/vfio/vfio_main.c | 2 +
include/linux/dma-buf.h | 18 ++
include/linux/pci-p2pdma.h | 120 +++++++-----
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 42 +++++
include/uapi/linux/vfio.h | 28 +++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
21 files changed, 1078 insertions(+), 140 deletions(-)
---
base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
change-id: 20251016-dmabuf-vfio-6cef732adf5a
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
On Thu, Nov 13, 2025 at 04:05:09PM -0800, Nicolin Chen wrote:
> > -struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
> > +struct iopt_pages *iopt_alloc_file_pages(struct file *file,
> > + unsigned long start_byte,
> > + unsigned long start,
> > unsigned long length, bool writable);
>
> Passing in start_byte looks like a cleanup to me, aligning with
> what iopt_map_common() has.
> Since we are doing this cleanup, maybe we could follow the same
> sequence: xxx, start, length, start_byte, writable?
??
static int iopt_map_common(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
struct iopt_pages *pages, unsigned long *iova,
unsigned long length, unsigned long start_byte,
int iommu_prot, unsigned int flags)
Not the same arguments, we don't pass start and start_byte there?
Jason
Hi, Pierre-Eric
On Thu, 2025-11-13 at 17:05 +0100, Pierre-Eric Pelloux-Prayer wrote:
> Until now ttm stored a single pipelined eviction fence which means
> drivers had to use a single entity for these evictions.
>
> To lift this requirement, this commit allows up to 8 entities to
> be used.
>
> Ideally a dma_resv object would have been used as a container of
> the eviction fences, but the locking rules makes it complex.
> dma_resv all have the same ww_class, which means "Attempting to
> lock more mutexes after ww_acquire_done." is an error.
>
> One alternative considered was to introduced a 2nd ww_class for
> specific resv to hold a single "transient" lock (= the resv lock
> would only be held for a short period, without taking any other
> locks).
Wouldn't it be possible to use lockdep_set_class_and_name() to modify
the resv lock class for these particular resv objects after they are
allocated? Reusing the resv code certainly sounds attractive.
Thanks,
Thomas
On Tue, Nov 18, 2025 at 05:37:59AM +0000, Kasireddy, Vivek wrote:
> Hi Jason,
>
> > Subject: Re: [PATCH 0/9] Initial DMABUF support for iommufd
> >
> > On Thu, Nov 13, 2025 at 11:37:12AM -0700, Alex Williamson wrote:
> > > > The latest series for interconnect negotation to exchange a phys_addr is:
> > > > https://lore.kernel.org/r/20251027044712.1676175-1-
> > vivek.kasireddy(a)intel.com
> > >
> > > If this is in development, why are we pursuing a vfio specific
> > > temporary "private interconnect" here rather than building on that
> > > work? What are the gaps/barriers/timeline?
> >
> > I broadly don't expect to see an agreement on the above for probably
> Are you planning to post your SGT mapping type patches soon, so that we
> can start discussion on the design?
It is on my list, but probably not soon enough :\
I wanted to address the remarks given and I still have to conclude
some urgent things for this merge window.
> I went ahead and tested your patches and did not notice any regressions
> with my test-cases (after adding some minor fixups). I have also added/tested
> support for IOV mapping type based on your design:
> https://gitlab.freedesktop.org/Vivek/drm-tip/-/commits/dmabuf_iov_v1
Wow, that's great!
Jason
On Mon, Nov 17, 2025 at 08:36:20AM -0700, Alex Williamson wrote:
> On Tue, 11 Nov 2025 09:54:22 +0100
> Christian König <christian.koenig(a)amd.com> wrote:
>
> > On 11/10/25 21:42, Alex Williamson wrote:
> > > On Thu, 6 Nov 2025 16:16:45 +0200
> > > Leon Romanovsky <leon(a)kernel.org> wrote:
> > >
> > >> Changelog:
> > >> v7:
> > >> * Dropped restore_revoke flag and added vfio_pci_dma_buf_move
> > >> to reverse loop.
> > >> * Fixed spelling errors in documentation patch.
> > >> * Rebased on top of v6.18-rc3.
> > >> * Added include to stddef.h to vfio.h, to keep uapi header file independent.
> > >
> > > I think we're winding down on review comments. It'd be great to get
> > > p2pdma and dma-buf acks on this series. Otherwise it's been posted
> > > enough that we'll assume no objections. Thanks,
> >
> > Already have it on my TODO list to take a closer look, but no idea when that will be.
> >
> > This patch set is on place 4 or 5 on a rather long list of stuff to review/finish.
>
> Hi Christian,
>
> Gentle nudge. Leon posted v8[1] last week, which is not drawing any
> new comments. Do you foresee having time for review that I should
> still hold off merging for v6.19 a bit longer? Thanks,
I really want this merged this cycle, along with the iommufd part,
which means it needs to go into your tree by very early next week on a
shared branch so I can do the iommufd part on top.
It is the last blocking kernel piece to conclude the viommu support
roll out into qemu for iommufd which quite a lot of people have been
working on for years now.
IMHO there is nothing profound in the dmabuf patch, it was written by
the expert in the new DMA API operation, and doesn't form any
troublesome API contracts. It is also the same basic code as from the
v1 in July just moved into dmabuf .c files instead of vfio .c files at
Christoph's request.
My hope is DRM folks will pick up the baton and continue to improve
this to move other drivers away from dma_map_resource(). Simona told
me people have wanted DMA API improvements for ages, now we have them,
now is the time!
Any remarks after the fact can be addressed incrementally.
If there are no concrete technical remarks please take it. 6 months is
long enough to wait for feedback.
Thanks,
Jason
On Tue, Nov 18, 2025 at 07:59:20AM +0000, Ankit Agrawal wrote:
> + if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) {
> + /*
> + * The P2P properties of the non-BAR memory is the same as the
> + * BAR memory, so just use the provider for index 0. Someday
> + * when CXL gets P2P support we could create CXLish providers
> + * for the non-BAR memory.
> + */
> + mem_region = &nvdev->resmem;
> + } else if (region_index == USEMEM_REGION_INDEX) {
> + /*
> + * This is actually cachable memory and isn't treated as P2P in
> + * the chip. For now we have no way to push cachable memory
> + * through everything and the Grace HW doesn't care what caching
> + * attribute is programmed into the SMMU. So use BAR 0.
> + */
> + mem_region = &nvdev->usemem;
> + }
> +
>
> Can we replace this with nvgrace_gpu_memregion()?
Yes, looks like
But we need to preserve the comments above as well somehow.
Jason
On 11/14/25 11:50, Tvrtko Ursulin wrote:
>> @@ -569,12 +577,12 @@ void dma_fence_release(struct kref *kref)
>> spin_unlock_irqrestore(fence->lock, flags);
>> }
>> - rcu_read_unlock();
>> -
>> - if (fence->ops->release)
>> - fence->ops->release(fence);
>> + ops = rcu_dereference(fence->ops);
>> + if (ops->release)
>> + ops->release(fence);
>> else
>> dma_fence_free(fence);
>> + rcu_read_unlock();
>
> Risk being a spin lock in the release callback will trigger a warning on PREEMPT_RT. But at least the current code base does not have anything like that AFAICS so I guess it is okay.
I don't think that this is a problem. When PREEMPT_RT is enabled both RCU and spinlocks become preemptible.
So as far as I know it is perfectly valid to grab a spinlock under an rcu read side critical section.
>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>> index 64639e104110..77f07735f556 100644
>> --- a/include/linux/dma-fence.h
>> +++ b/include/linux/dma-fence.h
>> @@ -66,7 +66,7 @@ struct seq_file;
>> */
>> struct dma_fence {
>> spinlock_t *lock;
>> - const struct dma_fence_ops *ops;
>> + const struct dma_fence_ops __rcu *ops;
>> /*
>> * We clear the callback list on kref_put so that by the time we
>> * release the fence it is unused. No one should be adding to the
>> @@ -218,6 +218,10 @@ struct dma_fence_ops {
>> * timed out. Can also return other error values on custom implementations,
>> * which should be treated as if the fence is signaled. For example a hardware
>> * lockup could be reported like that.
>> + *
>> + * Implementing this callback prevents the BO from detaching after
>
> s/BO/fence/
>
>> + * signaling and so it is mandatory for the module providing the
>> + * dma_fence_ops to stay loaded as long as the dma_fence exists.
>> */
>> signed long (*wait)(struct dma_fence *fence,
>> bool intr, signed long timeout);
>> @@ -229,6 +233,13 @@ struct dma_fence_ops {
>> * Can be called from irq context. This callback is optional. If it is
>> * NULL, then dma_fence_free() is instead called as the default
>> * implementation.
>> + *
>> + * Implementing this callback prevents the BO from detaching after
>
> Ditto.
Both fixed, thanks.
>
>> + * signaling and so it is mandatory for the module providing the
>> + * dma_fence_ops to stay loaded as long as the dma_fence exists.
>> + *
>> + * If the callback is implemented the memory backing the dma_fence
>> + * object must be freed RCU safe.
>> */
>> void (*release)(struct dma_fence *fence);
>> @@ -418,13 +429,19 @@ const char __rcu *dma_fence_timeline_name(struct dma_fence *fence);
>> static inline bool
>> dma_fence_is_signaled_locked(struct dma_fence *fence)
>> {
>> + const struct dma_fence_ops *ops;
>> +
>> if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>> return true;
>> - if (fence->ops->signaled && fence->ops->signaled(fence)) {
>> + rcu_read_lock();
>> + ops = rcu_dereference(fence->ops);
>> + if (ops->signaled && ops->signaled(fence)) {
>> + rcu_read_unlock();
>> dma_fence_signal_locked(fence);
>> return true;
>> }
>> + rcu_read_unlock();
>> return false;
>> }
>> @@ -448,13 +465,19 @@ dma_fence_is_signaled_locked(struct dma_fence *fence)
>> static inline bool
>> dma_fence_is_signaled(struct dma_fence *fence)
>> {
>> + const struct dma_fence_ops *ops;
>> +
>> if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
>> return true;
>> - if (fence->ops->signaled && fence->ops->signaled(fence)) {
>> + rcu_read_lock();
>> + ops = rcu_dereference(fence->ops);
>> + if (ops->signaled && ops->signaled(fence)) {
>> + rcu_read_unlock();
>
> With the unlocked version two threads could race and one could make the fence->lock go away just around here, before the dma_fence_signal below will take it. It seems it is only safe to rcu_read_unlock before signaling if using the embedded fence (later in the series). Can you think of a downside to holding the rcu read lock to after signaling? that would make it safe I think.
Well it's good to talk about it but I think that it is not necessary to protect the lock in this particular case.
See the RCU protection is only for the fence->ops pointer, but the lock can be taken way after the fence is already signaled.
That's why I came up with the patch to move the lock into the fence in the first place.
Regards,
Christian.
>
> Regards,
>
> Tvrtko
>
>> dma_fence_signal(fence);
>> return true;
>> }
>> + rcu_read_unlock();
>> return false;
>> }
>