WW mutexes and dma-resv objects, which embed them, typically have a
number of locks belocking to the same lock class. However
code using them typically want to verify the locking on
object granularity, not lock-class granularity.
This series add ww_mutex functions to facilitate that,
(patch 1) and utilizes these functions in the dma-resv lock
checks.
Thomas Hellström (2):
kernel/locking/ww_mutex: Add per-lock lock-check helpers
dma-buf/dma-resv: Improve the dma-resv lockdep checks
include/linux/dma-resv.h | 7 +++++--
include/linux/ww_mutex.h | 18 ++++++++++++++++++
kernel/locking/mutex.c | 10 ++++++++++
3 files changed, 33 insertions(+), 2 deletions(-)
--
2.51.1
Changelog:
v8:
* Fixed spelling errors in p2pdma documentation file.
* Added vdev->pci_ops check for NULL in vfio_pci_core_feature_dma_buf().
* Simplified the nvgrace_get_dmabuf_phys() function.
* Added extra check in pcim_p2pdma_provider() to catch missing call
to pcim_p2pdma_init().
v7: https://patch.msgid.link/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com
* Dropped restore_revoke flag and added vfio_pci_dma_buf_move
to reverse loop.
* Fixed spelling errors in documentation patch.
* Rebased on top of v6.18-rc3.
* Added include to stddef.h to vfio.h, to keep uapi header file independent.
v6: https://patch.msgid.link/20251102-dmabuf-vfio-v6-0-d773cff0db9f@nvidia.com
* Fixed wrong error check from pcim_p2pdma_init().
* Documented pcim_p2pdma_provider() function.
* Improved commit messages.
* Added VFIO DMA-BUF selftest, not sent yet.
* Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf.
* Fixed error unwind when dma_buf_fd() fails.
* Document latest changes to p2pmem.
* Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type.
* Moved DMA mapping logic to DMA-BUF.
* Removed types patch to avoid dependencies between subsystems.
* Moved vfio_pci_dma_buf_move() in err_undo block.
* Added nvgrace patch.
v5: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org
* Rebased on top of v6.18-rc1.
* Added more validation logic to make sure that DMA-BUF length doesn't
overflow in various scenarios.
* Hide kernel config from the users.
* Fixed type conversion issue. DMA ranges are exposed with u64 length,
but DMA-BUF uses "unsigned int" as a length for SG entries.
* Added check to prevent from VFIO drivers which reports BAR size
different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
* Split pcim_p2pdma_provider() to two functions, one that initializes
array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
* Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
* Cache provider in vfio_pci_dma_buf struct instead of BAR index.
* Removed misleading comment from pcim_p2pdma_provider().
* Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
* Added extra patch which adds new CONFIG, so next patches can reuse
* it.
* Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
into the other patch.
* Fixed revoke calls to be aligned with true->false semantics.
* Extended p2pdma_providers to be per-BAR and not global to whole
* device.
* Fixed possible race between dmabuf states and revoke.
* Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
* Changed commit messages.
* Reused DMA_ATTR_MMIO attribute.
* Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
---
Jason Gunthorpe (2):
PCI/P2PDMA: Document DMABUF model
vfio/nvgrace: Support get_dmabuf_phys
Leon Romanovsky (7):
PCI/P2PDMA: Separate the mmap() support from the core logic
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
dma-buf: provide phys_vec to scatter-gather mapping routine
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature functions
Documentation/driver-api/pci/p2pdma.rst | 95 +++++++---
block/blk-mq-dma.c | 2 +-
drivers/dma-buf/dma-buf.c | 235 ++++++++++++++++++++++++
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 186 ++++++++++++++-----
drivers/vfio/pci/Kconfig | 3 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/nvgrace-gpu/main.c | 56 ++++++
drivers/vfio/pci/vfio_pci.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 22 ++-
drivers/vfio/pci/vfio_pci_core.c | 53 ++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 315 ++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 +++
drivers/vfio/vfio_main.c | 2 +
include/linux/dma-buf.h | 18 ++
include/linux/pci-p2pdma.h | 120 +++++++-----
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 42 +++++
include/uapi/linux/vfio.h | 28 +++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
21 files changed, 1078 insertions(+), 140 deletions(-)
---
base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
change-id: 20251016-dmabuf-vfio-6cef732adf5a
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
Changelog:
v9:
* Added Reviewed-by tags.
* Fixes to p2pdma documentation.
* Renamed dma_buf_map and unmap.
* Moved them to separate file.
* Used nvgrace_gpu_memregion() function instead of open-coded variant.
* Paired get_file_active() with fput().
v8: https://patch.msgid.link/20251111-dmabuf-vfio-v8-0-fd9aa5df478f@nvidia.com
* Fixed spelling errors in p2pdma documentation file.
* Added vdev->pci_ops check for NULL in vfio_pci_core_feature_dma_buf().
* Simplified the nvgrace_get_dmabuf_phys() function.
* Added extra check in pcim_p2pdma_provider() to catch missing call
to pcim_p2pdma_init().
v7: https://patch.msgid.link/20251106-dmabuf-vfio-v7-0-2503bf390699@nvidia.com
* Dropped restore_revoke flag and added vfio_pci_dma_buf_move
to reverse loop.
* Fixed spelling errors in documentation patch.
* Rebased on top of v6.18-rc3.
* Added include to stddef.h to vfio.h, to keep uapi header file independent.
v6: https://patch.msgid.link/20251102-dmabuf-vfio-v6-0-d773cff0db9f@nvidia.com
* Fixed wrong error check from pcim_p2pdma_init().
* Documented pcim_p2pdma_provider() function.
* Improved commit messages.
* Added VFIO DMA-BUF selftest, not sent yet.
* Added __counted_by(nr_ranges) annotation to struct vfio_device_feature_dma_buf.
* Fixed error unwind when dma_buf_fd() fails.
* Document latest changes to p2pmem.
* Removed EXPORT_SYMBOL_GPL from pci_p2pdma_map_type.
* Moved DMA mapping logic to DMA-BUF.
* Removed types patch to avoid dependencies between subsystems.
* Moved vfio_pci_dma_buf_move() in err_undo block.
* Added nvgrace patch.
v5: https://lore.kernel.org/all/cover.1760368250.git.leon@kernel.org
* Rebased on top of v6.18-rc1.
* Added more validation logic to make sure that DMA-BUF length doesn't
overflow in various scenarios.
* Hide kernel config from the users.
* Fixed type conversion issue. DMA ranges are exposed with u64 length,
but DMA-BUF uses "unsigned int" as a length for SG entries.
* Added check to prevent from VFIO drivers which reports BAR size
different from PCI, do not use DMA-BUF functionality.
v4: https://lore.kernel.org/all/cover.1759070796.git.leon@kernel.org
* Split pcim_p2pdma_provider() to two functions, one that initializes
array of providers and another to return right provider pointer.
v3: https://lore.kernel.org/all/cover.1758804980.git.leon@kernel.org
* Changed pcim_p2pdma_enable() to be pcim_p2pdma_provider().
* Cache provider in vfio_pci_dma_buf struct instead of BAR index.
* Removed misleading comment from pcim_p2pdma_provider().
* Moved MMIO check to be in pcim_p2pdma_provider().
v2: https://lore.kernel.org/all/cover.1757589589.git.leon@kernel.org/
* Added extra patch which adds new CONFIG, so next patches can reuse
* it.
* Squashed "PCI/P2PDMA: Remove redundant bus_offset from map state"
into the other patch.
* Fixed revoke calls to be aligned with true->false semantics.
* Extended p2pdma_providers to be per-BAR and not global to whole
* device.
* Fixed possible race between dmabuf states and revoke.
* Moved revoke to PCI BAR zap block.
v1: https://lore.kernel.org/all/cover.1754311439.git.leon@kernel.org
* Changed commit messages.
* Reused DMA_ATTR_MMIO attribute.
* Returned support for multiple DMA ranges per-dMABUF.
v0: https://lore.kernel.org/all/cover.1753274085.git.leonro@nvidia.com
---------------------------------------------------------------------------
Based on "[PATCH v6 00/16] dma-mapping: migrate to physical address-based API"
https://lore.kernel.org/all/cover.1757423202.git.leonro@nvidia.com/ series.
---------------------------------------------------------------------------
This series extends the VFIO PCI subsystem to support exporting MMIO
regions from PCI device BARs as dma-buf objects, enabling safe sharing of
non-struct page memory with controlled lifetime management. This allows RDMA
and other subsystems to import dma-buf FDs and build them into memory regions
for PCI P2P operations.
The series supports a use case for SPDK where a NVMe device will be
owned by SPDK through VFIO but interacting with a RDMA device. The RDMA
device may directly access the NVMe CMB or directly manipulate the NVMe
device's doorbell using PCI P2P.
However, as a general mechanism, it can support many other scenarios with
VFIO. This dmabuf approach can be usable by iommufd as well for generic
and safe P2P mappings.
In addition to the SPDK use-case mentioned above, the capability added
in this patch series can also be useful when a buffer (located in device
memory such as VRAM) needs to be shared between any two dGPU devices or
instances (assuming one of them is bound to VFIO PCI) as long as they
are P2P DMA compatible.
The implementation provides a revocable attachment mechanism using dma-buf
move operations. MMIO regions are normally pinned as BARs don't change
physical addresses, but access is revoked when the VFIO device is closed
or a PCI reset is issued. This ensures kernel self-defense against
potentially hostile userspace.
The series includes significant refactoring of the PCI P2PDMA subsystem
to separate core P2P functionality from memory allocation features,
making it more modular and suitable for VFIO use cases that don't need
struct page support.
-----------------------------------------------------------------------
The series is based originally on
https://lore.kernel.org/all/20250307052248.405803-1-vivek.kasireddy@intel.c…
but heavily rewritten to be based on DMA physical API.
-----------------------------------------------------------------------
The WIP branch can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=…
Thanks
---
Jason Gunthorpe (2):
PCI/P2PDMA: Document DMABUF model
vfio/nvgrace: Support get_dmabuf_phys
Leon Romanovsky (7):
PCI/P2PDMA: Separate the mmap() support from the core logic
PCI/P2PDMA: Simplify bus address mapping API
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
dma-buf: provide phys_vec to scatter-gather mapping routine
vfio/pci: Enable peer-to-peer DMA transactions by default
vfio/pci: Add dma-buf export support for MMIO regions
Vivek Kasireddy (2):
vfio: Export vfio device get and put registration helpers
vfio/pci: Share the core device pointer while invoking feature functions
Documentation/driver-api/pci/p2pdma.rst | 97 +++++++---
block/blk-mq-dma.c | 2 +-
drivers/dma-buf/Makefile | 2 +-
drivers/dma-buf/dma-buf-mapping.c | 248 +++++++++++++++++++++++++
drivers/iommu/dma-iommu.c | 4 +-
drivers/pci/p2pdma.c | 186 ++++++++++++++-----
drivers/vfio/pci/Kconfig | 3 +
drivers/vfio/pci/Makefile | 1 +
drivers/vfio/pci/nvgrace-gpu/main.c | 52 ++++++
drivers/vfio/pci/vfio_pci.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 22 ++-
drivers/vfio/pci/vfio_pci_core.c | 53 ++++--
drivers/vfio/pci/vfio_pci_dmabuf.c | 316 ++++++++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_priv.h | 23 +++
drivers/vfio/vfio_main.c | 2 +
include/linux/dma-buf-mapping.h | 17 ++
include/linux/dma-buf.h | 11 ++
include/linux/pci-p2pdma.h | 120 +++++++-----
include/linux/vfio.h | 2 +
include/linux/vfio_pci_core.h | 42 +++++
include/uapi/linux/vfio.h | 28 +++
kernel/dma/direct.c | 4 +-
mm/hmm.c | 2 +-
23 files changed, 1101 insertions(+), 141 deletions(-)
---
base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa
change-id: 20251016-dmabuf-vfio-6cef732adf5a
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
Hi, Pierre-Eric
On Thu, 2025-11-13 at 17:05 +0100, Pierre-Eric Pelloux-Prayer wrote:
> Until now ttm stored a single pipelined eviction fence which means
> drivers had to use a single entity for these evictions.
>
> To lift this requirement, this commit allows up to 8 entities to
> be used.
>
> Ideally a dma_resv object would have been used as a container of
> the eviction fences, but the locking rules makes it complex.
> dma_resv all have the same ww_class, which means "Attempting to
> lock more mutexes after ww_acquire_done." is an error.
>
> One alternative considered was to introduced a 2nd ww_class for
> specific resv to hold a single "transient" lock (= the resv lock
> would only be held for a short period, without taking any other
> locks).
Wouldn't it be possible to use lockdep_set_class_and_name() to modify
the resv lock class for these particular resv objects after they are
allocated? Reusing the resv code certainly sounds attractive.
Thanks,
Thomas
On Wed, Nov 19, 2025 at 12:02:02AM +0000, Tian, Kevin wrote:
> > From: Keith Busch <kbusch(a)kernel.org>
> > Sent: Wednesday, November 19, 2025 4:19 AM
> >
> > On Tue, Nov 18, 2025 at 07:18:36AM +0000, Tian, Kevin wrote:
> > > > From: Leon Romanovsky <leon(a)kernel.org>
> > > > Sent: Tuesday, November 11, 2025 5:58 PM
> > > >
> > > > From: Leon Romanovsky <leonro(a)nvidia.com>
> > >
> > > not required with only your own s-o-b
> >
> > That's automatically appended when the sender and signer don't match.
> > It's not uncommon for developers to send from a kernel.org email but
> > sign off with a corporate account, or the other way around.
>
> Good to know.
Yes, in addition, I used to separate between code authorship and my
open-source activity. Code belongs to my employer and this is why corporate
address is used as an author, but all emails and communications are coming from
my kernel.org account.
Thanks
On Wed, Nov 19, 2025 at 05:54:55AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon(a)kernel.org>
> > Sent: Tuesday, November 11, 2025 5:58 PM
> > +
> > + if (dma->state && dma_use_iova(dma->state)) {
> > + WARN_ON_ONCE(mapped_len != size);
>
> then "goto err_unmap_dma".
It never should happen, there is no need to provide error unwind to
something that you won't get.
>
> Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Thanks
On Tue, Nov 18, 2025 at 04:06:11PM -0800, Nicolin Chen wrote:
> On Tue, Nov 11, 2025 at 11:57:48AM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro(a)nvidia.com>
> >
> > Add dma_buf_map() and dma_buf_unmap() helpers to convert an array of
> > MMIO physical address ranges into scatter-gather tables with proper
> > DMA mapping.
> >
> > These common functions are a starting point and support any PCI
> > drivers creating mappings from their BAR's MMIO addresses. VFIO is one
> > case, as shortly will be RDMA. We can review existing DRM drivers to
> > refactor them separately. We hope this will evolve into routines to
> > help common DRM that include mixed CPU and MMIO mappings.
> >
> > Compared to the dma_map_resource() abuse this implementation handles
> > the complicated PCI P2P scenarios properly, especially when an IOMMU
> > is enabled:
> >
> > - Direct bus address mapping without IOVA allocation for
> > PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
> > happens if the IOMMU is enabled but the PCIe switch ACS flags allow
> > transactions to avoid the host bridge.
> >
> > Further, this handles the slightly obscure, case of MMIO with a
> > phys_addr_t that is different from the physical BAR programming
> > (bus offset). The phys_addr_t is converted to a dma_addr_t and
> > accommodates this effect. This enables certain real systems to
> > work, especially on ARM platforms.
> >
> > - Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
> > attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
> > This happens when the IOMMU is enabled and the ACS flags are forcing
> > all traffic to the IOMMU - ie for virtualization systems.
> >
> > - Cases where P2P is not supported through the host bridge/CPU. The
> > P2P subsystem is the proper place to detect this and block it.
> >
> > Helper functions fill_sg_entry() and calc_sg_nents() handle the
> > scatter-gather table construction, splitting large regions into
> > UINT_MAX-sized chunks to fit within sg->length field limits.
> >
> > Since the physical address based DMA API forbids use of the CPU list
> > of the scatterlist this will produce a mangled scatterlist that has
> > a fully zero-length and NULL'd CPU list. The list is 0 length,
> > all the struct page pointers are NULL and zero sized. This is stronger
> > and more robust than the existing mangle_sg_table() technique. It is
> > a future project to migrate DMABUF as a subsystem away from using
> > scatterlist for this data structure.
> >
> > Tested-by: Alex Mastro <amastro(a)fb.com>
> > Tested-by: Nicolin Chen <nicolinc(a)nvidia.com>
> > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
>
> Reviewed-by: Nicolin Chen <nicolinc(a)nvidia.com>
>
> With a nit:
>
> > +err_unmap_dma:
> > + if (!i || !dma->state) {
> > + ; /* Do nothing */
> > + } else if (dma_use_iova(dma->state)) {
> > + dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
> > + DMA_ATTR_MMIO);
> > + } else {
> > + for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
> > + dma_unmap_phys(attach->dev, sg_dma_address(sgl),
> > + sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
>
> Would it be safer to skip dma_unmap_phys() the range [i, nents)?
[i, nents) is not supposed to be in SG list which we are iterating.
Thanks
On Thu, Nov 13, 2025 at 04:05:09PM -0800, Nicolin Chen wrote:
> > -struct iopt_pages *iopt_alloc_file_pages(struct file *file, unsigned long start,
> > +struct iopt_pages *iopt_alloc_file_pages(struct file *file,
> > + unsigned long start_byte,
> > + unsigned long start,
> > unsigned long length, bool writable);
>
> Passing in start_byte looks like a cleanup to me, aligning with
> what iopt_map_common() has.
> Since we are doing this cleanup, maybe we could follow the same
> sequence: xxx, start, length, start_byte, writable?
??
static int iopt_map_common(struct iommufd_ctx *ictx, struct io_pagetable *iopt,
struct iopt_pages *pages, unsigned long *iova,
unsigned long length, unsigned long start_byte,
int iommu_prot, unsigned int flags)
Not the same arguments, we don't pass start and start_byte there?
Jason