On 11/23/25 23:51, Pavel Begunkov wrote:
> Picking up the work on supporting dmabuf in the read/write path.
IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
Or am I mixing something up here? Since I don't see any dma_fence implementation at all that might actually be the case.
On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
Regards,
Christian.
> There
> are two main changes. First, it doesn't pass a dma addresss directly by
> rather wraps it into an opaque structure, which is extended and
> understood by the target driver.
>
> The second big change is support for dynamic attachments, which added a
> good part of complexity (see Patch 5). I kept the main machinery in nvme
> at first, but move_notify can ask to kill the dma mapping asynchronously,
> and any new IO would need to wait during submission, thus it was moved
> to blk-mq. That also introduced an extra callback layer b/w driver and
> blk-mq.
>
> There are some rough corners, and I'm not perfectly happy about the
> complexity and layering. For v3 I'll try to move the waiting up in the
> stack to io_uring wrapped into library helpers.
>
> For now, I'm interested what is the best way to test move_notify? And
> how dma_resv_reserve_fences() errors should be handled in move_notify?
>
> The uapi didn't change, after registration it looks like a normal
> io_uring registered buffer and can be used as such. Only non-vectored
> fixed reads/writes are allowed. Pseudo code:
>
> // registration
> reg_buf_idx = 0;
> io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });
>
> // request creation
> io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
> buffer_size, file_offset, reg_buf_idx);
>
> And as previously, a good bunch of code was taken from Keith's series [1].
>
> liburing based example:
>
> git: https://github.com/isilence/liburing.git dmabuf-rw
> link: https://github.com/isilence/liburing/tree/dmabuf-rw
>
> [1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
>
> Pavel Begunkov (11):
> file: add callback for pre-mapping dmabuf
> iov_iter: introduce iter type for pre-registered dma
> block: move around bio flagging helpers
> block: introduce dma token backed bio type
> block: add infra to handle dmabuf tokens
> nvme-pci: add support for dmabuf reggistration
> nvme-pci: implement dma_token backed requests
> io_uring/rsrc: add imu flags
> io_uring/rsrc: extended reg buffer registration
> io_uring/rsrc: add dmabuf-backed buffer registeration
> io_uring/rsrc: implement dmabuf regbuf import
>
> block/Makefile | 1 +
> block/bdev.c | 14 ++
> block/bio.c | 21 +++
> block/blk-merge.c | 23 +++
> block/blk-mq-dma-token.c | 236 +++++++++++++++++++++++++++++++
> block/blk-mq.c | 20 +++
> block/blk.h | 3 +-
> block/fops.c | 3 +
> drivers/nvme/host/pci.c | 217 ++++++++++++++++++++++++++++
> include/linux/bio.h | 49 ++++---
> include/linux/blk-mq-dma-token.h | 60 ++++++++
> include/linux/blk-mq.h | 21 +++
> include/linux/blk_types.h | 8 +-
> include/linux/blkdev.h | 3 +
> include/linux/dma_token.h | 35 +++++
> include/linux/fs.h | 4 +
> include/linux/uio.h | 10 ++
> include/uapi/linux/io_uring.h | 13 +-
> io_uring/rsrc.c | 201 +++++++++++++++++++++++---
> io_uring/rsrc.h | 23 ++-
> io_uring/rw.c | 7 +-
> lib/iov_iter.c | 30 +++-
> 22 files changed, 948 insertions(+), 54 deletions(-)
> create mode 100644 block/blk-mq-dma-token.c
> create mode 100644 include/linux/blk-mq-dma-token.h
> create mode 100644 include/linux/dma_token.h
>
This series is the start of adding full DMABUF support to
iommufd. Currently it is limited to only work with VFIO's DMABUF exporter.
It sits on top of Leon's series to add a DMABUF exporter to VFIO:
https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.c…
The existing IOMMU_IOAS_MAP_FILE is enhanced to detect DMABUF fd's, but
otherwise works the same as it does today for a memfd. The user can select
a slice of the FD to map into the ioas and if the underliyng alignment
requirements are met it will be placed in the iommu_domain.
Though limited, it is enough to allow a VMM like QEMU to connect MMIO BAR
memory from VFIO to an iommu_domain controlled by iommufd. This is used
for PCI Peer to Peer support in VMs, and is the last feature that the VFIO
type 1 container has that iommufd couldn't do.
The VFIO type1 version extracts raw PFNs from VMAs, which has no lifetime
control and is a use-after-free security problem.
Instead iommufd relies on revokable DMABUFs. Whenever VFIO thinks there
should be no access to the MMIO it can shoot down the mapping in iommufd
which will unmap it from the iommu_domain. There is no automatic remap,
this is a safety protocol so the kernel doesn't get stuck. Userspace is
expected to know it is doing something that will revoke the dmabuf and
map/unmap it around the activity. Eg when QEMU goes to issue FLR it should
do the map/unmap to iommufd.
Since DMABUF is missing some key general features for this use case it
relies on a "private interconnect" between VFIO and iommufd via the
vfio_pci_dma_buf_iommufd_map() call.
The call confirms the DMABUF has revoke semantics and delivers a phys_addr
for the memory suitable for use with iommu_map().
Medium term there is a desire to expand the supported DMABUFs to include
GPU drivers to support DPDK/SPDK type use cases so future series will work
to add a general concept of revoke and a general negotiation of
interconnect to remove vfio_pci_dma_buf_iommufd_map().
I also plan another series to modify iommufd's vfio_compat to
transparently pull a dmabuf out of a VFIO VMA to emulate more of the uAPI
of type1.
The latest series for interconnect negotation to exchange a phys_addr is:
https://lore.kernel.org/r/20251027044712.1676175-1-vivek.kasireddy@intel.com
And the discussion for design of revoke is here:
https://lore.kernel.org/dri-devel/20250114173103.GE5556@nvidia.com/
This is on github: https://github.com/jgunthorpe/linux/commits/iommufd_dmabuf
v2:
- Rebase on Leon's v9
- Fix mislocking in an iopt_fill_domain() error path
- Revise the comments around how the sub page offset works
- Remove a useless WARN_ON in iopt_pages_rw_access()
- Fixed missed memory free in the selftest
v1: https://patch.msgid.link/r/0-v1-64bed2430cdb+31b-iommufd_dmabuf_jgg@nvidia.…
Jason Gunthorpe (9):
vfio/pci: Add vfio_pci_dma_buf_iommufd_map()
iommufd: Add DMABUF to iopt_pages
iommufd: Do not map/unmap revoked DMABUFs
iommufd: Allow a DMABUF to be revoked
iommufd: Allow MMIO pages in a batch
iommufd: Have pfn_reader process DMABUF iopt_pages
iommufd: Have iopt_map_file_pages convert the fd to a file
iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE
iommufd/selftest: Add some tests for the dmabuf flow
drivers/iommu/iommufd/io_pagetable.c | 78 +++-
drivers/iommu/iommufd/io_pagetable.h | 54 ++-
drivers/iommu/iommufd/ioas.c | 8 +-
drivers/iommu/iommufd/iommufd_private.h | 14 +-
drivers/iommu/iommufd/iommufd_test.h | 10 +
drivers/iommu/iommufd/main.c | 10 +
drivers/iommu/iommufd/pages.c | 414 ++++++++++++++++--
drivers/iommu/iommufd/selftest.c | 143 ++++++
drivers/vfio/pci/vfio_pci_dmabuf.c | 34 ++
include/linux/vfio_pci_core.h | 4 +
tools/testing/selftests/iommu/iommufd.c | 43 ++
tools/testing/selftests/iommu/iommufd_utils.h | 44 ++
12 files changed, 786 insertions(+), 70 deletions(-)
base-commit: f836737ed56db9e2d5b047c56a31e05af0f3f116
--
2.43.0
On Thu, Nov 20, 2025 at 08:04:37AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)nvidia.com>
> > Sent: Saturday, November 8, 2025 12:50 AM
> > +
> > +static int pfn_reader_fill_dmabuf(struct pfn_reader_dmabuf *dmabuf,
> > + struct pfn_batch *batch,
> > + unsigned long start_index,
> > + unsigned long last_index)
> > +{
> > + unsigned long start = dmabuf->start_offset + start_index * PAGE_SIZE;
> > +
> > + /*
> > + * This works in PAGE_SIZE indexes, if the dmabuf is sliced and
> > + * starts/ends at a sub page offset then the batch to domain code will
> > + * adjust it.
> > + */
>
> dmabuf->start_offset comes from pages->dmabuf.start, which is initialized as:
>
> pages->dmabuf.start = start - start_byte;
>
> so it's always page-aligned. Where is the sub-page offset coming from?
I need to go over this again to check it, this sub-page stuff is
a bit convoluted. start_offset should include the sub page offset
here..
> > @@ -1687,6 +1737,12 @@ static void __iopt_area_unfill_domain(struct
> > iopt_area *area,
> >
> > lockdep_assert_held(&pages->mutex);
> >
> > + if (iopt_is_dmabuf(pages)) {
> > + iopt_area_unmap_domain_range(area, domain, start_index,
> > + last_index);
> > + return;
> > + }
> > +
>
> this belongs to patch3?
This is part of programming the domain with the dmabuf, the patch3 was
about the revoke which is a slightly different topic though they are
both similar.
Thanks,
Jason
On Thu, Nov 20, 2025 at 07:55:04AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)nvidia.com>
> > Sent: Saturday, November 8, 2025 12:50 AM
> >
> >
> > @@ -2031,7 +2155,10 @@ int iopt_pages_rw_access(struct iopt_pages
> > *pages, unsigned long start_byte,
> > if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
> > return -EPERM;
> >
> > - if (pages->type == IOPT_ADDRESS_FILE)
> > + if (iopt_is_dmabuf(pages))
> > + return -EINVAL;
> > +
>
> probably also add helpers for other types, e.g.:
>
> iopt_is_user()
> iopt_is_memfd()
The helper was to integrate the IS_ENABLED() check for DMABUF, there
are not so many others uses, I think leave it to not bloat the patch.
> > + if (pages->type != IOPT_ADDRESS_USER)
> > return iopt_pages_rw_slow(pages, start_index, last_index,
> > start_byte % PAGE_SIZE, data,
> > length,
> > flags);
> > --
>
> then the following WARN_ON() becomes useless:
>
> if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
> WARN_ON(pages->type != IOPT_ADDRESS_USER))
> return -EINVAL;
Yep
Thanks,
Jason
On Thu, Nov 20, 2025 at 05:04:13PM -0700, Alex Williamson wrote:
> On Thu, 20 Nov 2025 11:28:29 +0200
> Leon Romanovsky <leon(a)kernel.org> wrote:
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 142b84b3f225..51a3bcc26f8b 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> ...
> > @@ -2487,8 +2500,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> >
> > err_undo:
> > list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
> > - vdev.dev_set_list)
> > + vdev.dev_set_list) {
> > + if (__vfio_pci_memory_enabled(vdev))
> > + vfio_pci_dma_buf_move(vdev, false);
> > up_write(&vdev->memory_lock);
> > + }
>
> I ran into a bug here. In the hot reset path we can have dev_sets
> where one or more devices are not opened by the user. The vconfig
> buffer for the device is established on open. However:
>
> bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
> {
> struct pci_dev *pdev = vdev->pdev;
> u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
> ...
>
> Leads to a NULL pointer dereference.
>
> I think the most straightforward fix is simply to test the open_count
> on the vfio_device, which is also protected by the dev_set->lock that
> we already hold here:
>
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -2501,7 +2501,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> err_undo:
> list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
> vdev.dev_set_list) {
> - if (__vfio_pci_memory_enabled(vdev))
> + if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev))
> vfio_pci_dma_buf_move(vdev, false);
> up_write(&vdev->memory_lock);
> }
>
> Any other suggestions? This should be the only reset path with this
> nuance of affecting non-opened devices. Thanks,
It seems right to me.
Thanks
>
> Alex
Hi Barry,
On Fri, 21 Nov 2025 at 06:54, Barry Song <21cnbao(a)gmail.com> wrote:
>
> Hi Sumit,
>
> >
> > Using the micro-benchmark below, we see that mmap becomes
> > 3.5X faster:
>
>
> Marcin pointed out to me off-tree that it is actually 35x faster,
> not 3.5x faster. Sorry for my poor math. I assume you can fix this
> when merging it?
Sure, I corrected this, and is merged to drm-misc-next
Thanks,
Sumit.
>
> >
> > W/ patch:
> >
> > ~ # ./a.out
> > mmap 512MB took 200266.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 198151.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 197069.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 196781.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 198102.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 195552.000 us, verify OK
> >
> > W/o patch:
> >
> > ~ # ./a.out
> > mmap 512MB took 6987470.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6970739.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6984383.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6971311.000 us, verify OK
> > ~ # ./a.out
> > mmap 512MB took 6991680.000 us, verify OK
>
>
> Thanks
> Barry