Hi all,
This series is based on previous RFCs/discussions:
Tech topic: https://lore.kernel.org/linux-iommu/20250918214425.2677057-1-amastro@fb.com/
RFCv1: https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
RFCv2: https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
The background/rationale is covered in more detail in the RFC cover
letters. The TL;DR is:
The goal is to enable userspace driver designs that use VFIO to export
DMABUFs representing subsets of PCI device BARs, and "vend" those
buffers from a primary process to other subordinate processes by fd.
These processes then mmap() the buffers and their access to the device
is isolated to the exported ranges. This is an improvement on sharing
the VFIO device fd to subordinate processes, which would allow
unfettered access .
This is achieved by enabling mmap() of vfio-pci DMABUFs. Second, a
new ioctl()-based revocation mechanism is added to allow the primary
process to forcibly revoke access to previously-shared BAR spans, even
if the subordinate processes haven't cleanly exited.
(The related topic of safe delegation of iommufd control to the
subordinate processes is not addressed here, and is follow-up work.)
As well as isolation and revocation, another advantage to accessing a
BAR through a VMA backed by a DMABUF is that it's straightforward to
create the buffer with access attributes, such as write-combining.
Notes on patches
================
Feedback from the RFCs requested that, instead of creating
DMABUF-specific vm_ops and .fault paths, to go the whole way and
migrate the existing VFIO PCI BAR mmap() to be backed by a DMABUF too,
resulting in a common vm_ops and fault handler for mmap()s of both the
VFIO device and explicitly-exported DMABUFs. This has been done for
vfio-pci, but not sub-drivers (nvgrace-gpu's special-case mappings are
unchanged).
vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
A bug fix to a related are, whose context is a depdency for later
patches.
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
The first is for a DMABUF VMA fault handler to determine
arbitrary-sized PFNs from ranges in DMABUF. Secondly, refactor
DMABUF export for use by the existing export feature and a new
helper that creates a DMABUF corresponding to a VFIO BAR mmap()
request.
vfio/pci: Convert BAR mmap() to use a DMABUF
The vfio-pci core mmap() creates a DMABUF with the helper, and the
vm_ops fault handler uses the other helper to resolve the fault.
Because this depends on DMABUF structs/code, CONFIG_VFIO_PCI_CORE
needs to depend on CONFIG_DMA_SHARED_BUFFER. The
CONFIG_VFIO_PCI_DMABUF still conditionally enables the export
support code.
NOTE: The user mmap()s a device fd, but the resulting VMA's vm_file
becomes that of the DMABUF which takes ownership of the device and
puts it on release. This maintains the existing behaviour of a VMA
keeping the VFIO device open.
BAR zapping then happens via the existing vfio_pci_dma_buf_move()
path, which now needs to unmap PTEs in the DMABUF's address_space.
vfio/pci: Provide a user-facing name for BAR mappings
There was a request for decent debug naming in /proc/<pid>/maps
etc. comparable to the existing VFIO names: since the VMAs are
DMABUFs, they have a "dmabuf:" prefix and can't be 100% identical
to before. This is a user-visible change, but this patch at least
now gives us extra info on the BDF & BAR being mapped.
vfio/pci: Clean up BAR zap and revocation
In general (see NOTE!) the vfio_pci_zap_bars() is now obsolete,
since it unmaps PTEs in the VFIO device address_space which is now
unused. This consolidates all calls (e.g. around reset) with the
neighbouring vfio_pci_dma_buf_move()s into new functions, to
revoke-zap/unrevoke.
NOTE: the nvgrace-gpu driver continues to use its own private
vm_ops, fault handler, etc. for its special memregions, and these
DO still add PTEs to the VFIO device address_space. So, a
temporary flag, vdev->bar_needs_zap, maintains the old behaviour
for this use. At least this patch's consolidation makes it easy
to remove the remaining zap when this need goes away.
A FIXME is added: if nvgrace-gpu is converted to DMABUFs, remove
the flag and final zap.
vfio/pci: Support mmap() of a VFIO DMABUF
Adds mmap() for a DMABUF fd exported from vfio-pci.
It was a goal to keep the VFIO device fd lifetime behaviour
unchanged with respect to the DMABUFs. An application can close
all device fds, and this will revoke/clean up all DMABUFs; no
mappings or other access can be performed now. When enabling
mmap() of the DMABUFs, this means access through the VMA is also
revoked. This complicates the fault handler because whilst the
DMABUF exists, it has no guarantee that the corresponding VFIO
device is still alive. Adds synchronisation ensuring the vdev is
available before vdev->memory_lock is touched.
(I decided against the alternative of preventing cleanup by holding
the VFIO device open if any DMABUFs exist, because it's both a
change of behaviour and less clean overall.)
I've added a chonky comment in place, happy to clarify more if you
have ideas.
vfio/pci: Permanently revoke a DMABUF on request
By weight, this is mostly a rename of revoked to an enum, status.
There are now 3 states for a buffer, usable and revoked
temporary/permanent. A new VFIO device ioctl is added,
VFIO_DEVICE_PCI_DMABUF_REVOKE, which passes a DMABUF (exported from
that device) and permanently revokes it. Thus a userspace driver
can guarantee any downstream consumers of a shared fd are prevented
from accessing a BAR range, and that range can be reused.
The code doing revocation in vfio_pci_dma_buf_move() is moved,
unchanged, to a common function for use by _move() and the new
ioctl path.
Q: I can't think of a good reason to temporarily revoke/unrevoke
buffers from userspace, so didn't add a 'flags' field to the ioctl
struct. Easy to add if people think it's worthwhile for future
use.
vfio/pci: Add mmap() attributes to DMABUF feature
Reserves bits [31:28] in vfio_device_feature_dma_buf to allow a
(CPU) mapping attribute to be specified for an exported set of
ranges. The default is the current UC, and a new flag can specify
CPU access as WC.
Q: I've taken 4 bits; the intention is for this field to be a
scalar not a bitmap (i.e. mutually-exclusive access properties).
Perhaps 4 is a bit too many?
Testing
=======
(The [RFC ONLY] userspace test program, for QEMU edu-plus, has been
dropped, but can be found in the GitHub branch below.)
This code has been tested in mapping DMABUFs of single/multiple
ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
to work correctly. I've lightly tested WC mappings also (by observing
resulting PTEs as having the correct attributes...). No regressions
observed on the VFIO selftests, or on our internal vfio-pci
applications.
End
===
This is based on -next (next-20260414 but will merge earlier), as it
depends on Leon's series "vfio: Wait for dma-buf invalidation to
complete":
https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566a…
These commits are on GitHub, along with "[RFC ONLY] selftests: vfio: Add
standalone vfio_dmabuf_mmap_test":
https://github.com/metamev/linux/compare/next-20260414...metamev:linux:dev/…
Thanks for reading,
Matt
================================================================================
Change log:
v1:
- Cleanup of the common DMABUF-aware VMA vm_ops fault handler and
export code.
- Fixed a lot of races, particularly faults racing with DMABUF
cleanup (if the VFIO device fds close, for example).
- Added nicer human-readable names for VFIO mmap() VMAs
RFCv2: Respin based on the feedback/suggestions:
https://lore.kernel.org/kvm/20260312184613.3710705-1-mattev@meta.com/
- Transform the existing VFIO BAR mmap path to also use DMABUFs
behind the scenes, and then simply share that code for
explicitly-mapped DMABUFs. Jason wanted to go that direction to
enable iommufd VFIO type 1 emulation to pick up a DMABUF for an IO
mapping.
- Revoke buffers using a VFIO device fd ioctl
RFCv1:
https://lore.kernel.org/all/20260226202211.929005-1-mattev@meta.com/
Matt Evans (9):
vfio/pci: Fix vfio_pci_dma_buf_cleanup() double-put
vfio/pci: Add a helper to look up PFNs for DMABUFs
vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
vfio/pci: Convert BAR mmap() to use a DMABUF
vfio/pci: Provide a user-facing name for BAR mappings
vfio/pci: Clean up BAR zap and revocation
vfio/pci: Support mmap() of a VFIO DMABUF
vfio/pci: Permanently revoke a DMABUF on request
vfio/pci: Add mmap() attributes to DMABUF feature
drivers/vfio/pci/Kconfig | 3 +-
drivers/vfio/pci/Makefile | 3 +-
drivers/vfio/pci/nvgrace-gpu/main.c | 5 +
drivers/vfio/pci/vfio_pci_config.c | 30 +-
drivers/vfio/pci/vfio_pci_core.c | 224 ++++++++++---
drivers/vfio/pci/vfio_pci_dmabuf.c | 500 +++++++++++++++++++++++-----
drivers/vfio/pci/vfio_pci_priv.h | 49 ++-
include/linux/vfio_pci_core.h | 1 +
include/uapi/linux/vfio.h | 42 ++-
9 files changed, 690 insertions(+), 167 deletions(-)
--
2.47.3
Most of this patch series has already been pushed upstream, this is just
the second half of the patch series that has not been pushed yet + some
additional changes which were required to implement changes requested by
the mailing list. This patch series is originally from Asahi, previously
posted by Daniel Almeida.
The previous version of the patch series can be found here:
https://patchwork.freedesktop.org/series/164580/
Branch with patches applied available here
sure this builds:
https://gitlab.freedesktop.org/lyudess/linux/-/commits/rust/gem-shmem
This patch series applies on top of drm-rust-next
Lyude Paul (5):
rust: drm: gem: s/device::Device/Device/ for shmem.rs
drm/gem/shmem: Introduce __drm_gem_shmem_free_sgt_locked()
rust: drm: gem/shmem: Add DmaResvGuard helper
rust: drm: gem: Introduce shmem::SGTable
rust: drm: gem: Add vmap functions to shmem bindings
drivers/gpu/drm/drm_gem_shmem_helper.c | 32 +-
include/drm/drm_gem_shmem_helper.h | 1 +
rust/kernel/drm/gem/shmem.rs | 602 ++++++++++++++++++++++++-
3 files changed, 614 insertions(+), 21 deletions(-)
base-commit: d9a6809478f9815b6455a327aa001737ac7b2c09
--
2.54.0
On Thu, Apr 30, 2026 at 9:15 PM Barry Song <baohua(a)kernel.org> wrote:
>
> On Wed, Apr 22, 2026 at 3:10 PM Christian König
> <christian.koenig(a)amd.com> wrote:
> >
> > On 4/7/26 13:29, Barry Song wrote:
> > > On Tue, Apr 7, 2026 at 3:58 PM Christian König <christian.koenig(a)amd.com> wrote:
> > >>
> > >> On 4/6/26 23:49, Barry Song (Xiaomi) wrote:
> > >>> From: Xueyuan Chen <Xueyuan.chen21(a)gmail.com>
> > >>>
> > >>> Replace the heavy for_each_sgtable_page() iterator in system_heap_do_vmap()
> > >>> with a more efficient nested loop approach.
> > >>>
> > >>> Instead of iterating page by page, we now iterate through the scatterlist
> > >>> entries via for_each_sgtable_sg(). Because pages within a single sg entry
> > >>> are physically contiguous, we can populate the page array with a in an
> > >>> inner loop using simple pointer math. This save a lot of time.
> > >>>
> > >>> The WARN_ON check is also pulled out of the loop to save branch
> > >>> instructions.
> > >>>
> > >>> Performance results mapping a 2GB buffer on Radxa O6:
> > >>> - Before: ~1440000 ns
> > >>> - After: ~232000 ns
> > >>> (~84% reduction in iteration time, or ~6.2x faster)
> > >>
> > >> Well real question is why do you care about the vmap performance?
> > >>
> > >> That should basically only be used for fbdev emulation (except for VMGFX) and we absolutely don't care about performance there.
> > >
> > > I agree that in mainline, dma_buf_vmap is not used very often.
> > > Here’s what I was able to find:
> > >
> > > 1 1638 drivers/dma-buf/dma-buf.c <<dma_buf_vmap_unlocked>>
> > > ret = dma_buf_vmap(dmabuf, map);
> > > 2 376 drivers/gpu/drm/drm_gem_shmem_helper.c
> > > <<drm_gem_shmem_vmap_locked>>
> > > ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> > > 3 85 drivers/gpu/drm/etnaviv/etnaviv_gem_prime.c
> > > <<etnaviv_gem_prime_vmap_impl>>
> > > ret = dma_buf_vmap(etnaviv_obj->base.import_attach->dmabuf, &map);
> > > 4 433 drivers/gpu/drm/vmwgfx/vmwgfx_blit.c <<map_external>>
> > > ret = dma_buf_vmap(bo->tbo.base.dma_buf, map);
> > > 5 88 drivers/gpu/drm/vmwgfx/vmwgfx_gem.c <<vmw_gem_vmap>>
> > > ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
> > >
> > > However, in the Android ecosystem, system_heap and similar heaps
> > > are widely used across camera, NPU, and media drivers. Many of these
> > > drivers are not in mainline but do use vmap() in real code paths.
> >
> > Well out of tree drivers are not a justification to make an upstream changes.
> >
> > Apart from a handful of workarounds which need to CPU access as fallback DMA-buf vmap is only used to provide fb dev emulation.
> >
> > The vmap interface has already given us quite a headache in the first place and there are a couple of unresolved problems regarding synchronization and coherency.
> >
> > When a driver would be pushed upstream which makes so frequent use of the dma_buf_vmap function that it matters for the performance I think there would be push back on that and the driver developer would require a very good explanation why that is necessary.
> >
> > So for now I have to reject that patch.
>
> Well, it doesn’t seem to increase complexity, and the code is quite easy
> to understand.
I agree with this. This change introduces basically no downsides for
upstream, even if it primarily benefits a rare use case. Since
dma_buf_vmap is exported for driver use, why not enhance the
performance for all callers?
-T.J.
> It would be great if the community could be more welcoming
> to developers who are just getting involved, rather than discouraging them.
>
> Apparently, no one can control whether the source code of those kernel
> modules will be upstreamed except the vendors themselves, but products
> can still benefit from the common kernel.
>
> Best Regards
> Barry
On Thu, Apr 30, 2026 at 05:47:49PM +0100, Matt Evans wrote:
> > On Thu, Apr 16, 2026 at 06:17:46AM -0700, Matt Evans wrote:
> > > +int vfio_pci_core_mmap_prep_dmabuf(struct vfio_pci_core_device *vdev,
> > > + struct vm_area_struct *vma,
> > > + u64 phys_start, u64 req_len,
> > > + unsigned int res_index)
> > > +{
> > > + struct vfio_pci_dma_buf *priv;
> > > + const unsigned int nr_ranges = 1;
> > > + int ret;
> > > +
> > > + priv = kzalloc_obj(*priv);
> > > + if (!priv)
> > > + return -ENOMEM;
> > > +
> > > + priv->phys_vec = kzalloc_obj(*priv->phys_vec);
> > > + if (!priv->phys_vec) {
> > > + ret = -ENOMEM;
> > > + goto err_free_priv;
> > > + }
> > > +
> > > + /*
> > > + * The mmap() request's vma->vm_offs might be non-zero, but
> > > + * the DMABUF is created from _offset zero_ of the BAR. The
> > > + * portion between zero and the vm_offs is inaccessible
> > > + * through this VMA, but this approach keeps the
> > > + * /proc/<pid>/maps offset somewhat consistent with the
> > > + * pre-DMABUF code. Size includes the offset portion.
> >
> > I'm not sure I understand this comment?
> >
> > For the old path vm_pgoff for byte 0 of the bar starts at some large
> > offset
> >
> > For the new path vm_pgoff for byte 0 of the first range starts at 0
>
> Glad you asked. :)
>
> This is trying to achieve keeping /proc/<pid>/maps (or similar) somewhat
> as informative as pre-DMABUF BAR mmap, in terms of keeping the VMA
> vm_offs column useful. Before this patch, say you mmap() two slices A
> and B of the same BAR:
>
> struct vfio_region_info bar_region;
>
> vm_a = mmap(0, 0x1000, ..., device_fd, bar_region.offset + 0);
> vm_b = mmap(0, 0x1000, ..., device_fd, bar_region.offset + 0x4000);
>
> ...you'd see something like this in /proc/blah/maps:
>
> fffff4000000-fffff4001000 rw-s 10000000000 00:07 148 /dev/vfio/devices/vfio0
> fffff5000000-fffff5001000 rw-s 10000004000 00:07 148 /dev/vfio/devices/vfio0
> then the VMA's vm_offs would need to be thunked back down to 0 (since
> the fault handler then treats vm_b + 0 as the first byte of the DMABUF).
> That works/adds up, but then the vm_offs of both VMAs A & B both have
> offset 0, and it's harder to differentiate in /proc/blah/maps.
Yes, and that would be correct.
The VMA output of lspci should show the exact pgoff passed to mmap and
nothing else. Do not mangle it for "debugging".
pgoff is not to be used to show random internal FD details..
> We could possibly stash the original offset somewhere and then render it
> in the name string, but the name's already about the max size and using
> the existing vm_offs column is nicer IMO, doesn't need a new field, etc.
> I need to work on this comment then! What this is trying to say is that
> the DMABUF is made artificially larger than the part that is visible
> through the VMA.
Yuk, that's another reason not to do this.
Jason
Hi,
The recent introduction of heaps in the optee driver [1] made possible
the creation of heaps as modules.
It's generally a good idea if possible, including for the already
existing system and CMA heaps.
The system one is pretty trivial, the CMA is now easy too with the
reworks we got in 7.1-r1.
Let me know what you think,
Maxime
1: https://lore.kernel.org/dri-devel/20250911135007.1275833-4-jens.wiklander@l…
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v5:
- Rebase on 7.1-rc1
- Add a patch to enable the heaps in arm64 defconfig
- Link to v4: https://lore.kernel.org/r/20260331-dma-buf-heaps-as-modules-v4-0-e18fda5044…
Changes in v4:
- Fix compilation failure
- Rework to take into account OF_RESERVED_MEM
- Fix regression making the default CMA area disappear if not created
through the DT
- Added some documentation and comments
- Link to v3: https://lore.kernel.org/r/20260303-dma-buf-heaps-as-modules-v3-0-24344812c7…
Changes in v3:
- Squashed cma_get_name and cma_alloc/release patches
- Fixed typo in Export dev_get_cma_area commit title
- Fixed compilation failure with DMA_CMA but not OF_RESERVED_MEM
- Link to v2: https://lore.kernel.org/r/20260227-dma-buf-heaps-as-modules-v2-0-454aee7e06…
Changes in v2:
- Collect tags
- Don't export dma_contiguous_default_area anymore, but export
dev_get_cma_area instead
- Mentioned that heap modules can't be removed
- Link to v1: https://lore.kernel.org/r/20260225-dma-buf-heaps-as-modules-v1-0-2109225a09…
---
Maxime Ripard (4):
dma-buf: heaps: Export mem_accounting parameter
dma-buf: heaps: cma: Turn the heap into a module
dma-buf: heaps: system: Turn the heap into a module
arm64: defconfig: Enable dma-buf heaps
arch/arm64/configs/defconfig | 3 +++
drivers/dma-buf/dma-heap.c | 1 +
drivers/dma-buf/heaps/Kconfig | 4 ++--
drivers/dma-buf/heaps/cma_heap.c | 3 +++
drivers/dma-buf/heaps/system_heap.c | 5 +++++
5 files changed, 14 insertions(+), 2 deletions(-)
---
base-commit: 5e9b7d093f3f77cb0af4409559e3d139babfb443
change-id: 20260225-dma-buf-heaps-as-modules-1034b3ec9f2a
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>