On Fri, Feb 27, 2026 at 01:52:15PM -0800, Alex Mastro wrote:
> On Fri, Feb 27, 2026 at 03:48:07PM -0400, Jason Gunthorpe wrote:
> > > > I actually would like to go the other way and have VFIO always have a
> > > > DMABUF under the VMA's it mmaps because that will make it easy to
> > > > finish the type1 emulation which requires finding dmabufs for the
> > > > VMAs.
> >
> > This is a still better idea since it avoid duplicating the VMA flow
> > into two parts..
>
> I suppose this would also compose with your idea to use dma-buf for
> iommufd_compat support of VFIO_IOMMU_MAP_DMA of vfio device fd-backed mmap()s
> [1]? Instead of needing to materialize a new dma-buf, you could use the existing
> backing one?
Yeah, that too
I think it is a fairly easy progression:
1) mmap_prepare() allocates a new dmabuf file * and sticks it in
desc->vm_file. Rework so all the vma_ops are using vm_file that is
a dmabuf. The allocated dmabuf has a singleton range
2) Teach the fault handlers to support full range semantics
3) Use dmabuf revoke variables/etc in the mmap fault handlers
4) Move the address space from the vfio to the dmabuf
5) Allow mmaping the dmabuf fd directly which is now only a couple lines
I forget how all the different mmap implementations in vfio interact
though - but I think the above is good for vfio-pci
Jason
On Fri, Feb 27, 2026 at 07:42:08PM +0000, Matt Evans wrote:
> Hi Jason + Christian,
>
> On 27/02/2026 12:51, Jason Gunthorpe wrote:
> > On Fri, Feb 27, 2026 at 11:09:31AM +0100, Christian König wrote:
> >
> >> When a DMA-buf just represents a linear piece of BAR which is
> >> map-able through the VFIO FD anyway then the right approach is to
> >> just re-direct the mapping to this VFIO FD.
>
> We think limiting this to one range per DMABUF isn't enough,
> i.e. supporting multiple ranges will be a benefit.
>
> Bumping vm_pgoff to then reuse vfio_pci_mmap_ops is a really nice
> suggestion for the simplest case, but can't support multiple ranges;
> the .fault() needs to be aware of the non-linear DMABUF layout.
Sigh, yes that's right we have the non-linear thing, and if you need
that to work it can't use the existing code.
> > I actually would like to go the other way and have VFIO always have a
> > DMABUF under the VMA's it mmaps because that will make it easy to
> > finish the type1 emulation which requires finding dmabufs for the
> > VMAs.
This is a still better idea since it avoid duplicating the VMA flow
into two parts..
> Putting aside the above point of needing a new .fault() able to find a
> PFN for >1 range for a mo, how would the test of the revoked flag work
> w.r.t. synchronisation and protecting against a racing revoke? It's not
> safe to take memory_lock, test revoked, unlock, then hand over to the
> existing vfio_pci_mmap_*fault() -- which re-takes the lock. I'm not
> quite seeing how we could reuse the existing vfio_pci_mmap_*fault(),
> TBH. I did briefly consider refactoring that existing .fault() code,
> but that makes both paths uglier.
More reasons to do the above..
> > Possibly for this use case you can keep that and do a global unmap and
> > rely on fault to restore the mmaps that were not revoked.
>
> Hm, that'd be functional, but we should consider huge BARs with a lot of
> PTEs (even huge ones); zapping all BARs might noticeably disturb other
> clients. But see my query below please, if we could zap just the
> resource being reclaimed that would be preferable.
Hurm. Otherwise you have to create a bunch of address spaces and
juggle them.
> >> Otherwise functions like vfio_pci_zap_bars() doesn't work correctly
> >> any more and that usually creates a huge bunch of problems.
>
> I'd reasoned it was OK for the DMABUF to have its own unique address
> space -- even though IIUC that means an unmap_mapping_range() by
> vfio_pci_core_device won't affect a DMABUF's mappings -- because
> anything that needs to zap a BAR _also_ must already plan to notify
> DMABUF importers via vfio_pci_dma_buf_move(). And then,
> vfio_pci_dma_buf_move() will zap the mappings.
That might be correct, but if then it is yet another reason to do the
first point and remove the shared address_space fully.
Basically one mmap flow that always uses dma-buf and always uses a
per-dma-buf address space with a per-FD revoke and so on and so forth.
This way there is still one of everything, we just pay a bit of cost
to automatically create a dmabuf file * in the existing path.
> Are there paths that _don't_ always pair vfio_pci_zap_bars() with a
> vfio_pci_dma_buf_move()?
There should not be.
Jason
Hi Matt,
On 2/27/26 14:02, Matt Evans wrote:
> Hi Christian,
>
> On 27/02/2026 10:05, Christian König wrote:
>> On 2/26/26 21:22, Matt Evans wrote:
>>> Add a new dma-buf ioctl() op, DMA_BUF_IOCTL_REVOKE, connected to a new
>>> (optional) dma_buf_ops callback, revoke(). An exporter receiving this
>>> will _permanently_ revoke the DMABUF, meaning it can no longer be
>>> mapped/attached/mmap()ed. It also guarantees that existing
>>> importers have been detached (e.g. via move_notify) and all mappings
>>> made inaccessible.
>>>
>>> This is useful for lifecycle management in scenarios where a process
>>> has created a DMABUF representing a resource, then delegated it to
>>> a client process; access to the resource is revoked when the client is
>>> deemed "done", and the resource can be safely re-used elsewhere.
>>
>> Well that means revoking from the importer side. That absolutely doesn't make sense to me.
>>
>> Why would you do that?
>
> Well, it's for cleanup, but directed to a specific buffer.
>
> Elaborating on the original example, a userspace driver creates a DMABUF
> for parts of a BAR and then sends its fd to some other client process
> via SCM_RIGHTS. The client might then do all of:
>
> - Process mappings of the buffer
> - iommufd IO-mappings of it
> - other unrelated drivers import it
> - share the fd with more processes!
>
> i.e. poking a programming interface and orchestrating P2P DMA to it.
> Eventually the client completes and messages the driver to say goodbye,
> except the client is buggy: it hangs before it munmaps or request other
> drivers to shut down/detach their imports.
>
> Now the original driver can't reuse any BAR ranges it shared out, as
> there might still be active mappings or even ongoing P2P DMA to them.
>
> The goal is to guarantee a point in time where resources corresponding
> to a previously-shared DMABUF fd _cannot_ be accessed anymore: CPUs,
> or other drivers/importers, or any other kind of P2P DMA. So yes, a
> revoke must detach importers, using the synchronous revocation flow
> Leon added in [0] ("dma-buf: Use revoke mechanism to invalidate shared
> buffers").
>
> (Apologies, I should really have just built this on top of a tree
> containing that series to make this need clearer.)
>
> But, it ultimately seems to have the same downstream effects as if one
> were to, say, shut down VFIO device fds and therefore trigger
> vfio_pci_dma_buf_cleanup(). It's just the reason to trigger revocation
> is different: a selective userspace-triggered revocation of a given
> buffer, instead of an exporter cleanup-triggered revocation of all
> buffers. In both cases the goals are identical too, of a synchronised
> point after which no more DMA/CPU access can happen.
>
> (If I've misunderstood your question please clarify, but I hope that
> answers it!)
Yeah that makes it clear, Jasons answer also helped quite a bit to understand what you want to do here.
First of all your requirements sound reasonable, but absolutely clear NAK to the way those patches approach of implementing them. You completely mixed up the different DMA-buf roles and which is used for what.
See the IOCTLs on the DMA-buf file descriptor are for the importer side to communicate with the exporter side. E.g. thinks like "I'm done writing with the CPU, please make that visible to yourself and other importers".....
But what you want to do here is just the other way around, the exporter side wants to signal to all importers that it can't use the buffer any more, correct?
If I understood that correctly then my suggestion is that you have a new IOCTL on the VFIO fd you originally used to export the DMA-buf fd. This IOCTL takes the DMA-buf fd and after double checking that it indeed is the exporter of that fd revokes all importer access to it.
I'm certainly open on suggestions on how to improve the DMA-buf documentation to make that more clearer in the future.
Regards,
Christian.
>
> Cheers,
>
>
> Matt
>
> [0] https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566a…
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Signed-off-by: Matt Evans <mattev(a)meta.com>
>>> ---
>>> drivers/dma-buf/dma-buf.c | 5 +++++
>>> include/linux/dma-buf.h | 22 ++++++++++++++++++++++
>>> include/uapi/linux/dma-buf.h | 1 +
>>> 3 files changed, 28 insertions(+)
>>>
>>> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
>>> index edaa9e4ee4ae..b9b315317f2d 100644
>>> --- a/drivers/dma-buf/dma-buf.c
>>> +++ b/drivers/dma-buf/dma-buf.c
>>> @@ -561,6 +561,11 @@ static long dma_buf_ioctl(struct file *file,
>>> case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
>>> return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
>>> #endif
>>> + case DMA_BUF_IOCTL_REVOKE:
>>> + if (dmabuf->ops->revoke)
>>> + return dmabuf->ops->revoke(dmabuf);
>>> + else
>>> + return -EINVAL;
>>>
>>> default:
>>> return -ENOTTY;
>>> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
>>> index 0bc492090237..a68c9ad7aebd 100644
>>> --- a/include/linux/dma-buf.h
>>> +++ b/include/linux/dma-buf.h
>>> @@ -277,6 +277,28 @@ struct dma_buf_ops {
>>>
>>> int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map);
>>> void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map);
>>> +
>>> + /**
>>> + * @revoke:
>>> + *
>>> + * This callback is invoked from a userspace
>>> + * DMA_BUF_IOCTL_REVOKE operation, and requests that access to
>>> + * the buffer is immediately and permanently revoked. On
>>> + * successful return, the buffer is not accessible through any
>>> + * mmap() or dma-buf import. The request fails if the buffer
>>> + * is pinned; otherwise, the exporter marks the buffer as
>>> + * inaccessible and uses the move_notify callback to inform
>>> + * importers of the change. The buffer is permanently
>>> + * disabled, and the exporter must refuse all map, mmap,
>>> + * attach, etc. requests.
>>> + *
>>> + * Returns:
>>> + *
>>> + * 0 on success, or a negative error code on failure:
>>> + * -ENODEV if the associated device no longer exists/is closed.
>>> + * -EBADFD if the buffer has already been revoked.
>>> + */
>>> + int (*revoke)(struct dma_buf *dmabuf);
>>> };
>>>
>>> /**
>>> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
>>> index 5a6fda66d9ad..84bf2dd2d0f3 100644
>>> --- a/include/uapi/linux/dma-buf.h
>>> +++ b/include/uapi/linux/dma-buf.h
>>> @@ -178,5 +178,6 @@ struct dma_buf_import_sync_file {
>>> #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, __u64)
>>> #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_export_sync_file)
>>> #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_import_sync_file)
>>> +#define DMA_BUF_IOCTL_REVOKE _IO(DMA_BUF_BASE, 4)
>>>
>>> #endif
>>> --
>>> 2.47.3
>>>
>>
>
Hi,
The recent introduction of heaps in the optee driver [1] made possible
the creation of heaps as modules.
It's generally a good idea if possible, including for the already
existing system and CMA heaps.
The system one is pretty trivial, the CMA one is a bit more involved,
especially since we have a call from kernel/dma/contiguous.c to the CMA
heap code. This was solved by turning the logic around and making the
CMA heap call into the contiguous DMA code.
Let me know what you think,
Maxime
1: https://lore.kernel.org/dri-devel/20250911135007.1275833-4-jens.wiklander@l…
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Maxime Ripard (7):
dma: contiguous: Turn heap registration logic around
mm: cma: Export cma_alloc and cma_release
mm: cma: Export cma_get_name
mm: cma: Export dma_contiguous_default_area
dma-buf: heaps: Export mem_accounting parameter
dma-buf: heaps: cma: Turn the heap into a module
dma-buf: heaps: system: Turn the heap into a module
drivers/dma-buf/dma-heap.c | 1 +
drivers/dma-buf/heaps/Kconfig | 4 ++--
drivers/dma-buf/heaps/cma_heap.c | 21 +++++----------------
drivers/dma-buf/heaps/system_heap.c | 5 +++++
include/linux/dma-map-ops.h | 5 +++++
kernel/dma/contiguous.c | 27 +++++++++++++++++++++++++--
mm/cma.c | 3 +++
7 files changed, 46 insertions(+), 20 deletions(-)
---
base-commit: 499a718536dc0e1c1d1b6211847207d58acd9916
change-id: 20260225-dma-buf-heaps-as-modules-1034b3ec9f2a
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
Hi David,
On Thu, Feb 26, 2026 at 11:25:24AM +0100, David Hildenbrand (Arm) wrote:
> On 2/25/26 17:41, Maxime Ripard wrote:
> > The CMA dma-buf heap uses cma_alloc() and cma_release() to allocate and
> > free, respectively, its CMA buffers.
> >
> > However, these functions are not exported. Since we want to turn the CMA
> > heap into a module, let's export them both.
> >
> > Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
> > ---
> > mm/cma.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 94b5da468a7d719e5144d33b06bcc7619c0fbcc9..be142b473f3bd41b9c7d8ba4397f018f6993d962 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -949,10 +949,11 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> > if (page)
> > set_pages_refcounted(page, count);
> >
> > return page;
> > }
> > +EXPORT_SYMBOL_GPL(cma_alloc);
> >
> > static struct cma_memrange *find_cma_memrange(struct cma *cma,
> > const struct page *pages, unsigned long count)
> > {
> > struct cma_memrange *cmr = NULL;
> > @@ -1025,10 +1026,11 @@ bool cma_release(struct cma *cma, const struct page *pages,
> >
> > __cma_release_frozen(cma, cmr, pages, count);
> >
> > return true;
> > }
> > +EXPORT_SYMBOL_GPL(cma_release);
> >
> > bool cma_release_frozen(struct cma *cma, const struct page *pages,
> > unsigned long count)
> > {
> > struct cma_memrange *cmr;
> >
>
> I'm wondering whether we want to restrict all these exports to the
> dma-buf module only using EXPORT_SYMBOL_FOR_MODULES().
TIL about EXPORT_SYMBOL_FOR_MODULES, thanks.
> Especially dma_contiguous_default_area() (patch #4), I am not sure
> whether we want arbitrary modules to mess with that.
Yeah, I wasn't too fond about that one either. Alternatively, I guess we
could turn dev_get_cma_area into a non-inlined function and export that
instead?
Or we could do both.
Maxime
Hi,
The recent introduction of heaps in the optee driver [1] made possible
the creation of heaps as modules.
It's generally a good idea if possible, including for the already
existing system and CMA heaps.
The system one is pretty trivial, the CMA one is a bit more involved,
especially since we have a call from kernel/dma/contiguous.c to the CMA
heap code. This was solved by turning the logic around and making the
CMA heap call into the contiguous DMA code.
Let me know what you think,
Maxime
1: https://lore.kernel.org/dri-devel/20250911135007.1275833-4-jens.wiklander@l…
Signed-off-by: Maxime Ripard <mripard(a)kernel.org>
---
Changes in v2:
- Collect tags
- Don't export dma_contiguous_default_area anymore, but export
dev_get_cma_area instead
- Mentioned that heap modules can't be removed
- Link to v1: https://lore.kernel.org/r/20260225-dma-buf-heaps-as-modules-v1-0-2109225a09…
---
Maxime Ripard (9):
dma: contiguous: Turn heap registration logic around
dma: contiguous: Make dev_get_cma_area() a proper function
dma: contiguous: Make dma_contiguous_default_area static
mm: cma: Export dev_get_cma_area()
mm: cma: Export cma_alloc and cma_release
mm: cma: Export cma_get_name
dma-buf: heaps: Export mem_accounting parameter
dma-buf: heaps: cma: Turn the heap into a module
dma-buf: heaps: system: Turn the heap into a module
drivers/dma-buf/dma-heap.c | 1 +
drivers/dma-buf/heaps/Kconfig | 4 ++--
drivers/dma-buf/heaps/cma_heap.c | 21 +++++----------------
drivers/dma-buf/heaps/system_heap.c | 5 +++++
include/linux/dma-map-ops.h | 14 ++++++--------
kernel/dma/contiguous.c | 37 ++++++++++++++++++++++++++++++++++---
mm/cma.c | 3 +++
7 files changed, 56 insertions(+), 29 deletions(-)
---
base-commit: 499a718536dc0e1c1d1b6211847207d58acd9916
change-id: 20260225-dma-buf-heaps-as-modules-1034b3ec9f2a
Best regards,
--
Maxime Ripard <mripard(a)kernel.org>
On 2/26/26 21:22, Matt Evans wrote:
> Add a new dma-buf ioctl() op, DMA_BUF_IOCTL_REVOKE, connected to a new
> (optional) dma_buf_ops callback, revoke(). An exporter receiving this
> will _permanently_ revoke the DMABUF, meaning it can no longer be
> mapped/attached/mmap()ed. It also guarantees that existing
> importers have been detached (e.g. via move_notify) and all mappings
> made inaccessible.
>
> This is useful for lifecycle management in scenarios where a process
> has created a DMABUF representing a resource, then delegated it to
> a client process; access to the resource is revoked when the client is
> deemed "done", and the resource can be safely re-used elsewhere.
Well that means revoking from the importer side. That absolutely doesn't make sense to me.
Why would you do that?
Regards,
Christian.
>
> Signed-off-by: Matt Evans <mattev(a)meta.com>
> ---
> drivers/dma-buf/dma-buf.c | 5 +++++
> include/linux/dma-buf.h | 22 ++++++++++++++++++++++
> include/uapi/linux/dma-buf.h | 1 +
> 3 files changed, 28 insertions(+)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index edaa9e4ee4ae..b9b315317f2d 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -561,6 +561,11 @@ static long dma_buf_ioctl(struct file *file,
> case DMA_BUF_IOCTL_IMPORT_SYNC_FILE:
> return dma_buf_import_sync_file(dmabuf, (const void __user *)arg);
> #endif
> + case DMA_BUF_IOCTL_REVOKE:
> + if (dmabuf->ops->revoke)
> + return dmabuf->ops->revoke(dmabuf);
> + else
> + return -EINVAL;
>
> default:
> return -ENOTTY;
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index 0bc492090237..a68c9ad7aebd 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -277,6 +277,28 @@ struct dma_buf_ops {
>
> int (*vmap)(struct dma_buf *dmabuf, struct iosys_map *map);
> void (*vunmap)(struct dma_buf *dmabuf, struct iosys_map *map);
> +
> + /**
> + * @revoke:
> + *
> + * This callback is invoked from a userspace
> + * DMA_BUF_IOCTL_REVOKE operation, and requests that access to
> + * the buffer is immediately and permanently revoked. On
> + * successful return, the buffer is not accessible through any
> + * mmap() or dma-buf import. The request fails if the buffer
> + * is pinned; otherwise, the exporter marks the buffer as
> + * inaccessible and uses the move_notify callback to inform
> + * importers of the change. The buffer is permanently
> + * disabled, and the exporter must refuse all map, mmap,
> + * attach, etc. requests.
> + *
> + * Returns:
> + *
> + * 0 on success, or a negative error code on failure:
> + * -ENODEV if the associated device no longer exists/is closed.
> + * -EBADFD if the buffer has already been revoked.
> + */
> + int (*revoke)(struct dma_buf *dmabuf);
> };
>
> /**
> diff --git a/include/uapi/linux/dma-buf.h b/include/uapi/linux/dma-buf.h
> index 5a6fda66d9ad..84bf2dd2d0f3 100644
> --- a/include/uapi/linux/dma-buf.h
> +++ b/include/uapi/linux/dma-buf.h
> @@ -178,5 +178,6 @@ struct dma_buf_import_sync_file {
> #define DMA_BUF_SET_NAME_B _IOW(DMA_BUF_BASE, 1, __u64)
> #define DMA_BUF_IOCTL_EXPORT_SYNC_FILE _IOWR(DMA_BUF_BASE, 2, struct dma_buf_export_sync_file)
> #define DMA_BUF_IOCTL_IMPORT_SYNC_FILE _IOW(DMA_BUF_BASE, 3, struct dma_buf_import_sync_file)
> +#define DMA_BUF_IOCTL_REVOKE _IO(DMA_BUF_BASE, 4)
>
> #endif
> --
> 2.47.3
>
On 2/26/26 21:21, Matt Evans wrote:
> A VFIO DMABUF can export a subset of a BAR to userspace by fd; add
> support for mmap() of this fd. This provides another route for a
> process to map BARs, except one where the process can only map a specific
> subset of a BAR represented by the exported DMABUF.
>
> mmap() support enables userspace driver designs that safely delegate
> access to BAR sub-ranges to other client processes by sharing a DMABUF
> fd, without having to share the (omnipotent) VFIO device fd with them.
>
> The mmap callback installs vm_ops callbacks for .fault and .huge_fault;
> they find a PFN by searching the DMABUF's physical ranges. That is,
> DMABUFs with multiple ranges are supported for mmap().
In general sounds like a good idea but this approach here doesn't looks good at all.
Especially how you call unmap_mapping_range() from your DMA-buf cleanup path looks extremely questionable.
...
> +/*
> + * Similar to vfio_pci_core_mmap() for a regular VFIO device fd, but
> + * differs by pre-checks performed and ultimately the vm_ops installed.
> + */
> +static int vfio_pci_dma_buf_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
> +{
> + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> + u64 req_len, req_start;
> +
> + if (!vfio_pci_dma_buf_is_mappable(dmabuf))
> + return -ENODEV;
> + if ((vma->vm_flags & VM_SHARED) == 0)
> + return -EINVAL;
> +
> + req_len = vma->vm_end - vma->vm_start;
> + req_start = vma->vm_pgoff << PAGE_SHIFT;
> +
> + if (req_start + req_len > priv->size)
> + return -EINVAL;
> +
> + vma->vm_private_data = priv;
> + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> + vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot);
> +
> + /*
> + * See comments in vfio_pci_core_mmap() re VM_ALLOW_ANY_UNCACHED.
> + *
> + * FIXME: get mapping attributes from dmabuf?
> + */
> + vm_flags_set(vma, VM_ALLOW_ANY_UNCACHED | VM_IO | VM_PFNMAP |
> + VM_DONTEXPAND | VM_DONTDUMP);
> + vma->vm_ops = &vfio_pci_dma_buf_mmap_ops;
> +
> + return 0;
Let's start with this here, it just looks horrible over complicated.
When a DMA-buf just represents a linear piece of BAR which is map-able through the VFIO FD anyway then the right approach is to just re-direct the mapping to this VFIO FD.
Roughly something like this here should do it:
vma->vm_pgoff += offset_which_your_dma_buf_represents;
vma_set_file(vma, core_dev->file);
vfio_pci_core_mmap(core_dev, vma);
It can be that you want additional checks (e.g. if the DMA-buf is revoked) in which case you would need to override the vma->vm_ops, but then just do the access checks and call the vfio_pci_mmap_ops to get the actually page fault handling done.
>+ unmap_mapping_range(priv->dmabuf->file->f_mapping,
>+ 0, priv->size, 1);
When you need to use unmap_mapping_range() then you usually share the address space object between the file descriptor exporting the DMA-buf and the DMA-buf fd itself.
Otherwise functions like vfio_pci_zap_bars() doesn't work correctly any more and that usually creates a huge bunch of problems.
Regards,
Christian.
Hi Maxime!
On Thu, Feb 26, 2026 at 11:12 AM Maxime Ripard <mripard(a)redhat.com> wrote:
>
> Hi Albert,
>
> Thanks for your patch!
>
> On Tue, Feb 24, 2026 at 08:57:33AM +0100, Albert Esteve wrote:
> > Add a dma-buf heap for DT coherent reserved-memory
> > (i.e., 'shared-dma-pool' without 'reusable' property),
> > exposing one heap per region for userspace buffers.
> >
> > The heap binds a synthetic platform device to each region
> > so coherent allocations use the correct dev->dma_mem,
> > and it defers registration until late_initcall when
> > normal allocator are available.
> >
> > This patch includes charging of the coherent heap
> > allocator to the dmem cgroup.
> >
> > Signed-off-by: Albert Esteve <aesteve(a)redhat.com>
> > ---
> > This patch introduces a new driver to expose DT coherent reserved-memory
> > regions as dma-buf heaps, allowing userspace buffers to be created.
> >
> > Since these regions are device-dependent, we bind a synthetic platform
> > device to each region so coherent allocations use the correct dev->dma_mem.
> >
> > Following Eric’s [1] and Maxime’s [2] work on charging DMA buffers
> > allocated from userspace to cgroups (dmem), this patch adds the same
> > charging pattern used by CMA heaps patch. Charging is done only through
> > the dma-buf heap interface so it can be attributed to a userspace allocator.
> >
> > This allows each device-specific reserved-memory region to enforce its
> > own limits.
> >
> > [1] https://lore.kernel.org/all/20260218-dmabuf-heap-cma-dmem-v2-0-b249886fb7b2…
> > [2] https://lore.kernel.org/all/20250310-dmem-cgroups-v1-0-2984c1bc9312@kernel.…
> > ---
> > drivers/dma-buf/heaps/Kconfig | 17 ++
> > drivers/dma-buf/heaps/Makefile | 1 +
> > drivers/dma-buf/heaps/coherent_heap.c | 485 ++++++++++++++++++++++++++++++++++
> > include/linux/dma-heap.h | 11 +
> > kernel/dma/coherent.c | 9 +
> > 5 files changed, 523 insertions(+)
> >
> > diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig
> > index a5eef06c42264..93765dca164e3 100644
> > --- a/drivers/dma-buf/heaps/Kconfig
> > +++ b/drivers/dma-buf/heaps/Kconfig
> > @@ -12,3 +12,20 @@ config DMABUF_HEAPS_CMA
> > Choose this option to enable dma-buf CMA heap. This heap is backed
> > by the Contiguous Memory Allocator (CMA). If your system has these
> > regions, you should say Y here.
> > +
> > +config DMABUF_HEAPS_COHERENT
> > + bool "DMA-BUF Coherent Reserved-Memory Heap"
> > + depends on DMABUF_HEAPS && OF_RESERVED_MEM && DMA_DECLARE_COHERENT
> > + help
> > + Choose this option to enable coherent reserved-memory dma-buf heaps.
> > + This heap is backed by non-reusable DT "shared-dma-pool" regions.
> > + If your system defines coherent reserved-memory regions, you should
> > + say Y here.
> > +
> > +config COHERENT_AREAS_DEFERRED
> > + int "Max deferred coherent reserved-memory regions"
> > + depends on DMABUF_HEAPS_COHERENT
> > + default 16
> > + help
> > + Maximum number of coherent reserved-memory regions that can be
> > + deferred for later registration during early boot.
> > diff --git a/drivers/dma-buf/heaps/Makefile b/drivers/dma-buf/heaps/Makefile
> > index 974467791032f..96bda7a65f041 100644
> > --- a/drivers/dma-buf/heaps/Makefile
> > +++ b/drivers/dma-buf/heaps/Makefile
> > @@ -1,3 +1,4 @@
> > # SPDX-License-Identifier: GPL-2.0
> > obj-$(CONFIG_DMABUF_HEAPS_SYSTEM) += system_heap.o
> > obj-$(CONFIG_DMABUF_HEAPS_CMA) += cma_heap.o
> > +obj-$(CONFIG_DMABUF_HEAPS_COHERENT) += coherent_heap.o
> > diff --git a/drivers/dma-buf/heaps/coherent_heap.c b/drivers/dma-buf/heaps/coherent_heap.c
> > new file mode 100644
> > index 0000000000000..870b2b89aefcb
> > --- /dev/null
> > +++ b/drivers/dma-buf/heaps/coherent_heap.c
> > @@ -0,0 +1,485 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * DMABUF heap for coherent reserved-memory regions
> > + *
> > + * Copyright (C) 2026 Red Hat, Inc.
> > + * Author: Albert Esteve <aesteve(a)redhat.com>
> > + *
> > + */
> > +
> > +#include <linux/cgroup_dmem.h>
> > +#include <linux/dma-heap.h>
> > +#include <linux/dma-buf.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/err.h>
> > +#include <linux/highmem.h>
> > +#include <linux/iosys-map.h>
> > +#include <linux/of_reserved_mem.h>
> > +#include <linux/platform_device.h>
> > +#include <linux/scatterlist.h>
> > +#include <linux/slab.h>
> > +#include <linux/vmalloc.h>
> > +
> > +#define DEFERRED_AREAS_MAX CONFIG_COHERENT_AREAS_DEFERRED
>
> I'm not sure we need to make it configurable. Distros are going to set
> it to the user with the highest number of regions anyway. How about
> using MAX_RESERVED_REGIONS for now?
Makes sense, will do.
>
> >
> > [...]
> >
> > +struct coherent_heap {
> > + struct dma_heap *heap;
> > + struct reserved_mem *rmem;
> > + char *name;
> > + struct device *dev;
> > + struct platform_device *pdev;
> > +#if IS_ENABLED(CONFIG_CGROUP_DMEM)
> > + struct dmem_cgroup_region *cg;
> > +#endif
>
> We might want to leave the dmem accounting out for now so we can focus
> on the heap itself. That being said, it ended up being pretty trivial
> for CMA, so maybe it's not too much of a concern.
Sure. That allows us to follow the same patterns once the CMA series lands.
I will strip all dmem accounting parts for v2.
>
> >
> > [...]
> >
> > +static int __coherent_heap_register(struct reserved_mem *rmem)
> > +{
> > + struct dma_heap_export_info exp_info;
> > + struct coherent_heap *coh_heap;
> > +#if IS_ENABLED(CONFIG_CGROUP_DMEM)
> > + struct dmem_cgroup_region *region;
> > +#endif
> > + const char *rmem_name;
> > + int ret;
> > +
> > + if (!rmem)
> > + return -EINVAL;
> > +
> > + rmem_name = rmem->name ? rmem->name : "unknown";
>
> If the reserved region has no name, we probably shouldn't expose it to
> userspace at all. Using unknown will probably create some bugs if you
> have several, but also it's pretty like to have a name at some point and
> thus we wouldn't have a stable name for the uAPI.
Agreed. I will return an error code if no name is present.
>
> > + coh_heap = kzalloc_obj(*coh_heap);
> > + if (!coh_heap)
> > + return -ENOMEM;
> > +
> > + coh_heap->name = kasprintf(GFP_KERNEL, "coherent_%s", rmem_name);
> > + if (!coh_heap->name) {
> > + ret = -ENOMEM;
> > + goto free_coherent_heap;
> > + }
>
> Similarly, we shouldn't use the coherent_ prefix for the heap name. If
> the backing allocator ever changes (and between contiguous and coherent,
> the difference is just a single property value in the DT), then the name
> would change and userspace would break. We should directly use the name
> of the region here.
>
> > + coh_heap->rmem = rmem;
> > +
> > + /* create a platform device per rmem and bind it */
> > + coh_heap->pdev = platform_device_register_simple("coherent-heap",
> > + PLATFORM_DEVID_AUTO,
> > + NULL, 0);
> > + if (IS_ERR(coh_heap->pdev)) {
> > + ret = PTR_ERR(coh_heap->pdev);
> > + goto free_name;
> > + }
>
> We probably should use a faux_device here instead of a platform_device,
> but more importantly, the heap itself already has a device allocated for
> it (dev_ret in dma_heap_add).
>
> It's not in struct dma_heap yet, but there's a patch that moves it there
> that we should probably carry:
> https://lore.kernel.org/r/20210120210937.15069-2-john.stultz@linaro.org/
Thanks for sharing the link! I will pick the patch.
>
> > + if (rmem->ops && rmem->ops->device_init) {
> > + ret = rmem->ops->device_init(rmem, &coh_heap->pdev->dev);
> > + if (ret)
> > + goto pdev_unregister;
> > + }
>
> I'm not really a fan of calling ops directly. Maybe we should create an
> of_reserved_mem_device_init_with_mem function that would do it for us
> (and would be called by of_reserved_mem_device_init_by_idx and the
> likes).
Agreed.
>
> > + coh_heap->dev = &coh_heap->pdev->dev;
> > +#if IS_ENABLED(CONFIG_CGROUP_DMEM)
> > + region = dmem_cgroup_register_region(rmem->size, "coh/%s", rmem_name);
> > + if (IS_ERR(region)) {
> > + ret = PTR_ERR(region);
> > + goto pdev_unregister;
> > + }
> > + coh_heap->cg = region;
> > +#endif
>
> Same comment than for CMA here: it should really be created by the
> coherent memory region itself.
>
> > + exp_info.name = coh_heap->name;
> > + exp_info.ops = &coherent_heap_ops;
> > + exp_info.priv = coh_heap;
> > +
> > + coh_heap->heap = dma_heap_add(&exp_info);
> > + if (IS_ERR(coh_heap->heap)) {
> > + ret = PTR_ERR(coh_heap->heap);
> > + goto cg_unregister;
> > + }
> > +
> > + return 0;
> > +
> > +cg_unregister:
> > +#if IS_ENABLED(CONFIG_CGROUP_DMEM)
> > + dmem_cgroup_unregister_region(coh_heap->cg);
> > +#endif
> > +pdev_unregister:
> > + platform_device_unregister(coh_heap->pdev);
> > + coh_heap->pdev = NULL;
> > +free_name:
> > + kfree(coh_heap->name);
> > +free_coherent_heap:
> > + kfree(coh_heap);
> > +
> > + return ret;
> > +}
> > +
> > +int dma_heap_coherent_register(struct reserved_mem *rmem)
> > +{
> > + int ret;
> > +
> > + ret = __coherent_heap_register(rmem);
> > + if (ret == -ENOMEM)
> > + return coherent_heap_add_deferred(rmem);
> > + return ret;
> > +}
>
> I appreciate you did it like we did for CMA, but if we ever want to make
> that heap a module you'll end up with a dependency from the core kernel
> on a module which doesn't work. The best here might be to wait a bit
> until we have somewhat of an agreement on
>
> https://lore.kernel.org/r/20260225-dma-buf-heaps-as-modules-v1-0-2109225a09…
>
> > +static int __init coherent_heap_register_deferred(void)
> > +{
> > + unsigned int i;
> > + int ret;
> > +
> > + for (i = 0; i < rmem_areas_deferred_num; i++) {
> > + struct reserved_mem *rmem = rmem_areas_deferred[i];
> > +
> > + ret = __coherent_heap_register(rmem);
> > + if (ret) {
> > + pr_warn("Failed to add coherent heap %s",
> > + rmem->name ? rmem->name : "unknown");
> > + continue;
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +late_initcall(coherent_heap_register_deferred);
>
> Why do you need a late_initcall here? Isn't module_init enough?
When I tested this initially, I relied on direct invocations from
cma/coherent.c to register new coherent heap areas. However it failed
in `kzalloc_obj` calls within the register function. Then I read the
article about boot time memory management[1] and saw other drivers
collected info for deferred initialization at late_initcall(), similar
to what I tried to do here. I honestly did not try with module_init().
Since I will refactor this part to follow your previous comments, I
will try to update if possible.
[1] https://docs.kernel.org/core-api/boot-time-mm.html
>
> > +MODULE_DESCRIPTION("DMA-BUF heap for coherent reserved-memory regions");
> > diff --git a/include/linux/dma-heap.h b/include/linux/dma-heap.h
> > index 648328a64b27e..e894cfa1ecf1a 100644
> > --- a/include/linux/dma-heap.h
> > +++ b/include/linux/dma-heap.h
> > @@ -9,9 +9,11 @@
> > #ifndef _DMA_HEAPS_H
> > #define _DMA_HEAPS_H
> >
> > +#include <linux/errno.h>
> > #include <linux/types.h>
> >
> > struct dma_heap;
> > +struct reserved_mem;
> >
> > /**
> > * struct dma_heap_ops - ops to operate on a given heap
> > @@ -48,4 +50,13 @@ struct dma_heap *dma_heap_add(const struct dma_heap_export_info *exp_info);
> >
> > extern bool mem_accounting;
> >
> > +#if IS_ENABLED(CONFIG_DMABUF_HEAPS_COHERENT)
> > +int dma_heap_coherent_register(struct reserved_mem *rmem);
> > +#else
> > +static inline int dma_heap_coherent_register(struct reserved_mem *rmem)
> > +{
> > + return -EOPNOTSUPP;
> > +}
> > +#endif
> > +
> > #endif /* _DMA_HEAPS_H */
> > diff --git a/kernel/dma/coherent.c b/kernel/dma/coherent.c
> > index 1147497bc512c..f49d13e460e4b 100644
> > --- a/kernel/dma/coherent.c
> > +++ b/kernel/dma/coherent.c
> > @@ -9,6 +9,7 @@
> > #include <linux/module.h>
> > #include <linux/dma-direct.h>
> > #include <linux/dma-map-ops.h>
> > +#include <linux/dma-heap.h>
> >
> > struct dma_coherent_mem {
> > void *virt_base;
> > @@ -393,6 +394,14 @@ static int __init rmem_dma_setup(struct reserved_mem *rmem)
> > rmem->ops = &rmem_dma_ops;
> > pr_info("Reserved memory: created DMA memory pool at %pa, size %ld MiB\n",
> > &rmem->base, (unsigned long)rmem->size / SZ_1M);
> > +
> > + if (IS_ENABLED(CONFIG_DMABUF_HEAPS_COHERENT)) {
> > + int ret = dma_heap_coherent_register(rmem);
> > +
> > + if (ret)
> > + pr_warn("Reserved memory: failed to register coherent heap for %s (%d)\n",
> > + rmem->name ? rmem->name : "unknown", ret);
> > + }
>
> I think this should be split into a patch of its own. It's going to be
> reviewed (and possibly merged) by another maintainer, through another
> tree.
That's fine. I will split it into another series then. I guess I can
send it with a Based-On: tag or similar to link two series together?
BR,
Albert
>
> Maxime
On Fri, Feb 27, 2026 at 08:45:29AM +0200, Daniel Baluta wrote:
> On 2/26/26 20:20, Conor Dooley wrote:
> [..]
> >> + - |
> >> + #include <dt-bindings/interrupt-controller/arm-gic.h>
> >> + #include <dt-bindings/interrupt-controller/irq.h>
> >> +
> >> + bus {
> >> + #address-cells = <2>;
> >> + #size-cells = <2>;
> >> +
> >> + neutron@4ab00000 {
> > "neutron" is not a generic node name. This should be something like
> > "accelerator" or similar.
> >
> The only dts nodes I could find using accel subsystem are from rockhip. And they use npu@
>
> e.g:
>
> » rknn_core_0: npu@fdab0000 {
> » » compatible = "rockchip,rk3588-rknn-core";
>
> Also, Ethos-U64 introduced by Rob with [1] is using npu@.
>
> So, I think we should go wit that. I haven't seen any document to standardize the naming.
accelerator, npu, makes no difference to me, so sure.