Changelog:
v5:
* Documented the DMA-BUF expectations around DMA unmap.
* Added wait support in VFIO for DMA unmap.
* Reordered patches.
* Improved commit messages to document even more.
v4: https://lore.kernel.org/all/20260121-dmabuf-revoke-v4-0-d311cbc8633d@nvidia…
* Changed DMA_RESV_USAGE_KERNEL to DMA_RESV_USAGE_BOOKKEEP.
* Made .invalidate_mapping() truly optional.
* Added patch which renames dma_buf_move_notify() to be
dma_buf_invalidate_mappings().
* Restored dma_buf_attachment_is_dynamic() function.
v3: https://lore.kernel.org/all/20260120-dmabuf-revoke-v3-0-b7e0b07b8214@nvidia…
* Used Jason's wordings for commits and cover letter.
* Removed IOMMUFD patch.
* Renamed dma_buf_attachment_is_revoke() to be dma_buf_attach_revocable().
* Added patch to remove CONFIG_DMABUF_MOVE_NOTIFY.
* Added Reviewed-by tags.
* Called to dma_resv_wait_timeout() after dma_buf_move_notify() in VFIO.
* Added dma_buf_attach_revocable() check to VFIO DMABUF attach function.
* Slightly changed commit messages.
v2: https://patch.msgid.link/20260118-dmabuf-revoke-v2-0-a03bb27c0875@nvidia.com
* Changed series to document the revoke semantics instead of
implementing it.
v1: https://patch.msgid.link/20260111-dmabuf-revoke-v1-0-fb4bcc8c259b@nvidia.com
-------------------------------------------------------------------------
This series is based on latest VFIO fix, which will be sent to Linus
very soon.
https://lore.kernel.org/all/20260121-vfio-add-pin-v1-1-4e04916b17f1@nvidia.…
Thanks
-------------------------------------------------------------------------
This series documents a dma-buf “revoke” mechanism: to allow a dma-buf
exporter to explicitly invalidate (“kill”) a shared buffer after it has
been distributed to importers, so that further CPU and device access is
prevented and importers reliably observe failure.
The change in this series is to properly document and use existing core
“revoked” state on the dma-buf object and a corresponding exporter-triggered
revoke operation.
dma-buf has quietly allowed calling move_notify on pinned dma-bufs, even
though legacy importers using dma_buf_attach() would simply ignore
these calls.
The intention was that move_notify() would tell the importer to expedite
it's unmapping process and once the importer is fully finished with DMA it
would unmap the dma-buf which finally signals that the importer is no
longer ever going to touch the memory again. Importers that touch past
their unmap() call can trigger IOMMU errors, AER and beyond, however
read-and-discard access between move_notify() and unmap is allowed.
Thus, we can define the exporter's revoke sequence for pinned dma-buf as:
dma_resv_lock(dmabuf->resv, NULL);
// Prevent new mappings from being established
priv->revoked = true;
// Tell all importers to eventually unmap
dma_buf_invalidate_mappings(dmabuf);
// Wait for any inprogress fences on the old mapping
dma_resv_wait_timeout(dmabuf->resv,
DMA_RESV_USAGE_BOOKKEEP, false,
MAX_SCHEDULE_TIMEOUT);
dma_resv_unlock(dmabuf->resv, NULL);
// Wait for all importers to complete unmap
wait_for_completion(&priv->unmapp_comp);
However, dma-buf also supports importers that don't do anything on
move_notify(), and will not unmap the buffer in bounded time.
Since such importers would cause the above sequence to hang, a new
mechanism is needed to detect incompatible importers.
Introduce dma_buf_attach_revocable() which if true indicates the above
sequence is safe to use and will complete in kernel-only bounded time for
this attachment.
Unfortunately dma_buf_attach_revocable() is going to fail for the popular
RDMA pinned importer, which means we cannot introduce it to existing
places using pinned move_notify() without potentially breaking existing
userspace flows.
Existing exporters that only trigger this flow for RAS errors should not
call dma_buf_attach_revocable() and will suffer an unbounded block on the
final completion, hoping that the userspace will notice the RAS and clean
things up. Without revoke support on the RDMA pinned importers it doesn't
seem like any other non-breaking option is currently possible.
For new exporters, like VFIO and RDMA, that have userspace triggered
revoke events, the unbouned sleep would not be acceptable. They can call
dma_buf_attach_revocable() and will not work with the RDMA pinned importer
from day 0, preventing regressions.
In the process add documentation explaining the above details.
Thanks
Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
---
Leon Romanovsky (8):
dma-buf: Rename .move_notify() callback to a clearer identifier
dma-buf: Rename dma_buf_move_notify() to dma_buf_invalidate_mappings()
dma-buf: Always build with DMABUF_MOVE_NOTIFY
vfio: Wait for dma-buf invalidation to complete
dma-buf: Make .invalidate_mapping() truly optional
dma-buf: Add dma_buf_attach_revocable()
vfio: Permit VFIO to work with pinned importers
iommufd: Add dma_buf_pin()
drivers/dma-buf/Kconfig | 12 ----
drivers/dma-buf/dma-buf.c | 69 +++++++++++++++++-----
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 14 ++---
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 2 +-
drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +-
drivers/gpu/drm/virtio/virtgpu_prime.c | 2 +-
drivers/gpu/drm/xe/tests/xe_dma_buf.c | 7 +--
drivers/gpu/drm/xe/xe_bo.c | 2 +-
drivers/gpu/drm/xe/xe_dma_buf.c | 14 ++---
drivers/infiniband/core/umem_dmabuf.c | 13 -----
drivers/infiniband/hw/mlx5/mr.c | 2 +-
drivers/iommu/iommufd/pages.c | 11 +++-
drivers/iommu/iommufd/selftest.c | 2 +-
drivers/vfio/pci/vfio_pci_dmabuf.c | 90 +++++++++++++++++++++++------
include/linux/dma-buf.h | 17 +++---
15 files changed, 164 insertions(+), 95 deletions(-)
---
base-commit: 61ceaf236115f20f4fdd7cf60f883ada1063349a
change-id: 20251221-dmabuf-revoke-b90ef16e4236
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
> Bear me if it's an ignorant question.
>
> The commit msg of patch6 says that VFIO doesn't tolerate unbounded
> wait, which is the reason behind the 2nd timeout wait here.
As far as I understand dmabuf design a fence wait should complete
eventually under kernel control, because these sleeps are
sprinkled all around the kernel today.
I suspect that is not actually true for every HW, probably something
like "shader programs can run forever technically".
We can argue if those cases should not report revocable either, but at
least this will work "correctly" even if it takes a huge amount of
time.
I wouldn't mind seeing a shorter timeout and print on the fence too
just in case.
Jason
On Thu, Jan 29, 2026 at 08:13:18AM +0000, Tian, Kevin wrote:
> > From: Leon Romanovsky <leon(a)kernel.org>
> > Sent: Thursday, January 29, 2026 3:34 PM
> >
> > On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg(a)ziepe.ca>
> > > > Sent: Wednesday, January 28, 2026 12:28 AM
> > > >
> > > > On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
> > > > > > > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
> > > > vfio_pci_core_device *vdev, bool revoked)
> > > > > > > dma_resv_lock(priv->dmabuf->resv, NULL);
> > > > > > > priv->revoked = revoked;
> > > > > > > dma_buf_invalidate_mappings(priv-
> > >dmabuf);
> > > > > > > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > > > > > > +
> > DMA_RESV_USAGE_BOOKKEEP,
> > > > false,
> > > > > > > +
> > MAX_SCHEDULE_TIMEOUT);
> > > > > > > dma_resv_unlock(priv->dmabuf->resv);
> > > > > > > + if (revoked) {
> > > > > > > + kref_put(&priv->kref,
> > > > vfio_pci_dma_buf_done);
> > > > > > > + /* Let's wait till all DMA unmap are
> > > > completed. */
> > > > > > > + wait = wait_for_completion_timeout(
> > > > > > > + &priv->comp,
> > secs_to_jiffies(1));
> > > > > >
> > > > > > Is the 1-second constant sufficient for all hardware, or should the
> > > > > > invalidate_mappings() contract require the callback to block until
> > > > > > speculative reads are strictly fenced? I'm wondering about a case
> > where
> > > > > > a device's firmware has a high response latency, perhaps due to
> > internal
> > > > > > management tasks like error recovery or thermal and it exceeds the
> > 1s
> > > > > > timeout.
> > > > > >
> > > > > > If the device is in the middle of a large DMA burst and the firmware is
> > > > > > slow to flush the internal pipelines to a fully "quiesced"
> > > > > > read-and-discard state, reclaiming the memory at exactly 1.001
> > seconds
> > > > > > risks triggering platform-level faults..
> > > > > >
> > > > > > Since the wen explicitly permit these speculative reads until unmap is
> > > > > > complete, relying on a hardcoded timeout in the exporter seems to
> > > > > > introduce a hardware-dependent race condition that could
> > compromise
> > > > > > system stability via IOMMU errors or AER faults.
> > > > > >
> > > > > > Should the importer instead be required to guarantee that all
> > > > > > speculative access has ceased before the invalidation call returns?
> > > > >
> > > > > It is guaranteed by the dma_resv_wait_timeout() call above. That call
> > > > ensures
> > > > > that the hardware has completed all pending operations. The 1‑second
> > > > delay is
> > > > > meant to catch cases where an in-kernel DMA unmap call is missing,
> > which
> > > > should
> > > > > not trigger any DMA activity at that point.
> > > >
> > > > Christian may know actual examples, but my general feeling is he was
> > > > worrying about drivers that have pushed the DMABUF to visibility on
> > > > the GPU and the move notify & fences only shoot down some access. So
> > > > it has to wait until the DMABUF is finally unmapped.
> > > >
> > > > Pranjal's example should be covered by the driver adding a fence and
> > > > then the unbounded fence wait will complete it.
> > > >
> > >
> > > Bear me if it's an ignorant question.
> > >
> > > The commit msg of patch6 says that VFIO doesn't tolerate unbounded
> > > wait, which is the reason behind the 2nd timeout wait here.
> >
> > It is not accurate. A second timeout is present both in the
> > description of patch 6 and in VFIO implementation. The difference is
> > that the timeout is enforced within VFIO.
> >
> > >
> > > Then why is "the unbounded fence wait" not a problem in the same
> > > code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case
> > > timeout in hundreds of years...
> >
> > "An unbounded fence wait" is a different class of wait. It indicates broken
> > hardware that continues to issue DMA transactions even after it has been
> > told to
> > stop.
> >
> > The second wait exists to catch software bugs or misuse, where the dma-buf
> > importer has misrepresented its capabilities.
> >
>
> Okay I see.
>
> > >
> > > and it'd be helpful to put some words in the code based on what's
> > > discussed here.
> >
> > We've documented as much as we can in dma_buf_attach_revocable() and
> > dma_buf_invalidate_mappings(). Do you have any suggestions on what else
> > should be added here?
> >
>
> the selection of 1s?
It is indirectly written in description of WARN_ON(), but let's add
more. What about the following?
diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
index 93795ad2e025..948ba75288c6 100644
--- a/drivers/vfio/pci/vfio_pci_dmabuf.c
+++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
@@ -357,7 +357,13 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
dma_resv_unlock(priv->dmabuf->resv);
if (revoked) {
kref_put(&priv->kref, vfio_pci_dma_buf_done);
- /* Let's wait till all DMA unmap are completed. */
+ /*
+ * Let's wait for 1 second till all DMA unmap
+ * are completed. It is supposed to catch dma-buf
+ * importers which lied about their support
+ * of dmabuf revoke. See dma_buf_invalidate_mappings()
+ * for the expected behaviour,
+ */
wait = wait_for_completion_timeout(
&priv->comp, secs_to_jiffies(1));
/*
>
> then,
>
> Reviewed-by: Kevin Tian <kevin.tian(a)intel.com>
Thanks
On Thu, Jan 29, 2026 at 07:06:37AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)ziepe.ca>
> > Sent: Wednesday, January 28, 2026 12:28 AM
> >
> > On Tue, Jan 27, 2026 at 10:58:35AM +0200, Leon Romanovsky wrote:
> > > > > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct
> > vfio_pci_core_device *vdev, bool revoked)
> > > > > dma_resv_lock(priv->dmabuf->resv, NULL);
> > > > > priv->revoked = revoked;
> > > > > dma_buf_invalidate_mappings(priv->dmabuf);
> > > > > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > > > > + DMA_RESV_USAGE_BOOKKEEP,
> > false,
> > > > > + MAX_SCHEDULE_TIMEOUT);
> > > > > dma_resv_unlock(priv->dmabuf->resv);
> > > > > + if (revoked) {
> > > > > + kref_put(&priv->kref,
> > vfio_pci_dma_buf_done);
> > > > > + /* Let's wait till all DMA unmap are
> > completed. */
> > > > > + wait = wait_for_completion_timeout(
> > > > > + &priv->comp, secs_to_jiffies(1));
> > > >
> > > > Is the 1-second constant sufficient for all hardware, or should the
> > > > invalidate_mappings() contract require the callback to block until
> > > > speculative reads are strictly fenced? I'm wondering about a case where
> > > > a device's firmware has a high response latency, perhaps due to internal
> > > > management tasks like error recovery or thermal and it exceeds the 1s
> > > > timeout.
> > > >
> > > > If the device is in the middle of a large DMA burst and the firmware is
> > > > slow to flush the internal pipelines to a fully "quiesced"
> > > > read-and-discard state, reclaiming the memory at exactly 1.001 seconds
> > > > risks triggering platform-level faults..
> > > >
> > > > Since the wen explicitly permit these speculative reads until unmap is
> > > > complete, relying on a hardcoded timeout in the exporter seems to
> > > > introduce a hardware-dependent race condition that could compromise
> > > > system stability via IOMMU errors or AER faults.
> > > >
> > > > Should the importer instead be required to guarantee that all
> > > > speculative access has ceased before the invalidation call returns?
> > >
> > > It is guaranteed by the dma_resv_wait_timeout() call above. That call
> > ensures
> > > that the hardware has completed all pending operations. The 1‑second
> > delay is
> > > meant to catch cases where an in-kernel DMA unmap call is missing, which
> > should
> > > not trigger any DMA activity at that point.
> >
> > Christian may know actual examples, but my general feeling is he was
> > worrying about drivers that have pushed the DMABUF to visibility on
> > the GPU and the move notify & fences only shoot down some access. So
> > it has to wait until the DMABUF is finally unmapped.
> >
> > Pranjal's example should be covered by the driver adding a fence and
> > then the unbounded fence wait will complete it.
> >
>
> Bear me if it's an ignorant question.
>
> The commit msg of patch6 says that VFIO doesn't tolerate unbounded
> wait, which is the reason behind the 2nd timeout wait here.
It is not accurate. A second timeout is present both in the
description of patch 6 and in VFIO implementation. The difference is
that the timeout is enforced within VFIO.
>
> Then why is "the unbounded fence wait" not a problem in the same
> code path? the use of MAX_SCHEDULE_TIMEOUT imply a worst-case
> timeout in hundreds of years...
"An unbounded fence wait" is a different class of wait. It indicates broken
hardware that continues to issue DMA transactions even after it has been told to
stop.
The second wait exists to catch software bugs or misuse, where the dma-buf
importer has misrepresented its capabilities.
>
> and it'd be helpful to put some words in the code based on what's
> discussed here.
We've documented as much as we can in dma_buf_attach_revocable() and
dma_buf_invalidate_mappings(). Do you have any suggestions on what else
should be added here?
Thanks
On Mon, Jan 26, 2026 at 08:53:57PM +0000, Pranjal Shrivastava wrote:
> On Sat, Jan 24, 2026 at 09:14:16PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro(a)nvidia.com>
> >
> > dma-buf invalidation is handled asynchronously by the hardware, so VFIO
> > must wait until all affected objects have been fully invalidated.
> >
> > In addition, the dma-buf exporter is expecting that all importers unmap any
> > buffers they previously mapped.
> >
> > Fixes: 5d74781ebc86 ("vfio/pci: Add dma-buf export support for MMIO regions")
> > Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
> > ---
> > drivers/vfio/pci/vfio_pci_dmabuf.c | 71 ++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 68 insertions(+), 3 deletions(-)
<...>
> > @@ -333,7 +359,37 @@ void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> > dma_resv_lock(priv->dmabuf->resv, NULL);
> > priv->revoked = revoked;
> > dma_buf_invalidate_mappings(priv->dmabuf);
> > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > + DMA_RESV_USAGE_BOOKKEEP, false,
> > + MAX_SCHEDULE_TIMEOUT);
> > dma_resv_unlock(priv->dmabuf->resv);
> > + if (revoked) {
> > + kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > + /* Let's wait till all DMA unmap are completed. */
> > + wait = wait_for_completion_timeout(
> > + &priv->comp, secs_to_jiffies(1));
>
> Is the 1-second constant sufficient for all hardware, or should the
> invalidate_mappings() contract require the callback to block until
> speculative reads are strictly fenced? I'm wondering about a case where
> a device's firmware has a high response latency, perhaps due to internal
> management tasks like error recovery or thermal and it exceeds the 1s
> timeout.
>
> If the device is in the middle of a large DMA burst and the firmware is
> slow to flush the internal pipelines to a fully "quiesced"
> read-and-discard state, reclaiming the memory at exactly 1.001 seconds
> risks triggering platform-level faults..
>
> Since the wen explicitly permit these speculative reads until unmap is
> complete, relying on a hardcoded timeout in the exporter seems to
> introduce a hardware-dependent race condition that could compromise
> system stability via IOMMU errors or AER faults.
>
> Should the importer instead be required to guarantee that all
> speculative access has ceased before the invalidation call returns?
It is guaranteed by the dma_resv_wait_timeout() call above. That call ensures
that the hardware has completed all pending operations. The 1‑second delay is
meant to catch cases where an in-kernel DMA unmap call is missing, which should
not trigger any DMA activity at that point.
So yes, one second is more than sufficient.
Thanks
>
> Thanks
> Praan
>
> > + /*
> > + * If you see this WARN_ON, it means that
> > + * importer didn't call unmap in response to
> > + * dma_buf_invalidate_mappings() which is not
> > + * allowed.
> > + */
> > + WARN(!wait,
> > + "Timed out waiting for DMABUF unmap, importer has a broken invalidate_mapping()");
> > + } else {
> > + /*
> > + * Kref is initialize again, because when revoke
> > + * was performed the reference counter was decreased
> > + * to zero to trigger completion.
> > + */
> > + kref_init(&priv->kref);
> > + /*
> > + * There is no need to wait as no mapping was
> > + * performed when the previous status was
> > + * priv->revoked == true.
> > + */
> > + reinit_completion(&priv->comp);
> > + }
> > }
> > fput(priv->dmabuf->file);
> > }
> > @@ -346,6 +402,8 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> >
> > down_write(&vdev->memory_lock);
> > list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> > + unsigned long wait;
> > +
> > if (!get_file_active(&priv->dmabuf->file))
> > continue;
> >
> > @@ -354,7 +412,14 @@ void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> > priv->vdev = NULL;
> > priv->revoked = true;
> > dma_buf_invalidate_mappings(priv->dmabuf);
> > + dma_resv_wait_timeout(priv->dmabuf->resv,
> > + DMA_RESV_USAGE_BOOKKEEP, false,
> > + MAX_SCHEDULE_TIMEOUT);
> > dma_resv_unlock(priv->dmabuf->resv);
> > + kref_put(&priv->kref, vfio_pci_dma_buf_done);
> > + wait = wait_for_completion_timeout(&priv->comp,
> > + secs_to_jiffies(1));
> > + WARN_ON(!wait);
> > vfio_device_put_registration(&vdev->vdev);
> > fput(priv->dmabuf->file);
> > }
> >
> > --
> > 2.52.0
> >
> >
>
Hi everyone,
dma_fences have ever lived under the tyranny dictated by the module
lifetime of their issuer, leading to crashes should anybody still holding
a reference to a dma_fence when the module of the issuer was unloaded.
The basic problem is that when buffer are shared between drivers
dma_fence objects can leak into external drivers and stay there even
after they are signaled. The dma_resv object for example only lazy releases
dma_fences.
So what happens is that when the module who originally created the dma_fence
unloads the dma_fence_ops function table becomes unavailable as well and so
any attempt to release the fence crashes the system.
Previously various approaches have been discussed, including changing the
locking semantics of the dma_fence callbacks (by me) as well as using the
drm scheduler as intermediate layer (by Sima) to disconnect dma_fences
from their actual users, but none of them are actually solving all problems.
Tvrtko did some really nice prerequisite work by protecting the returned
strings of the dma_fence_ops by RCU. This way dma_fence creators where
able to just wait for an RCU grace period after fence signaling before
they could be save to free those data structures.
Now this patch set here goes a step further and protects the whole
dma_fence_ops structure by RCU, so that after the fence signals the
pointer to the dma_fence_ops is set to NULL when there is no wait nor
release callback given. All functionality which use the dma_fence_ops
reference are put inside an RCU critical section, except for the
deprecated issuer specific wait and of course the optional release
callback.
Additional to the RCU changes the lock protecting the dma_fence state
previously had to be allocated external. This set here now changes the
functionality to make that external lock optional and allows dma_fences
to use an inline lock and be self contained.
v4:
Rebases the whole set on upstream changes, especially the cleanup
from Philip in patch "drm/amdgpu: independence for the amdkfd_fence!".
Adding two patches which brings the DMA-fence self tests up to date.
The first selftest changes removes the mock_wait and so actually starts
testing the default behavior instead of some hacky implementation in the
test. This one got upstreamed independent of this set.
The second drops the mock_fence as well and tests the new RCU and inline
spinlock functionality.
v5:
Rebase on top of drm-misc-next instead of drm-tip, leave out all driver
changes for now since those should go through the driver specific paths
anyway.
Address a few more review comments, especially some rebase mess and
typos. And finally fix one more bug found by AMDs CI system.
v6:
Minor style changes, re-ordered patch #1, dropped the scheduler fence
change for now
Please review and comment,
Christian.
On Mon, Jan 26, 2026 at 08:38:44PM +0000, Pranjal Shrivastava wrote:
> I noticed that Patch 5 removes the invalidate_mappings stub from
> umem_dmabuf.c, effectively making the callback NULL for an RDMA
> importer. Consequently, dma_buf_attach_revocable() (introduced here)
> will return false for these importers.
Yes, that is the intention.
> Since the cover letter mentions that VFIO will use
> dma_buf_attach_revocable() to prevent unbounded waits, this appears to
> effectively block paths like the VFIO-export -> RDMA-import path..
It remains usable with the ODP path and people are using that right
now.
> Given that RDMA is a significant consumer of dma-bufs, are there plans
> to implement proper revocation support in the IB/RDMA core (umem_dmabuf)?
This depends on each HW, they need a way to implement the revoke
semantic. I can't guess what is possible, but I would hope that most
HW could at least do a revoke on a real MR.
Eg a MR rereg operation to a kernel owned empty PD is an effective
"revoke", and MR rereg is at least defined by standards so HW should
implement it.
> It would be good to know if there's a plan for bringing such importers
> into compliance with the new revocation semantics so they can interop
> with VFIO OR are we completely ruling out users like RDMA / IB importing
> any DMABUFs exported by VFIO?
It will be driver dependent, there is no one shot update here.
Jason
On Thu, Jan 08, 2026 at 01:11:14PM +0200, Edward Srouji wrote:
> From: Yishai Hadas <yishaih(a)nvidia.com>
>
> Expose DMABUF functionality to userspace through the uverbs interface,
> enabling InfiniBand/RDMA devices to export PCI based memory regions
> (e.g. device memory) as DMABUF file descriptors. This allows
> zero-copy sharing of RDMA memory with other subsystems that support the
> dma-buf framework.
>
> A new UVERBS_OBJECT_DMABUF object type and allocation method were
> introduced.
>
> During allocation, uverbs invokes the driver to supply the
> rdma_user_mmap_entry associated with the given page offset (pgoff).
>
> Based on the returned rdma_user_mmap_entry, uverbs requests the driver
> to provide the corresponding physical-memory details as well as the
> driver’s PCI provider information.
>
> Using this information, dma_buf_export() is called; if it succeeds,
> uobj->object is set to the underlying file pointer returned by the
> dma-buf framework.
>
> The file descriptor number follows the standard uverbs allocation flow,
> but the file pointer comes from the dma-buf subsystem, including its own
> fops and private data.
>
> Because of this, alloc_begin_fd_uobject() must handle cases where
> fd_type->fops is NULL, and both alloc_commit_fd_uobject() and
> alloc_abort_fd_uobject() must account for whether filp->private_data
> exists, since it is only populated after a successful dma_buf_export().
>
> When an mmap entry is removed, uverbs iterates over its associated
> DMABUFs, marks them as revoked, and calls dma_buf_move_notify() so that
> their importers are notified.
>
> The same procedure applies during the disassociate flow; final cleanup
> occurs when the application closes the file.
>
> Signed-off-by: Yishai Hadas <yishaih(a)nvidia.com>
> Signed-off-by: Edward Srouji <edwards(a)nvidia.com>
> ---
> drivers/infiniband/core/Makefile | 1 +
> drivers/infiniband/core/device.c | 2 +
> drivers/infiniband/core/ib_core_uverbs.c | 19 +++
> drivers/infiniband/core/rdma_core.c | 63 ++++----
> drivers/infiniband/core/rdma_core.h | 1 +
> drivers/infiniband/core/uverbs.h | 10 ++
> drivers/infiniband/core/uverbs_std_types_dmabuf.c | 172 ++++++++++++++++++++++
> drivers/infiniband/core/uverbs_uapi.c | 1 +
> include/rdma/ib_verbs.h | 9 ++
> include/rdma/uverbs_types.h | 1 +
> include/uapi/rdma/ib_user_ioctl_cmds.h | 10 ++
> 11 files changed, 263 insertions(+), 26 deletions(-)
<...>
> +static struct sg_table *
> +uverbs_dmabuf_map(struct dma_buf_attachment *attachment,
> + enum dma_data_direction dir)
> +{
> + struct ib_uverbs_dmabuf_file *priv = attachment->dmabuf->priv;
> +
> + dma_resv_assert_held(priv->dmabuf->resv);
> +
> + if (priv->revoked)
> + return ERR_PTR(-ENODEV);
> +
> + return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
> + &priv->phys_vec, 1, priv->phys_vec.len,
> + dir);
> +}
> +
> +static void uverbs_dmabuf_unmap(struct dma_buf_attachment *attachment,
> + struct sg_table *sgt,
> + enum dma_data_direction dir)
> +{
> + dma_buf_free_sgt(attachment, sgt, dir);
> +}
Unfortunately, it is not enough. Exporters should count their
map<->unmap calls and make sure that they are equal.
See this VFIO change https://lore.kernel.org/kvm/20260124-dmabuf-revoke-v5-4-f98fca917e96@nvidia…
Thanks
On Fri, Oct 31, 2025 at 05:15:32AM +0000, Kasireddy, Vivek wrote:
> > So the next steps would be to make all the exporters directly declare
> > a SGT and then remove the SGT related ops from dma_ops itself and
> > remove the compat sgt in the attach logic. This is not hard, it is all
> > simple mechanical work.
> IMO, this SGT compatibility stuff should ideally be a separate follow-on
> effort (and patch series) that would also probably include updates to
> various drivers to add the SGT mapping type.
I've beeen working on this idea and have updated my github here:
https://github.com/jgunthorpe/linux/commits/dmabuf_map_type/
I still need to run it through what testing I can do here, but it goes
all the way and converts everything into SGT mapping type, all
drivers. I think this shows the idea works.
I'm hoping to post it next week if the revoke thing settles down and I
can complete some more checking.
We can discuss how to break it up along with get feedback if people
are happy with the idea.
It looks like it turns out fairly well, I didn't find anything
surprising along the way at least.
Thanks,
Jason
Changelog:
v3:
* Used Jason's wordings for commits and cover letter.
* Removed IOMMUFD patch.
* Renamed dma_buf_attachment_is_revoke() to be dma_buf_attach_revocable().
* Added patch to remove CONFIG_DMABUF_MOVE_NOTIFY.
* Added Reviewed-by tags.
* Called to dma_resv_wait_timeout() after dma_buf_move_notify() in VFIO.
* Added dma_buf_attach_revocable() check to VFIO DMABUF attach function.
* Slightly changed commit messages.
v2: https://patch.msgid.link/20260118-dmabuf-revoke-v2-0-a03bb27c0875@nvidia.com
* Changed series to document the revoke semantics instead of
implementing it.
v1: https://patch.msgid.link/20260111-dmabuf-revoke-v1-0-fb4bcc8c259b@nvidia.com
-------------------------------------------------------------------------
This series documents a dma-buf “revoke” mechanism: to allow a dma-buf
exporter to explicitly invalidate (“kill”) a shared buffer after it has
been distributed to importers, so that further CPU and device access is
prevented and importers reliably observe failure.
The change in this series is to properly document and use existing core
“revoked” state on the dma-buf object and a corresponding exporter-triggered
revoke operation.
dma-buf has quietly allowed calling move_notify on pinned dma-bufs, even
though legacy importers using dma_buf_attach() would simply ignore
these calls.
RDMA saw this and needed to use allow_peer2peer=true, so implemented a
new-style pinned importer with an explicitly non-working move_notify()
callback.
This has been tolerable because the existing exporters are thought to
only call move_notify() on a pinned DMABUF under RAS events and we
have been willing to tolerate the UAF that results by allowing the
importer to continue to use the mapping in this rare case.
VFIO wants to implement a pin supporting exporter that will issue a
revoking move_notify() around FLRs and a few other user triggerable
operations. Since this is much more common we are not willing to
tolerate the security UAF caused by interworking with non-move_notify()
supporting drivers. Thus till now VFIO has required dynamic importers,
even though it never actually moves the buffer location.
To allow VFIO to work with pinned importers, according to how dma-buf
was intended, we need to allow VFIO to detect if an importer is legacy
or RDMA and does not actually implement move_notify().
Introduce a new function that exporters can call to detect these less
capable importers. VFIO can then refuse to accept them during attach.
In theory all exporters that call move_notify() on pinned dma-buf's
should call this function, however that would break a number of widely
used NIC/GPU flows. Thus for now do not spread this further than VFIO
until we can understand how much of RDMA can implement the full
semantic.
In the process clarify how move_notify is intended to be used with
pinned dma-bufs.
Thanks
Signed-off-by: Leon Romanovsky <leonro(a)nvidia.com>
---
Leon Romanovsky (7):
dma-buf: Rename .move_notify() callback to a clearer identifier
dma-buf: Always build with DMABUF_MOVE_NOTIFY
dma-buf: Document RDMA non-ODP invalidate_mapping() special case
dma-buf: Add check function for revoke semantics
iommufd: Pin dma-buf importer for revoke semantics
vfio: Wait for dma-buf invalidation to complete
vfio: Validate dma-buf revocation semantics
drivers/dma-buf/Kconfig | 12 -----
drivers/dma-buf/dma-buf.c | 69 +++++++++++++++++++++++------
drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 14 +++---
drivers/gpu/drm/amd/amdkfd/Kconfig | 2 +-
drivers/gpu/drm/virtio/virtgpu_prime.c | 2 +-
drivers/gpu/drm/xe/tests/xe_dma_buf.c | 7 ++-
drivers/gpu/drm/xe/xe_dma_buf.c | 14 +++---
drivers/infiniband/core/umem_dmabuf.c | 13 +-----
drivers/infiniband/hw/mlx5/mr.c | 2 +-
drivers/iommu/iommufd/pages.c | 11 ++++-
drivers/vfio/pci/vfio_pci_dmabuf.c | 8 ++++
include/linux/dma-buf.h | 9 ++--
12 files changed, 96 insertions(+), 67 deletions(-)
---
base-commit: 9ace4753a5202b02191d54e9fdf7f9e3d02b85eb
change-id: 20251221-dmabuf-revoke-b90ef16e4236
Best regards,
--
Leon Romanovsky <leonro(a)nvidia.com>