Hi everybody,
we have documented here https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#dma-fence-cr… that dma_fence objects must signal in a reasonable amount of time, but at the same time note that drivers might have a different idea of what reasonable means.
Recently I realized that this is actually not a good idea. Background is that the wall clock timeout means that for example the OOM killer might actually wait for this timeout to be able to terminate a process and reclaim the memory used. And this is just an example of how general kernel features might depend on that.
Some drivers and fence implementations used 10 seconds and that raised complains by end users. So at least amdgpu recently switched to 2 second which triggered an internal discussion about it.
This patch set here now adds a define to the dma_fence header which gives 2 seconds as reasonable amount of time. SW-sync is modified to always taint the kernel (since it doesn't has a timeout), VGEM is switched over to the new define and the scheduler gets a warning and taints the kernel if a driver uses a timeout longer than that.
I have not much intention of actually committing the patches (maybe except the SW-sync one), but question is if 2 seconds are reasonable?
Regards,
Christian.
On 11/24/25 12:30, Pavel Begunkov wrote:
> On 11/24/25 10:33, Christian König wrote:
>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>> Picking up the work on supporting dmabuf in the read/write path.
>>
>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>
>> Or am I mixing something up here?
>
> The time gap is purely due to me being busy. I wasn't CC'ed to those private
> discussions you mentioned, but the v1 feedback was to use dynamic attachments
> and avoid passing dma address arrays directly.
>
> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>
> I'm lost on what part is not doable. Can you elaborate on the core
> dma-fence dma-buf rules?
I most likely mixed that up, in other words that was a different discussion.
When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
>> Since I don't see any dma_fence implementation at all that might actually be the case.
>
> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
> callback and is signaled when all inflight IO using the current
> mapping are complete. All new IO requests will try to recreate the
> mapping, and hence potentially wait with dma_resv_wait_timeout().
Without looking at the code that approach sounds more or less correct to me.
>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>
> Have any reference?
There is a WIP feature in AMDs GPU driver package for ROCm.
But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.
BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.
Regards,
Christian.
On 11/23/25 23:51, Pavel Begunkov wrote:
> Picking up the work on supporting dmabuf in the read/write path.
IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
Or am I mixing something up here? Since I don't see any dma_fence implementation at all that might actually be the case.
On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
Regards,
Christian.
> There
> are two main changes. First, it doesn't pass a dma addresss directly by
> rather wraps it into an opaque structure, which is extended and
> understood by the target driver.
>
> The second big change is support for dynamic attachments, which added a
> good part of complexity (see Patch 5). I kept the main machinery in nvme
> at first, but move_notify can ask to kill the dma mapping asynchronously,
> and any new IO would need to wait during submission, thus it was moved
> to blk-mq. That also introduced an extra callback layer b/w driver and
> blk-mq.
>
> There are some rough corners, and I'm not perfectly happy about the
> complexity and layering. For v3 I'll try to move the waiting up in the
> stack to io_uring wrapped into library helpers.
>
> For now, I'm interested what is the best way to test move_notify? And
> how dma_resv_reserve_fences() errors should be handled in move_notify?
>
> The uapi didn't change, after registration it looks like a normal
> io_uring registered buffer and can be used as such. Only non-vectored
> fixed reads/writes are allowed. Pseudo code:
>
> // registration
> reg_buf_idx = 0;
> io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });
>
> // request creation
> io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
> buffer_size, file_offset, reg_buf_idx);
>
> And as previously, a good bunch of code was taken from Keith's series [1].
>
> liburing based example:
>
> git: https://github.com/isilence/liburing.git dmabuf-rw
> link: https://github.com/isilence/liburing/tree/dmabuf-rw
>
> [1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
>
> Pavel Begunkov (11):
> file: add callback for pre-mapping dmabuf
> iov_iter: introduce iter type for pre-registered dma
> block: move around bio flagging helpers
> block: introduce dma token backed bio type
> block: add infra to handle dmabuf tokens
> nvme-pci: add support for dmabuf reggistration
> nvme-pci: implement dma_token backed requests
> io_uring/rsrc: add imu flags
> io_uring/rsrc: extended reg buffer registration
> io_uring/rsrc: add dmabuf-backed buffer registeration
> io_uring/rsrc: implement dmabuf regbuf import
>
> block/Makefile | 1 +
> block/bdev.c | 14 ++
> block/bio.c | 21 +++
> block/blk-merge.c | 23 +++
> block/blk-mq-dma-token.c | 236 +++++++++++++++++++++++++++++++
> block/blk-mq.c | 20 +++
> block/blk.h | 3 +-
> block/fops.c | 3 +
> drivers/nvme/host/pci.c | 217 ++++++++++++++++++++++++++++
> include/linux/bio.h | 49 ++++---
> include/linux/blk-mq-dma-token.h | 60 ++++++++
> include/linux/blk-mq.h | 21 +++
> include/linux/blk_types.h | 8 +-
> include/linux/blkdev.h | 3 +
> include/linux/dma_token.h | 35 +++++
> include/linux/fs.h | 4 +
> include/linux/uio.h | 10 ++
> include/uapi/linux/io_uring.h | 13 +-
> io_uring/rsrc.c | 201 +++++++++++++++++++++++---
> io_uring/rsrc.h | 23 ++-
> io_uring/rw.c | 7 +-
> lib/iov_iter.c | 30 +++-
> 22 files changed, 948 insertions(+), 54 deletions(-)
> create mode 100644 block/blk-mq-dma-token.c
> create mode 100644 include/linux/blk-mq-dma-token.h
> create mode 100644 include/linux/dma_token.h
>
On Thu, Nov 20, 2025 at 08:04:37AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)nvidia.com>
> > Sent: Saturday, November 8, 2025 12:50 AM
> > +
> > +static int pfn_reader_fill_dmabuf(struct pfn_reader_dmabuf *dmabuf,
> > + struct pfn_batch *batch,
> > + unsigned long start_index,
> > + unsigned long last_index)
> > +{
> > + unsigned long start = dmabuf->start_offset + start_index * PAGE_SIZE;
> > +
> > + /*
> > + * This works in PAGE_SIZE indexes, if the dmabuf is sliced and
> > + * starts/ends at a sub page offset then the batch to domain code will
> > + * adjust it.
> > + */
>
> dmabuf->start_offset comes from pages->dmabuf.start, which is initialized as:
>
> pages->dmabuf.start = start - start_byte;
>
> so it's always page-aligned. Where is the sub-page offset coming from?
I need to go over this again to check it, this sub-page stuff is
a bit convoluted. start_offset should include the sub page offset
here..
> > @@ -1687,6 +1737,12 @@ static void __iopt_area_unfill_domain(struct
> > iopt_area *area,
> >
> > lockdep_assert_held(&pages->mutex);
> >
> > + if (iopt_is_dmabuf(pages)) {
> > + iopt_area_unmap_domain_range(area, domain, start_index,
> > + last_index);
> > + return;
> > + }
> > +
>
> this belongs to patch3?
This is part of programming the domain with the dmabuf, the patch3 was
about the revoke which is a slightly different topic though they are
both similar.
Thanks,
Jason
On Thu, Nov 20, 2025 at 07:55:04AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg(a)nvidia.com>
> > Sent: Saturday, November 8, 2025 12:50 AM
> >
> >
> > @@ -2031,7 +2155,10 @@ int iopt_pages_rw_access(struct iopt_pages
> > *pages, unsigned long start_byte,
> > if ((flags & IOMMUFD_ACCESS_RW_WRITE) && !pages->writable)
> > return -EPERM;
> >
> > - if (pages->type == IOPT_ADDRESS_FILE)
> > + if (iopt_is_dmabuf(pages))
> > + return -EINVAL;
> > +
>
> probably also add helpers for other types, e.g.:
>
> iopt_is_user()
> iopt_is_memfd()
The helper was to integrate the IS_ENABLED() check for DMABUF, there
are not so many others uses, I think leave it to not bloat the patch.
> > + if (pages->type != IOPT_ADDRESS_USER)
> > return iopt_pages_rw_slow(pages, start_index, last_index,
> > start_byte % PAGE_SIZE, data,
> > length,
> > flags);
> > --
>
> then the following WARN_ON() becomes useless:
>
> if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
> WARN_ON(pages->type != IOPT_ADDRESS_USER))
> return -EINVAL;
Yep
Thanks,
Jason