On 11/25/25 11:56, Philipp Stanner wrote:
>>>>
>>>> The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
>>>>
>>>> The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
>>>
>>> That's a question only the drivers can answer, which is why I like to
>>> think that setting global constants constraining all parties is not the
>>> right thing to do.
>>
>> Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
>>
>> See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
>
> Good and well, but then patch 4 becomes even more problematic:
>
> So we'd just have drivers fire warnings, and then they would still have
> the freedom to set timeouts for drm/sched, as long as those timeouts
> are smaller than your new global constant.
>
> Why then not remove drm/sched's timeout parameter API completely and
> always use your maximum value internally in drm/sched? Or maybe
> truncate it with a warning?
I have considered that as well, but then thought that we should at least give end users the possibility to override the timeout while still tainting the kernel so that we know about this in bug reports, core dumps etc...
> "Maximum timeout parameter exceeded, truncating to %ld.\n"
>
> I suppose some drivers want even higher responsiveness than those 2
> seconds.
As far as I know some medical use cases for example have timeouts like 100-200ms. But again that is the use case and not the driver.
> I do believe that more of the driver folks should be made aware of this
> intended change.
I have no real intention of actually pushing those patches, at least not as they are. I just wanted to kick of some discussion.
>>
>>> What is even your motivation? What problem does this solve? Is the OOM
>>> killer currently hanging for anyone? Can you link a bug report?
>>
>> I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
>>
>> So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
>>
>> While working on that I realized that the timeout is actually not driver dependent at all.
>>
>> What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
>>
>> I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
>
> I suppose some drivers are indeed only used for server hardware. And
> for compute you might not care about responsiveness as long as your
> result drops off at some point. But there's cloud gaming, too..
Good point with the cloud gaming.
> I agree that distinguishing the use case that way is not ideal.
> However, who has the knowledge of how the hardware is being used by
> customers / users, if not the driver?
Well the end user.
Maybe we should move the whole timeout topic into the DRM layer or the scheduler component.
Something like 2 seconds default (which BTW is the default on Windows as well), which can be overridden on a global, per device, per queue name basis.
And 10 seconds maximum with only a warning that a not default timeout is used and everything above 10 seconds taints the kernel and should really only be used for testing/debugging.
Thoughts?
Regards,
Christian.
>
>
> P.
On 11/25/25 14:52, Pavel Begunkov wrote:
> On 11/24/25 14:17, Christian König wrote:
>> On 11/24/25 12:30, Pavel Begunkov wrote:
>>> On 11/24/25 10:33, Christian König wrote:
>>>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>>>> Picking up the work on supporting dmabuf in the read/write path.
>>>>
>>>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>>>
>>>> Or am I mixing something up here?
>>>
>>> The time gap is purely due to me being busy. I wasn't CC'ed to those private
>>> discussions you mentioned, but the v1 feedback was to use dynamic attachments
>>> and avoid passing dma address arrays directly.
>>>
>>> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>>>
>>> I'm lost on what part is not doable. Can you elaborate on the core
>>> dma-fence dma-buf rules?
>>
>> I most likely mixed that up, in other words that was a different discussion.
>>
>> When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
>
> I'm curious, what can happen if there is new IO using a
> move_notify()ed mapping, but let's say it's guaranteed to complete
> strictly before dma_buf_unmap_attachment() and the fence is signaled?
> Is there some loss of data or corruption that can happen?
The problem is that you can't guarantee that because you run into deadlocks.
As soon as a dma_fence() is created and published by calling add_fence it can be memory management loops back and depends on that fence.
So you actually can't issue any new IO which might block the unmap operation.
>
> sg_table = map_attach() |
> move_notify() |
> -> add_fence(fence) |
> | issue_IO(sg_table)
> | // IO completed
> unmap_attachment(sg_table) |
> signal_fence(fence) |
>
>> For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
>
> Looks I have some terminology gap here. By "memory allocations" you
> don't mean kmalloc, right? I assume it's about new users of the
> mapping.
kmalloc() as well as get_free_page() is exactly what is meant here.
You can't make any memory allocation any more after creating/publishing a dma_fence.
The usually flow is the following:
1. Lock dma_resv object
2. Prepare I/O operation, make all memory allocations etc...
3. Allocate dma_fence object
4. Push I/O operation to the HW, making sure that you don't allocate memory any more.
5. Call dma_resv_add_fence(with fence allocate in #3).
6. Unlock dma_resv object
If you stride from that you most likely end up in a deadlock sooner or later.
Regards,
Christian.
>>>> Since I don't see any dma_fence implementation at all that might actually be the case.
>>>
>>> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
>>> callback and is signaled when all inflight IO using the current
>>> mapping are complete. All new IO requests will try to recreate the
>>> mapping, and hence potentially wait with dma_resv_wait_timeout().
>>
>> Without looking at the code that approach sounds more or less correct to me.
>>
>>>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>>>
>>> Have any reference?
>>
>> There is a WIP feature in AMDs GPU driver package for ROCm.
>>
>> But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.
>
> Got it
>
>> BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.
>
> Right. The direct IO path also works with user pages, so the
> constraints are similar in this regard.
>
On 11/25/25 09:13, Philipp Stanner wrote:
> On Tue, 2025-11-25 at 09:03 +0100, Christian König wrote:
>> On 11/25/25 08:55, Philipp Stanner wrote:
>>>>
>>>> +/**
>>>> + * define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
>>>> + *
>>>> + * The dma_fence object has a deep inter dependency with core memory
>>>> + * management, for a detailed explanation see section DMA Fences under
>>>> + * Documentation/driver-api/dma-buf.rst.
>>>> + *
>>>> + * Because of this all dma_fence implementations must guarantee that each fence
>>>> + * completes in a finite time. This define here now gives a reasonable value for
>>>> + * the timeout to use. It is possible to use a longer timeout in an
>>>> + * implementation but that should taint the kernel.
>>>> + */
>>>> +#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
>>>
>>> HZ can change depending on the config. Is that really a good choice? I
>>> could see racy situations arising in some configs vs others
>>
>> 2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
>
> On AMD64 maybe. What about the other architectures?
HZ is defined as jiffies per second, So even if it changes to 10,100 or 1000 depending on the architecture 2*HZ is always two seconds expressed in jiffies.
The HZ define is actually there to make it architecture independent.
>>
>> The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
>>
>> The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
>
> That's a question only the drivers can answer, which is why I like to
> think that setting global constants constraining all parties is not the
> right thing to do.
Exactly that's the reason why I bring that up. I think that drivers should be in charge of timeouts is the wrong approach.
See the reason why we have the timeout (and documented that it is a must have) is because we have both core memory management as well a desktop responsiveness depend on it.
> What is even your motivation? What problem does this solve? Is the OOM
> killer currently hanging for anyone? Can you link a bug report?
I'm not sure if we have an external bug report (we have an internal one), but for amdgpu there were customer complains that 10 seconds is to long.
So we changed it to 2 seconds for amdgpu, and now there are complains from internal AMD teams that 2 seconds is to short.
While working on that I realized that the timeout is actually not driver dependent at all.
What can maybe argued is that a desktop system should have a shorter timeout than some server, but that one driver needs a different timeout than another driver doesn't really makes sense to me.
I mean what is actually HW dependent on the requirement that I need a responsive desktop system?
>>
>> Some AMD internal team is pushing for 10 seconds, but that also means that for example we wait 10 seconds for the OOM killer to do something. That sounds like way to long.
>>
>
> Nouveau has timeout = 10 seconds. AFAIK we've never seen bugs because
> of that. Have you seen some?
Thanks for that info. And to answer the question, yes certainly.
Regards,
Christian.
>
>
> P.
On 11/25/25 08:55, Philipp Stanner wrote:
> On Thu, 2025-11-20 at 15:41 +0100, Christian König wrote:
>> Add a define implementations can use as reasonable maximum signaling
>> timeout. Document that if this timeout is exceeded by config options
>> implementations should taint the kernel.
>>
>> Tainting the kernel is important for bug reports to detect that end
>> users might be using a problematic configuration.
>>
>> Signed-off-by: Christian König <christian.koenig(a)amd.com>
>> ---
>> include/linux/dma-fence.h | 14 ++++++++++++++
>> 1 file changed, 14 insertions(+)
>>
>> diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h
>> index 64639e104110..b31dfa501c84 100644
>> --- a/include/linux/dma-fence.h
>> +++ b/include/linux/dma-fence.h
>> @@ -28,6 +28,20 @@ struct dma_fence_ops;
>> struct dma_fence_cb;
>> struct seq_file;
>>
>> +/**
>> + * define DMA_FENCE_MAX_REASONABLE_TIMEOUT - max reasonable signaling timeout
>> + *
>> + * The dma_fence object has a deep inter dependency with core memory
>> + * management, for a detailed explanation see section DMA Fences under
>> + * Documentation/driver-api/dma-buf.rst.
>> + *
>> + * Because of this all dma_fence implementations must guarantee that each fence
>> + * completes in a finite time. This define here now gives a reasonable value for
>> + * the timeout to use. It is possible to use a longer timeout in an
>> + * implementation but that should taint the kernel.
>> + */
>> +#define DMA_FENCE_MAX_REASONABLE_TIMEOUT (2*HZ)
>
> HZ can change depending on the config. Is that really a good choice? I
> could see racy situations arising in some configs vs others
2*HZ is always two seconds expressed in number of jiffies, I can use msecs_to_jiffies(2000) to make that more obvious.
The GPU scheduler has a very similar define, MAX_WAIT_SCHED_ENTITY_Q_EMPTY which is currently just 1 second.
The real question is what is the maximum amount of time we can wait for the HW before we should trigger a timeout?
Some AMD internal team is pushing for 10 seconds, but that also means that for example we wait 10 seconds for the OOM killer to do something. That sounds like way to long.
Regards,
Christian.
>
> P.
Hi everybody,
we have documented here https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#dma-fence-cr… that dma_fence objects must signal in a reasonable amount of time, but at the same time note that drivers might have a different idea of what reasonable means.
Recently I realized that this is actually not a good idea. Background is that the wall clock timeout means that for example the OOM killer might actually wait for this timeout to be able to terminate a process and reclaim the memory used. And this is just an example of how general kernel features might depend on that.
Some drivers and fence implementations used 10 seconds and that raised complains by end users. So at least amdgpu recently switched to 2 second which triggered an internal discussion about it.
This patch set here now adds a define to the dma_fence header which gives 2 seconds as reasonable amount of time. SW-sync is modified to always taint the kernel (since it doesn't has a timeout), VGEM is switched over to the new define and the scheduler gets a warning and taints the kernel if a driver uses a timeout longer than that.
I have not much intention of actually committing the patches (maybe except the SW-sync one), but question is if 2 seconds are reasonable?
Regards,
Christian.
On 11/24/25 12:30, Pavel Begunkov wrote:
> On 11/24/25 10:33, Christian König wrote:
>> On 11/23/25 23:51, Pavel Begunkov wrote:
>>> Picking up the work on supporting dmabuf in the read/write path.
>>
>> IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
>>
>> Or am I mixing something up here?
>
> The time gap is purely due to me being busy. I wasn't CC'ed to those private
> discussions you mentioned, but the v1 feedback was to use dynamic attachments
> and avoid passing dma address arrays directly.
>
> https://lore.kernel.org/all/cover.1751035820.git.asml.silence@gmail.com/
>
> I'm lost on what part is not doable. Can you elaborate on the core
> dma-fence dma-buf rules?
I most likely mixed that up, in other words that was a different discussion.
When you use dma_fences to indicate async completion of events you need to be super duper careful that you only do this for in flight events, have the fence creation in the right order etc...
For example once the fence is created you can't make any memory allocations any more, that's why we have this dance of reserving fence slots, creating the fence and then adding it.
>> Since I don't see any dma_fence implementation at all that might actually be the case.
>
> See Patch 5, struct blk_mq_dma_fence. It's used in the move_notify
> callback and is signaled when all inflight IO using the current
> mapping are complete. All new IO requests will try to recreate the
> mapping, and hence potentially wait with dma_resv_wait_timeout().
Without looking at the code that approach sounds more or less correct to me.
>> On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
>
> Have any reference?
There is a WIP feature in AMDs GPU driver package for ROCm.
But that can't be used as general purpose DMA-buf approach, because it makes use of internal knowledge about how the GPU driver is using the backing store.
BTW when you use DMA addresses from DMA-buf always keep in mind that this memory can be written by others at the same time, e.g. you can't do things like compute a CRC first, then write to backing store and finally compare CRC.
Regards,
Christian.
On 11/23/25 23:51, Pavel Begunkov wrote:
> Picking up the work on supporting dmabuf in the read/write path.
IIRC that work was completely stopped because it violated core dma_fence and DMA-buf rules and after some private discussion was considered not doable in general.
Or am I mixing something up here? Since I don't see any dma_fence implementation at all that might actually be the case.
On the other hand we have direct I/O from DMA-buf working for quite a while, just not upstream and without io_uring support.
Regards,
Christian.
> There
> are two main changes. First, it doesn't pass a dma addresss directly by
> rather wraps it into an opaque structure, which is extended and
> understood by the target driver.
>
> The second big change is support for dynamic attachments, which added a
> good part of complexity (see Patch 5). I kept the main machinery in nvme
> at first, but move_notify can ask to kill the dma mapping asynchronously,
> and any new IO would need to wait during submission, thus it was moved
> to blk-mq. That also introduced an extra callback layer b/w driver and
> blk-mq.
>
> There are some rough corners, and I'm not perfectly happy about the
> complexity and layering. For v3 I'll try to move the waiting up in the
> stack to io_uring wrapped into library helpers.
>
> For now, I'm interested what is the best way to test move_notify? And
> how dma_resv_reserve_fences() errors should be handled in move_notify?
>
> The uapi didn't change, after registration it looks like a normal
> io_uring registered buffer and can be used as such. Only non-vectored
> fixed reads/writes are allowed. Pseudo code:
>
> // registration
> reg_buf_idx = 0;
> io_uring_update_buffer(ring, reg_buf_idx, { dma_buf_fd, file_fd });
>
> // request creation
> io_uring_prep_read_fixed(sqe, file_fd, buffer_offset,
> buffer_size, file_offset, reg_buf_idx);
>
> And as previously, a good bunch of code was taken from Keith's series [1].
>
> liburing based example:
>
> git: https://github.com/isilence/liburing.git dmabuf-rw
> link: https://github.com/isilence/liburing/tree/dmabuf-rw
>
> [1] https://lore.kernel.org/io-uring/20220805162444.3985535-1-kbusch@fb.com/
>
> Pavel Begunkov (11):
> file: add callback for pre-mapping dmabuf
> iov_iter: introduce iter type for pre-registered dma
> block: move around bio flagging helpers
> block: introduce dma token backed bio type
> block: add infra to handle dmabuf tokens
> nvme-pci: add support for dmabuf reggistration
> nvme-pci: implement dma_token backed requests
> io_uring/rsrc: add imu flags
> io_uring/rsrc: extended reg buffer registration
> io_uring/rsrc: add dmabuf-backed buffer registeration
> io_uring/rsrc: implement dmabuf regbuf import
>
> block/Makefile | 1 +
> block/bdev.c | 14 ++
> block/bio.c | 21 +++
> block/blk-merge.c | 23 +++
> block/blk-mq-dma-token.c | 236 +++++++++++++++++++++++++++++++
> block/blk-mq.c | 20 +++
> block/blk.h | 3 +-
> block/fops.c | 3 +
> drivers/nvme/host/pci.c | 217 ++++++++++++++++++++++++++++
> include/linux/bio.h | 49 ++++---
> include/linux/blk-mq-dma-token.h | 60 ++++++++
> include/linux/blk-mq.h | 21 +++
> include/linux/blk_types.h | 8 +-
> include/linux/blkdev.h | 3 +
> include/linux/dma_token.h | 35 +++++
> include/linux/fs.h | 4 +
> include/linux/uio.h | 10 ++
> include/uapi/linux/io_uring.h | 13 +-
> io_uring/rsrc.c | 201 +++++++++++++++++++++++---
> io_uring/rsrc.h | 23 ++-
> io_uring/rw.c | 7 +-
> lib/iov_iter.c | 30 +++-
> 22 files changed, 948 insertions(+), 54 deletions(-)
> create mode 100644 block/blk-mq-dma-token.c
> create mode 100644 include/linux/blk-mq-dma-token.h
> create mode 100644 include/linux/dma_token.h
>