On Tue, 2026-06-23 at 12:37 +0100, André Draszik wrote:
Hi,
On Thu, 2026-06-18 at 17:56 +0200, Philipp Stanner wrote:
+Cc Danilo
On Thu, 2026-06-18 at 15:03 +0100, André Draszik wrote:
Since commit 541c8f2468b9 ("dma-buf: detach fence ops on signal v3"), I'm seeing the BUG_ON() triggering in drm_crtc's fence_to_crtc() via drm_crtc_fence_get_driver_name() regularly:
Call trace: panic+0x58/0x5c die+0x160/0x178 bug_brk_handler+0x70/0xa4 call_el1_break_hook+0x3c/0x1a0 do_el1_brk64+0x24/0x74 el1_brk64+0x34/0x54 el1h_64_sync_handler+0x80/0xfc el1h_64_sync+0x84/0x88 drm_crtc_fence_get_driver_name+0x60/0x68 (P) sync_file_get_name+0x184/0x45c sync_file_ioctl+0x404/0xf70 __arm64_sys_ioctl+0x124/0x1dc
This looks to be caused by a code flow similar to the following:
+++ snip +++ thread A thread B
ioctl(SYNC_IOC_FILE_INFO) sync_file_ioctl() sync_file_get_name() dma_fence_signal_timestamp_locked() dma_fence_driver_name() ops = rcu_dereference(fence->ops) if (!dma_fence_test_signaled_flag()) ops->get_driver_name(fence) i.e. drm_crtc_fence_get_driver_name() test_and_set_bit(SIGNALED) RCU_INIT_POINTER(fence->ops, NULL) drm_crtc_fence_get_driver_name() BUG_ON(rcu_access_pointer(fence->ops) != &drm_crtc_fence_ops)
Now this looks like a very similar problem that I have recently been concerned with:
https://lore.kernel.org/dri-devel/20260612104251.2264707-2-phasta@kernel.org...
https://lore.kernel.org/dri-devel/fa0dc9757bf8343516c4b156a2b70ec91b64ef8f.c...
I continue to believe because of bugs like this and the ones I have quoted in the threads above the robustness of the kernel could be greatly improved if we could get dma_fence fully synchronized with its lock.
On top of that, sashiko highlighted (via my other patch) that the existing code is missing some memory barriers:
https://sashiko.dev/#/patchset/20260618-linux-drm_crtc_fix-v1-1-801f29c9853d...
I believe Lock synchronization would resolve that (as would adding explicit memory barriers).
That is being discussed in the thread I linked, where Gary lists which barriers you would need for (presumably correct) lockless magic.
However, if my issue were to be solved with barriers, the test_and_set_bit() in dma_fence_signal_timestamp_locked() would have to be replaced with the more weakly ordered test_bit() and set_bit(), maybe creating other pitfalls.
The ordering issue in the get_*_name() functions plays into that. Setting the bit would then be done after setting the ops-pointer to NULL. So one would have to try to move the NULL set, too.
Long story short, this is painful and subtle.
But I think what we are realizing over and over again is that dma_fence has many subtleties to its API contract, and the implementation's sparring use of spinlocks leads to workarounds where people take locks manually or have to do an RCU dance.
Note that Christian is strongly opposed to guarding everything with locks, in part for supposedly occuring deadlocks in the fence callbacks when the driver needs to take its own locks.
The community discussion regarding that problem is currently in some sort of dead end, where none of us seems to know what the correct path forward is.
drm_sched users (and future users in Rust) use intermediate fences which decouple e.g. userspace from the actual hardware fences. So one path forward might be to question the callbacks in general and think about some sort of replacement for them.
[...]
Does the CRTC or DRM device need to be kept alive for the RCU grace period, or should the fence hold a proper reference to prevent the use-after-free when get_driver_name() and get_timeline_name() access the freed CRTC structure?
Do you guys have any preference on that? It appears the use-after-free should be resolved before merging the removal of the BUG_ON(), and I'd like to progress on this.
My understanding of the current situation is that as an issuer of dma_fence's you, in general, should wait for a grace period until you perform operations like driver unload, or, more generally, have fence- related resources and such being accessed through callbacks go away.
Danilo has recently mentioned some life-time inconsistencies between wider kernel device model and DRM device model that might be related to that discussion, and which made him object against some RCU requirements.
Maybe he's got the time to share some details with you that are relevant to your work.
P.
Cheers, Andre'
linaro-mm-sig@lists.linaro.org