Re: [PATCH] drm/drm_crtc: fix race with dma_fence_signal() in ::get_driver_name() - Linaro-mm-sig

24 Jun 2026


      On Tue, 2026-06-23 at 15:33 +0100, André Draszik wrote:
...
...
However, if my issue were to be solved with barriers, the
test_and_set_bit() in dma_fence_signal_timestamp_locked() would have to
be replaced with the more weakly ordered test_bit() and set_bit(),
maybe creating other pitfalls.
For the avoidance of doubts, I'm not saying that all the issues you raised
can be solved by barriers instead of appropriate locks (I don't know enough
about the code and issues in general here).
I'm not saying that you're saying that. I'm just cautioning you that
this change could be tricky.
...
I do think however that appropriate locks will fix the ordering issue
highlighted by sashiko (i.e. +1 for your argument). Barriers would fix this
specific issue, too, but that is not a statement about any wider issues.
...
The ordering issue in the get_*_name() functions plays into that.
Setting the bit would then be done after setting the ops-pointer to
NULL. So one would have to try to move the NULL set, too.
Long story short, this is painful and subtle.
But I think what we are realizing over and over again is that dma_fence
has many subtleties to its API contract, and the implementation's
sparring use of spinlocks leads to workarounds where people take locks
manually or have to do an RCU dance.
Note that Christian is strongly opposed to guarding everything with
locks, in part for supposedly occuring deadlocks in the fence callbacks
when the driver needs to take its own locks.
ww_mutex could help against deadlocks, but might affect performance, in case
these are all critical code paths (IDK),
You can't use sleepable locks in fences. They fire in interrupt context
left and right ;)
Despite, that wouldn't even solve the reported problem.
The tl;dr is:
there is fence_ops->enable_signaling(), which is currently being called
with the fence lock held. So the driver, in that callback, cannot take
a driver-specific lock IF there is another driver party (like an IRQ)
taking first the driver lock and then the fence lock.
Which is why Christian König wants to remove the fence lock being held
in enable_signaling().
One reason why that, supposedly, is currently not a problem is that
without fence->inline_lock, you can protect the fctx with the same lock
and do fctx list manipulations in enable_signaling() with lock
protection.
If you have a big bowl of popcorn available, you could checkout this
thread:
https://lore.kernel.org/dri-devel/20260608142436.265820-2-phasta@kernel.org/
;p
My own thinking is:
If everyone used inline_lock, and if we could rely on everyone being
able to do the necessary work in enable_signaling() without said lock-
inversion, then we could perfectly synchronize all actions related to
dma_fence, including driver and, thus, fence_ops unload.
The only thing blocking really might be enable_signaling (the other
callbacks already take the lock). The more difficult question would be
how to implement that in a backwards compatible manner, i.e., for those
who don't have inline_lock.
Another idea for the distant future might be to question the existence
of those callbacks. Userspace often is sort of decoupled from the
hardware fences through intermediate fences already.
...
...
The community discussion regarding that problem is currently in some
sort of dead end, where none of us seems to know what the correct path
forward is.
Please ignore if the following doesn't make sense, I'm just a bystander :-)
How about at least adding the required barriers and related changes, and
taking it from there? This would solve some immediate and easy to hit
issues on Arm64? If they turn out to be insufficient, code can still
be changed.
I am in support of that, which is why I posted that RFC for feedback
about the appropriate memory barriers.
...
BTW, thanks Philipp for all these details, much appreciated.
You're welcome. If you'd find a clever solution, probably everyone
would be happy.
P.
...
Cheers,
A.