[PATCH] drm/drm_crtc: fix race with dma_fence_signal() in ::get_driver_name()

List overview All Threads
Download

newer

older

[PATCH] drm/sched: Protect...

Dive into the World of io Games:...

André Draszik

18 Jun 2026 18 Jun '26

2:03 p.m.

Since commit 541c8f2468b9 ("dma-buf: detach fence ops on signal v3"), I'm seeing the BUG_ON() triggering in drm_crtc's fence_to_crtc() via drm_crtc_fence_get_driver_name() regularly:

Call trace: panic+0x58/0x5c die+0x160/0x178 bug_brk_handler+0x70/0xa4 call_el1_break_hook+0x3c/0x1a0 do_el1_brk64+0x24/0x74 el1_brk64+0x34/0x54 el1h_64_sync_handler+0x80/0xfc el1h_64_sync+0x84/0x88 drm_crtc_fence_get_driver_name+0x60/0x68 (P) sync_file_get_name+0x184/0x45c sync_file_ioctl+0x404/0xf70 __arm64_sys_ioctl+0x124/0x1dc

This looks to be caused by a code flow similar to the following:

+++ snip +++ thread A thread B

ioctl(SYNC_IOC_FILE_INFO) sync_file_ioctl() sync_file_get_name() dma_fence_signal_timestamp_locked() dma_fence_driver_name() ops = rcu_dereference(fence->ops) if (!dma_fence_test_signaled_flag()) ops->get_driver_name(fence) i.e. drm_crtc_fence_get_driver_name() test_and_set_bit(SIGNALED) RCU_INIT_POINTER(fence->ops, NULL) drm_crtc_fence_get_driver_name() BUG_ON(rcu_access_pointer(fence->ops) != &drm_crtc_fence_ops) +++ snap +++

I see two ways to resolve this: a) simply drop the BUG_ON(). It can not work anymore since above commit, as it is racy now. b) pass the original 'ops' pointer obtained in dma_fence_driver_name() to all callees.

This patch implements option a), as because: * I don't see much benefit in passing the extra pointer just for this BUG_ON() to work. * Requiring the dma_fence_ops in those callbacks is an implementation detail of the drm_crtc driver, and therefore upper layers shouldn't have to care about that. * The existence of the BUG_ON() doesn't appear to be consistent with implementations of ::get_driver_name() or ::get_timeline_name() in the majority of other DRM drivers in the first place. Those that do have a similar BUG_ON() (i915, xe) probably also need an update similar to this patch here but I'm not in a position to test those.

Note that the adjacent drm_crtc_fence_get_timeline_name() has the same problem and is fixed by this patch as well.

Fixes: 541c8f2468b9 ("dma-buf: detach fence ops on signal v3") Signed-off-by: André Draszik andre.draszik@linaro.org --- drivers/gpu/drm/drm_crtc.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c index 63ead8ba6756..31c8636e7467 100644 --- a/drivers/gpu/drm/drm_crtc.c +++ b/drivers/gpu/drm/drm_crtc.c @@ -73,6 +73,9 @@ * &drm_mode_config_funcs.atomic_check. */

+#define fence_to_crtc(f) container_of((f)->extern_lock, \ + struct drm_crtc, fence_lock) + /** * drm_crtc_from_index - find the registered CRTC at an index * @dev: DRM device @@ -154,14 +157,6 @@ static void drm_crtc_crc_fini(struct drm_crtc *crtc) #endif }

-static const struct dma_fence_ops drm_crtc_fence_ops; - -static struct drm_crtc *fence_to_crtc(struct dma_fence *fence) -{ - BUG_ON(rcu_access_pointer(fence->ops) != &drm_crtc_fence_ops); - return container_of(fence->extern_lock, struct drm_crtc, fence_lock); -} - static const char *drm_crtc_fence_get_driver_name(struct dma_fence *fence) { struct drm_crtc *crtc = fence_to_crtc(fence);

--- base-commit: e2cae00c05d196491c318196792297f2dfbaa02c change-id: 20260618-linux-drm_crtc_fix2-23a7c354a412

Best regards,

-- André Draszik andre.draszik@linaro.org

Show replies by date

Jani Nikula

18 Jun 18 Jun

2:53 p.m.

On Thu, 18 Jun 2026, André Draszik andre.draszik@linaro.org wrote:

...

Since commit 541c8f2468b9 ("dma-buf: detach fence ops on signal v3"), I'm seeing the BUG_ON() triggering in drm_crtc's fence_to_crtc() via drm_crtc_fence_get_driver_name() regularly:
Call trace:
 panic+0x58/0x5c
 die+0x160/0x178
 bug_brk_handler+0x70/0xa4
 call_el1_break_hook+0x3c/0x1a0
 do_el1_brk64+0x24/0x74
 el1_brk64+0x34/0x54
 el1h_64_sync_handler+0x80/0xfc
 el1h_64_sync+0x84/0x88
 drm_crtc_fence_get_driver_name+0x60/0x68 (P)
 sync_file_get_name+0x184/0x45c
 sync_file_ioctl+0x404/0xf70
 __arm64_sys_ioctl+0x124/0x1dc
This looks to be caused by a code flow similar to the following:

+++ snip +++ thread A thread B
                                 ioctl(SYNC_IOC_FILE_INFO)
                                 sync_file_ioctl()
                                 sync_file_get_name()
dma_fence_signal_timestamp_locked() dma_fence_driver_name() ops = rcu_dereference(fence->ops) if (!dma_fence_test_signaled_flag()) ops->get_driver_name(fence) i.e. drm_crtc_fence_get_driver_name() test_and_set_bit(SIGNALED) RCU_INIT_POINTER(fence->ops, NULL) drm_crtc_fence_get_driver_name() BUG_ON(rcu_access_pointer(fence->ops) != &drm_crtc_fence_ops) +++ snap +++

I see two ways to resolve this: a) simply drop the BUG_ON(). It can not work anymore since above commit, as it is racy now. b) pass the original 'ops' pointer obtained in dma_fence_driver_name() to all callees.

This patch implements option a), as because:

I don't see much benefit in passing the extra pointer just for this BUG_ON() to work.

Requiring the dma_fence_ops in those callbacks is an implementation detail of the drm_crtc driver, and therefore upper layers shouldn't have to care about that.

The existence of the BUG_ON() doesn't appear to be consistent with implementations of ::get_driver_name() or ::get_timeline_name() in the majority of other DRM drivers in the first place. Those that do have a similar BUG_ON() (i915, xe) probably also need an update similar to this patch here but I'm not in a position to test those.

Note that the adjacent drm_crtc_fence_get_timeline_name() has the same problem and is fixed by this patch as well.

Fixes: 541c8f2468b9 ("dma-buf: detach fence ops on signal v3") Signed-off-by: André Draszik andre.draszik@linaro.org

drivers/gpu/drm/drm_crtc.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c index 63ead8ba6756..31c8636e7467 100644 --- a/drivers/gpu/drm/drm_crtc.c +++ b/drivers/gpu/drm/drm_crtc.c @@ -73,6 +73,9 @@

&drm_mode_config_funcs.atomic_check.

*/ +#define fence_to_crtc(f) container_of((f)->extern_lock, \
		      struct drm_crtc, fence_lock)
/**

drm_crtc_from_index - find the registered CRTC at an index

@dev: DRM device

@@ -154,14 +157,6 @@ static void drm_crtc_crc_fini(struct drm_crtc *crtc) #endif } -static const struct dma_fence_ops drm_crtc_fence_ops;

-static struct drm_crtc *fence_to_crtc(struct dma_fence *fence) -{

BUG_ON(rcu_access_pointer(fence->ops) != &drm_crtc_fence_ops);

Whether removing the BUG_ON() turns out to be the right choice or not, I couldn't say, but please don't turn this function into a macro, at least not without rationale. (I can't think of any.)

BR, Jani.

...

return container_of(fence->extern_lock, struct drm_crtc, fence_lock);

-}

static const char *drm_crtc_fence_get_driver_name(struct dma_fence *fence) { struct drm_crtc *crtc = fence_to_crtc(fence);

base-commit: e2cae00c05d196491c318196792297f2dfbaa02c change-id: 20260618-linux-drm_crtc_fix2-23a7c354a412

Best regards,

-- Jani Nikula, Intel

Philipp Stanner

3:56 p.m.

+Cc Danilo

On Thu, 2026-06-18 at 15:03 +0100, André Draszik wrote:

...

Since commit 541c8f2468b9 ("dma-buf: detach fence ops on signal v3"), I'm seeing the BUG_ON() triggering in drm_crtc's fence_to_crtc() via drm_crtc_fence_get_driver_name() regularly:

Call trace:      panic+0x58/0x5c      die+0x160/0x178      bug_brk_handler+0x70/0xa4      call_el1_break_hook+0x3c/0x1a0      do_el1_brk64+0x24/0x74      el1_brk64+0x34/0x54      el1h_64_sync_handler+0x80/0xfc      el1h_64_sync+0x84/0x88      drm_crtc_fence_get_driver_name+0x60/0x68 (P)      sync_file_get_name+0x184/0x45c      sync_file_ioctl+0x404/0xf70      __arm64_sys_ioctl+0x124/0x1dc

This looks to be caused by a code flow similar to the following:

+++ snip +++ thread A                             thread B

ioctl(SYNC_IOC_FILE_INFO)                                      sync_file_ioctl()                                      sync_file_get_name() dma_fence_signal_timestamp_locked() dma_fence_driver_name()                                        ops = rcu_dereference(fence->ops)                                        if (!dma_fence_test_signaled_flag())                                          ops->get_driver_name(fence) i.e.                                          drm_crtc_fence_get_driver_name() test_and_set_bit(SIGNALED) RCU_INIT_POINTER(fence->ops, NULL)                                      drm_crtc_fence_get_driver_name()                                        BUG_ON(rcu_access_pointer(fence->ops)                                               != &drm_crtc_fence_ops)

Now this looks like a very similar problem that I have recently been concerned with:

https://lore.kernel.org/dri-devel/20260612104251.2264707-2-phasta@kernel.org...

https://lore.kernel.org/dri-devel/fa0dc9757bf8343516c4b156a2b70ec91b64ef8f.c...

I continue to believe because of bugs like this and the ones I have quoted in the threads above the robustness of the kernel could be greatly improved if we could get dma_fence fully synchronized with its lock.

That said:

...

+++ snap +++

I see two ways to resolve this: a) simply drop the BUG_ON(). It can not work anymore since above commit, as it is racy now. b) pass the original 'ops' pointer obtained in dma_fence_driver_name() to all callees.

This patch implements option a), as because:

I don't see much benefit in passing the extra pointer just for this

BUG_ON() to work.

Requiring the dma_fence_ops in those callbacks is an implementation

detail of the drm_crtc driver, and therefore upper layers shouldn't have to care about that.

The existence of the BUG_ON() doesn't appear to be consistent with

implementations of ::get_driver_name() or ::get_timeline_name() in the majority of other DRM drivers in the first place. Those that do have a similar BUG_ON() (i915, xe) probably also need an update similar to this patch here but I'm not in a position to test those.

Note that the adjacent drm_crtc_fence_get_timeline_name() has the same problem and is fixed by this patch as well.

Fixes: 541c8f2468b9 ("dma-buf: detach fence ops on signal v3") Signed-off-by: André Draszik andre.draszik@linaro.org

drivers/gpu/drm/drm_crtc.c | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c index 63ead8ba6756..31c8636e7467 100644 --- a/drivers/gpu/drm/drm_crtc.c +++ b/drivers/gpu/drm/drm_crtc.c @@ -73,6 +73,9 @@ * &drm_mode_config_funcs.atomic_check. */ +#define fence_to_crtc(f) container_of((f)->extern_lock, \
		      struct drm_crtc, fence_lock)

I agree that macros should be avoided if possible.

...

/** * drm_crtc_from_index - find the registered CRTC at an index * @dev: DRM device @@ -154,14 +157,6 @@ static void drm_crtc_crc_fini(struct drm_crtc *crtc) #endif } -static const struct dma_fence_ops drm_crtc_fence_ops;

-static struct drm_crtc *fence_to_crtc(struct dma_fence *fence) -{

BUG_ON(rcu_access_pointer(fence->ops) != &drm_crtc_fence_ops);

BUG_ON is more or less deprecated and should not be used anymore. There needs to be bombastic justification for shooting down the entire kernel.

...

return container_of(fence->extern_lock, struct drm_crtc, fence_lock);

-}

static const char *drm_crtc_fence_get_driver_name(struct dma_fence *fence) { struct drm_crtc *crtc = fence_to_crtc(fence);

base-commit: e2cae00c05d196491c318196792297f2dfbaa02c change-id: 20260618-linux-drm_crtc_fix2-23a7c354a412

Best regards,

André Draszik

23 Jun 23 Jun

11:37 a.m.

Hi,

On Thu, 2026-06-18 at 17:56 +0200, Philipp Stanner wrote:

...

+Cc Danilo

On Thu, 2026-06-18 at 15:03 +0100, André Draszik wrote:

...
Since commit 541c8f2468b9 ("dma-buf: detach fence ops on signal v3"), I'm seeing the BUG_ON() triggering in drm_crtc's fence_to_crtc() via drm_crtc_fence_get_driver_name() regularly:

Call trace:      panic+0x58/0x5c      die+0x160/0x178      bug_brk_handler+0x70/0xa4      call_el1_break_hook+0x3c/0x1a0      do_el1_brk64+0x24/0x74      el1_brk64+0x34/0x54      el1h_64_sync_handler+0x80/0xfc      el1h_64_sync+0x84/0x88      drm_crtc_fence_get_driver_name+0x60/0x68 (P)      sync_file_get_name+0x184/0x45c      sync_file_ioctl+0x404/0xf70      __arm64_sys_ioctl+0x124/0x1dc

This looks to be caused by a code flow similar to the following:

+++ snip +++ thread A                             thread B

ioctl(SYNC_IOC_FILE_INFO)                                      sync_file_ioctl()                                      sync_file_get_name() dma_fence_signal_timestamp_locked() dma_fence_driver_name()                                        ops = rcu_dereference(fence->ops)                                        if (!dma_fence_test_signaled_flag())                                          ops->get_driver_name(fence) i.e.                                          drm_crtc_fence_get_driver_name() test_and_set_bit(SIGNALED) RCU_INIT_POINTER(fence->ops, NULL)                                      drm_crtc_fence_get_driver_name()                                        BUG_ON(rcu_access_pointer(fence->ops)                                               != &drm_crtc_fence_ops)

Now this looks like a very similar problem that I have recently been concerned with:

https://lore.kernel.org/dri-devel/20260612104251.2264707-2-phasta@kernel.org...

https://lore.kernel.org/dri-devel/fa0dc9757bf8343516c4b156a2b70ec91b64ef8f.c...

I continue to believe because of bugs like this and the ones I have quoted in the threads above the robustness of the kernel could be greatly improved if we could get dma_fence fully synchronized with its lock.

On top of that, sashiko highlighted (via my other patch) that the existing code is missing some memory barriers:

https://sashiko.dev/#/patchset/20260618-linux-drm_crtc_fix-v1-1-801f29c9853d...

I believe Lock synchronization would resolve that (as would adding explicit memory barriers).

[...]

...

...
diff --git a/drivers/gpu/drm/drm_crtc.c b/drivers/gpu/drm/drm_crtc.c index 63ead8ba6756..31c8636e7467 100644 --- a/drivers/gpu/drm/drm_crtc.c +++ b/drivers/gpu/drm/drm_crtc.c @@ -73,6 +73,9 @@ * &drm_mode_config_funcs.atomic_check. */ +#define fence_to_crtc(f) container_of((f)->extern_lock, \
		      struct drm_crtc, fence_lock)
I agree that macros should be avoided if possible.

No problem, I'll change that.

...

...

/** * drm_crtc_from_index - find the registered CRTC at an index * @dev: DRM device @@ -154,14 +157,6 @@ static void drm_crtc_crc_fini(struct drm_crtc *crtc) #endif } -static const struct dma_fence_ops drm_crtc_fence_ops;

-static struct drm_crtc *fence_to_crtc(struct dma_fence *fence) -{

BUG_ON(rcu_access_pointer(fence->ops) != &drm_crtc_fence_ops);

+1

BUG_ON is more or less deprecated and should not be used anymore. There needs to be bombastic justification for shooting down the entire kernel.

Yes, I meant to mention that in my commit message as well.

Now, you might have seen that sashiko highlighted that the existing BUG_ON() is masking a potential use-after-free during driver removal which I believe to be a correct observation. It suggests two alternative ways to resolve it:

https://sashiko.dev/#/patchset/20260618-linux-drm_crtc_fix2-v1-1-c03e77b36f3...

...

Does the CRTC or DRM device need to be kept alive for the RCU grace period, or should the fence hold a proper reference to prevent the use-after-free when get_driver_name() and get_timeline_name() access the freed CRTC structure?

Do you guys have any preference on that? It appears the use-after-free should be resolved before merging the removal of the BUG_ON(), and I'd like to progress on this.

Cheers, Andre'

Philipp Stanner

11:58 a.m.

On Tue, 2026-06-23 at 12:37 +0100, André Draszik wrote:

...

Hi,

On Thu, 2026-06-18 at 17:56 +0200, Philipp Stanner wrote:

...
+Cc Danilo

On Thu, 2026-06-18 at 15:03 +0100, André Draszik wrote:

...
Since commit 541c8f2468b9 ("dma-buf: detach fence ops on signal v3"), I'm seeing the BUG_ON() triggering in drm_crtc's fence_to_crtc() via drm_crtc_fence_get_driver_name() regularly:

Call trace:      panic+0x58/0x5c      die+0x160/0x178      bug_brk_handler+0x70/0xa4      call_el1_break_hook+0x3c/0x1a0      do_el1_brk64+0x24/0x74      el1_brk64+0x34/0x54      el1h_64_sync_handler+0x80/0xfc      el1h_64_sync+0x84/0x88      drm_crtc_fence_get_driver_name+0x60/0x68 (P)      sync_file_get_name+0x184/0x45c      sync_file_ioctl+0x404/0xf70      __arm64_sys_ioctl+0x124/0x1dc

This looks to be caused by a code flow similar to the following:

+++ snip +++ thread A                             thread B

ioctl(SYNC_IOC_FILE_INFO)                                      sync_file_ioctl()                                      sync_file_get_name() dma_fence_signal_timestamp_locked() dma_fence_driver_name()                                        ops = rcu_dereference(fence->ops)                                        if (!dma_fence_test_signaled_flag())                                          ops->get_driver_name(fence) i.e.                                          drm_crtc_fence_get_driver_name() test_and_set_bit(SIGNALED) RCU_INIT_POINTER(fence->ops, NULL)                                      drm_crtc_fence_get_driver_name()                                        BUG_ON(rcu_access_pointer(fence->ops)                                               != &drm_crtc_fence_ops)

Now this looks like a very similar problem that I have recently been concerned with:

https://lore.kernel.org/dri-devel/20260612104251.2264707-2-phasta@kernel.org...

https://lore.kernel.org/dri-devel/fa0dc9757bf8343516c4b156a2b70ec91b64ef8f.c...

I continue to believe because of bugs like this and the ones I have quoted in the threads above the robustness of the kernel could be greatly improved if we could get dma_fence fully synchronized with its lock.

On top of that, sashiko highlighted (via my other patch) that the existing code is missing some memory barriers:

https://sashiko.dev/#/patchset/20260618-linux-drm_crtc_fix-v1-1-801f29c9853d...

I believe Lock synchronization would resolve that (as would adding explicit memory barriers).

That is being discussed in the thread I linked, where Gary lists which barriers you would need for (presumably correct) lockless magic.

However, if my issue were to be solved with barriers, the test_and_set_bit() in dma_fence_signal_timestamp_locked() would have to be replaced with the more weakly ordered test_bit() and set_bit(), maybe creating other pitfalls.

The ordering issue in the get_*_name() functions plays into that. Setting the bit would then be done after setting the ops-pointer to NULL. So one would have to try to move the NULL set, too.

Long story short, this is painful and subtle.

But I think what we are realizing over and over again is that dma_fence has many subtleties to its API contract, and the implementation's sparring use of spinlocks leads to workarounds where people take locks manually or have to do an RCU dance.

Note that Christian is strongly opposed to guarding everything with locks, in part for supposedly occuring deadlocks in the fence callbacks when the driver needs to take its own locks.

The community discussion regarding that problem is currently in some sort of dead end, where none of us seems to know what the correct path forward is.

drm_sched users (and future users in Rust) use intermediate fences which decouple e.g. userspace from the actual hardware fences. So one path forward might be to question the callbacks in general and think about some sort of replacement for them.

...

[...]

...
...
...
Does the CRTC or DRM device need to be kept alive for the RCU grace period, or should the fence hold a proper reference to prevent the use-after-free when get_driver_name() and get_timeline_name() access the freed CRTC structure?

Do you guys have any preference on that? It appears the use-after-free should be resolved before merging the removal of the BUG_ON(), and I'd like to progress on this.

My understanding of the current situation is that as an issuer of dma_fence's you, in general, should wait for a grace period until you perform operations like driver unload, or, more generally, have fence- related resources and such being accessed through callbacks go away.

Danilo has recently mentioned some life-time inconsistencies between wider kernel device model and DRM device model that might be related to that discussion, and which made him object against some RCU requirements.

Maybe he's got the time to share some details with you that are relevant to your work.

...

Cheers, Andre'

André Draszik

2:33 p.m.

Hi Philipp,

On Tue, 2026-06-23 at 13:58 +0200, Philipp Stanner wrote:

...

On Tue, 2026-06-23 at 12:37 +0100, André Draszik wrote:

...
On Thu, 2026-06-18 at 17:56 +0200, Philipp Stanner wrote:

...
I continue to believe because of bugs like this and the ones I have quoted in the threads above the robustness of the kernel could be greatly improved if we could get dma_fence fully synchronized with its lock.

On top of that, sashiko highlighted (via my other patch) that the existing code is missing some memory barriers:

https://sashiko.dev/#/patchset/20260618-linux-drm_crtc_fix-v1-1-801f29c9853d...

I believe Lock synchronization would resolve that (as would adding explicit memory barriers).

That is being discussed in the thread I linked, where Gary lists which barriers you would need for (presumably correct) lockless magic.

Having read Gary's suggestion, that aligns with what I had in mind.

...

However, if my issue were to be solved with barriers, the test_and_set_bit() in dma_fence_signal_timestamp_locked() would have to be replaced with the more weakly ordered test_bit() and set_bit(), maybe creating other pitfalls.

For the avoidance of doubts, I'm not saying that all the issues you raised can be solved by barriers instead of appropriate locks (I don't know enough about the code and issues in general here).

I do think however that appropriate locks will fix the ordering issue highlighted by sashiko (i.e. +1 for your argument). Barriers would fix this specific issue, too, but that is not a statement about any wider issues.

...

The ordering issue in the get_*_name() functions plays into that. Setting the bit would then be done after setting the ops-pointer to NULL. So one would have to try to move the NULL set, too.

Long story short, this is painful and subtle.

But I think what we are realizing over and over again is that dma_fence has many subtleties to its API contract, and the implementation's sparring use of spinlocks leads to workarounds where people take locks manually or have to do an RCU dance.

Note that Christian is strongly opposed to guarding everything with locks, in part for supposedly occuring deadlocks in the fence callbacks when the driver needs to take its own locks.

ww_mutex could help against deadlocks, but might affect performance, in case these are all critical code paths (IDK),

...

The community discussion regarding that problem is currently in some sort of dead end, where none of us seems to know what the correct path forward is.

Please ignore if the following doesn't make sense, I'm just a bystander :-) How about at least adding the required barriers and related changes, and taking it from there? This would solve some immediate and easy to hit issues on Arm64? If they turn out to be insufficient, code can still be changed.

...

...
...
[...] My understanding of the current situation is that as an issuer of dma_fence's you, in general, should wait for a grace period until you perform operations like driver unload, or, more generally, have fence- related resources and such being accessed through callbacks go away.

If I understand correctly, simply waiting for a grace period in the driver's unbind should be the way to go.

...

Danilo ... Maybe he's got the time to share some details with you that are relevant to your work.

Will wait a little :-)

BTW, thanks Philipp for all these details, much appreciated.

Cheers, A.

Philipp Stanner

24 Jun 24 Jun

7:33 a.m.

On Tue, 2026-06-23 at 15:33 +0100, André Draszik wrote:

...

...
However, if my issue were to be solved with barriers, the test_and_set_bit() in dma_fence_signal_timestamp_locked() would have to be replaced with the more weakly ordered test_bit() and set_bit(), maybe creating other pitfalls.

For the avoidance of doubts, I'm not saying that all the issues you raised can be solved by barriers instead of appropriate locks (I don't know enough about the code and issues in general here).

I'm not saying that you're saying that. I'm just cautioning you that this change could be tricky.

...

I do think however that appropriate locks will fix the ordering issue highlighted by sashiko (i.e. +1 for your argument). Barriers would fix this specific issue, too, but that is not a statement about any wider issues.

...
The ordering issue in the get_*_name() functions plays into that. Setting the bit would then be done after setting the ops-pointer to NULL. So one would have to try to move the NULL set, too.

Long story short, this is painful and subtle.

But I think what we are realizing over and over again is that dma_fence has many subtleties to its API contract, and the implementation's sparring use of spinlocks leads to workarounds where people take locks manually or have to do an RCU dance.

Note that Christian is strongly opposed to guarding everything with locks, in part for supposedly occuring deadlocks in the fence callbacks when the driver needs to take its own locks.

ww_mutex could help against deadlocks, but might affect performance, in case these are all critical code paths (IDK),

You can't use sleepable locks in fences. They fire in interrupt context left and right ;)

Despite, that wouldn't even solve the reported problem.

The tl;dr is:

there is fence_ops->enable_signaling(), which is currently being called with the fence lock held. So the driver, in that callback, cannot take a driver-specific lock IF there is another driver party (like an IRQ) taking first the driver lock and then the fence lock.

Which is why Christian König wants to remove the fence lock being held in enable_signaling().

One reason why that, supposedly, is currently not a problem is that without fence->inline_lock, you can protect the fctx with the same lock and do fctx list manipulations in enable_signaling() with lock protection.

If you have a big bowl of popcorn available, you could checkout this thread:

https://lore.kernel.org/dri-devel/20260608142436.265820-2-phasta@kernel.org/

My own thinking is: If everyone used inline_lock, and if we could rely on everyone being able to do the necessary work in enable_signaling() without said lock- inversion, then we could perfectly synchronize all actions related to dma_fence, including driver and, thus, fence_ops unload.

The only thing blocking really might be enable_signaling (the other callbacks already take the lock). The more difficult question would be how to implement that in a backwards compatible manner, i.e., for those who don't have inline_lock.

Another idea for the distant future might be to question the existence of those callbacks. Userspace often is sort of decoupled from the hardware fences through intermediate fences already.

...

...
The community discussion regarding that problem is currently in some sort of dead end, where none of us seems to know what the correct path forward is.

Please ignore if the following doesn't make sense, I'm just a bystander :-) How about at least adding the required barriers and related changes, and taking it from there? This would solve some immediate and easy to hit issues on Arm64? If they turn out to be insufficient, code can still be changed.

I am in support of that, which is why I posted that RFC for feedback about the appropriate memory barriers.

...

BTW, thanks Philipp for all these details, much appreciated.

You're welcome. If you'd find a clever solution, probably everyone would be happy.

...

Cheers, A.

days inactive

days old

linaro-mm-sig@lists.linaro.org

6 comments

participants

tags (0)

participants (3)

André Draszik
Jani Nikula
Philipp Stanner