On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
Am 11.04.25 um 11:29 schrieb Philipp Stanner:
[SNIP] It could be, however, that at the same moment nouveau_fence_signal() is removing that entry, holding the appropriate lock.
So we have a race. Again.
Ah, yes of course. If signaled is called with or without the lock is actually undetermined.
You see, fixing things in Nouveau is difficult :) It gets more difficult if you want to clean it up "properly", so it conforms to rules such as those from dma_fence.
I have now provided two fixes that both work, but you are not satisfied with from the dma_fence-maintainer's perspective. I understand that, but please also understand that it's actually not my primary task to work on Nouveau. I just have to fix this bug to move on with my scheduler work.
Well I'm happy with whatever solution as long as it works, but as far as I can see the approach with the callback simply doesn't. You just can't drop the fence reference for the list from the callback.
So if you have another idea, feel free to share it. But I'd like to know how we can go on here.
Well the fence code actually works, doesn't it? The problem is rather that setting the error throws a warning because it doesn't expect signaled fences on the pending list. Maybe we should fix that instead.
The fence code works as the author intended, but I would be happy if it were more explicitly documented.
Regarding the WARN_ON: It occurs in dma_fence_set_error() because there is an attempt to set an error code on a signaled fence. I don't think that should be "fixed", it works as intended: You must not set an error code of a fence that was already signaled.
The reason seems to be that once a fence is signaled, a third party might evaluate the error code.
But I think this wasn't wat you meant with "fix".
In any case, there must not be signaled fences in nouveau's pending- list. They must be removed immediately once they signal, and this must not race.
I'm running out of ideas. What I'm wondering if we couldn't just remove performance hacky fastpath functions such as nouveau_fence_is_signaled() completely. It seems redundant to me.
That would work for me as well.
I'll test this approach. Seems a bit like the nuclear approach, but if it works we'd at least clean up a lot of this mess.
P.
Or we might add locking to it, but IDK what was achieved with RCU here. In any case it's definitely bad that Nouveau has so many redundant and half-redundant mechanisms.
Yeah, agree messing with the locks even more won't help us here. Regards, Christian.
P.
P.
Regards, Christian.
P.
Regards, Christian.
> > Replace the call to dma_fence_is_signaled() with > nouveau_fence_base_is_signaled(). > > Cc: stable@vger.kernel.org # 4.10+, precise commit not > to > be > determined > Signed-off-by: Philipp Stanner phasta@kernel.org > --- > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c > b/drivers/gpu/drm/nouveau/nouveau_fence.c > index 7cc84472cece..33535987d8ed 100644 > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c > @@ -274,7 +274,7 @@ nouveau_fence_done(struct > nouveau_fence > *fence) > nvif_event_block(&fctx->event); > spin_unlock_irqrestore(&fctx->lock, > flags); > } > - return dma_fence_is_signaled(&fence->base); > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, > &fence- > > > > > base.flags); > > > > } > > static long >
Am 11.04.25 um 14:44 schrieb Philipp Stanner:
On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
Am 11.04.25 um 11:29 schrieb Philipp Stanner:
[SNIP] It could be, however, that at the same moment nouveau_fence_signal() is removing that entry, holding the appropriate lock.
So we have a race. Again.
Ah, yes of course. If signaled is called with or without the lock is actually undetermined.
You see, fixing things in Nouveau is difficult :) It gets more difficult if you want to clean it up "properly", so it conforms to rules such as those from dma_fence.
I have now provided two fixes that both work, but you are not satisfied with from the dma_fence-maintainer's perspective. I understand that, but please also understand that it's actually not my primary task to work on Nouveau. I just have to fix this bug to move on with my scheduler work.
Well I'm happy with whatever solution as long as it works, but as far as I can see the approach with the callback simply doesn't. You just can't drop the fence reference for the list from the callback.
So if you have another idea, feel free to share it. But I'd like to know how we can go on here.
Well the fence code actually works, doesn't it? The problem is rather that setting the error throws a warning because it doesn't expect signaled fences on the pending list. Maybe we should fix that instead.
The fence code works as the author intended, but I would be happy if it were more explicitly documented.
Regarding the WARN_ON: It occurs in dma_fence_set_error() because there is an attempt to set an error code on a signaled fence. I don't think that should be "fixed", it works as intended: You must not set an error code of a fence that was already signaled.
The reason seems to be that once a fence is signaled, a third party might evaluate the error code.
Yeah, more or less correct. The idea is you can't declare an operation as having an error after the operation has already completed.
Because everyone will just wait for the completion and nobody checks the status again after that.
But I think this wasn't wat you meant with "fix".
The idea was to avoid calling dma_fence_set_error() on already signaled fences. Something like this:
@@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct nouveau_fence_chan *fctx, int error) while (!list_empty(&fctx->pending)) { fence = list_entry(fctx->pending.next, typeof(*fence), head); - if (error) + if (error & !dma_fence_is_signaled_locked(&fence->base)) dma_fence_set_error(&fence->base, error); if (nouveau_fence_signal(fence))
That would also improve the handling quite a bit since we now don't set errors on fences which are already completed even if we haven't realized that they are already completed yet.
In any case, there must not be signaled fences in nouveau's pending- list. They must be removed immediately once they signal, and this must not race.
Why actually? As far as I can see the pending list is not for the unsignaled fences, but rather the pending interrupt processing.
So having signaled fences on the pending list is perfectly possible.
Regards, Christian.
I'm running out of ideas. What I'm wondering if we couldn't just remove performance hacky fastpath functions such as nouveau_fence_is_signaled() completely. It seems redundant to me.
That would work for me as well.
I'll test this approach. Seems a bit like the nuclear approach, but if it works we'd at least clean up a lot of this mess.
P.
Or we might add locking to it, but IDK what was achieved with RCU here. In any case it's definitely bad that Nouveau has so many redundant and half-redundant mechanisms.
Yeah, agree messing with the locks even more won't help us here. Regards, Christian.
P.
P.
Regards, Christian.
P.
> > Regards, > Christian. > > >> >> Replace the call to dma_fence_is_signaled() with >> nouveau_fence_base_is_signaled(). >> >> Cc: stable@vger.kernel.org # 4.10+, precise commit not >> to >> be >> determined >> Signed-off-by: Philipp Stanner phasta@kernel.org >> --- >> drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c >> b/drivers/gpu/drm/nouveau/nouveau_fence.c >> index 7cc84472cece..33535987d8ed 100644 >> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c >> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c >> @@ -274,7 +274,7 @@ nouveau_fence_done(struct >> nouveau_fence >> *fence) >> nvif_event_block(&fctx->event); >> spin_unlock_irqrestore(&fctx->lock, >> flags); >> } >> - return dma_fence_is_signaled(&fence->base); >> + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, >> &fence- >> >>> >>> base.flags); >>> >> >> } >> >> static long >> >
On Fri, 2025-04-11 at 15:06 +0200, Christian König wrote:
Am 11.04.25 um 14:44 schrieb Philipp Stanner:
On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
Am 11.04.25 um 11:29 schrieb Philipp Stanner:
[SNIP] It could be, however, that at the same moment nouveau_fence_signal() is removing that entry, holding the appropriate lock.
So we have a race. Again.
Ah, yes of course. If signaled is called with or without the lock is actually undetermined.
You see, fixing things in Nouveau is difficult :) It gets more difficult if you want to clean it up "properly", so it conforms to rules such as those from dma_fence.
I have now provided two fixes that both work, but you are not satisfied with from the dma_fence-maintainer's perspective. I understand that, but please also understand that it's actually not my primary task to work on Nouveau. I just have to fix this bug to move on with my scheduler work.
Well I'm happy with whatever solution as long as it works, but as far as I can see the approach with the callback simply doesn't. You just can't drop the fence reference for the list from the callback.
So if you have another idea, feel free to share it. But I'd like to know how we can go on here.
Well the fence code actually works, doesn't it? The problem is rather that setting the error throws a warning because it doesn't expect signaled fences on the pending list. Maybe we should fix that instead.
The fence code works as the author intended, but I would be happy if it were more explicitly documented.
Regarding the WARN_ON: It occurs in dma_fence_set_error() because there is an attempt to set an error code on a signaled fence. I don't think that should be "fixed", it works as intended: You must not set an error code of a fence that was already signaled.
The reason seems to be that once a fence is signaled, a third party might evaluate the error code.
Yeah, more or less correct. The idea is you can't declare an operation as having an error after the operation has already completed.
Because everyone will just wait for the completion and nobody checks the status again after that.
But I think this wasn't wat you meant with "fix".
The idea was to avoid calling dma_fence_set_error() on already signaled fences. Something like this:
@@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct nouveau_fence_chan *fctx, int error) while (!list_empty(&fctx->pending)) { fence = list_entry(fctx->pending.next, typeof(*fence), head); - if (error) + if (error & !dma_fence_is_signaled_locked(&fence-
base))
dma_fence_set_error(&fence->base, error); if (nouveau_fence_signal(fence))
That would also improve the handling quite a bit since we now don't set errors on fences which are already completed even if we haven't realized that they are already completed yet.
In any case, there must not be signaled fences in nouveau's pending- list. They must be removed immediately once they signal, and this must not race.
Why actually? As far as I can see the pending list is not for the unsignaled fences, but rather the pending interrupt processing.
That's a list of fences that are "in the air", i.e., whose jobs are currently being processed by the hardware. Once a job is done, its fence must be removed.
So having signaled fences on the pending list is perfectly possible.
It is possible, and that is a bug. The list is used by nouveau_fence_context_kill() to kill still pending jobs. It shall not try to kill and set error codes for fences that are already signaled.
Anyways, forget about the "remove callbacks solution" it actually causes a MASSIVE performance regression. No idea why, AFAICS the fast path is only ever evaluated in nouveau_fence_done(), but maybe I missed something.
Will re-iterate next week…
P.
Regards, Christian.
I'm running out of ideas. What I'm wondering if we couldn't just remove performance hacky fastpath functions such as nouveau_fence_is_signaled() completely. It seems redundant to me.
That would work for me as well.
I'll test this approach. Seems a bit like the nuclear approach, but if it works we'd at least clean up a lot of this mess.
P.
Or we might add locking to it, but IDK what was achieved with RCU here. In any case it's definitely bad that Nouveau has so many redundant and half-redundant mechanisms.
Yeah, agree messing with the locks even more won't help us here. Regards, Christian.
P.
P.
Regards, Christian.
> > P. > > > > > > > > Regards, > > Christian. > > > > > > > > > > Replace the call to dma_fence_is_signaled() with > > > nouveau_fence_base_is_signaled(). > > > > > > Cc: stable@vger.kernel.org # 4.10+, precise commit > > > not > > > to > > > be > > > determined > > > Signed-off-by: Philipp Stanner phasta@kernel.org > > > --- > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c > > > index 7cc84472cece..33535987d8ed 100644 > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct > > > nouveau_fence > > > *fence) > > > nvif_event_block(&fctx- > > > >event); > > > spin_unlock_irqrestore(&fctx->lock, > > > flags); > > > } > > > - return dma_fence_is_signaled(&fence->base); > > > + return test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, > > > &fence- > > > > > > > > > > > base.flags); > > > > > > > > > > } > > > > > > static long > > > > > >
On Fri, 2025-04-11 at 16:10 +0200, Philipp Stanner wrote:
On Fri, 2025-04-11 at 15:06 +0200, Christian König wrote:
Am 11.04.25 um 14:44 schrieb Philipp Stanner:
On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote:
Am 11.04.25 um 11:29 schrieb Philipp Stanner:
[SNIP] It could be, however, that at the same moment nouveau_fence_signal() is removing that entry, holding the appropriate lock.
So we have a race. Again.
Ah, yes of course. If signaled is called with or without the lock is actually undetermined.
You see, fixing things in Nouveau is difficult :) It gets more difficult if you want to clean it up "properly", so it conforms to rules such as those from dma_fence.
I have now provided two fixes that both work, but you are not satisfied with from the dma_fence-maintainer's perspective. I understand that, but please also understand that it's actually not my primary task to work on Nouveau. I just have to fix this bug to move on with my scheduler work.
Well I'm happy with whatever solution as long as it works, but as far as I can see the approach with the callback simply doesn't. You just can't drop the fence reference for the list from the callback.
So if you have another idea, feel free to share it. But I'd like to know how we can go on here.
Well the fence code actually works, doesn't it? The problem is rather that setting the error throws a warning because it doesn't expect signaled fences on the pending list. Maybe we should fix that instead.
The fence code works as the author intended, but I would be happy if it were more explicitly documented.
Regarding the WARN_ON: It occurs in dma_fence_set_error() because there is an attempt to set an error code on a signaled fence. I don't think that should be "fixed", it works as intended: You must not set an error code of a fence that was already signaled.
The reason seems to be that once a fence is signaled, a third party might evaluate the error code.
Yeah, more or less correct. The idea is you can't declare an operation as having an error after the operation has already completed.
Because everyone will just wait for the completion and nobody checks the status again after that.
But I think this wasn't wat you meant with "fix".
The idea was to avoid calling dma_fence_set_error() on already signaled fences. Something like this:
@@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct nouveau_fence_chan *fctx, int error) while (!list_empty(&fctx->pending)) { fence = list_entry(fctx->pending.next, typeof(*fence), head); - if (error) + if (error & !dma_fence_is_signaled_locked(&fence-
base))
dma_fence_set_error(&fence->base, error); if (nouveau_fence_signal(fence))
That would also improve the handling quite a bit since we now don't set errors on fences which are already completed even if we haven't realized that they are already completed yet.
In any case, there must not be signaled fences in nouveau's pending- list. They must be removed immediately once they signal, and this must not race.
Why actually? As far as I can see the pending list is not for the unsignaled fences, but rather the pending interrupt processing.
That's a list of fences that are "in the air", i.e., whose jobs are currently being processed by the hardware. Once a job is done, its fence must be removed.
So having signaled fences on the pending list is perfectly possible.
It is possible, and that is a bug. The list is used by nouveau_fence_context_kill() to kill still pending jobs. It shall not try to kill and set error codes for fences that are already signaled.
@Danilo: We have now 2 possible solutions for the firing WARN_ON floating.
Version A (Christian) Check in nouveau_fence_context_kill() whether a fence is already signaled before setting an error.
Version B (Me) This patch series here. Make sure that in Nouveau, only nouveau_fence_signal() signals fences.
Both should do the trick. Please share a maintainer-preference so I can move on here.
Thx P.
Anyways, forget about the "remove callbacks solution" it actually causes a MASSIVE performance regression. No idea why, AFAICS the fast path is only ever evaluated in nouveau_fence_done(), but maybe I missed something.
Will re-iterate next week…
P.
Regards, Christian.
I'm running out of ideas. What I'm wondering if we couldn't just remove performance hacky fastpath functions such as nouveau_fence_is_signaled() completely. It seems redundant to me.
That would work for me as well.
I'll test this approach. Seems a bit like the nuclear approach, but if it works we'd at least clean up a lot of this mess.
P.
Or we might add locking to it, but IDK what was achieved with RCU here. In any case it's definitely bad that Nouveau has so many redundant and half-redundant mechanisms.
Yeah, agree messing with the locks even more won't help us here. Regards, Christian.
P.
P.
> > Regards, > Christian. > > > > > > P. > > > > > > > > > > > > > > Regards, > > > Christian. > > > > > > > > > > > > > > Replace the call to dma_fence_is_signaled() with > > > > nouveau_fence_base_is_signaled(). > > > > > > > > Cc: stable@vger.kernel.org # 4.10+, precise > > > > commit > > > > not > > > > to > > > > be > > > > determined > > > > Signed-off-by: Philipp Stanner phasta@kernel.org > > > > --- > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git > > > > a/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > index 7cc84472cece..33535987d8ed 100644 > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct > > > > nouveau_fence > > > > *fence) > > > > nvif_event_block(&fctx- > > > > > event); > > > > spin_unlock_irqrestore(&fctx- > > > > >lock, > > > > flags); > > > > } > > > > - return dma_fence_is_signaled(&fence- > > > > >base); > > > > + return > > > > test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, > > > > &fence- > > > > > > > > > > > > > > base.flags); > > > > > > > > > > > > > } > > > > > > > > static long > > > > > > > > > >
On Mon, Apr 14, 2025 at 10:54:25AM +0200, Philipp Stanner wrote:
@Danilo: We have now 2 possible solutions for the firing WARN_ON floating.
Version A (Christian) Check in nouveau_fence_context_kill() whether a fence is already signaled before setting an error.
Version B (Me) This patch series here. Make sure that in Nouveau, only nouveau_fence_signal() signals fences.
Both should do the trick. Please share a maintainer-preference so I can move on here.
Thanks for working on this Philipp.
If you don't want to rework things entirely, A seems to be superior, since it also catches the case when someone else would call dma_fence_is_signaled() on a nouveau fence (which could happen at any time). This doesn't seem to be caught by B, right?
Am 14.04.25 um 16:27 schrieb Danilo Krummrich:
On Mon, Apr 14, 2025 at 10:54:25AM +0200, Philipp Stanner wrote:
@Danilo: We have now 2 possible solutions for the firing WARN_ON floating.
Version A (Christian) Check in nouveau_fence_context_kill() whether a fence is already signaled before setting an error.
Version B (Me) This patch series here. Make sure that in Nouveau, only nouveau_fence_signal() signals fences.
Both should do the trick. Please share a maintainer-preference so I can move on here.
Thanks for working on this Philipp.
If you don't want to rework things entirely, A seems to be superior, since it also catches the case when someone else would call dma_fence_is_signaled() on a nouveau fence (which could happen at any time). This doesn't seem to be caught by B, right?
Correct, yes. I would also keep it as simple as possible for backporting this bug fix.
On the other hand a rework is certainly appropriate including both nouveau as well as the DMA-fence calling rules. Especially that the DMA-fence framework calls the signaled callback with inconsistent locking is something we should fix.
Regards, Christian.
On Tue, 2025-04-15 at 11:56 +0200, Christian König wrote:
Am 14.04.25 um 16:27 schrieb Danilo Krummrich:
On Mon, Apr 14, 2025 at 10:54:25AM +0200, Philipp Stanner wrote:
@Danilo: We have now 2 possible solutions for the firing WARN_ON floating.
Version A (Christian) Check in nouveau_fence_context_kill() whether a fence is already signaled before setting an error.
Version B (Me) This patch series here. Make sure that in Nouveau, only nouveau_fence_signal() signals fences.
Both should do the trick. Please share a maintainer-preference so I can move on here.
Thanks for working on this Philipp.
If you don't want to rework things entirely, A seems to be superior, since it also catches the case when someone else would call dma_fence_is_signaled() on a nouveau fence (which could happen at any time). This doesn't seem to be caught by B, right?
Correct, yes. I would also keep it as simple as possible for backporting this bug fix.
On the other hand a rework is certainly appropriate including both nouveau as well as the DMA-fence calling rules. Especially that the DMA-fence framework calls the signaled callback with inconsistent locking is something we should fix.
Do you have a suggestion where to start?
I btw would still be interested in adding some sort of centralized mechanism in dma_fence that the driver could use to do some cleanup stuff once a fence gets signaled ^_^
P.
Regards, Christian.
linux-stable-mirror@lists.linaro.org