On Mon, Jul 24, 2017 at 11:51 AM, Christian König deathsimple@vodafone.de wrote:
Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
From: Christian König christian.koenig@amd.com
With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation.
How did you end up with both shared and exclusive fences on the same reservation object? At least I thought the point of exclusive was that it's exclusive (and has an implicit barrier on all previous shared fences). Same for shared fences, they need to wait for the exclusive one (and replace it).
Is this fallout from the amdgpu trickery where by default you do all shared fences? I thought we've aligned semantics a while back ...
No, that is perfectly normal even for other drivers. Take a look at the reservation code.
The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well.
Hm right.
Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object.
How does that work? The batch(es) with the shared fence are all supposed to wait for the exclusive fence before they start, which means even if you gpu reset and restart/cancel certain things, they shouldn't be able to complete out of order.
If you outright cancel a fence then you're supposed to first call dma_fence_set_error(-EIO) and then complete it. Note that atm that part might be slightly overengineered and I'm not sure about how we expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or soon, has been) used by i915 for it's internal book-keeping, which might not be the best to leak to other consumers. But completing fences (at least exported ones, where userspace or other drivers can get at them) shouldn't be possible. -Daniel
On 2017年07月24日 19:57, Daniel Vetter wrote:
On Mon, Jul 24, 2017 at 11:51 AM, Christian König deathsimple@vodafone.de wrote:
Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
From: Christian König christian.koenig@amd.com
With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation.
How did you end up with both shared and exclusive fences on the same reservation object? At least I thought the point of exclusive was that it's exclusive (and has an implicit barrier on all previous shared fences). Same for shared fences, they need to wait for the exclusive one (and replace it).
Is this fallout from the amdgpu trickery where by default you do all shared fences? I thought we've aligned semantics a while back ...
No, that is perfectly normal even for other drivers. Take a look at the reservation code.
The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well.
Hm right.
Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object.
How does that work? The batch(es) with the shared fence are all supposed to wait for the exclusive fence before they start, which means even if you gpu reset and restart/cancel certain things, they shouldn't be able to complete out of order.
Hi Daniel,
Do you mean exclusive fence must be signalled before any shared fence? Where could I find this restriction?
Thanks, David Zhou
If you outright cancel a fence then you're supposed to first call dma_fence_set_error(-EIO) and then complete it. Note that atm that part might be slightly overengineered and I'm not sure about how we expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or soon, has been) used by i915 for it's internal book-keeping, which might not be the best to leak to other consumers. But completing fences (at least exported ones, where userspace or other drivers can get at them) shouldn't be possible. -Daniel
On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
On 2017年07月24日 19:57, Daniel Vetter wrote:
On Mon, Jul 24, 2017 at 11:51 AM, Christian König deathsimple@vodafone.de wrote:
Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
From: Christian König christian.koenig@amd.com
With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation.
How did you end up with both shared and exclusive fences on the same reservation object? At least I thought the point of exclusive was that it's exclusive (and has an implicit barrier on all previous shared fences). Same for shared fences, they need to wait for the exclusive one (and replace it).
Is this fallout from the amdgpu trickery where by default you do all shared fences? I thought we've aligned semantics a while back ...
No, that is perfectly normal even for other drivers. Take a look at the reservation code.
The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well.
Hm right.
Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object.
How does that work? The batch(es) with the shared fence are all supposed to wait for the exclusive fence before they start, which means even if you gpu reset and restart/cancel certain things, they shouldn't be able to complete out of order.
Hi Daniel,
Do you mean exclusive fence must be signalled before any shared fence? Where could I find this restriction?
Yes, Christian also described it above. Could be that we should have better kerneldoc to document this ... -Daniel
Thanks, David Zhou
If you outright cancel a fence then you're supposed to first call dma_fence_set_error(-EIO) and then complete it. Note that atm that part might be slightly overengineered and I'm not sure about how we expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or soon, has been) used by i915 for it's internal book-keeping, which might not be the best to leak to other consumers. But completing fences (at least exported ones, where userspace or other drivers can get at them) shouldn't be possible. -Daniel
On 2017年07月25日 14:50, Daniel Vetter wrote:
On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
On 2017年07月24日 19:57, Daniel Vetter wrote:
On Mon, Jul 24, 2017 at 11:51 AM, Christian König deathsimple@vodafone.de wrote:
Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote:
From: Christian König christian.koenig@amd.com
With hardware resets in mind it is possible that all shared fences are signaled, but the exlusive isn't. Fix waiting for everything in this situation.
How did you end up with both shared and exclusive fences on the same reservation object? At least I thought the point of exclusive was that it's exclusive (and has an implicit barrier on all previous shared fences). Same for shared fences, they need to wait for the exclusive one (and replace it).
Is this fallout from the amdgpu trickery where by default you do all shared fences? I thought we've aligned semantics a while back ...
No, that is perfectly normal even for other drivers. Take a look at the reservation code.
The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well.
Hm right.
Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object.
How does that work? The batch(es) with the shared fence are all supposed to wait for the exclusive fence before they start, which means even if you gpu reset and restart/cancel certain things, they shouldn't be able to complete out of order.
Hi Daniel,
Do you mean exclusive fence must be signalled before any shared fence? Where could I find this restriction?
Yes, Christian also described it above. Could be that we should have better kerneldoc to document this ...
Is that a known assumption? if that way, it doesn't matter even that we always wait exclusive fence, right? Just one more line checking.
Thanks, David Zhou
-Daniel
Thanks, David Zhou
If you outright cancel a fence then you're supposed to first call dma_fence_set_error(-EIO) and then complete it. Note that atm that part might be slightly overengineered and I'm not sure about how we expose stuff to userspace, e.g. dma_fence_set_error(-EAGAIN) is (or soon, has been) used by i915 for it's internal book-keeping, which might not be the best to leak to other consumers. But completing fences (at least exported ones, where userspace or other drivers can get at them) shouldn't be possible. -Daniel
On Tue, Jul 25, 2017 at 02:55:14PM +0800, zhoucm1 wrote:
On 2017年07月25日 14:50, Daniel Vetter wrote:
On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
On 2017年07月24日 19:57, Daniel Vetter wrote:
On Mon, Jul 24, 2017 at 11:51 AM, Christian König deathsimple@vodafone.de wrote:
Am 24.07.2017 um 10:33 schrieb Daniel Vetter:
On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote: > From: Christian König christian.koenig@amd.com > > With hardware resets in mind it is possible that all shared fences are > signaled, but the exlusive isn't. Fix waiting for everything in this > situation. How did you end up with both shared and exclusive fences on the same reservation object? At least I thought the point of exclusive was that it's exclusive (and has an implicit barrier on all previous shared fences). Same for shared fences, they need to wait for the exclusive one (and replace it).
Is this fallout from the amdgpu trickery where by default you do all shared fences? I thought we've aligned semantics a while back ...
No, that is perfectly normal even for other drivers. Take a look at the reservation code.
The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well.
Hm right.
Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object.
How does that work? The batch(es) with the shared fence are all supposed to wait for the exclusive fence before they start, which means even if you gpu reset and restart/cancel certain things, they shouldn't be able to complete out of order.
Hi Daniel,
Do you mean exclusive fence must be signalled before any shared fence? Where could I find this restriction?
Yes, Christian also described it above. Could be that we should have better kerneldoc to document this ...
Is that a known assumption? if that way, it doesn't matter even that we always wait exclusive fence, right? Just one more line checking.
The problem is that amdgpu breaks that assumption over gpu reset, and that might have implications _everywhere_, not just in this code here. Are you sure this case won't pull the i915 driver over the table when sharing dma-bufs with it? Did you audit the code (plus all the other drivers too ofc). -Daniel
On 2017年07月25日 15:02, Daniel Vetter wrote:
On Tue, Jul 25, 2017 at 02:55:14PM +0800, zhoucm1 wrote:
On 2017年07月25日 14:50, Daniel Vetter wrote:
On Tue, Jul 25, 2017 at 02:16:55PM +0800, zhoucm1 wrote:
On 2017年07月24日 19:57, Daniel Vetter wrote:
On Mon, Jul 24, 2017 at 11:51 AM, Christian König deathsimple@vodafone.de wrote:
Am 24.07.2017 um 10:33 schrieb Daniel Vetter: > On Fri, Jul 21, 2017 at 06:20:01PM +0200, Christian König wrote: >> From: Christian König christian.koenig@amd.com >> >> With hardware resets in mind it is possible that all shared fences are >> signaled, but the exlusive isn't. Fix waiting for everything in this >> situation. > How did you end up with both shared and exclusive fences on the same > reservation object? At least I thought the point of exclusive was that > it's exclusive (and has an implicit barrier on all previous shared > fences). Same for shared fences, they need to wait for the exclusive one > (and replace it). > > Is this fallout from the amdgpu trickery where by default you do all > shared fences? I thought we've aligned semantics a while back ... No, that is perfectly normal even for other drivers. Take a look at the reservation code.
The exclusive fence replaces all shared fences, but adding a shared fence doesn't replace the exclusive fence. That actually makes sense, cause when you want to add move shared fences those need to wait for the last exclusive fence as well.
Hm right.
Now normally I would agree that when you have shared fences it is sufficient to wait for all of them cause those operations can't start before the exclusive one finishes. But with GPU reset and/or the ability to abort already submitted operations it is perfectly possible that you end up with an exclusive fence which isn't signaled and a shared fence which is signaled in the same reservation object.
How does that work? The batch(es) with the shared fence are all supposed to wait for the exclusive fence before they start, which means even if you gpu reset and restart/cancel certain things, they shouldn't be able to complete out of order.
Hi Daniel,
Do you mean exclusive fence must be signalled before any shared fence? Where could I find this restriction?
Yes, Christian also described it above. Could be that we should have better kerneldoc to document this ...
Is that a known assumption? if that way, it doesn't matter even that we always wait exclusive fence, right? Just one more line checking.
The problem is that amdgpu breaks that assumption over gpu reset, and that might have implications _everywhere_, not just in this code here. Are you sure this case won't pull the i915 driver over the table when sharing dma-bufs with it?
Ah, I finally got your mean. So as you mentioned before, we at least should have better describe for this assumption, the best place is comments in reservation.c, every newer would know it when reading code first time.
Thanks, David Zhou
Did you audit the code (plus all the other drivers too ofc). -Daniel
linaro-mm-sig@lists.linaro.org