Am 23.02.20 um 12:56 schrieb Pan, Xinhui:
If shared fence list is not empty, even we want to test all fences, excl fence is ignored. That is abviously wrong, so fix it.
Yeah that is a known issue and I completely agree with you, but other disagree.
See the shared fences are meant to depend on the exclusive fence. So all shared fences must finish only after the exclusive one has finished as well.
The problem now is that for error handling this isn't necessary true. In other words when a shared fence completes with an error it is perfectly possible that he does this before the exclusive fence is finished.
I'm trying to convince Daniel that this is a problem for years :)
Regards, Christian.
Signed-off-by: xinhui pan xinhui.pan@amd.com
drivers/dma-buf/dma-resv.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c index 4264e64788c4..44dc64c547c6 100644 --- a/drivers/dma-buf/dma-resv.c +++ b/drivers/dma-buf/dma-resv.c @@ -632,14 +632,14 @@ static inline int dma_resv_test_signaled_single(struct dma_fence *passed_fence) */ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) {
- unsigned seq, shared_count;
- unsigned int seq, shared_count, left; int ret;
rcu_read_lock(); retry: ret = true; shared_count = 0;
- seq = read_seqcount_begin(&obj->seq);
- left = seq = read_seqcount_begin(&obj->seq);
if (test_all) { unsigned i; @@ -647,7 +647,7 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) struct dma_resv_list *fobj = rcu_dereference(obj->fence); if (fobj)
shared_count = fobj->shared_count;
left = shared_count = fobj->shared_count;
for (i = 0; i < shared_count; ++i) { struct dma_fence *fence = rcu_dereference(fobj->shared[i]); @@ -657,13 +657,14 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) goto retry; else if (!ret) break;
}left--;
if (read_seqcount_retry(&obj->seq, seq)) goto retry; }
- if (!shared_count) {
- if (!left) { struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
if (fence_excl) {
On Sun, Feb 23, 2020 at 01:04:15PM +0100, Christian König wrote:
Am 23.02.20 um 12:56 schrieb Pan, Xinhui:
If shared fence list is not empty, even we want to test all fences, excl fence is ignored. That is abviously wrong, so fix it.
Yeah that is a known issue and I completely agree with you, but other disagree.
See the shared fences are meant to depend on the exclusive fence. So all shared fences must finish only after the exclusive one has finished as well.
The problem now is that for error handling this isn't necessary true. In other words when a shared fence completes with an error it is perfectly possible that he does this before the exclusive fence is finished.
I'm trying to convince Daniel that this is a problem for years :)
I thought the consensus is that reasonable gpu schedulers and gpu reset code should try to make really, really sure it only completes stuff in sequence? That's at least my take away from the syncobj timeline discussion, where you convinced me we shouldn't just crash&burn.
I think as long as your scheduler is competent and your gpu reset tries to limit damage (i.e. kill offending ctx terminally, mark everything else that didn't complete for re-running) we should end up with everything completing in sequence. I guess if you do kill a lot more stuff, then you'd have to push these through your scheduler as dummy jobs, i.e. they still wait for their dependencies, but then all they do is set the dma_fence error and complete it. Maybe something the common scheduler could do. -Daniel
Regards, Christian.
Signed-off-by: xinhui pan xinhui.pan@amd.com
drivers/dma-buf/dma-resv.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c index 4264e64788c4..44dc64c547c6 100644 --- a/drivers/dma-buf/dma-resv.c +++ b/drivers/dma-buf/dma-resv.c @@ -632,14 +632,14 @@ static inline int dma_resv_test_signaled_single(struct dma_fence *passed_fence) */ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) {
- unsigned seq, shared_count;
- unsigned int seq, shared_count, left; int ret; rcu_read_lock(); retry: ret = true; shared_count = 0;
- seq = read_seqcount_begin(&obj->seq);
- left = seq = read_seqcount_begin(&obj->seq); if (test_all) { unsigned i;
@@ -647,7 +647,7 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) struct dma_resv_list *fobj = rcu_dereference(obj->fence); if (fobj)
shared_count = fobj->shared_count;
for (i = 0; i < shared_count; ++i) { struct dma_fence *fence = rcu_dereference(fobj->shared[i]);left = shared_count = fobj->shared_count;
@@ -657,13 +657,14 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) goto retry; else if (!ret) break;
} if (read_seqcount_retry(&obj->seq, seq)) goto retry; }left--;
- if (!shared_count) {
- if (!left) { struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); if (fence_excl) {
Am 25.02.20 um 18:23 schrieb Daniel Vetter:
On Sun, Feb 23, 2020 at 01:04:15PM +0100, Christian König wrote:
Am 23.02.20 um 12:56 schrieb Pan, Xinhui:
If shared fence list is not empty, even we want to test all fences, excl fence is ignored. That is abviously wrong, so fix it.
Yeah that is a known issue and I completely agree with you, but other disagree.
See the shared fences are meant to depend on the exclusive fence. So all shared fences must finish only after the exclusive one has finished as well.
The problem now is that for error handling this isn't necessary true. In other words when a shared fence completes with an error it is perfectly possible that he does this before the exclusive fence is finished.
I'm trying to convince Daniel that this is a problem for years :)
I thought the consensus is that reasonable gpu schedulers and gpu reset code should try to make really, really sure it only completes stuff in sequence? That's at least my take away from the syncobj timeline discussion, where you convinced me we shouldn't just crash&burn.
I think as long as your scheduler is competent and your gpu reset tries to limit damage (i.e. kill offending ctx terminally, mark everything else that didn't complete for re-running) we should end up with everything completing in sequence. I guess if you do kill a lot more stuff, then you'd have to push these through your scheduler as dummy jobs, i.e. they still wait for their dependencies, but then all they do is set the dma_fence error and complete it. Maybe something the common scheduler could do.
Yes, that's exactly how we currently implement it. But I still think that this is not necessary the best approach :)
Anyway Xinhui's problem turned out to be deeper. We somehow add an old stale fence to the dma_resv object sometimes and that can result in quite a bunch of problems.
I'm currently trying to hunt down what's going wrong here in more detail.
Regards, Christian.
-Daniel
Regards, Christian.
Signed-off-by: xinhui pan xinhui.pan@amd.com
drivers/dma-buf/dma-resv.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c index 4264e64788c4..44dc64c547c6 100644 --- a/drivers/dma-buf/dma-resv.c +++ b/drivers/dma-buf/dma-resv.c @@ -632,14 +632,14 @@ static inline int dma_resv_test_signaled_single(struct dma_fence *passed_fence) */ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) {
- unsigned seq, shared_count;
- unsigned int seq, shared_count, left; int ret; rcu_read_lock(); retry: ret = true; shared_count = 0;
- seq = read_seqcount_begin(&obj->seq);
- left = seq = read_seqcount_begin(&obj->seq); if (test_all) { unsigned i;
@@ -647,7 +647,7 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) struct dma_resv_list *fobj = rcu_dereference(obj->fence); if (fobj)
shared_count = fobj->shared_count;
left = shared_count = fobj->shared_count; for (i = 0; i < shared_count; ++i) { struct dma_fence *fence = rcu_dereference(fobj->shared[i]);
@@ -657,13 +657,14 @@ bool dma_resv_test_signaled_rcu(struct dma_resv *obj, bool test_all) goto retry; else if (!ret) break;
}left--; } if (read_seqcount_retry(&obj->seq, seq)) goto retry;
- if (!shared_count) {
- if (!left) { struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl); if (fence_excl) {
linaro-mm-sig@lists.linaro.org