Quoting Christian König (2017-09-04 14:27:33)
From: Christian König christian.koenig@amd.com
The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all.
It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL.
Which is not guaranteed by the code you wrote either.
The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence.
You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence. -Chris
Sorry for the delayed response, but your mail somehow ended up in the Spam folder.
Am 04.09.2017 um 15:40 schrieb Chris Wilson:
Quoting Christian König (2017-09-04 14:27:33)
From: Christian König christian.koenig@amd.com
The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all.
It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL.
Which is not guaranteed by the code you wrote either.
The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence.
You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence.
You are completely missing the point here.
It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid.
So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for.
See the usual life of a fence* variable looks like this: 1. assigning pointer to fence A; 2. assigning pointer to fence B; 3. assigning pointer to fence C; ....
When dma_fence_get_rcu_safe() is called between step #1 and step #2 for example it is perfectly valid to just return either fence A or fence B.
But it is invalid to return NULL because that suggests that we don't need to sync at all.
Regards, Christian.
Quoting Christian König (2017-09-11 09:50:40)
Sorry for the delayed response, but your mail somehow ended up in the Spam folder.
Am 04.09.2017 um 15:40 schrieb Chris Wilson:
Quoting Christian König (2017-09-04 14:27:33)
From: Christian König christian.koenig@amd.com
The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all.
It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL.
Which is not guaranteed by the code you wrote either.
The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence.
You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence.
You are completely missing the point here.
It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid.
So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for.
Not quite. We can grab a reference on a fence that was already freed and reused between the rcu_dereference() and dma_fence_get_rcu(). -Chris
Am 11.09.2017 um 10:59 schrieb Chris Wilson:
Quoting Christian König (2017-09-11 09:50:40)
Sorry for the delayed response, but your mail somehow ended up in the Spam folder.
Am 04.09.2017 um 15:40 schrieb Chris Wilson:
Quoting Christian König (2017-09-04 14:27:33)
From: Christian König christian.koenig@amd.com
The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all.
It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL.
Which is not guaranteed by the code you wrote either.
The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence.
You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence.
You are completely missing the point here.
It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid.
So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for.
Not quite. We can grab a reference on a fence that was already freed and reused between the rcu_dereference() and dma_fence_get_rcu().
Reusing a memory structure before the RCU grace period is completed is illegal, otherwise the whole RCU approach won't work.
So that case can't happen.
Regards, Christian.
-Chris
Quoting Christian König (2017-09-11 10:06:50)
Am 11.09.2017 um 10:59 schrieb Chris Wilson:
Quoting Christian König (2017-09-11 09:50:40)
Sorry for the delayed response, but your mail somehow ended up in the Spam folder.
Am 04.09.2017 um 15:40 schrieb Chris Wilson:
Quoting Christian König (2017-09-04 14:27:33)
From: Christian König christian.koenig@amd.com
The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all.
It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL.
Which is not guaranteed by the code you wrote either.
The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence.
You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence.
You are completely missing the point here.
It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid.
So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for.
Not quite. We can grab a reference on a fence that was already freed and reused between the rcu_dereference() and dma_fence_get_rcu().
Reusing a memory structure before the RCU grace period is completed is illegal, otherwise the whole RCU approach won't work.
RCU only protects that the pointer remains valid. If you use SLAB_TYPESAFE_BY_RCU, it is possible to reuse the pointer within a grace period. It does happen and that is the point the comment is trying to make. -Chris
Am 11.09.2017 um 11:23 schrieb Chris Wilson:
Quoting Christian König (2017-09-11 10:06:50)
Am 11.09.2017 um 10:59 schrieb Chris Wilson:
Quoting Christian König (2017-09-11 09:50:40)
Sorry for the delayed response, but your mail somehow ended up in the Spam folder.
Am 04.09.2017 um 15:40 schrieb Chris Wilson:
Quoting Christian König (2017-09-04 14:27:33)
From: Christian König christian.koenig@amd.com
The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to acquire a reference it doesn't necessary mean that there is no fence at all.
It usually mean that the fence was replaced by a new one and in this situation we certainly want to have the new one as result and *NOT* NULL.
Which is not guaranteed by the code you wrote either.
The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence.
You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence.
You are completely missing the point here.
It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid.
So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for.
Not quite. We can grab a reference on a fence that was already freed and reused between the rcu_dereference() and dma_fence_get_rcu().
Reusing a memory structure before the RCU grace period is completed is illegal, otherwise the whole RCU approach won't work.
RCU only protects that the pointer remains valid. If you use SLAB_TYPESAFE_BY_RCU, it is possible to reuse the pointer within a grace period. It does happen and that is the point the comment is trying to make.
Yeah, but that is illegal with a fence objects.
When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single().
Cause all of them rely on dma_fence_get() to return NULL when the fence isn't valid any more to restart the operation.
When dma_fence_get_rcu() returns a reallocated fence the operation wouldn't correctly restart and the end result most likely not be correct at all.
Using SLAB_TYPESAFE_BY_RCU is only valid if you can ensure that you have the right object using a second criteria and that is not the case with fences.
Regards, Christian.
Quoting Christian König (2017-09-11 10:57:57)
Am 11.09.2017 um 11:23 schrieb Chris Wilson:
Quoting Christian König (2017-09-11 10:06:50)
Am 11.09.2017 um 10:59 schrieb Chris Wilson:
Quoting Christian König (2017-09-11 09:50:40)
Sorry for the delayed response, but your mail somehow ended up in the Spam folder.
Am 04.09.2017 um 15:40 schrieb Chris Wilson:
Quoting Christian König (2017-09-04 14:27:33) > From: Christian König christian.koenig@amd.com > > The logic is buggy and unnecessary complex. When dma_fence_get_rcu() fails to > acquire a reference it doesn't necessary mean that there is no fence at all. > > It usually mean that the fence was replaced by a new one and in this situation > we certainly want to have the new one as result and *NOT* NULL. Which is not guaranteed by the code you wrote either.
The point of the comment is that the mb is only inside the successful kref_atomic_inc_unless_zero, and that only after that mb do you know whether or not you have the current fence.
You can argue that you want to replace the if (!dma_fence_get_rcu()) return NULL with if (!dma_fence_get_rcu() continue; but it would be incorrect to say that by simply ignoring the post-condition check that you do have the right fence.
You are completely missing the point here.
It is irrelevant if you have the current fence or not when you return. You can only guarantee that it is the current fence when you take a look and that is exactly what we want to avoid.
So the existing code is complete nonsense. Instead what we need to guarantee is that we return *ANY* fence which we can grab a reference for.
Not quite. We can grab a reference on a fence that was already freed and reused between the rcu_dereference() and dma_fence_get_rcu().
Reusing a memory structure before the RCU grace period is completed is illegal, otherwise the whole RCU approach won't work.
RCU only protects that the pointer remains valid. If you use SLAB_TYPESAFE_BY_RCU, it is possible to reuse the pointer within a grace period. It does happen and that is the point the comment is trying to make.
Yeah, but that is illegal with a fence objects.
When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single().
Many, many months ago I sent patches to fix them all. -Chris
Am 11.09.2017 um 12:01 schrieb Chris Wilson:
[SNIP]
Yeah, but that is illegal with a fence objects.
When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single().
Many, many months ago I sent patches to fix them all.
Found those after a bit a searching. Yeah, those patches where proposed more than a year ago, but never pushed upstream.
Not sure if we really should go this way. dma_fence objects are shared between drivers and since we can't judge if it's the correct fence based on a criteria in the object (only the read counter which is outside) all drivers need to be correct for this.
I would rather go the way and change dma_fence_release() to wrap fence->ops->release into call_rcu() to keep the whole RCU handling outside of the individual drivers.
Regards, Christian.
On Mon, Sep 11, 2017 at 01:06:32PM +0200, Christian König wrote:
Am 11.09.2017 um 12:01 schrieb Chris Wilson:
[SNIP]
Yeah, but that is illegal with a fence objects.
When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single().
Many, many months ago I sent patches to fix them all.
Found those after a bit a searching. Yeah, those patches where proposed more than a year ago, but never pushed upstream.
Not sure if we really should go this way. dma_fence objects are shared between drivers and since we can't judge if it's the correct fence based on a criteria in the object (only the read counter which is outside) all drivers need to be correct for this.
I would rather go the way and change dma_fence_release() to wrap fence->ops->release into call_rcu() to keep the whole RCU handling outside of the individual drivers.
Hm, I entirely dropped the ball on this, I kinda assumed that we managed to get some agreement on this between i915 and dma_fence. Adding a pile more people.
Joonas, Tvrtko, I guess we need to fix this one way or the other. -Daniel
Am 20.09.2017 um 20:20 schrieb Daniel Vetter:
On Mon, Sep 11, 2017 at 01:06:32PM +0200, Christian König wrote:
Am 11.09.2017 um 12:01 schrieb Chris Wilson:
[SNIP]
Yeah, but that is illegal with a fence objects.
When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single().
Many, many months ago I sent patches to fix them all.
Found those after a bit a searching. Yeah, those patches where proposed more than a year ago, but never pushed upstream.
Not sure if we really should go this way. dma_fence objects are shared between drivers and since we can't judge if it's the correct fence based on a criteria in the object (only the read counter which is outside) all drivers need to be correct for this.
I would rather go the way and change dma_fence_release() to wrap fence->ops->release into call_rcu() to keep the whole RCU handling outside of the individual drivers.
Hm, I entirely dropped the ball on this, I kinda assumed that we managed to get some agreement on this between i915 and dma_fence. Adding a pile more people.
For the meantime I've send a v2 of this patch to fix at least the buggy return of NULL when we fail to grab the RCU reference but keeping the extra checking for now.
Can I get an rb on this please so that we fix at least the bug at hand?
Thanks, Christian.
Joonas, Tvrtko, I guess we need to fix this one way or the other. -Daniel
Op 21-09-17 om 09:00 schreef Christian König:
Am 20.09.2017 um 20:20 schrieb Daniel Vetter:
On Mon, Sep 11, 2017 at 01:06:32PM +0200, Christian König wrote:
Am 11.09.2017 um 12:01 schrieb Chris Wilson:
[SNIP]
Yeah, but that is illegal with a fence objects.
When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single().
Many, many months ago I sent patches to fix them all.
Found those after a bit a searching. Yeah, those patches where proposed more than a year ago, but never pushed upstream.
Not sure if we really should go this way. dma_fence objects are shared between drivers and since we can't judge if it's the correct fence based on a criteria in the object (only the read counter which is outside) all drivers need to be correct for this.
I would rather go the way and change dma_fence_release() to wrap fence->ops->release into call_rcu() to keep the whole RCU handling outside of the individual drivers.
Hm, I entirely dropped the ball on this, I kinda assumed that we managed to get some agreement on this between i915 and dma_fence. Adding a pile more people.
For the meantime I've send a v2 of this patch to fix at least the buggy return of NULL when we fail to grab the RCU reference but keeping the extra checking for now.
Can I get an rb on this please so that we fix at least the bug at hand?
Thanks, Christian.
Done.
On Wed, 2017-09-20 at 20:20 +0200, Daniel Vetter wrote:
On Mon, Sep 11, 2017 at 01:06:32PM +0200, Christian König wrote:
Am 11.09.2017 um 12:01 schrieb Chris Wilson:
[SNIP]
Yeah, but that is illegal with a fence objects.
When anybody allocates fences this way it breaks at least reservation_object_get_fences_rcu(), reservation_object_wait_timeout_rcu() and reservation_object_test_signaled_single().
Many, many months ago I sent patches to fix them all.
Found those after a bit a searching. Yeah, those patches where proposed more than a year ago, but never pushed upstream.
Not sure if we really should go this way. dma_fence objects are shared between drivers and since we can't judge if it's the correct fence based on a criteria in the object (only the read counter which is outside) all drivers need to be correct for this.
I would rather go the way and change dma_fence_release() to wrap fence->ops->release into call_rcu() to keep the whole RCU handling outside of the individual drivers.
Hm, I entirely dropped the ball on this, I kinda assumed that we managed to get some agreement on this between i915 and dma_fence. Adding a pile more people.
Joonas, Tvrtko, I guess we need to fix this one way or the other.
I definitely didn't get the memo or notice this before. Tvrtko/Chris?
Regars, Joonas
linaro-mm-sig@lists.linaro.org