Quoting Christian König (2019-08-21 13:31:45)
@@ -117,17 +120,10 @@ i915_gem_busy_ioctl(struct drm_device *dev, void *data, busy_check_writer(rcu_dereference(obj->base.resv->fence_excl)); /* Translate shared fences to READ set of engines */
list = rcu_dereference(obj->base.resv->fence);
if (list) {
unsigned int shared_count = list->shared_count, i;
for (i = 0; i < shared_count; ++i) {
struct dma_fence *fence =
rcu_dereference(list->shared[i]);
args->busy |= busy_check_reader(fence);
}
}
readers = dma_resv_fences_get_rcu(&obj->base.resv->readers);
dma_fence_array_for_each(fence, cursor, readers)
args->busy |= busy_check_reader(fence);
dma_fence_put(readers);
That's underwhelming, the full-mb shows up in scaling tests (I'll test the impact of this series later). Something like,
do { read = 0; fences = dma_resv_fences_get_deref(&obj->base.resv->readers); dma_fence_array_for_each(fence, cursor, fences) read |= busy_check_reader(fence); smp_rmb(); } while (dma_resv_fences_get_deref(obj->readers) != fences)
do { fences = dma_resv_fences_get_deref(&obj->base.resv->fences); write = busy_check_writer(fences); smp_rmb(); } while (dma_resv_fences_get_deref(obj->writes) != fences)
args->busy = write | read;
Perhaps? -Chris
Quoting Chris Wilson (2019-08-21 16:24:22)
Quoting Christian König (2019-08-21 13:31:45)
@@ -117,17 +120,10 @@ i915_gem_busy_ioctl(struct drm_device *dev, void *data, busy_check_writer(rcu_dereference(obj->base.resv->fence_excl)); /* Translate shared fences to READ set of engines */
list = rcu_dereference(obj->base.resv->fence);
if (list) {
unsigned int shared_count = list->shared_count, i;
for (i = 0; i < shared_count; ++i) {
struct dma_fence *fence =
rcu_dereference(list->shared[i]);
args->busy |= busy_check_reader(fence);
}
}
readers = dma_resv_fences_get_rcu(&obj->base.resv->readers);
dma_fence_array_for_each(fence, cursor, readers)
args->busy |= busy_check_reader(fence);
dma_fence_put(readers);
That's underwhelming, the full-mb shows up in scaling tests (I'll test the impact of this series later). Something like,
To put some numbers to it, adding the full-mb adds 5ns to a single thread on Kabylake and 20ns under contention. -Chris
Am 21.08.19 um 19:35 schrieb Chris Wilson:
Quoting Chris Wilson (2019-08-21 16:24:22)
Quoting Christian König (2019-08-21 13:31:45)
@@ -117,17 +120,10 @@ i915_gem_busy_ioctl(struct drm_device *dev, void *data, busy_check_writer(rcu_dereference(obj->base.resv->fence_excl)); /* Translate shared fences to READ set of engines */
list = rcu_dereference(obj->base.resv->fence);
if (list) {
unsigned int shared_count = list->shared_count, i;
for (i = 0; i < shared_count; ++i) {
struct dma_fence *fence =
rcu_dereference(list->shared[i]);
args->busy |= busy_check_reader(fence);
}
}
readers = dma_resv_fences_get_rcu(&obj->base.resv->readers);
dma_fence_array_for_each(fence, cursor, readers)
args->busy |= busy_check_reader(fence);
dma_fence_put(readers);
That's underwhelming, the full-mb shows up in scaling tests (I'll test the impact of this series later). Something like,
To put some numbers to it, adding the full-mb adds 5ns to a single thread on Kabylake and 20ns under contention.
The question is if that's the use case we want to optimize for.
Querying a buffer for business is something we do absolutely rarely on amdgpu, e.g. IIRC we even grab the full reservation lock for this.
But adding new fences comes with every command submission.
What could maybe work is the "do { } while (fence has changed); loop you suggested earlier in this mail thread, but I need to double check if that would really work or clash with recycling dma_fence_arrays().
Christian.
-Chris
Am 22.08.19 um 10:37 schrieb Christian König:
Am 21.08.19 um 19:35 schrieb Chris Wilson:
Quoting Chris Wilson (2019-08-21 16:24:22)
Quoting Christian König (2019-08-21 13:31:45)
@@ -117,17 +120,10 @@ i915_gem_busy_ioctl(struct drm_device *dev, void *data, busy_check_writer(rcu_dereference(obj->base.resv->fence_excl)); /* Translate shared fences to READ set of engines */ - list = rcu_dereference(obj->base.resv->fence); - if (list) { - unsigned int shared_count = list->shared_count, i;
- for (i = 0; i < shared_count; ++i) { - struct dma_fence *fence =
- rcu_dereference(list->shared[i]);
- args->busy |= busy_check_reader(fence); - } - } + readers = dma_resv_fences_get_rcu(&obj->base.resv->readers); + dma_fence_array_for_each(fence, cursor, readers) + args->busy |= busy_check_reader(fence); + dma_fence_put(readers);
That's underwhelming, the full-mb shows up in scaling tests (I'll test the impact of this series later). Something like,
To put some numbers to it, adding the full-mb adds 5ns to a single thread on Kabylake and 20ns under contention.
The question is if that's the use case we want to optimize for.
Querying a buffer for business is something we do absolutely rarely on amdgpu, e.g. IIRC we even grab the full reservation lock for this.
But adding new fences comes with every command submission.
What could maybe work is the "do { } while (fence has changed); loop you suggested earlier in this mail thread, but I need to double check if that would really work or clash with recycling dma_fence_arrays().
Ok thinking about it some more that won't work either.
See the dma_fence_array is only guaranteed to not change when you hold a reference to it. Otherwise we don't guarantee anything.
Christian.
Christian.
-Chris
linaro-mm-sig@lists.linaro.org