Currently we install a callback for performing poll on a dma-buf, irrespective of the timeout. This involves taking a spinlock, as well as unnecessary work, and greatly reduces scaling of poll(.timeout=0) across multiple threads.
We can query whether the poll will block prior to installing the callback to make the busy-query fast.
Single thread: 60% faster 8 threads on 4 (+4 HT) cores: 600% faster
Still not quite the perfect scaling we get with a native busy ioctl, but poll(dmabuf) is faster due to the quicker lookup of the object and avoiding drm_ioctl().
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch --- drivers/dma-buf/dma-buf.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cf04d249a6a4..c7a7bc579941 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -156,6 +156,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) if (!events) return 0;
+ if (poll_does_not_wait(poll)) { + if (events & POLLOUT && + !reservation_object_test_signaled_rcu(resv, true)) + events &= ~(POLLOUT | POLLIN); + + if (events & POLLIN && + !reservation_object_test_signaled_rcu(resv, false)) + events &= ~POLLIN; + + return events; + } + retry: seq = read_seqcount_begin(&resv->seq); rcu_read_lock();
If we being polled with a timeout of zero, a nonblocking busy query, we don't need to install any fence callbacks as we will not be waiting. As we only install the callback once, the overhead comes from the atomic bit test that also causes serialisation between threads.
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: Gustavo Padovan gustavo@padovan.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org --- drivers/dma-buf/sync_file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/dma-buf/sync_file.c b/drivers/dma-buf/sync_file.c index 486d29c1a830..abb5fdab75fd 100644 --- a/drivers/dma-buf/sync_file.c +++ b/drivers/dma-buf/sync_file.c @@ -306,7 +306,8 @@ static unsigned int sync_file_poll(struct file *file, poll_table *wait)
poll_wait(file, &sync_file->wq, wait);
- if (!test_and_set_bit(POLL_ENABLED, &sync_file->fence->flags)) { + if (!poll_does_not_wait(wait) && + !test_and_set_bit(POLL_ENABLED, &sync_file->fence->flags)) { if (fence_add_callback(sync_file->fence, &sync_file->cb, fence_check_cb_func) < 0) wake_up_all(&sync_file->wq);
Hi Chris,
2016-08-29 Chris Wilson chris@chris-wilson.co.uk:
If we being polled with a timeout of zero, a nonblocking busy query, we don't need to install any fence callbacks as we will not be waiting. As we only install the callback once, the overhead comes from the atomic bit test that also causes serialisation between threads.
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: Gustavo Padovan gustavo@padovan.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org
drivers/dma-buf/sync_file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Indeed, we can shortcut this.
Reviewed-by: Gustavo Padovan gustavo.padovan@collabora.co.uk
Gustavo
Hi Chris,
On 29 August 2016 at 23:56, Gustavo Padovan gustavo@padovan.org wrote:
Hi Chris,
2016-08-29 Chris Wilson chris@chris-wilson.co.uk:
If we being polled with a timeout of zero, a nonblocking busy query, we don't need to install any fence callbacks as we will not be waiting. As we only install the callback once, the overhead comes from the atomic bit test that also causes serialisation between threads.
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: Gustavo Padovan gustavo@padovan.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org
drivers/dma-buf/sync_file.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
Indeed, we can shortcut this.
Reviewed-by: Gustavo Padovan gustavo.padovan@collabora.co.uk
Gustavo
Thanks; pushed to drm-misc.
Best, Sumit.
On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
Currently we install a callback for performing poll on a dma-buf, irrespective of the timeout. This involves taking a spinlock, as well as unnecessary work, and greatly reduces scaling of poll(.timeout=0) across multiple threads.
We can query whether the poll will block prior to installing the callback to make the busy-query fast.
Single thread: 60% faster 8 threads on 4 (+4 HT) cores: 600% faster
Still not quite the perfect scaling we get with a native busy ioctl, but poll(dmabuf) is faster due to the quicker lookup of the object and avoiding drm_ioctl().
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
Need to strike the r-b here, since Christian König pointed out that objects won't magically switch signalling on. -Daniel
drivers/dma-buf/dma-buf.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index cf04d249a6a4..c7a7bc579941 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -156,6 +156,18 @@ static unsigned int dma_buf_poll(struct file *file, poll_table *poll) if (!events) return 0;
- if (poll_does_not_wait(poll)) {
if (events & POLLOUT &&
!reservation_object_test_signaled_rcu(resv, true))
events &= ~(POLLOUT | POLLIN);
if (events & POLLIN &&
!reservation_object_test_signaled_rcu(resv, false))
events &= ~POLLIN;
return events;
- }
retry: seq = read_seqcount_begin(&resv->seq); rcu_read_lock(); -- 2.9.3
Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
Currently we install a callback for performing poll on a dma-buf, irrespective of the timeout. This involves taking a spinlock, as well as unnecessary work, and greatly reduces scaling of poll(.timeout=0) across multiple threads.
We can query whether the poll will block prior to installing the callback to make the busy-query fast.
Single thread: 60% faster 8 threads on 4 (+4 HT) cores: 600% faster
Still not quite the perfect scaling we get with a native busy ioctl, but poll(dmabuf) is faster due to the quicker lookup of the object and avoiding drm_ioctl().
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
Need to strike the r-b here, since Christian König pointed out that objects won't magically switch signalling on.
The point being here that we don't even want to switch signaling on! :)
Christian's point was that not all fences guarantee forward progress irrespective of whether signaling is enabled or not, and fences are not required to guarantee forward progress without signaling even if they provide an ops->signaled(). -Chris
On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
Currently we install a callback for performing poll on a dma-buf, irrespective of the timeout. This involves taking a spinlock, as well as unnecessary work, and greatly reduces scaling of poll(.timeout=0) across multiple threads.
We can query whether the poll will block prior to installing the callback to make the busy-query fast.
Single thread: 60% faster 8 threads on 4 (+4 HT) cores: 600% faster
Still not quite the perfect scaling we get with a native busy ioctl, but poll(dmabuf) is faster due to the quicker lookup of the object and avoiding drm_ioctl().
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
Need to strike the r-b here, since Christian König pointed out that objects won't magically switch signalling on.
Propagating a flag through to sync_file is trivial, but not through to the dma_buf->resv. Looks like dma-buf will be without a fast busy query, which I guess in the grand scheme of things (i.e. dma-buf itself is not intended to be used as a fence) is not that important. -Chris
On Fri, Sep 23, 2016 at 03:50:44PM +0200, Daniel Vetter wrote:
On Mon, Aug 29, 2016 at 08:08:34AM +0100, Chris Wilson wrote:
Currently we install a callback for performing poll on a dma-buf, irrespective of the timeout. This involves taking a spinlock, as well as unnecessary work, and greatly reduces scaling of poll(.timeout=0) across multiple threads.
We can query whether the poll will block prior to installing the callback to make the busy-query fast.
Single thread: 60% faster 8 threads on 4 (+4 HT) cores: 600% faster
Still not quite the perfect scaling we get with a native busy ioctl, but poll(dmabuf) is faster due to the quicker lookup of the object and avoiding drm_ioctl().
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Sumit Semwal sumit.semwal@linaro.org Cc: linux-media@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linaro-mm-sig@lists.linaro.org Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
Need to strike the r-b here, since Christian König pointed out that objects won't magically switch signalling on.
Oh, it also means that
commit fb8b7d2b9d80e1e71f379e57355936bd2b024be9 Author: Jammy Zhou Jammy.Zhou@amd.com Date: Wed Jan 21 18:35:47 2015 +0800
reservation: wait only with non-zero timeout specified (v3)
When the timeout value passed to reservation_object_wait_timeout_rcu is zero, no wait should be done if the fences are not signaled.
Return '1' for idle and '0' for busy if the specified timeout is '0' to keep consistent with the case of non-zero timeout.
v2: call fence_put if not signaled in the case of timeout==0
v3: switch to reservation_object_test_signaled_rcu
Signed-off-by: Jammy Zhou Jammy.Zhou@amd.com Reviewed-by: Christian König christian.koenig@amd.com Reviewed-by: Alex Deucher alexander.deucher@amd.com Reviewed-By: Maarten Lankhorst maarten.lankhorst@canonical.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
is wrong. And reservation_object_test_signaled_rcu() is unreliable. -Chris
linaro-mm-sig@lists.linaro.org