Linaro-mm-sig

linaro-mm-sig@lists.linaro.org

4 participants
3305 discussions

Patch "drm/shme-helpers: Fix dma_buf_mmap forwarding bug" has been added to the 5.9-stable tree

by gregkh＠linuxfoundation.org

This is a note to let you know that I've just added the patch titled drm/shme-helpers: Fix dma_buf_mmap forwarding bug to the 5.9-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=sum… The filename of the patch is: drm-shme-helpers-fix-dma_buf_mmap-forwarding-bug.patch and it can be found in the queue-5.9 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please let <stable(a)vger.kernel.org> know about it. >From f49a51bfdc8ea717c97ccd4cc98b7e6daaa5553a Mon Sep 17 00:00:00 2001 From: Daniel Vetter <daniel.vetter(a)ffwll.ch> Date: Tue, 27 Oct 2020 22:49:22 +0100 Subject: drm/shme-helpers: Fix dma_buf_mmap forwarding bug MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit From: Daniel Vetter <daniel.vetter(a)ffwll.ch> commit f49a51bfdc8ea717c97ccd4cc98b7e6daaa5553a upstream. When we forward an mmap to the dma_buf exporter, they get to own everything. Unfortunately drm_gem_mmap_obj() overwrote vma->vm_private_data after the driver callback, wreaking the exporter complete. This was noticed because vb2_common_vm_close blew up on mali gpu with panfrost after commit 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf"). Unfortunately drm_gem_mmap_obj also acquires a surplus reference that we need to drop in shmem helpers, which is a bit of a mislayer situation. Maybe the entire dma_buf_mmap forwarding should be pulled into core gem code. Note that the only two other drivers which forward mmap in their own code (etnaviv and exynos) get this somewhat right by overwriting the gem mmap code. But they seem to still have the leak. This might be a good excuse to move these drivers over to shmem helpers completely. Reviewed-by: Boris Brezillon <boris.brezillon(a)collabora.com> Acked-by: Christian König <christian.koenig(a)amd.com> Cc: Christian König <christian.koenig(a)amd.com> Cc: Sumit Semwal <sumit.semwal(a)linaro.org> Cc: Lucas Stach <l.stach(a)pengutronix.de> Cc: Russell King <linux+etnaviv(a)armlinux.org.uk> Cc: Christian Gmeiner <christian.gmeiner(a)gmail.com> Cc: Inki Dae <inki.dae(a)samsung.com> Cc: Joonyoung Shim <jy0922.shim(a)samsung.com> Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com> Cc: Kyungmin Park <kyungmin.park(a)samsung.com> Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf") Cc: Boris Brezillon <boris.brezillon(a)collabora.com> Cc: Thomas Zimmermann <tzimmermann(a)suse.de> Cc: Gerd Hoffmann <kraxel(a)redhat.com> Cc: Rob Herring <robh(a)kernel.org> Cc: dri-devel(a)lists.freedesktop.org Cc: linux-media(a)vger.kernel.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: <stable(a)vger.kernel.org> # v5.9+ Reported-and-tested-by: piotr.oniszczuk(a)gmail.com Cc: piotr.oniszczuk(a)gmail.com Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201027214922.3566743-1-dani… Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org> --- drivers/gpu/drm/drm_gem.c | 4 ++-- drivers/gpu/drm/drm_gem_shmem_helper.c | 7 ++++++- 2 files changed, 8 insertions(+), 3 deletions(-) --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -1085,6 +1085,8 @@ int drm_gem_mmap_obj(struct drm_gem_obje */ drm_gem_object_get(obj); + vma->vm_private_data = obj; + if (obj->funcs && obj->funcs->mmap) { ret = obj->funcs->mmap(obj, vma); if (ret) { @@ -1107,8 +1109,6 @@ int drm_gem_mmap_obj(struct drm_gem_obje vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); } - vma->vm_private_data = obj; - return 0; } EXPORT_SYMBOL(drm_gem_mmap_obj); --- a/drivers/gpu/drm/drm_gem_shmem_helper.c +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c @@ -594,8 +594,13 @@ int drm_gem_shmem_mmap(struct drm_gem_ob /* Remove the fake offset */ vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node); - if (obj->import_attach) + if (obj->import_attach) { + /* Drop the reference drm_gem_mmap_obj() acquired.*/ + drm_gem_object_put(obj); + vma->vm_private_data = NULL; + return dma_buf_mmap(obj->dma_buf, vma, 0); + } shmem = to_drm_gem_shmem_obj(obj); Patches currently in stable-queue which might be from daniel.vetter(a)ffwll.ch are queue-5.9/drm-ast-separate-drm-driver-from-pci-code.patch queue-5.9/drm-shme-helpers-fix-dma_buf_mmap-forwarding-bug.patch

5 years, 3 months

Re: [Linaro-mm-sig] WARNING in dma_map_page_attrs

by hch＠lst.de

On Tue, Oct 27, 2020 at 12:52:30PM +0000, Parav Pandit wrote: > > > From: hch(a)lst.de <hch(a)lst.de> > > Sent: Tuesday, October 27, 2020 1:41 PM > > > > On Mon, Oct 26, 2020 at 05:23:48AM +0000, Parav Pandit wrote: > > > Hi Christoph, > > > > > > > From: Jakub Kicinski <kuba(a)kernel.org> > > > > Sent: Saturday, October 24, 2020 11:45 PM > > > > > > > > CC: rdma, looks like rdma from the stack trace > > > > > > > > On Fri, 23 Oct 2020 20:07:17 -0700 syzbot wrote: > > > > > syzbot has found a reproducer for the following issue on: > > > > > > > > > > HEAD commit: 3cb12d27 Merge tag 'net-5.10-rc1' of > > git://git.kernel.org/.. > > > > > > In [1] you mentioned that dma_mask should not be set for dma_virt_ops. > > > So patch [2] removed it. > > > > > > But check to validate the dma mask for all dma_ops was added in [3]. > > > > > > What is the right way? Did I misunderstood your comment about > > dma_mask in [1]? > > > > No, I did not say we don't need the mask. I said copying over the various > > dma-related fields from the parent is bogus. > > > > I think rxe (and ther other drivers/infiniband/sw drivers) need a simple > > dma_coerce_mask_and_coherent and nothing else. > > I see. Does below fix make sense? > Is DMA_MASK_NONE correct? DMA_MASK_NONE is gone in 5.10. I think you want DMA_BIT_MASK(64). That isn't actually correct for 32-bit platforms, but good enough.

5 years, 3 months

Re: [Linaro-mm-sig] [PATCH] drm/shme-helpers: Fix dma_buf_mmap forwarding bug

by Daniel Vetter

On Wed, Oct 28, 2020 at 09:44:15AM +0100, Boris Brezillon wrote: > On Tue, 27 Oct 2020 22:49:22 +0100 > Daniel Vetter <daniel.vetter(a)ffwll.ch> wrote: > > > When we forward an mmap to the dma_buf exporter, they get to own > > everything. Unfortunately drm_gem_mmap_obj() overwrote > > vma->vm_private_data after the driver callback, wreaking the > > exporter complete. This was noticed because vb2_common_vm_close blew > > up on mali gpu with panfrost after commit 26d3ac3cb04d > > ("drm/shmem-helpers: Redirect mmap for imported dma-buf"). > > > > Unfortunately drm_gem_mmap_obj also acquires a surplus reference that > > we need to drop in shmem helpers, which is a bit of a mislayer > > situation. Maybe the entire dma_buf_mmap forwarding should be pulled > > into core gem code. > > > > Note that the only two other drivers which forward mmap in their own > > code (etnaviv and exynos) get this somewhat right by overwriting the > > gem mmap code. But they seem to still have the leak. This might be a > > good excuse to move these drivers over to shmem helpers completely. > > > > Note to stable team: There's a trivial context conflict with > > d693def4fd1c ("drm: Remove obsolete GEM and PRIME callbacks from > > struct drm_driver"). I'm assuming it's clear where to put the first > > hunk, otherwise I can send a 5.9 version of this. > > > > Cc: Christian König <christian.koenig(a)amd.com> > > Cc: Sumit Semwal <sumit.semwal(a)linaro.org> > > Cc: Lucas Stach <l.stach(a)pengutronix.de> > > Cc: Russell King <linux+etnaviv(a)armlinux.org.uk> > > Cc: Christian Gmeiner <christian.gmeiner(a)gmail.com> > > Cc: Inki Dae <inki.dae(a)samsung.com> > > Cc: Joonyoung Shim <jy0922.shim(a)samsung.com> > > Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com> > > Cc: Kyungmin Park <kyungmin.park(a)samsung.com> > > Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf") > > Cc: Boris Brezillon <boris.brezillon(a)collabora.com> > > Reviewed-by: Boris Brezillon <boris.brezillon(a)collabora.com> Patch pushed to drm-misc-fixes, thanks for taking a look. -Daniel > > > Cc: Thomas Zimmermann <tzimmermann(a)suse.de> > > Cc: Gerd Hoffmann <kraxel(a)redhat.com> > > Cc: Rob Herring <robh(a)kernel.org> > > Cc: dri-devel(a)lists.freedesktop.org > > Cc: linux-media(a)vger.kernel.org > > Cc: linaro-mm-sig(a)lists.linaro.org > > Cc: <stable(a)vger.kernel.org> # v5.9+ > > Reported-and-tested-by: piotr.oniszczuk(a)gmail.com > > Cc: piotr.oniszczuk(a)gmail.com > > Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com> > > --- > > drivers/gpu/drm/drm_gem.c | 4 ++-- > > drivers/gpu/drm/drm_gem_shmem_helper.c | 7 ++++++- > > 2 files changed, 8 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c > > index 1da67d34e55d..d586068f5509 100644 > > --- a/drivers/gpu/drm/drm_gem.c > > +++ b/drivers/gpu/drm/drm_gem.c > > @@ -1076,6 +1076,8 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, > > */ > > drm_gem_object_get(obj); > > > > + vma->vm_private_data = obj; > > + > > if (obj->funcs->mmap) { > > ret = obj->funcs->mmap(obj, vma); > > if (ret) { > > @@ -1096,8 +1098,6 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, > > vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); > > } > > > > - vma->vm_private_data = obj; > > - > > return 0; > > } > > EXPORT_SYMBOL(drm_gem_mmap_obj); > > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c > > index fb11df7aced5..8233bda4692f 100644 > > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c > > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c > > @@ -598,8 +598,13 @@ int drm_gem_shmem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma) > > /* Remove the fake offset */ > > vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node); > > > > - if (obj->import_attach) > > + if (obj->import_attach) { > > + /* Drop the reference drm_gem_mmap_obj() acquired.*/ > > + drm_gem_object_put(obj); > > + vma->vm_private_data = NULL; > > + > > return dma_buf_mmap(obj->dma_buf, vma, 0); > > + } > > > > shmem = to_drm_gem_shmem_obj(obj); > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

5 years, 3 months

[PATCH] drm/shme-helpers: Fix dma_buf_mmap forwarding bug

by Daniel Vetter

When we forward an mmap to the dma_buf exporter, they get to own everything. Unfortunately drm_gem_mmap_obj() overwrote vma->vm_private_data after the driver callback, wreaking the exporter complete. This was noticed because vb2_common_vm_close blew up on mali gpu with panfrost after commit 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf"). Unfortunately drm_gem_mmap_obj also acquires a surplus reference that we need to drop in shmem helpers, which is a bit of a mislayer situation. Maybe the entire dma_buf_mmap forwarding should be pulled into core gem code. Note that the only two other drivers which forward mmap in their own code (etnaviv and exynos) get this somewhat right by overwriting the gem mmap code. But they seem to still have the leak. This might be a good excuse to move these drivers over to shmem helpers completely. Note to stable team: There's a trivial context conflict with d693def4fd1c ("drm: Remove obsolete GEM and PRIME callbacks from struct drm_driver"). I'm assuming it's clear where to put the first hunk, otherwise I can send a 5.9 version of this. Cc: Christian König <christian.koenig(a)amd.com> Cc: Sumit Semwal <sumit.semwal(a)linaro.org> Cc: Lucas Stach <l.stach(a)pengutronix.de> Cc: Russell King <linux+etnaviv(a)armlinux.org.uk> Cc: Christian Gmeiner <christian.gmeiner(a)gmail.com> Cc: Inki Dae <inki.dae(a)samsung.com> Cc: Joonyoung Shim <jy0922.shim(a)samsung.com> Cc: Seung-Woo Kim <sw0312.kim(a)samsung.com> Cc: Kyungmin Park <kyungmin.park(a)samsung.com> Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf") Cc: Boris Brezillon <boris.brezillon(a)collabora.com> Cc: Thomas Zimmermann <tzimmermann(a)suse.de> Cc: Gerd Hoffmann <kraxel(a)redhat.com> Cc: Rob Herring <robh(a)kernel.org> Cc: dri-devel(a)lists.freedesktop.org Cc: linux-media(a)vger.kernel.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: <stable(a)vger.kernel.org> # v5.9+ Reported-and-tested-by: piotr.oniszczuk(a)gmail.com Cc: piotr.oniszczuk(a)gmail.com Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com> --- drivers/gpu/drm/drm_gem.c | 4 ++-- drivers/gpu/drm/drm_gem_shmem_helper.c | 7 ++++++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index 1da67d34e55d..d586068f5509 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -1076,6 +1076,8 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, */ drm_gem_object_get(obj); + vma->vm_private_data = obj; + if (obj->funcs->mmap) { ret = obj->funcs->mmap(obj, vma); if (ret) { @@ -1096,8 +1098,6 @@ int drm_gem_mmap_obj(struct drm_gem_object *obj, unsigned long obj_size, vma->vm_page_prot = pgprot_decrypted(vma->vm_page_prot); } - vma->vm_private_data = obj; - return 0; } EXPORT_SYMBOL(drm_gem_mmap_obj); diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c b/drivers/gpu/drm/drm_gem_shmem_helper.c index fb11df7aced5..8233bda4692f 100644 --- a/drivers/gpu/drm/drm_gem_shmem_helper.c +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c @@ -598,8 +598,13 @@ int drm_gem_shmem_mmap(struct drm_gem_object *obj, struct vm_area_struct *vma) /* Remove the fake offset */ vma->vm_pgoff -= drm_vma_node_start(&obj->vma_node); - if (obj->import_attach) + if (obj->import_attach) { + /* Drop the reference drm_gem_mmap_obj() acquired.*/ + drm_gem_object_put(obj); + vma->vm_private_data = NULL; + return dma_buf_mmap(obj->dma_buf, vma, 0); + } shmem = to_drm_gem_shmem_obj(obj); -- 2.28.0

5 years, 3 months

Re: [Linaro-mm-sig] WARNING in dma_map_page_attrs

by hch＠lst.de

On Mon, Oct 26, 2020 at 05:23:48AM +0000, Parav Pandit wrote: > Hi Christoph, > > > From: Jakub Kicinski <kuba(a)kernel.org> > > Sent: Saturday, October 24, 2020 11:45 PM > > > > CC: rdma, looks like rdma from the stack trace > > > > On Fri, 23 Oct 2020 20:07:17 -0700 syzbot wrote: > > > syzbot has found a reproducer for the following issue on: > > > > > > HEAD commit: 3cb12d27 Merge tag 'net-5.10-rc1' of git://git.kernel.org/.. > > In [1] you mentioned that dma_mask should not be set for dma_virt_ops. > So patch [2] removed it. > > But check to validate the dma mask for all dma_ops was added in [3]. > > What is the right way? Did I misunderstood your comment about dma_mask in [1]? No, I did not say we don't need the mask. I said copying over the various dma-related fields from the parent is bogus. I think rxe (and ther other drivers/infiniband/sw drivers) need a simple dma_coerce_mask_and_coherent and nothing else.

5 years, 3 months

[PATCH v4 00/23] drm/msm: de-struct_mutex-ification

by Rob Clark

From: Rob Clark <robdclark(a)chromium.org> This doesn't remove *all* the struct_mutex, but it covers the worst of it, ie. shrinker/madvise/free/retire. The submit path still uses struct_mutex, but it still needs *something* serialize a portion of the submit path, and lock_stat mostly just shows the lock contention there being with other submits. And there are a few other bits of struct_mutex usage in less critical paths (debugfs, etc). But this seems like a reasonable step in the right direction. v2: teach lockdep about shrinker locking patters (danvet) and convert to obj->resv locking (danvet) v3: fix get_vaddr locking for legacy userspace (relocs), devcoredump, and rd/hangrd v4: couple minor review comments (krh), fix deadlock with imported dma-buf's (ie. from v4l2, etc) Rob Clark (23): drm/msm: Fix a couple incorrect usages of get_vaddr_active() drm/msm/gem: Add obj->lock wrappers drm/msm/gem: Rename internal get_iova_locked helper drm/msm/gem: Move prototypes to msm_gem.h drm/msm/gem: Add some _locked() helpers drm/msm/gem: Move locking in shrinker path drm/msm/submit: Move copy_from_user ahead of locking bos drm/msm: Do rpm get sooner in the submit path drm/msm/gem: Switch over to obj->resv for locking drm/msm: Use correct drm_gem_object_put() in fail case drm/msm: Drop chatty trace drm/msm: Move update_fences() drm/msm: Add priv->mm_lock to protect active/inactive lists drm/msm: Document and rename preempt_lock drm/msm: Protect ring->submits with it's own lock drm/msm: Refcount submits drm/msm: Remove obj->gpu drm/msm: Drop struct_mutex from the retire path drm/msm: Drop struct_mutex in free_object() path drm/msm: Remove msm_gem_free_work drm/msm: Drop struct_mutex in madvise path drm/msm: Drop struct_mutex in shrinker path drm/msm: Don't implicit-sync if only a single ring drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 6 +- drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 12 +- drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 6 +- drivers/gpu/drm/msm/disp/mdp4/mdp4_crtc.c | 1 + drivers/gpu/drm/msm/disp/mdp5/mdp5_crtc.c | 1 + drivers/gpu/drm/msm/dsi/dsi_host.c | 1 + drivers/gpu/drm/msm/msm_debugfs.c | 7 + drivers/gpu/drm/msm/msm_drv.c | 21 +- drivers/gpu/drm/msm/msm_drv.h | 73 +----- drivers/gpu/drm/msm/msm_fbdev.c | 1 + drivers/gpu/drm/msm/msm_gem.c | 271 +++++++++++----------- drivers/gpu/drm/msm/msm_gem.h | 133 +++++++++-- drivers/gpu/drm/msm/msm_gem_shrinker.c | 81 ++----- drivers/gpu/drm/msm/msm_gem_submit.c | 164 ++++++++----- drivers/gpu/drm/msm/msm_gpu.c | 110 +++++---- drivers/gpu/drm/msm/msm_gpu.h | 5 +- drivers/gpu/drm/msm/msm_rd.c | 2 +- drivers/gpu/drm/msm/msm_ringbuffer.c | 3 +- drivers/gpu/drm/msm/msm_ringbuffer.h | 13 +- 19 files changed, 506 insertions(+), 405 deletions(-) -- 2.26.2

5 years, 3 months

[PATCH 23/65] drm/i915: Annotate dma_fence_work

by Daniel Vetter

i915 does tons of allocations from this worker, which lockdep catches. Also generic infrastructure like this with big potential for how dma_fence or other cross driver contracts work, really should be reviewed on dri-devel. Implementing custom wheels for everything within the driver is a classic case of "platform problem" [1]. Which in upstream we really shouldn't have. Since there's no quick way to solve these splats (dma_fence_work is used a bunch in basic buffer management and command submission) like for amdgpu, I'm giving up at this point here. Annotating i915 scheduler and gpu reset could would be interesting, but since lockdep is one-shot we can't see what surprises would lurk there. 1: https://lwn.net/Articles/443531/ Cc: linux-media(a)vger.kernel.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: linux-rdma(a)vger.kernel.org Cc: amd-gfx(a)lists.freedesktop.org Cc: intel-gfx(a)lists.freedesktop.org Cc: Chris Wilson <chris(a)chris-wilson.co.uk> Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com> Cc: Christian König <christian.koenig(a)amd.com> Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com> --- drivers/gpu/drm/i915/i915_sw_fence_work.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_sw_fence_work.c b/drivers/gpu/drm/i915/i915_sw_fence_work.c index a3a81bb8f2c3..5b74acadaef5 100644 --- a/drivers/gpu/drm/i915/i915_sw_fence_work.c +++ b/drivers/gpu/drm/i915/i915_sw_fence_work.c @@ -17,12 +17,15 @@ static void fence_work(struct work_struct *work) { struct dma_fence_work *f = container_of(work, typeof(*f), work); int err; + bool fence_cookie; + fence_cookie = dma_fence_begin_signalling(); err = f->ops->work(f); if (err) dma_fence_set_error(&f->dma, err); fence_complete(f); + dma_fence_end_signalling(fence_cookie); dma_fence_put(&f->dma); } -- 2.28.0

5 years, 3 months

[PATCH 22/65] Revert "drm/amdgpu: add fbdev suspend/resume on gpu reset"

by Daniel Vetter

This is one from the department of "maybe play lottery if you hit this, karma compensation might work". Or at least lockdep ftw! This reverts commit 565d1941557756a584ac357d945bc374d5fcd1d0. It's not quite as low-risk as the commit message claims, because this grabs console_lock, which might be held when we allocate memory, which might never happen because the dma_fence_wait() is stuck waiting on our gpu reset: [ 136.763714] ====================================================== [ 136.763714] WARNING: possible circular locking dependency detected [ 136.763715] 5.7.0-rc3+ #346 Tainted: G W [ 136.763716] ------------------------------------------------------ [ 136.763716] kworker/2:3/682 is trying to acquire lock: [ 136.763716] ffffffff8226f140 (console_lock){+.+.}-{0:0}, at: drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763723] but task is already holding lock: [ 136.763724] ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] [ 136.763726] which lock already depends on the new lock. [ 136.763726] the existing dependency chain (in reverse order) is: [ 136.763727] -> #2 (dma_fence_map){++++}-{0:0}: [ 136.763730] __dma_fence_might_wait+0x41/0xb0 [ 136.763732] dma_resv_lockdep+0x171/0x202 [ 136.763734] do_one_initcall+0x5d/0x2f0 [ 136.763736] kernel_init_freeable+0x20d/0x26d [ 136.763738] kernel_init+0xa/0xfb [ 136.763740] ret_from_fork+0x27/0x50 [ 136.763740] -> #1 (fs_reclaim){+.+.}-{0:0}: [ 136.763743] fs_reclaim_acquire.part.0+0x25/0x30 [ 136.763745] kmem_cache_alloc_trace+0x2e/0x6e0 [ 136.763747] device_create_groups_vargs+0x52/0xf0 [ 136.763747] device_create+0x49/0x60 [ 136.763749] fb_console_init+0x25/0x145 [ 136.763750] fbmem_init+0xcc/0xe2 [ 136.763750] do_one_initcall+0x5d/0x2f0 [ 136.763751] kernel_init_freeable+0x20d/0x26d [ 136.763752] kernel_init+0xa/0xfb [ 136.763753] ret_from_fork+0x27/0x50 [ 136.763753] -> #0 (console_lock){+.+.}-{0:0}: [ 136.763755] __lock_acquire+0x1241/0x23f0 [ 136.763756] lock_acquire+0xad/0x370 [ 136.763757] console_lock+0x47/0x70 [ 136.763761] drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763809] amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu] [ 136.763850] amdgpu_job_timedout+0xfb/0x150 [amdgpu] [ 136.763851] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] [ 136.763852] process_one_work+0x23c/0x580 [ 136.763853] worker_thread+0x50/0x3b0 [ 136.763854] kthread+0x12e/0x150 [ 136.763855] ret_from_fork+0x27/0x50 [ 136.763855] other info that might help us debug this: [ 136.763856] Chain exists of: console_lock --> fs_reclaim --> dma_fence_map [ 136.763857] Possible unsafe locking scenario: [ 136.763857] CPU0 CPU1 [ 136.763857] ---- ---- [ 136.763857] lock(dma_fence_map); [ 136.763858] lock(fs_reclaim); [ 136.763858] lock(dma_fence_map); [ 136.763858] lock(console_lock); [ 136.763859] *** DEADLOCK *** [ 136.763860] 4 locks held by kworker/2:3/682: [ 136.763860] #0: ffff8887fb81c938 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 [ 136.763862] #1: ffffc90000cafe58 ((work_completion)(&(&sched->work_tdr)->work)){+.+.}-{0:0}, at: process_one_work+0x1bc/0x580 [ 136.763863] #2: ffffffff82318c80 (dma_fence_map){++++}-{0:0}, at: drm_sched_job_timedout+0x25/0xf0 [gpu_sched] [ 136.763865] #3: ffff8887ab621748 (&adev->lock_reset){+.+.}-{3:3}, at: amdgpu_device_gpu_recover.cold+0x5ab/0xe7b [amdgpu] [ 136.763914] stack backtrace: [ 136.763915] CPU: 2 PID: 682 Comm: kworker/2:3 Tainted: G W 5.7.0-rc3+ #346 [ 136.763916] Hardware name: System manufacturer System Product Name/PRIME X370-PRO, BIOS 4011 04/19/2018 [ 136.763918] Workqueue: events drm_sched_job_timedout [gpu_sched] [ 136.763919] Call Trace: [ 136.763922] dump_stack+0x8f/0xd0 [ 136.763924] check_noncircular+0x162/0x180 [ 136.763926] __lock_acquire+0x1241/0x23f0 [ 136.763927] lock_acquire+0xad/0x370 [ 136.763932] ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763933] ? mark_held_locks+0x2d/0x80 [ 136.763934] ? _raw_spin_unlock_irqrestore+0x46/0x60 [ 136.763936] console_lock+0x47/0x70 [ 136.763940] ? drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763944] drm_fb_helper_set_suspend_unlocked+0x7b/0xa0 [drm_kms_helper] [ 136.763993] amdgpu_device_gpu_recover.cold+0x21e/0xe7b [amdgpu] [ 136.764036] amdgpu_job_timedout+0xfb/0x150 [amdgpu] [ 136.764038] drm_sched_job_timedout+0x8a/0xf0 [gpu_sched] [ 136.764040] process_one_work+0x23c/0x580 [ 136.764041] worker_thread+0x50/0x3b0 [ 136.764042] ? process_one_work+0x580/0x580 [ 136.764044] kthread+0x12e/0x150 [ 136.764045] ? kthread_create_worker_on_cpu+0x70/0x70 [ 136.764046] ret_from_fork+0x27/0x50 Cc: linux-media(a)vger.kernel.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: linux-rdma(a)vger.kernel.org Cc: amd-gfx(a)lists.freedesktop.org Cc: intel-gfx(a)lists.freedesktop.org Cc: Chris Wilson <chris(a)chris-wilson.co.uk> Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com> Cc: Christian König <christian.koenig(a)amd.com> Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 029a026ecfa9..935116614884 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4330,8 +4330,6 @@ static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive, if (r) goto out; - amdgpu_fbdev_set_suspend(tmp_adev, 0); - /* * The GPU enters bad state once faulty pages * by ECC has reached the threshold, and ras @@ -4590,8 +4588,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, */ amdgpu_unregister_gpu_instance(tmp_adev); - amdgpu_fbdev_set_suspend(tmp_adev, 1); - /* disable ras on ALL IPs */ if (!need_emergency_restart && amdgpu_device_ip_need_full_reset(tmp_adev)) -- 2.28.0

5 years, 3 months

[PATCH 21/65] drm/amdgpu: use dma-fence annotations for gpu reset code

by Daniel Vetter

To improve coverage also annotate the gpu reset code itself, since that's called from other places than drm/scheduler (which is already annotated). Annotations nests, so this doesn't break anything, and allows easier testing. Cc: linux-media(a)vger.kernel.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: linux-rdma(a)vger.kernel.org Cc: amd-gfx(a)lists.freedesktop.org Cc: intel-gfx(a)lists.freedesktop.org Cc: Chris Wilson <chris(a)chris-wilson.co.uk> Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com> Cc: Christian König <christian.koenig(a)amd.com> Signed-off-by: Daniel Vetter <daniel.vetter(a)ffwll.ch> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index e8b41756c9f9..029a026ecfa9 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4496,6 +4496,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, int i, r = 0; bool need_emergency_restart = false; bool audio_suspended = false; + bool fence_cookie; + + fence_cookie = dma_fence_begin_signalling(); /** * Special case: RAS triggered and full reset isn't supported @@ -4529,6 +4532,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress", job ? job->base.id : -1, hive->hive_id); amdgpu_put_xgmi_hive(hive); + dma_fence_end_signalling(fence_cookie); return 0; } mutex_lock(&hive->hive_lock); @@ -4541,8 +4545,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, */ INIT_LIST_HEAD(&device_list); if (adev->gmc.xgmi.num_physical_nodes > 1) { - if (!hive) + if (!hive) { + dma_fence_end_signalling(fence_cookie); return -ENODEV; + } if (!list_is_first(&adev->gmc.xgmi.head, &hive->device_list)) list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list); device_list_handle = &hive->device_list; @@ -4556,8 +4562,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (!amdgpu_device_lock_adev(tmp_adev, hive)) { dev_info(tmp_adev->dev, "Bailing on TDR for s_job:%llx, as another already in progress", job ? job->base.id : -1); - r = 0; - goto skip_recovery; } /* @@ -4699,6 +4703,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (r) dev_info(adev->dev, "GPU reset end with ret = %d\n", r); + dma_fence_end_signalling(fence_cookie); return r; } -- 2.28.0

5 years, 3 months

[PATCH 20/65] drm/scheduler: use dma-fence annotations in tdr work

by Daniel Vetter

In the face of unpriviledged userspace being able to submit bogus gpu workloads the kernel needs gpu timeout and reset (tdr) to guarantee that dma_fences actually complete. Annotate this worker to make sure we don't have any accidental locking inversions or other problems lurking. Originally this was part of the overall scheduler annotation patch. But amdgpu has some glorious inversions here: - grabs console_lock - does a full modeset, which grabs all kinds of locks (drm_modeset_lock, dma_resv_lock) which can deadlock with dma_fence_wait held inside them. - almost minor at that point, but the modeset code also allocates memory These all look like they'll be very hard to fix properly, the hardware seems to require a full display reset with any gpu recovery. Hence split out as a seperate patch. Since amdgpu isn't the only hardware driver that needs to reset the display (at least gen2/3 on intel have the same problem) we need a generic solution for this. There's two tricks we could still from drm/i915 and lift to dma-fence: - The big whack, aka force-complete all fences. i915 does this for all pending jobs if the reset is somehow stuck. Trouble is we'd need to do this for all fences in the entire system, and just the book-keeping for that will be fun. Plus lots of drivers use fences for all kinds of internal stuff like memory management, so unconditionally resetting all of them doesn't work. I'm also hoping that with these fence annotations we could enlist lockdep in finding the last offenders causing deadlocks, and we could remove this get-out-of-jail trick. - The more feasible approach (across drivers at least as part of the dma_fence contract) is what drm/i915 does for gen2/3: When we need to reset the display we wake up all dma_fence_wait_interruptible calls, or well at least the equivalent of those in i915 internally. Relying on ioctl restart we force all other threads to release their locks, which means the tdr thread is guaranteed to be able to get them. I think we could implement this at the dma_fence level, including proper lockdep annotations. dma_fence_begin_tdr(): - must be nested within a dma_fence_begin/end_signalling section - will wake up all interruptible (but not the non-interruptible) dma_fence_wait() calls and force them to complete with a -ERESTARTSYS errno code. All new interrupitble calls to dma_fence_wait() will immeidately fail with the same error code. dma_fence_end_trdr(): - this will convert dma_fence_wait() calls back to normal. Of course interrupting dma_fence_wait is only ok if the caller specified that, which means we need to split the annotations into interruptible and non-interruptible version. If we then make sure that we only use interruptible dma_fence_wait() calls while holding drm_modeset_lock we can grab them in tdr code, and allow display resets. Doing the same for dma_resv_lock might be a lot harder, so buffer updates must be avoided. What's worse, we're not going to be able to make the dma_fence_wait calls in mmu-notifiers interruptible, that doesn't work. So allocating memory still wont' be allowed, even in tdr sections. Plus obviously we can use this trick only in tdr, it is rather intrusive. Cc: linux-media(a)vger.kernel.org Cc: linaro-mm-sig(a)lists.linaro.org Cc: linux-rdma(a)vger.kernel.org Cc: amd-gfx(a)lists.freedesktop.org Cc: intel-gfx(a)lists.freedesktop.org Cc: Chris Wilson <chris(a)chris-wilson.co.uk> Cc: Maarten Lankhorst <maarten.lankhorst(a)linux.intel.com> Cc: Christian König <christian.koenig(a)amd.com> Signed-off-by: Daniel Vetter <daniel.vetter(a)intel.com> --- drivers/gpu/drm/scheduler/sched_main.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index f69abc4e70d3..ae0d5ceca49a 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -281,9 +281,12 @@ static void drm_sched_job_timedout(struct work_struct *work) { struct drm_gpu_scheduler *sched; struct drm_sched_job *job; + bool fence_cookie; sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); + fence_cookie = dma_fence_begin_signalling(); + /* Protects against concurrent deletion in drm_sched_get_cleanup_job */ spin_lock(&sched->job_list_lock); job = list_first_entry_or_null(&sched->ring_mirror_list, @@ -315,6 +318,8 @@ static void drm_sched_job_timedout(struct work_struct *work) spin_lock(&sched->job_list_lock); drm_sched_start_timeout(sched); spin_unlock(&sched->job_list_lock); + + dma_fence_end_signalling(fence_cookie); } /** -- 2.28.0

5 years, 3 months

Jump to page:

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig