From: Rob Clark robdclark@chromium.org
Inspired by https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffw... it seemed like a good idea to get rid of memory allocation in job_run() fence signaling path, and use lockdep annotations to yell at us about anything that could deadlock against shrinker/reclaim. Anything that can trigger reclaim, or block on any other thread that has triggered reclaim, can block the GPU shrinker from releasing memory if it is waiting the job to complete, causing deadlock.
The first patch pre-allocates the hw_fence, splitting allocation and initialization, to avoid allocation in the job_run() path. The next eight decouple the obj lock from job_run(), as the obj lock is required to pin/unpin backing pages (ie. holding an obj lock in job_run() could deadlock the shrinker by blocking forward progress towards pinned buffers becoming idle). Followed by two so that we could idr_preload() in order to avoid memory allocations under locks indirectly connected to the shrinker path.
Next are three paths to decouple initialization (where allocations are needed) from GPU runpm and devfreq, to avoid allocations in the fence signaling path. Followed by various PM devfreq/qos and interconnect locking fixes to decouple initialization (allocation) from runtime.
And finally, the last patch is a modified version of danvet's patch to add lockdep annotations to gpu scheduler, but does so conditionally so that drivers can opt-in.
v2: Switch from embedding hw_fence in submit/job object to preallocating the hw_fence. Rework "fenced unpin" locking to drop obj lock from fence signaling path (ie. the part that was still WIP in the first iteration of the patchset). Adds the final patch to enable fence signaling annotations now that job_run() and job_free() are safe. The PM devfreq/QoS and interconnect patches are unchanged.
Rob Clark (23): drm/msm: Pre-allocate hw_fence drm/msm: Move submit bo flags update from obj lock drm/msm/gem: Tidy up VMA API drm/msm: Decouple vma tracking from obj lock drm/msm/gem: Simplify vmap vs LRU tracking drm/gem: Export drm_gem_lru_move_tail_locked() drm/msm/gem: Move update_lru() drm/msm/gem: Protect pin_count/madv by LRU lock drm/msm/gem: Avoid obj lock in job_run() drm/msm: Switch idr_lock to spinlock drm/msm: Use idr_preload() drm/msm/gpu: Move fw loading out of hw_init() path drm/msm/gpu: Move BO allocation out of hw_init drm/msm/a6xx: Move ioremap out of hw_init path PM / devfreq: Drop unneed locking to appease lockdep PM / devfreq: Teach lockdep about locking order PM / QoS: Fix constraints alloc vs reclaim locking PM / QoS: Decouple request alloc from dev_pm_qos_mtx PM / QoS: Teach lockdep about dev_pm_qos_mtx locking order soc: qcom: smd-rpm: Use GFP_ATOMIC in write path interconnect: Fix locking for runpm vs reclaim interconnect: Teach lockdep about icc_bw_lock order drm/sched: Add (optional) fence signaling annotation
drivers/base/power/qos.c | 83 +++++++++--- drivers/devfreq/devfreq.c | 52 ++++---- drivers/gpu/drm/drm_gem.c | 11 +- drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 48 ++++--- drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 18 ++- drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 46 ++++--- drivers/gpu/drm/msm/adreno/adreno_device.c | 6 + drivers/gpu/drm/msm/adreno/adreno_gpu.c | 9 +- drivers/gpu/drm/msm/msm_drv.c | 6 +- drivers/gpu/drm/msm/msm_fence.c | 12 +- drivers/gpu/drm/msm/msm_fence.h | 3 +- drivers/gpu/drm/msm/msm_gem.c | 145 ++++++++++++++------- drivers/gpu/drm/msm/msm_gem.h | 29 +++-- drivers/gpu/drm/msm/msm_gem_submit.c | 27 ++-- drivers/gpu/drm/msm/msm_gem_vma.c | 91 ++++++++++--- drivers/gpu/drm/msm/msm_gpu.h | 8 +- drivers/gpu/drm/msm/msm_ringbuffer.c | 9 +- drivers/gpu/drm/msm/msm_submitqueue.c | 2 +- drivers/gpu/drm/scheduler/sched_main.c | 9 ++ drivers/interconnect/core.c | 18 ++- drivers/soc/qcom/smd-rpm.c | 2 +- include/drm/drm_gem.h | 1 + include/drm/gpu_scheduler.h | 2 + 23 files changed, 416 insertions(+), 221 deletions(-)
From: Rob Clark robdclark@chromium.org
Avoid allocating memory in job_run() by pre-allocating the hw_fence.
Signed-off-by: Rob Clark robdclark@chromium.org --- drivers/gpu/drm/msm/msm_fence.c | 12 +++++++++--- drivers/gpu/drm/msm/msm_fence.h | 3 ++- drivers/gpu/drm/msm/msm_gem_submit.c | 7 +++++++ drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +- 4 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c index 56641408ea74..bab3d84f1686 100644 --- a/drivers/gpu/drm/msm/msm_fence.c +++ b/drivers/gpu/drm/msm/msm_fence.c @@ -99,7 +99,7 @@ static const struct dma_fence_ops msm_fence_ops = { };
struct dma_fence * -msm_fence_alloc(struct msm_fence_context *fctx) +msm_fence_alloc(void) { struct msm_fence *f;
@@ -107,10 +107,16 @@ msm_fence_alloc(struct msm_fence_context *fctx) if (!f) return ERR_PTR(-ENOMEM);
+ return &f->base; +} + +void +msm_fence_init(struct dma_fence *fence, struct msm_fence_context *fctx) +{ + struct msm_fence *f = to_msm_fence(fence); + f->fctx = fctx;
dma_fence_init(&f->base, &msm_fence_ops, &fctx->spinlock, fctx->context, ++fctx->last_fence); - - return &f->base; } diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h index 7f1798c54cd1..f913fa22d8fe 100644 --- a/drivers/gpu/drm/msm/msm_fence.h +++ b/drivers/gpu/drm/msm/msm_fence.h @@ -61,7 +61,8 @@ void msm_fence_context_free(struct msm_fence_context *fctx); bool msm_fence_completed(struct msm_fence_context *fctx, uint32_t fence); void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
-struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx); +struct dma_fence * msm_fence_alloc(void); +void msm_fence_init(struct dma_fence *fence, struct msm_fence_context *fctx);
static inline bool fence_before(uint32_t a, uint32_t b) diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index be4bf77103cd..2570c018b0cb 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -41,6 +41,13 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, if (!submit) return ERR_PTR(-ENOMEM);
+ submit->hw_fence = msm_fence_alloc(); + if (IS_ERR(submit->hw_fence)) { + ret = PTR_ERR(submit->hw_fence); + kfree(submit); + return ERR_PTR(ret); + } + ret = drm_sched_job_init(&submit->base, queue->entity, queue); if (ret) { kfree(submit); diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c index 57a8e9564540..a62b45e5a8c3 100644 --- a/drivers/gpu/drm/msm/msm_ringbuffer.c +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c @@ -18,7 +18,7 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job) struct msm_gpu *gpu = submit->gpu; int i;
- submit->hw_fence = msm_fence_alloc(fctx); + msm_fence_init(submit->hw_fence, fctx);
for (i = 0; i < submit->nr_bos; i++) { struct drm_gem_object *obj = &submit->bos[i].obj->base;
Am 20.03.23 um 15:43 schrieb Rob Clark:
From: Rob Clark robdclark@chromium.org
Avoid allocating memory in job_run() by pre-allocating the hw_fence.
Signed-off-by: Rob Clark robdclark@chromium.org
drivers/gpu/drm/msm/msm_fence.c | 12 +++++++++--- drivers/gpu/drm/msm/msm_fence.h | 3 ++- drivers/gpu/drm/msm/msm_gem_submit.c | 7 +++++++ drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +- 4 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c index 56641408ea74..bab3d84f1686 100644 --- a/drivers/gpu/drm/msm/msm_fence.c +++ b/drivers/gpu/drm/msm/msm_fence.c @@ -99,7 +99,7 @@ static const struct dma_fence_ops msm_fence_ops = { }; struct dma_fence * -msm_fence_alloc(struct msm_fence_context *fctx) +msm_fence_alloc(void) { struct msm_fence *f; @@ -107,10 +107,16 @@ msm_fence_alloc(struct msm_fence_context *fctx) if (!f) return ERR_PTR(-ENOMEM);
- return &f->base;
+}
+void +msm_fence_init(struct dma_fence *fence, struct msm_fence_context *fctx) +{
- struct msm_fence *f = to_msm_fence(fence);
- f->fctx = fctx;
dma_fence_init(&f->base, &msm_fence_ops, &fctx->spinlock, fctx->context, ++fctx->last_fence);
- return &f->base; }
diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h index 7f1798c54cd1..f913fa22d8fe 100644 --- a/drivers/gpu/drm/msm/msm_fence.h +++ b/drivers/gpu/drm/msm/msm_fence.h @@ -61,7 +61,8 @@ void msm_fence_context_free(struct msm_fence_context *fctx); bool msm_fence_completed(struct msm_fence_context *fctx, uint32_t fence); void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence); -struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx); +struct dma_fence * msm_fence_alloc(void); +void msm_fence_init(struct dma_fence *fence, struct msm_fence_context *fctx); static inline bool fence_before(uint32_t a, uint32_t b) diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index be4bf77103cd..2570c018b0cb 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -41,6 +41,13 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, if (!submit) return ERR_PTR(-ENOMEM);
- submit->hw_fence = msm_fence_alloc();
- if (IS_ERR(submit->hw_fence)) {
ret = PTR_ERR(submit->hw_fence);
kfree(submit);
return ERR_PTR(ret);
- }
- ret = drm_sched_job_init(&submit->base, queue->entity, queue); if (ret) { kfree(submit);
You probably need some error handling here or otherwise leak submit->hw_fence.
Apart from that looks good to me.
Christian.
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c index 57a8e9564540..a62b45e5a8c3 100644 --- a/drivers/gpu/drm/msm/msm_ringbuffer.c +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c @@ -18,7 +18,7 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job) struct msm_gpu *gpu = submit->gpu; int i;
- submit->hw_fence = msm_fence_alloc(fctx);
- msm_fence_init(submit->hw_fence, fctx);
for (i = 0; i < submit->nr_bos; i++) { struct drm_gem_object *obj = &submit->bos[i].obj->base;
On Mon, Mar 20, 2023 at 9:52 AM Christian König christian.koenig@amd.com wrote:
Am 20.03.23 um 15:43 schrieb Rob Clark:
From: Rob Clark robdclark@chromium.org
Avoid allocating memory in job_run() by pre-allocating the hw_fence.
Signed-off-by: Rob Clark robdclark@chromium.org
drivers/gpu/drm/msm/msm_fence.c | 12 +++++++++--- drivers/gpu/drm/msm/msm_fence.h | 3 ++- drivers/gpu/drm/msm/msm_gem_submit.c | 7 +++++++ drivers/gpu/drm/msm/msm_ringbuffer.c | 2 +- 4 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/msm/msm_fence.c b/drivers/gpu/drm/msm/msm_fence.c index 56641408ea74..bab3d84f1686 100644 --- a/drivers/gpu/drm/msm/msm_fence.c +++ b/drivers/gpu/drm/msm/msm_fence.c @@ -99,7 +99,7 @@ static const struct dma_fence_ops msm_fence_ops = { };
struct dma_fence * -msm_fence_alloc(struct msm_fence_context *fctx) +msm_fence_alloc(void) { struct msm_fence *f;
@@ -107,10 +107,16 @@ msm_fence_alloc(struct msm_fence_context *fctx) if (!f) return ERR_PTR(-ENOMEM);
return &f->base;
+}
+void +msm_fence_init(struct dma_fence *fence, struct msm_fence_context *fctx) +{
struct msm_fence *f = to_msm_fence(fence);
f->fctx = fctx; dma_fence_init(&f->base, &msm_fence_ops, &fctx->spinlock, fctx->context, ++fctx->last_fence);
}return &f->base;
diff --git a/drivers/gpu/drm/msm/msm_fence.h b/drivers/gpu/drm/msm/msm_fence.h index 7f1798c54cd1..f913fa22d8fe 100644 --- a/drivers/gpu/drm/msm/msm_fence.h +++ b/drivers/gpu/drm/msm/msm_fence.h @@ -61,7 +61,8 @@ void msm_fence_context_free(struct msm_fence_context *fctx); bool msm_fence_completed(struct msm_fence_context *fctx, uint32_t fence); void msm_update_fence(struct msm_fence_context *fctx, uint32_t fence);
-struct dma_fence * msm_fence_alloc(struct msm_fence_context *fctx); +struct dma_fence * msm_fence_alloc(void); +void msm_fence_init(struct dma_fence *fence, struct msm_fence_context *fctx);
static inline bool fence_before(uint32_t a, uint32_t b) diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c index be4bf77103cd..2570c018b0cb 100644 --- a/drivers/gpu/drm/msm/msm_gem_submit.c +++ b/drivers/gpu/drm/msm/msm_gem_submit.c @@ -41,6 +41,13 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev, if (!submit) return ERR_PTR(-ENOMEM);
submit->hw_fence = msm_fence_alloc();
if (IS_ERR(submit->hw_fence)) {
ret = PTR_ERR(submit->hw_fence);
kfree(submit);
return ERR_PTR(ret);
}
ret = drm_sched_job_init(&submit->base, queue->entity, queue); if (ret) { kfree(submit);
You probably need some error handling here or otherwise leak submit->hw_fence.
ah, right.. thx
BR, -R
Apart from that looks good to me.
Christian.
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c index 57a8e9564540..a62b45e5a8c3 100644 --- a/drivers/gpu/drm/msm/msm_ringbuffer.c +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c @@ -18,7 +18,7 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job) struct msm_gpu *gpu = submit->gpu; int i;
submit->hw_fence = msm_fence_alloc(fctx);
msm_fence_init(submit->hw_fence, fctx); for (i = 0; i < submit->nr_bos; i++) { struct drm_gem_object *obj = &submit->bos[i].obj->base;
From: Rob Clark robdclark@chromium.org
Now that everything that controls which LRU an obj lives in *except* the backing pages is protected by the LRU lock, add a special path to unpin in the job_run() path, we we are assured that we already have backing pages and will not be racing against eviction (because the GEM object's dma_resv contains the fence that will be signaled when the submit/job completes).
Signed-off-by: Rob Clark robdclark@chromium.org --- drivers/gpu/drm/msm/msm_gem.c | 44 +++++++++++++++++++++++----- drivers/gpu/drm/msm/msm_gem.h | 1 + drivers/gpu/drm/msm/msm_ringbuffer.c | 4 +-- 3 files changed, 39 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/msm/msm_gem.c b/drivers/gpu/drm/msm/msm_gem.c index d0ac3e704b66..9628e8d8dd02 100644 --- a/drivers/gpu/drm/msm/msm_gem.c +++ b/drivers/gpu/drm/msm/msm_gem.c @@ -61,18 +61,14 @@ static void sync_for_cpu(struct msm_gem_object *msm_obj) dma_unmap_sgtable(dev, msm_obj->sgt, DMA_BIDIRECTIONAL, 0); }
-static void update_lru_locked(struct drm_gem_object *obj) +static void update_lru_active(struct drm_gem_object *obj) { struct msm_drm_private *priv = obj->dev->dev_private; struct msm_gem_object *msm_obj = to_msm_bo(obj);
- msm_gem_assert_locked(&msm_obj->base); - - if (!msm_obj->pages) { - GEM_WARN_ON(msm_obj->pin_count); + GEM_WARN_ON(!msm_obj->pages);
- drm_gem_lru_move_tail_locked(&priv->lru.unbacked, obj); - } else if (msm_obj->pin_count) { + if (msm_obj->pin_count) { drm_gem_lru_move_tail_locked(&priv->lru.pinned, obj); } else if (msm_obj->madv == MSM_MADV_WILLNEED) { drm_gem_lru_move_tail_locked(&priv->lru.willneed, obj); @@ -83,6 +79,22 @@ static void update_lru_locked(struct drm_gem_object *obj) } }
+static void update_lru_locked(struct drm_gem_object *obj) +{ + struct msm_drm_private *priv = obj->dev->dev_private; + struct msm_gem_object *msm_obj = to_msm_bo(obj); + + msm_gem_assert_locked(&msm_obj->base); + + if (!msm_obj->pages) { + GEM_WARN_ON(msm_obj->pin_count); + + drm_gem_lru_move_tail_locked(&priv->lru.unbacked, obj); + } else { + update_lru_active(obj); + } +} + static void update_lru(struct drm_gem_object *obj) { struct msm_drm_private *priv = obj->dev->dev_private; @@ -489,6 +501,24 @@ void msm_gem_unpin_locked(struct drm_gem_object *obj) mutex_unlock(&priv->lru.lock); }
+/* Special unpin path for use in fence-signaling path, avoiding the need + * to hold the obj lock by only depending on things that a protected by + * the LRU lock. In particular we know that that we already have backing + * and and that the object's dma_resv has the fence for the current + * submit/job which will prevent us racing against page eviction. + */ +void msm_gem_unpin_active(struct drm_gem_object *obj) +{ + struct msm_drm_private *priv = obj->dev->dev_private; + struct msm_gem_object *msm_obj = to_msm_bo(obj); + + mutex_lock(&priv->lru.lock); + msm_obj->pin_count--; + GEM_WARN_ON(msm_obj->pin_count < 0); + update_lru_active(obj); + mutex_unlock(&priv->lru.lock); +} + struct msm_gem_vma *msm_gem_get_vma_locked(struct drm_gem_object *obj, struct msm_gem_address_space *aspace) { diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h index 0057e8e8fa13..2bd6846c83a9 100644 --- a/drivers/gpu/drm/msm/msm_gem.h +++ b/drivers/gpu/drm/msm/msm_gem.h @@ -128,6 +128,7 @@ struct msm_gem_object { uint64_t msm_gem_mmap_offset(struct drm_gem_object *obj); int msm_gem_pin_vma_locked(struct drm_gem_object *obj, struct msm_gem_vma *vma); void msm_gem_unpin_locked(struct drm_gem_object *obj); +void msm_gem_unpin_active(struct drm_gem_object *obj); struct msm_gem_vma *msm_gem_get_vma_locked(struct drm_gem_object *obj, struct msm_gem_address_space *aspace); int msm_gem_get_iova(struct drm_gem_object *obj, diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c index 31b4fbf96c36..b60199184409 100644 --- a/drivers/gpu/drm/msm/msm_ringbuffer.c +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c @@ -24,9 +24,7 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job) struct drm_gem_object *obj = &submit->bos[i].obj->base;
msm_gem_vma_unpin_fenced(submit->bos[i].vma, fctx); - msm_gem_lock(obj); - msm_gem_unpin_locked(obj); - msm_gem_unlock(obj); + msm_gem_unpin_active(obj); submit->bos[i].flags &= ~(BO_VMA_PINNED | BO_OBJ_PINNED); }
On Mon, 20 Mar 2023 07:43:22 -0700, Rob Clark wrote:
From: Rob Clark robdclark@chromium.org
Inspired by https://lore.kernel.org/dri-devel/20200604081224.863494-10-daniel.vetter@ffw... it seemed like a good idea to get rid of memory allocation in job_run() fence signaling path, and use lockdep annotations to yell at us about anything that could deadlock against shrinker/reclaim. Anything that can trigger reclaim, or block on any other thread that has triggered reclaim, can block the GPU shrinker from releasing memory if it is waiting the job to complete, causing deadlock.
[...]
Applied, thanks!
[20/23] soc: qcom: smd-rpm: Use GFP_ATOMIC in write path commit: 5808c532ca0a983d643319caca44f2bcb148298f
Best regards,
linaro-mm-sig@lists.linaro.org