Hi,
This series of two patches fixes the issue introduced in cf586021642d80 ("drm/i915/gt: Pipelined page migration") where, as reported by Matt, in a chain of requests an error is reported only if happens in the last request.
However Chris noticed that without ensuring exclusivity in the locking we might end up in some deadlock. That's why patch 1 throttles for the ringspace in order to make sure that no one is holding it.
Version 1 of this patch has been reviewed by matt and this version is adding Chris exclusive locking.
Thanks Chris for this work.
Andi
Changelog ========= v4 -> v5 - add timeline locking also in the copy operation, which was forgottein in v4. - rearrange the patches in order to avoid a bisect break.
v3 -> v4 - In v3 the timeline was being locked, but I forgot that also request_create() and request_add() are locking the timeline as well. The former does the locking, the latter does the unlocking. In order to avoid this extra lock/unlock, we need the "_locked" version of the said functions.
v2 -> v3 - Really lock the timeline before generating all the requests until the last.
v1 -> v2 - Add patch 1 for ensuring exclusive locking of the timeline - Reword git commit of patch 2.
Andi Shyti (4): drm/i915/gt: Add intel_context_timeline_is_locked helper drm/i915: Create the locked version of the request create drm/i915: Create the locked version of the request add drm/i915/gt: Make sure that errors are propagated through request chains
Chris Wilson (1): drm/i915: Throttle for ringspace prior to taking the timeline mutex
drivers/gpu/drm/i915/gt/intel_context.c | 41 ++++++++++++++++++ drivers/gpu/drm/i915/gt/intel_context.h | 8 ++++ drivers/gpu/drm/i915/gt/intel_migrate.c | 51 +++++++++++++++++------ drivers/gpu/drm/i915/i915_request.c | 55 +++++++++++++++++++------ drivers/gpu/drm/i915/i915_request.h | 3 ++ 5 files changed, 133 insertions(+), 25 deletions(-)
We have:
- intel_context_timeline_lock() - intel_context_timeline_unlock()
In the next patches we will also need:
- intel_context_timeline_is_locked()
Add it.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/i915/gt/intel_context.h | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h index 48f888c3da083..f2f79ff0dfd1d 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.h +++ b/drivers/gpu/drm/i915/gt/intel_context.h @@ -270,6 +270,12 @@ static inline void intel_context_timeline_unlock(struct intel_timeline *tl) mutex_unlock(&tl->mutex); }
+static inline void intel_context_assert_timeline_is_locked(struct intel_timeline *tl) + __must_hold(&tl->mutex) +{ + lockdep_assert_held(&tl->mutex); +} + int intel_context_prepare_remote_request(struct intel_context *ce, struct i915_request *rq);
On 12.04.2023 13:33, Andi Shyti wrote:
We have:
- intel_context_timeline_lock()
- intel_context_timeline_unlock()
In the next patches we will also need:
- intel_context_timeline_is_locked()
Add it.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Nirmoy Das nirmoy.das@intel.com
Reviewed-by: Andrzej Hajda andrzej.hajda@intel.com
Regards Andrzej
drivers/gpu/drm/i915/gt/intel_context.h | 6 ++++++ 1 file changed, 6 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h index 48f888c3da083..f2f79ff0dfd1d 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.h +++ b/drivers/gpu/drm/i915/gt/intel_context.h @@ -270,6 +270,12 @@ static inline void intel_context_timeline_unlock(struct intel_timeline *tl) mutex_unlock(&tl->mutex); } +static inline void intel_context_assert_timeline_is_locked(struct intel_timeline *tl)
- __must_hold(&tl->mutex)
+{
- lockdep_assert_held(&tl->mutex);
+}
- int intel_context_prepare_remote_request(struct intel_context *ce, struct i915_request *rq);
Make version of the request creation that doesn't hold any lock.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/i915/i915_request.c | 38 +++++++++++++++++++++-------- drivers/gpu/drm/i915/i915_request.h | 2 ++ 2 files changed, 30 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 630a732aaecca..58662360ac34e 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1028,15 +1028,11 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) return ERR_PTR(ret); }
-struct i915_request * -i915_request_create(struct intel_context *ce) +static struct i915_request * +__i915_request_create_locked(struct intel_context *ce) { + struct intel_timeline *tl = ce->timeline; struct i915_request *rq; - struct intel_timeline *tl; - - tl = intel_context_timeline_lock(ce); - if (IS_ERR(tl)) - return ERR_CAST(tl);
/* Move our oldest request to the slab-cache (if not in use!) */ rq = list_first_entry(&tl->requests, typeof(*rq), link); @@ -1046,16 +1042,38 @@ i915_request_create(struct intel_context *ce) intel_context_enter(ce); rq = __i915_request_create(ce, GFP_KERNEL); intel_context_exit(ce); /* active reference transferred to request */ + if (IS_ERR(rq)) - goto err_unlock; + return rq;
/* Check that we do not interrupt ourselves with a new request */ rq->cookie = lockdep_pin_lock(&tl->mutex);
return rq; +} + +struct i915_request * +i915_request_create_locked(struct intel_context *ce) +{ + intel_context_assert_timeline_is_locked(ce->timeline); + + return __i915_request_create_locked(ce); +} + +struct i915_request * +i915_request_create(struct intel_context *ce) +{ + struct i915_request *rq; + struct intel_timeline *tl; + + tl = intel_context_timeline_lock(ce); + if (IS_ERR(tl)) + return ERR_CAST(tl); + + rq = __i915_request_create_locked(ce); + if (IS_ERR(rq)) + intel_context_timeline_unlock(tl);
-err_unlock: - intel_context_timeline_unlock(tl); return rq; }
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index f5e1bb5e857aa..bb48bd4605c03 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -374,6 +374,8 @@ struct i915_request * __must_check __i915_request_create(struct intel_context *ce, gfp_t gfp); struct i915_request * __must_check i915_request_create(struct intel_context *ce); +struct i915_request * __must_check +i915_request_create_locked(struct intel_context *ce);
void __i915_request_skip(struct i915_request *rq); bool i915_request_set_error_once(struct i915_request *rq, int error);
On 12.04.2023 13:33, Andi Shyti wrote:
Make version of the request creation that doesn't hold any lock.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Nirmoy Das nirmoy.das@intel.com
drivers/gpu/drm/i915/i915_request.c | 38 +++++++++++++++++++++-------- drivers/gpu/drm/i915/i915_request.h | 2 ++ 2 files changed, 30 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 630a732aaecca..58662360ac34e 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1028,15 +1028,11 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) return ERR_PTR(ret); } -struct i915_request * -i915_request_create(struct intel_context *ce) +static struct i915_request * +__i915_request_create_locked(struct intel_context *ce) {
- struct intel_timeline *tl = ce->timeline; struct i915_request *rq;
- struct intel_timeline *tl;
- tl = intel_context_timeline_lock(ce);
- if (IS_ERR(tl))
return ERR_CAST(tl);
/* Move our oldest request to the slab-cache (if not in use!) */ rq = list_first_entry(&tl->requests, typeof(*rq), link); @@ -1046,16 +1042,38 @@ i915_request_create(struct intel_context *ce) intel_context_enter(ce); rq = __i915_request_create(ce, GFP_KERNEL); intel_context_exit(ce); /* active reference transferred to request */
- if (IS_ERR(rq))
goto err_unlock;
return rq;
/* Check that we do not interrupt ourselves with a new request */ rq->cookie = lockdep_pin_lock(&tl->mutex); return rq; +}
+struct i915_request * +i915_request_create_locked(struct intel_context *ce) +{
- intel_context_assert_timeline_is_locked(ce->timeline);
- return __i915_request_create_locked(ce);
+}
I wonder if we really need to have such granularity? Leaving only i915_request_create_locked and removing __i915_request_create_locked would simplify little bit the code, I guess the cost of calling intel_context_assert_timeline_is_locked twice sometimes is not big, or maybe it can be re-arranged, up to you.
Reviewed-by: Andrzej Hajda andrzej.hajda@intel.com
Regards Andrzej
+struct i915_request * +i915_request_create(struct intel_context *ce) +{
- struct i915_request *rq;
- struct intel_timeline *tl;
- tl = intel_context_timeline_lock(ce);
- if (IS_ERR(tl))
return ERR_CAST(tl);
- rq = __i915_request_create_locked(ce);
- if (IS_ERR(rq))
intel_context_timeline_unlock(tl);
-err_unlock:
- intel_context_timeline_unlock(tl); return rq; }
diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index f5e1bb5e857aa..bb48bd4605c03 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -374,6 +374,8 @@ struct i915_request * __must_check __i915_request_create(struct intel_context *ce, gfp_t gfp); struct i915_request * __must_check i915_request_create(struct intel_context *ce); +struct i915_request * __must_check +i915_request_create_locked(struct intel_context *ce); void __i915_request_skip(struct i915_request *rq); bool i915_request_set_error_once(struct i915_request *rq, int error);
Hi Andrzej,
Make version of the request creation that doesn't hold any lock.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Nirmoy Das nirmoy.das@intel.com
drivers/gpu/drm/i915/i915_request.c | 38 +++++++++++++++++++++-------- drivers/gpu/drm/i915/i915_request.h | 2 ++ 2 files changed, 30 insertions(+), 10 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 630a732aaecca..58662360ac34e 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1028,15 +1028,11 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) return ERR_PTR(ret); } -struct i915_request * -i915_request_create(struct intel_context *ce) +static struct i915_request * +__i915_request_create_locked(struct intel_context *ce) {
- struct intel_timeline *tl = ce->timeline; struct i915_request *rq;
- struct intel_timeline *tl;
- tl = intel_context_timeline_lock(ce);
- if (IS_ERR(tl))
/* Move our oldest request to the slab-cache (if not in use!) */ rq = list_first_entry(&tl->requests, typeof(*rq), link);return ERR_CAST(tl);
@@ -1046,16 +1042,38 @@ i915_request_create(struct intel_context *ce) intel_context_enter(ce); rq = __i915_request_create(ce, GFP_KERNEL); intel_context_exit(ce); /* active reference transferred to request */
- if (IS_ERR(rq))
goto err_unlock;
/* Check that we do not interrupt ourselves with a new request */ rq->cookie = lockdep_pin_lock(&tl->mutex); return rq;return rq;
+}
+struct i915_request * +i915_request_create_locked(struct intel_context *ce) +{
- intel_context_assert_timeline_is_locked(ce->timeline);
- return __i915_request_create_locked(ce);
+}
I wonder if we really need to have such granularity? Leaving only i915_request_create_locked and removing __i915_request_create_locked would simplify little bit the code, I guess the cost of calling intel_context_assert_timeline_is_locked twice sometimes is not big, or maybe it can be re-arranged, up to you.
There is some usage of such granularity in patch 4. I am adding here the throttle on the timeline. I am adding it in the "_locked" version to avoid potential deadlocks coming from selftests (and from realworld?).
Here I'd love to have some comments from Chris and Matt.
I might still add this in the commit message:
"i915_request_create_locked() is now empty but will be used in later commits where a throttle on the ringspace will be executed to ensure exclusive ownership of the timeline."
Reviewed-by: Andrzej Hajda andrzej.hajda@intel.com
Thanks!
Andi
i915_request_add() assumes that the timeline is locked whtn the function is called. Before exiting it releases the lock. But in the next commit we have one case where releasing the timeline mutex is not necessary and we don't want that.
Make a new i915_request_add_locked() version of the function where the lock is not released.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org --- drivers/gpu/drm/i915/i915_request.c | 14 +++++++++++--- drivers/gpu/drm/i915/i915_request.h | 1 + 2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 58662360ac34e..21032b3b9d330 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1852,13 +1852,13 @@ void __i915_request_queue(struct i915_request *rq, local_bh_enable(); /* kick tasklets */ }
-void i915_request_add(struct i915_request *rq) +void i915_request_add_locked(struct i915_request *rq) { struct intel_timeline * const tl = i915_request_timeline(rq); struct i915_sched_attr attr = {}; struct i915_gem_context *ctx;
- lockdep_assert_held(&tl->mutex); + intel_context_assert_timeline_is_locked(tl); lockdep_unpin_lock(&tl->mutex, rq->cookie);
trace_i915_request_add(rq); @@ -1873,7 +1873,15 @@ void i915_request_add(struct i915_request *rq)
__i915_request_queue(rq, &attr);
- mutex_unlock(&tl->mutex); +} + +void i915_request_add(struct i915_request *rq) +{ + struct intel_timeline * const tl = i915_request_timeline(rq); + + i915_request_add_locked(rq); + + intel_context_timeline_unlock(tl); }
static unsigned long local_clock_ns(unsigned int *cpu) diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index bb48bd4605c03..29e3a37c300a7 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -425,6 +425,7 @@ int i915_request_await_deps(struct i915_request *rq, const struct i915_deps *dep int i915_request_await_execution(struct i915_request *rq, struct dma_fence *fence);
+void i915_request_add_locked(struct i915_request *rq); void i915_request_add(struct i915_request *rq);
bool __i915_request_submit(struct i915_request *request);
On 12.04.2023 13:33, Andi Shyti wrote:
i915_request_add() assumes that the timeline is locked whtn the
*when
function is called. Before exiting it releases the lock. But in the next commit we have one case where releasing the timeline mutex is not necessary and we don't want that.
Make a new i915_request_add_locked() version of the function where the lock is not released.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org
Have you looked for other potential users of these new helpers?
Reviewed-by: Andrzej Hajda andrzej.hajda@intel.com
Regards Andrzej
drivers/gpu/drm/i915/i915_request.c | 14 +++++++++++--- drivers/gpu/drm/i915/i915_request.h | 1 + 2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 58662360ac34e..21032b3b9d330 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1852,13 +1852,13 @@ void __i915_request_queue(struct i915_request *rq, local_bh_enable(); /* kick tasklets */ } -void i915_request_add(struct i915_request *rq) +void i915_request_add_locked(struct i915_request *rq) { struct intel_timeline * const tl = i915_request_timeline(rq); struct i915_sched_attr attr = {}; struct i915_gem_context *ctx;
- lockdep_assert_held(&tl->mutex);
- intel_context_assert_timeline_is_locked(tl); lockdep_unpin_lock(&tl->mutex, rq->cookie);
trace_i915_request_add(rq); @@ -1873,7 +1873,15 @@ void i915_request_add(struct i915_request *rq) __i915_request_queue(rq, &attr);
- mutex_unlock(&tl->mutex);
+}
+void i915_request_add(struct i915_request *rq) +{
- struct intel_timeline * const tl = i915_request_timeline(rq);
- i915_request_add_locked(rq);
- intel_context_timeline_unlock(tl); }
static unsigned long local_clock_ns(unsigned int *cpu) diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index bb48bd4605c03..29e3a37c300a7 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -425,6 +425,7 @@ int i915_request_await_deps(struct i915_request *rq, const struct i915_deps *dep int i915_request_await_execution(struct i915_request *rq, struct dma_fence *fence); +void i915_request_add_locked(struct i915_request *rq); void i915_request_add(struct i915_request *rq); bool __i915_request_submit(struct i915_request *request);
Hi Andrzej,
On Wed, Apr 12, 2023 at 03:06:42PM +0200, Andrzej Hajda wrote:
On 12.04.2023 13:33, Andi Shyti wrote:
i915_request_add() assumes that the timeline is locked whtn the
*when
function is called. Before exiting it releases the lock. But in the next commit we have one case where releasing the timeline mutex is not necessary and we don't want that.
Make a new i915_request_add_locked() version of the function where the lock is not released.
Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org
Have you looked for other potential users of these new helpers?
not yet, will do!
Reviewed-by: Andrzej Hajda andrzej.hajda@intel.com
Thanks!
Andi
From: Chris Wilson chris@chris-wilson.co.uk
Before taking exclusive ownership of the ring for emitting the request, wait for space in the ring to become available. This allows others to take the timeline->mutex to make forward progresses while userspace is blocked.
In particular, this allows regular clients to issue requests on the kernel context, potentially filling the ring, but allow the higher priority heartbeats and pulses to still be submitted without being blocked by the less critical work.
Signed-off-by: Chris Wilson chris.p.wilson@linux.intel.com Cc: Maciej Patelczyk maciej.patelczyk@intel.com Cc: stable@vger.kernel.org Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Reviewed-by: Andrzej Hajda andrzej.hajda@intel.com --- drivers/gpu/drm/i915/gt/intel_context.c | 41 +++++++++++++++++++++++++ drivers/gpu/drm/i915/gt/intel_context.h | 2 ++ drivers/gpu/drm/i915/i915_request.c | 3 ++ 3 files changed, 46 insertions(+)
diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index 2aa63ec521b89..59cd612a23561 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -626,6 +626,47 @@ bool intel_context_revoke(struct intel_context *ce) return ret; }
+int intel_context_throttle(const struct intel_context *ce) +{ + const struct intel_ring *ring = ce->ring; + const struct intel_timeline *tl = ce->timeline; + struct i915_request *rq; + int err = 0; + + if (READ_ONCE(ring->space) >= SZ_1K) + return 0; + + rcu_read_lock(); + list_for_each_entry_reverse(rq, &tl->requests, link) { + if (__i915_request_is_complete(rq)) + break; + + if (rq->ring != ring) + continue; + + /* Wait until there will be enough space following that rq */ + if (__intel_ring_space(rq->postfix, + ring->emit, + ring->size) < ring->size / 2) { + if (i915_request_get_rcu(rq)) { + rcu_read_unlock(); + + if (i915_request_wait(rq, + I915_WAIT_INTERRUPTIBLE, + MAX_SCHEDULE_TIMEOUT) < 0) + err = -EINTR; + + rcu_read_lock(); + i915_request_put(rq); + } + break; + } + } + rcu_read_unlock(); + + return err; +} + #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) #include "selftest_context.c" #endif diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h index f2f79ff0dfd1d..c0db00ac6b950 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.h +++ b/drivers/gpu/drm/i915/gt/intel_context.h @@ -233,6 +233,8 @@ static inline void intel_context_exit(struct intel_context *ce) ce->ops->exit(ce); }
+int intel_context_throttle(const struct intel_context *ce); + static inline struct intel_context *intel_context_get(struct intel_context *ce) { kref_get(&ce->ref); diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 21032b3b9d330..0b7c6aede0c6b 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -1057,6 +1057,9 @@ i915_request_create_locked(struct intel_context *ce) { intel_context_assert_timeline_is_locked(ce->timeline);
+ if (intel_context_throttle(ce)) + return ERR_PTR(-EINTR); + return __i915_request_create_locked(ce); }
Currently, when we perform operations such as clearing or copying large blocks of memory, we generate multiple requests that are executed in a chain.
However, if one of these requests fails, we may not realize it unless it happens to be the last request in the chain. This is because errors are not properly propagated.
For this we need to keep propagating the chain of fence notification in order to always reach the final fence associated to the final request.
To address this issue, we need to ensure that the chain of fence notifications is always propagated so that we can reach the final fence associated with the last request. By doing so, we will be able to detect any memory operation failures and determine whether the memory is still invalid.
On copy and clear migration signal fences upon completion.
On copy and clear migration, signal fences upon request completion to ensure that we have a reliable perpetuation of the operation outcome.
Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration") Reported-by: Matthew Auld matthew.auld@intel.com Suggested-by: Chris Wilson chris@chris-wilson.co.uk Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Matthew Auld matthew.auld@intel.com Acked-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/i915/gt/intel_migrate.c | 51 +++++++++++++++++++------ 1 file changed, 39 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c index 3f638f1987968..668c95af8cbcf 100644 --- a/drivers/gpu/drm/i915/gt/intel_migrate.c +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce, dst_offset = 2 * CHUNK_SZ; }
+ /* + * While building the chain of requests, we need to ensure + * that no one can sneak into the timeline unnoticed. + */ + mutex_lock(&ce->timeline->mutex); + do { int len;
- rq = i915_request_create(ce); + rq = i915_request_create_locked(ce); if (IS_ERR(rq)) { err = PTR_ERR(rq); - goto out_ce; + break; }
if (deps) { @@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce,
/* Arbitration is re-enabled between requests. */ out_rq: - if (*out) + i915_sw_fence_await(&rq->submit); + i915_request_get(rq); + i915_request_add_locked(rq); + if (*out) { + i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out); - *out = i915_request_get(rq); - i915_request_add(rq); + } + *out = rq;
if (err) break; @@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce, cond_resched(); } while (1);
-out_ce: + mutex_unlock(&ce->timeline->mutex); + + if (*out) + i915_sw_fence_complete(&(*out)->submit); return err; }
@@ -999,13 +1012,19 @@ intel_context_migrate_clear(struct intel_context *ce, if (HAS_64K_PAGES(i915) && is_lmem) offset = CHUNK_SZ;
+ /* + * While building the chain of requests, we need to ensure + * that no one can sneak into the timeline unnoticed. + */ + mutex_lock(&ce->timeline->mutex); + do { int len;
- rq = i915_request_create(ce); + rq = i915_request_create_locked(ce); if (IS_ERR(rq)) { err = PTR_ERR(rq); - goto out_ce; + break; }
if (deps) { @@ -1056,17 +1075,25 @@ intel_context_migrate_clear(struct intel_context *ce,
/* Arbitration is re-enabled between requests. */ out_rq: - if (*out) + i915_sw_fence_await(&rq->submit); + i915_request_get(rq); + i915_request_add_locked(rq); + if (*out) { + i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out); - *out = i915_request_get(rq); - i915_request_add(rq); + } + *out = rq; + if (err || !it.sg || !sg_dma_len(it.sg)) break;
cond_resched(); } while (1);
-out_ce: + mutex_unlock(&ce->timeline->mutex); + + if (*out) + i915_sw_fence_complete(&(*out)->submit); return err; }
On 12/04/2023 12:33, Andi Shyti wrote:
Currently, when we perform operations such as clearing or copying large blocks of memory, we generate multiple requests that are executed in a chain.
However, if one of these requests fails, we may not realize it unless it happens to be the last request in the chain. This is because errors are not properly propagated.
For this we need to keep propagating the chain of fence notification in order to always reach the final fence associated to the final request.
To address this issue, we need to ensure that the chain of fence notifications is always propagated so that we can reach the final fence associated with the last request. By doing so, we will be able to detect any memory operation failures and determine whether the memory is still invalid.
Above two paragraphs seems to have redundancy in the message they convey.
On copy and clear migration signal fences upon completion.
On copy and clear migration, signal fences upon request completion to ensure that we have a reliable perpetuation of the operation outcome.
These two too. So I think commit message can be a bit polished.
Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration") Reported-by: Matthew Auld matthew.auld@intel.com Suggested-by: Chris Wilson chris@chris-wilson.co.uk Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Matthew Auld matthew.auld@intel.com Acked-by: Nirmoy Das nirmoy.das@intel.com
drivers/gpu/drm/i915/gt/intel_migrate.c | 51 +++++++++++++++++++------ 1 file changed, 39 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c index 3f638f1987968..668c95af8cbcf 100644 --- a/drivers/gpu/drm/i915/gt/intel_migrate.c +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce, dst_offset = 2 * CHUNK_SZ; }
- /*
* While building the chain of requests, we need to ensure
* that no one can sneak into the timeline unnoticed.
*/
- mutex_lock(&ce->timeline->mutex);
- do { int len;
rq = i915_request_create(ce);
if (IS_ERR(rq)) { err = PTR_ERR(rq);rq = i915_request_create_locked(ce);
goto out_ce;
}break;
if (deps) { @@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce, /* Arbitration is re-enabled between requests. */ out_rq:
if (*out)
i915_sw_fence_await(&rq->submit);
i915_request_get(rq);
i915_request_add_locked(rq);
if (*out) {
i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out);
Could you help me understand this please. I have a few questions - first, what are the actual mechanics of fence error transfer here? I see the submit fence is being blocked until the next request is submitted - effectively previous request is only allowed to get on the hardware after the next one has been queued up. But I don't immediately see what that does in practice.
Second question relates to the need to hold the timeline mutex throughout. Presumably this is so two copy or migrate operations on the same context do not interleave, which can otherwise happen?
Would the error propagation be doable without the lock held by chaining on the previous request _completion_ fence? If so I am sure that would have a performance impact, because chunk by chunk would need a GPU<->CPU round trip to schedule. How much of an impact I don't know. Maybe enlarging CHUNK_SZ to compensate is an option?
Or if the perf hit would be bearable for stable backports only (much smaller patch) and then for tip we can do this full speed solution.
But yes, I would first want to understand the actual error propagation mechanism because sadly my working knowledge is a bit rusty.
*out = i915_request_get(rq);
i915_request_add(rq);
}
*out = rq;
if (err) break; @@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce, cond_resched(); } while (1); -out_ce:
- mutex_unlock(&ce->timeline->mutex);
- if (*out)
return err; }i915_sw_fence_complete(&(*out)->submit);
@@ -999,13 +1012,19 @@ intel_context_migrate_clear(struct intel_context *ce, if (HAS_64K_PAGES(i915) && is_lmem) offset = CHUNK_SZ;
- /*
* While building the chain of requests, we need to ensure
* that no one can sneak into the timeline unnoticed.
*/
- mutex_lock(&ce->timeline->mutex);
- do { int len;
rq = i915_request_create(ce);
if (IS_ERR(rq)) { err = PTR_ERR(rq);rq = i915_request_create_locked(ce);
goto out_ce;
}break;
if (deps) { @@ -1056,17 +1075,25 @@ intel_context_migrate_clear(struct intel_context *ce, /* Arbitration is re-enabled between requests. */ out_rq:
if (*out)
i915_sw_fence_await(&rq->submit);
i915_request_get(rq);
i915_request_add_locked(rq);
if (*out) {
i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out);
*out = i915_request_get(rq);
i915_request_add(rq);
}
*out = rq;
Btw if all else fails perhaps these two blocks can be consolidated by something like __chain_requests(rq, out) and all these operations in it. Not sure how much would that save in the grand total.
Regards,
Tvrtko
- if (err || !it.sg || !sg_dma_len(it.sg)) break;
cond_resched(); } while (1); -out_ce:
- mutex_unlock(&ce->timeline->mutex);
- if (*out)
return err; }i915_sw_fence_complete(&(*out)->submit);
On 13/04/2023 12:56, Tvrtko Ursulin wrote:
On 12/04/2023 12:33, Andi Shyti wrote:
Currently, when we perform operations such as clearing or copying large blocks of memory, we generate multiple requests that are executed in a chain.
However, if one of these requests fails, we may not realize it unless it happens to be the last request in the chain. This is because errors are not properly propagated.
For this we need to keep propagating the chain of fence notification in order to always reach the final fence associated to the final request.
To address this issue, we need to ensure that the chain of fence notifications is always propagated so that we can reach the final fence associated with the last request. By doing so, we will be able to detect any memory operation failures and determine whether the memory is still invalid.
Above two paragraphs seems to have redundancy in the message they convey.
On copy and clear migration signal fences upon completion.
On copy and clear migration, signal fences upon request completion to ensure that we have a reliable perpetuation of the operation outcome.
These two too. So I think commit message can be a bit polished.
Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration") Reported-by: Matthew Auld matthew.auld@intel.com Suggested-by: Chris Wilson chris@chris-wilson.co.uk Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Matthew Auld matthew.auld@intel.com Acked-by: Nirmoy Das nirmoy.das@intel.com
drivers/gpu/drm/i915/gt/intel_migrate.c | 51 +++++++++++++++++++------ 1 file changed, 39 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c index 3f638f1987968..668c95af8cbcf 100644 --- a/drivers/gpu/drm/i915/gt/intel_migrate.c +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce, dst_offset = 2 * CHUNK_SZ; } + /* + * While building the chain of requests, we need to ensure + * that no one can sneak into the timeline unnoticed. + */ + mutex_lock(&ce->timeline->mutex);
do { int len; - rq = i915_request_create(ce); + rq = i915_request_create_locked(ce); if (IS_ERR(rq)) { err = PTR_ERR(rq); - goto out_ce; + break; } if (deps) { @@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce, /* Arbitration is re-enabled between requests. */ out_rq: - if (*out) + i915_sw_fence_await(&rq->submit); + i915_request_get(rq); + i915_request_add_locked(rq); + if (*out) { + i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out);
Could you help me understand this please. I have a few questions - first, what are the actual mechanics of fence error transfer here? I see the submit fence is being blocked until the next request is submitted - effectively previous request is only allowed to get on the hardware after the next one has been queued up. But I don't immediately see what that does in practice.
Second question relates to the need to hold the timeline mutex throughout. Presumably this is so two copy or migrate operations on the same context do not interleave, which can otherwise happen?
Would the error propagation be doable without the lock held by chaining on the previous request _completion_ fence? If so I am sure that would have a performance impact, because chunk by chunk would need a GPU<->CPU round trip to schedule. How much of an impact I don't know. Maybe enlarging CHUNK_SZ to compensate is an option?
Or if the perf hit would be bearable for stable backports only (much smaller patch) and then for tip we can do this full speed solution.
But yes, I would first want to understand the actual error propagation mechanism because sadly my working knowledge is a bit rusty.
Another option - maybe - is this related to revert of fence error propagation? If it is and having that would avoid the need for this invasive fix, maybe we unrevert 3761baae908a7b5012be08d70fa553cc2eb82305 with edits to limit to special contexts? If doable..
Regards,
Tvrtko
- *out = i915_request_get(rq); - i915_request_add(rq); + } + *out = rq; if (err) break; @@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce, cond_resched(); } while (1); -out_ce: + mutex_unlock(&ce->timeline->mutex);
+ if (*out) + i915_sw_fence_complete(&(*out)->submit); return err; } @@ -999,13 +1012,19 @@ intel_context_migrate_clear(struct intel_context *ce, if (HAS_64K_PAGES(i915) && is_lmem) offset = CHUNK_SZ; + /* + * While building the chain of requests, we need to ensure + * that no one can sneak into the timeline unnoticed. + */ + mutex_lock(&ce->timeline->mutex);
do { int len; - rq = i915_request_create(ce); + rq = i915_request_create_locked(ce); if (IS_ERR(rq)) { err = PTR_ERR(rq); - goto out_ce; + break; } if (deps) { @@ -1056,17 +1075,25 @@ intel_context_migrate_clear(struct intel_context *ce, /* Arbitration is re-enabled between requests. */ out_rq: - if (*out) + i915_sw_fence_await(&rq->submit); + i915_request_get(rq); + i915_request_add_locked(rq); + if (*out) { + i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out); - *out = i915_request_get(rq); - i915_request_add(rq); + } + *out = rq;
Btw if all else fails perhaps these two blocks can be consolidated by something like __chain_requests(rq, out) and all these operations in it. Not sure how much would that save in the grand total.
Regards,
Tvrtko
if (err || !it.sg || !sg_dma_len(it.sg)) break; cond_resched(); } while (1); -out_ce: + mutex_unlock(&ce->timeline->mutex);
+ if (*out) + i915_sw_fence_complete(&(*out)->submit); return err; }
Hi Tvrtko,
Another option - maybe - is this related to revert of fence error propagation? If it is and having that would avoid the need for this invasive fix, maybe we unrevert 3761baae908a7b5012be08d70fa553cc2eb82305 with edits to limit to special contexts? If doable..
I think that is not enough as we want to get anyway to the last request and fence submitted. Right?
I guess this commit should be reverted anyway.
Andi
Hi Tvrtko,
sorry for the very late reply, it's about time to bring this patch up.
On Thu, Apr 13, 2023 at 12:56:00PM +0100, Tvrtko Ursulin wrote:
On 12/04/2023 12:33, Andi Shyti wrote:
Currently, when we perform operations such as clearing or copying large blocks of memory, we generate multiple requests that are executed in a chain.
However, if one of these requests fails, we may not realize it unless it happens to be the last request in the chain. This is because errors are not properly propagated.
For this we need to keep propagating the chain of fence notification in order to always reach the final fence associated to the final request.
To address this issue, we need to ensure that the chain of fence notifications is always propagated so that we can reach the final fence associated with the last request. By doing so, we will be able to detect any memory operation failures and determine whether the memory is still invalid.
Above two paragraphs seems to have redundancy in the message they convey.
On copy and clear migration signal fences upon completion.
On copy and clear migration, signal fences upon request completion to ensure that we have a reliable perpetuation of the operation outcome.
These two too. So I think commit message can be a bit polished.
In my intent of being very explicative I might have exaggerated. I know that these kind of patches might bring some controversy.
I will review the commit.
Fixes: cf586021642d80 ("drm/i915/gt: Pipelined page migration") Reported-by: Matthew Auld matthew.auld@intel.com Suggested-by: Chris Wilson chris@chris-wilson.co.uk Signed-off-by: Andi Shyti andi.shyti@linux.intel.com Cc: stable@vger.kernel.org Reviewed-by: Matthew Auld matthew.auld@intel.com Acked-by: Nirmoy Das nirmoy.das@intel.com
drivers/gpu/drm/i915/gt/intel_migrate.c | 51 +++++++++++++++++++------ 1 file changed, 39 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_migrate.c b/drivers/gpu/drm/i915/gt/intel_migrate.c index 3f638f1987968..668c95af8cbcf 100644 --- a/drivers/gpu/drm/i915/gt/intel_migrate.c +++ b/drivers/gpu/drm/i915/gt/intel_migrate.c @@ -742,13 +742,19 @@ intel_context_migrate_copy(struct intel_context *ce, dst_offset = 2 * CHUNK_SZ; }
- /*
* While building the chain of requests, we need to ensure
* that no one can sneak into the timeline unnoticed.
*/
- mutex_lock(&ce->timeline->mutex);
- do { int len;
rq = i915_request_create(ce);
if (IS_ERR(rq)) { err = PTR_ERR(rq);rq = i915_request_create_locked(ce);
goto out_ce;
} if (deps) {break;
@@ -878,10 +884,14 @@ intel_context_migrate_copy(struct intel_context *ce, /* Arbitration is re-enabled between requests. */ out_rq:
if (*out)
i915_sw_fence_await(&rq->submit);
i915_request_get(rq);
i915_request_add_locked(rq);
if (*out) {
i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out);
Could you help me understand this please. I have a few questions - first, what are the actual mechanics of fence error transfer here? I see the submit fence is being blocked until the next request is submitted - effectively previous request is only allowed to get on the hardware after the next one has been queued up. But I don't immediately see what that does in practice.
This is the basic of the error perpetuation. Without this serialization, for big operations like migrate and copy, we would only catch the error in the last rq.
Second question relates to the need to hold the timeline mutex throughout. Presumably this is so two copy or migrate operations on the same context do not interleave, which can otherwise happen?
Would the error propagation be doable without the lock held by chaining on the previous request _completion_ fence? If so I am sure that would have a performance impact, because chunk by chunk would need a GPU<->CPU round trip to schedule. How much of an impact I don't know. Maybe enlarging CHUNK_SZ to compensate is an option?
The need for a mutex lock comes from adding the throttle during request creation, which ensures no pending requests are being served.
I will copy paste from Chris review, which was missed in the mailing list:
Adding a large throttle before the mutex makes the race less likely, but to overcome that just increase the number of simultaneous clients fighting for ring space.
If we hold the lock while constructing the chain, no one else may inject themselves between links in our chain. If we do not, we may end up with
ABCDEFGHI ^head ^tail
Then in order for A to submit its next request it has to wait upon its previous request. But since we are holding the submit fence for A, it will not be executed until after we complete our submission. Boom.
Andi
Or if the perf hit would be bearable for stable backports only (much smaller patch) and then for tip we can do this full speed solution.
But yes, I would first want to understand the actual error propagation mechanism because sadly my working knowledge is a bit rusty.
*out = i915_request_get(rq);
i915_request_add(rq);
}
if (err) break;*out = rq;
@@ -905,7 +915,10 @@ intel_context_migrate_copy(struct intel_context *ce, cond_resched(); } while (1); -out_ce:
- mutex_unlock(&ce->timeline->mutex);
- if (*out)
return err; }i915_sw_fence_complete(&(*out)->submit);
@@ -999,13 +1012,19 @@ intel_context_migrate_clear(struct intel_context *ce, if (HAS_64K_PAGES(i915) && is_lmem) offset = CHUNK_SZ;
- /*
* While building the chain of requests, we need to ensure
* that no one can sneak into the timeline unnoticed.
*/
- mutex_lock(&ce->timeline->mutex);
- do { int len;
rq = i915_request_create(ce);
if (IS_ERR(rq)) { err = PTR_ERR(rq);rq = i915_request_create_locked(ce);
goto out_ce;
} if (deps) {break;
@@ -1056,17 +1075,25 @@ intel_context_migrate_clear(struct intel_context *ce, /* Arbitration is re-enabled between requests. */ out_rq:
if (*out)
i915_sw_fence_await(&rq->submit);
i915_request_get(rq);
i915_request_add_locked(rq);
if (*out) {
i915_sw_fence_complete(&(*out)->submit); i915_request_put(*out);
*out = i915_request_get(rq);
i915_request_add(rq);
}
*out = rq;
Btw if all else fails perhaps these two blocks can be consolidated by something like __chain_requests(rq, out) and all these operations in it. Not sure how much would that save in the grand total.
Regards,
Tvrtko
- if (err || !it.sg || !sg_dma_len(it.sg)) break; cond_resched(); } while (1);
-out_ce:
- mutex_unlock(&ce->timeline->mutex);
- if (*out)
return err; }i915_sw_fence_complete(&(*out)->submit);
linux-stable-mirror@lists.linaro.org