[PATCH v5 1/3] drm/xe: Move LNL scheduling WA to xe_device.h

List overview All Threads
Download

newer

older

[PATCH 0/2] drm/mediatek: Fix...

[PATCH v2] usb: typec: Drop...

Nirmoy Das

29 Oct 2024 29 Oct '24

12:01 p.m.

Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed.

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison John.C.Harrison@Intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ drivers/gpu/drm/xe/xe_guc_ct.c | 11 +---------- 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h index 4c3f0ebe78a9..f1fbfe916867 100644 --- a/drivers/gpu/drm/xe/xe_device.h +++ b/drivers/gpu/drm/xe/xe_device.h @@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe); struct xe_file *xe_file_get(struct xe_file *xef); void xe_file_put(struct xe_file *xef);

+/* + * Occasionally it is seen that the G2H worker starts running after a delay of more than + * a second even after being queued and activated by the Linux workqueue subsystem. This + * leads to G2H timeout error. The root cause of issue lies with scheduling latency of + * Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS + * and this is beyond xe kmd. + * + * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU. + */ +#define LNL_FLUSH_WORKQUEUE(wq__) \ + flush_workqueue(wq__) +#define LNL_FLUSH_WORK(wrk__) \ + flush_work(wrk__) + #endif diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c index 1b5d8fb1033a..703b44b257a7 100644 --- a/drivers/gpu/drm/xe/xe_guc_ct.c +++ b/drivers/gpu/drm/xe/xe_guc_ct.c @@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len,

ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);

- /* - * Occasionally it is seen that the G2H worker starts running after a delay of more than - * a second even after being queued and activated by the Linux workqueue subsystem. This - * leads to G2H timeout error. The root cause of issue lies with scheduling latency of - * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS - * and this is beyond xe kmd. - * - * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU. - */ if (!ret) { - flush_work(&ct->g2h_worker); + LNL_FLUSH_WORK(&ct->g2h_worker); if (g2h_fence.done) { xe_gt_warn(gt, "G2H fence %u, action %04x, done\n", g2h_fence.seqno, action[0]);

-- 2.46.0

Show replies by date

Nirmoy Das

29 Oct 29 Oct

12:01 p.m.

New subject: [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout

Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to recent scheduling issue with E-cores.

This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is a E-core scheduling fix for LNL.

v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani) v3: Remove gfx platform check as the issue related to cpu platform(John) v4: Use the Common macro(John) and print when the flush resolves timeout(Matt B)

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/xe/xe_wait_user_fence.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..5b4264ea38bd 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -155,6 +155,13 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, }

if (!timeout) { + LNL_FLUSH_WORKQUEUE(xe->ordered_wq); + err = do_compare(addr, args->value, args->mask, + args->op); + if (err <= 0) { + drm_dbg(&xe->drm, "LNL_FLUSH_WORKQUEUE resolved ufence timeout\n"); + break; + } err = -ETIME; break; }

-- 2.46.0

Matthew Auld

1 Nov 1 Nov

1:21 p.m.

New subject: [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout

On 29/10/2024 12:01, Nirmoy Das wrote:

...

Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to recent scheduling issue with E-cores.

This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is a E-core scheduling fix for LNL.

v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani) v3: Remove gfx platform check as the issue related to cpu platform(John) v4: Use the Common macro(John) and print when the flush resolves timeout(Matt B)

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

Reviewed-by: Matthew Auld matthew.auld@intel.com

John Harrison

6:49 p.m.

New subject: [PATCH v5 2/3] drm/xe/ufence: Flush xe ordered_wq in case of ufence timeout

On 10/29/2024 05:01, Nirmoy Das wrote:

...

Flush xe ordered_wq in case of ufence timeout which is observed on LNL and that points to recent scheduling issue with E-cores.

This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is a E-core scheduling fix for LNL.

v2: Add platform check(Himal) s/__flush_workqueue/flush_workqueue(Jani) v3: Remove gfx platform check as the issue related to cpu platform(John) v4: Use the Common macro(John) and print when the flush resolves timeout(Matt B)

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2754 Suggested-by: Matthew Brost matthew.brost@intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

drivers/gpu/drm/xe/xe_wait_user_fence.c | 7 +++++++ 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_wait_user_fence.c b/drivers/gpu/drm/xe/xe_wait_user_fence.c index f5deb81eba01..5b4264ea38bd 100644 --- a/drivers/gpu/drm/xe/xe_wait_user_fence.c +++ b/drivers/gpu/drm/xe/xe_wait_user_fence.c @@ -155,6 +155,13 @@ int xe_wait_user_fence_ioctl(struct drm_device *dev, void *data, } if (!timeout) {
	LNL_FLUSH_WORKQUEUE(xe->ordered_wq);
	err = do_compare(addr, args->value, args->mask,
			 args->op);
	if (err <= 0) {
		drm_dbg(&xe->drm, "LNL_FLUSH_WORKQUEUE resolved ufence timeout\n");

Should this not be a warning as well? To match the one in the G2H code, with the reason being that we want CI to track how often this is occurring.

John.

...

```
		break;
```
```
	}
err = -ETIME;
break;
```
}

Nirmoy Das

29 Oct 29 Oct

12:01 p.m.

New subject: [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout

Flush the g2h worker explicitly if TLB timeout happens which is observed on LNL and that points to the recent scheduling issue with E-cores on LNL.

This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.

v2: Add platform check(Himal) v3: Remove gfx platform check as the issue related to cpu platform(John) Use the common WA macro(John) and print when the flush resolves timeout(Matt B) v4: Remove the resolves log and do the flush before taking pending_lock(Matt A)

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687 Signed-off-by: Nirmoy Das nirmoy.das@intel.com --- drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c index 773de1f08db9..3cb228c773cd 100644 --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c @@ -72,6 +72,8 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work) struct xe_device *xe = gt_to_xe(gt); struct xe_gt_tlb_invalidation_fence *fence, *next;

+ LNL_FLUSH_WORK(&gt->uc.guc.ct.g2h_worker); + spin_lock_irq(&gt->tlb_invalidation.pending_lock); list_for_each_entry_safe(fence, next, &gt->tlb_invalidation.pending_fences, link) {

-- 2.46.0

Matthew Auld

1 Nov 1 Nov

1:19 p.m.

New subject: [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout

On 29/10/2024 12:01, Nirmoy Das wrote:

...

Flush the g2h worker explicitly if TLB timeout happens which is observed on LNL and that points to the recent scheduling issue with E-cores on LNL.

This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.

v2: Add platform check(Himal) v3: Remove gfx platform check as the issue related to cpu platform(John) Use the common WA macro(John) and print when the flush resolves timeout(Matt B) v4: Remove the resolves log and do the flush before taking pending_lock(Matt A)

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687 Signed-off-by: Nirmoy Das nirmoy.das@intel.com

Reviewed-by: Matthew Auld matthew.auld@intel.com

John Harrison

6:51 p.m.

New subject: [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout

On 10/29/2024 05:01, Nirmoy Das wrote:

...

Flush the g2h worker explicitly if TLB timeout happens which is observed on LNL and that points to the recent scheduling issue with E-cores on LNL.

This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.

v2: Add platform check(Himal) v3: Remove gfx platform check as the issue related to cpu platform(John) Use the common WA macro(John) and print when the flush resolves timeout(Matt B) v4: Remove the resolves log and do the flush before taking pending_lock(Matt A)

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687 Signed-off-by: Nirmoy Das nirmoy.das@intel.com

drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c index 773de1f08db9..3cb228c773cd 100644 --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c @@ -72,6 +72,8 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work) struct xe_device *xe = gt_to_xe(gt); struct xe_gt_tlb_invalidation_fence *fence, *next;

LNL_FLUSH_WORK(&gt->uc.guc.ct.g2h_worker);

Do we not want some kind of 'success required flush' message here as per the other instances?

John.

...

spin_lock_irq(&gt->tlb_invalidation.pending_lock); list_for_each_entry_safe(fence, next, &gt->tlb_invalidation.pending_fences, link) {

Nirmoy Das

4 Nov 4 Nov

11:20 a.m.

New subject: [PATCH v5 3/3] drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout

On 11/1/2024 7:51 PM, John Harrison wrote:

...

On 10/29/2024 05:01, Nirmoy Das wrote:

...
Flush the g2h worker explicitly if TLB timeout happens which is observed on LNL and that points to the recent scheduling issue with E-cores on LNL.

This is similar to the recent fix: commit e51527233804 ("drm/xe/guc/ct: Flush g2h worker in case of g2h response timeout") and should be removed once there is E core scheduling fix.

v2: Add platform check(Himal) v3: Remove gfx platform check as the issue related to cpu platform(John) Use the common WA macro(John) and print when the flush resolves timeout(Matt B) v4: Remove the resolves log and do the flush before taking pending_lock(Matt A)

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: John Harrison John.C.Harrison@Intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com Cc: stable@vger.kernel.org # v6.11+ Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687 Signed-off-by: Nirmoy Das nirmoy.das@intel.com

drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c index 773de1f08db9..3cb228c773cd 100644 --- a/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c +++ b/drivers/gpu/drm/xe/xe_gt_tlb_invalidation.c @@ -72,6 +72,8 @@ static void xe_gt_tlb_fence_timeout(struct work_struct *work) struct xe_device *xe = gt_to_xe(gt); struct xe_gt_tlb_invalidation_fence *fence, *next; + LNL_FLUSH_WORK(&gt->uc.guc.ct.g2h_worker);

Do we not want some kind of 'success required flush' message here as per the other instances?

This flush resolves TLB timeout but in this function, I can't think of a simple way to detect that and log a message.

Regards,

Nirmoy

...

John.

...
spin_lock_irq(&gt->tlb_invalidation.pending_lock); list_for_each_entry_safe(fence, next, &gt->tlb_invalidation.pending_fences, link) {

Nirmoy Das

1 Nov 1 Nov

1:09 p.m.

ping!

On 10/29/2024 1:01 PM, Nirmoy Das wrote:

...

Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed.

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison John.C.Harrison@Intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ drivers/gpu/drm/xe/xe_guc_ct.c | 11 +---------- 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h index 4c3f0ebe78a9..f1fbfe916867 100644 --- a/drivers/gpu/drm/xe/xe_device.h +++ b/drivers/gpu/drm/xe/xe_device.h @@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe); struct xe_file *xe_file_get(struct xe_file *xef); void xe_file_put(struct xe_file *xef); +/*

Occasionally it is seen that the G2H worker starts running after a delay of more than

a second even after being queued and activated by the Linux workqueue subsystem. This

leads to G2H timeout error. The root cause of issue lies with scheduling latency of

Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS

and this is beyond xe kmd.

TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.

*/

+#define LNL_FLUSH_WORKQUEUE(wq__) \

flush_workqueue(wq__)

+#define LNL_FLUSH_WORK(wrk__) \

flush_work(wrk__)

#endif diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c index 1b5d8fb1033a..703b44b257a7 100644 --- a/drivers/gpu/drm/xe/xe_guc_ct.c +++ b/drivers/gpu/drm/xe/xe_guc_ct.c @@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len, ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
/*
* Occasionally it is seen that the G2H worker starts running after a delay of more than
* a second even after being queued and activated by the Linux workqueue subsystem. This
* leads to G2H timeout error. The root cause of issue lies with scheduling latency of
* Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
* and this is beyond xe kmd.
*
* TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
*/
if (!ret) {
flush_work(&ct->g2h_worker);
LNL_FLUSH_WORK(&ct->g2h_worker);
if (g2h_fence.done) { xe_gt_warn(gt, "G2H fence %u, action %04x, done\n", g2h_fence.seqno, action[0]);

Matthew Auld

1:21 p.m.

On 29/10/2024 12:01, Nirmoy Das wrote:

...

Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed.

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison John.C.Harrison@Intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

Reviewed-by: Matthew Auld matthew.auld@intel.com

Nirmoy Das

1:44 p.m.

On 11/1/2024 2:21 PM, Matthew Auld wrote:

...

On 29/10/2024 12:01, Nirmoy Das wrote:

...
Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed.

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison John.C.Harrison@Intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

Reviewed-by: Matthew Auld matthew.auld@intel.com

Thanks for reviewing the series, Matt!

Lucas De Marchi

3:47 p.m.

On Fri, Nov 01, 2024 at 02:44:13PM +0100, Nirmoy Das wrote:

...

On 11/1/2024 2:21 PM, Matthew Auld wrote:

...
On 29/10/2024 12:01, Nirmoy Das wrote:

...
Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed.

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison John.C.Harrison@Intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

Reviewed-by: Matthew Auld matthew.auld@intel.com

Thanks for reviewing the series, Matt!

and pushed to drm-xe-next, thanks.

Lucas De Marchi

John Harrison

6:48 p.m.

On 10/29/2024 05:01, Nirmoy Das wrote:

...

Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed.

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison John.C.Harrison@Intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ drivers/gpu/drm/xe/xe_guc_ct.c | 11 +---------- 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h index 4c3f0ebe78a9..f1fbfe916867 100644 --- a/drivers/gpu/drm/xe/xe_device.h +++ b/drivers/gpu/drm/xe/xe_device.h @@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe); struct xe_file *xe_file_get(struct xe_file *xef); void xe_file_put(struct xe_file *xef); +/*

Occasionally it is seen that the G2H worker starts running after a delay of more than

a second even after being queued and activated by the Linux workqueue subsystem. This

leads to G2H timeout error. The root cause of issue lies with scheduling latency of

Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS

and this is beyond xe kmd.

TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.

*/

+#define LNL_FLUSH_WORKQUEUE(wq__) \

flush_workqueue(wq__)

+#define LNL_FLUSH_WORK(wrk__) \

flush_work(wrk__)

#endif

diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c index 1b5d8fb1033a..703b44b257a7 100644 --- a/drivers/gpu/drm/xe/xe_guc_ct.c +++ b/drivers/gpu/drm/xe/xe_guc_ct.c @@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len, ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ);
/*
* Occasionally it is seen that the G2H worker starts running after a delay of more than
* a second even after being queued and activated by the Linux workqueue subsystem. This
* leads to G2H timeout error. The root cause of issue lies with scheduling latency of
* Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS
* and this is beyond xe kmd.
*
* TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.
*/
if (!ret) {
flush_work(&ct->g2h_worker);
LNL_FLUSH_WORK(&ct->g2h_worker);
if (g2h_fence.done) { xe_gt_warn(gt, "G2H fence %u, action %04x, done\n", g2h_fence.seqno, action[0]);

This message is still wrong.

We have a warning that says 'job completed successfully'! That is misleading. It needs to say "done after flush" or "done but flush was required" or something along those lines.

John.

Nirmoy Das

4 Nov 4 Nov

9:52 a.m.

On 11/1/2024 7:48 PM, John Harrison wrote:

...

On 10/29/2024 05:01, Nirmoy Das wrote:

...
Move LNL scheduling WA to xe_device.h so this can be used in other places without needing keep the same comment about removal of this WA in the future. The WA, which flushes work or workqueues, is now wrapped in macros and can be reused wherever needed.

Cc: Badal Nilawar badal.nilawar@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Matthew Brost matthew.brost@intel.com Cc: Himal Prasad Ghimiray himal.prasad.ghimiray@intel.com Cc: Lucas De Marchi lucas.demarchi@intel.com cc: stable@vger.kernel.org # v6.11+ Suggested-by: John Harrison John.C.Harrison@Intel.com Signed-off-by: Nirmoy Das nirmoy.das@intel.com

drivers/gpu/drm/xe/xe_device.h | 14 ++++++++++++++ drivers/gpu/drm/xe/xe_guc_ct.c | 11 +---------- 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h index 4c3f0ebe78a9..f1fbfe916867 100644 --- a/drivers/gpu/drm/xe/xe_device.h +++ b/drivers/gpu/drm/xe/xe_device.h @@ -191,4 +191,18 @@ void xe_device_declare_wedged(struct xe_device *xe); struct xe_file *xe_file_get(struct xe_file *xef); void xe_file_put(struct xe_file *xef); +/*

Occasionally it is seen that the G2H worker starts running after a delay of more than

a second even after being queued and activated by the Linux workqueue subsystem. This

leads to G2H timeout error. The root cause of issue lies with scheduling latency of

Lunarlake Hybrid CPU. Issue disappears if we disable Lunarlake atom cores from BIOS

and this is beyond xe kmd.

TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU.

*/

+#define LNL_FLUSH_WORKQUEUE(wq__) \ + flush_workqueue(wq__) +#define LNL_FLUSH_WORK(wrk__) \ + flush_work(wrk__)

#endif diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c b/drivers/gpu/drm/xe/xe_guc_ct.c index 1b5d8fb1033a..703b44b257a7 100644 --- a/drivers/gpu/drm/xe/xe_guc_ct.c +++ b/drivers/gpu/drm/xe/xe_guc_ct.c @@ -1018,17 +1018,8 @@ static int guc_ct_send_recv(struct xe_guc_ct *ct, const u32 *action, u32 len, ret = wait_event_timeout(ct->g2h_fence_wq, g2h_fence.done, HZ); - /* - * Occasionally it is seen that the G2H worker starts running after a delay of more than - * a second even after being queued and activated by the Linux workqueue subsystem. This - * leads to G2H timeout error. The root cause of issue lies with scheduling latency of - * Lunarlake Hybrid CPU. Issue dissappears if we disable Lunarlake atom cores from BIOS - * and this is beyond xe kmd. - * - * TODO: Drop this change once workqueue scheduling delay issue is fixed on LNL Hybrid CPU. - */ if (!ret) { - flush_work(&ct->g2h_worker); + LNL_FLUSH_WORK(&ct->g2h_worker); if (g2h_fence.done) { xe_gt_warn(gt, "G2H fence %u, action %04x, done\n", g2h_fence.seqno, action[0]);

This message is still wrong.

I see that this is open from the previous patch, https://patchwork.freedesktop.org/patch/620189/#comment_1128218.

Sent out a patch to improve the message.

Thanks,

Nirmoy

...

We have a warning that says 'job completed successfully'! That is misleading. It needs to say "done after flush" or "done but flush was required" or something along those lines.

John.

239

days inactive

245

days old

linux-stable-mirror@lists.linaro.org

13 comments

participants

tags (0)

participants (5)

John Harrison
Lucas De Marchi
Matthew Auld
Nirmoy Das
Nirmoy Das