[PATCH] drm/xe/client: Better correlate exec_queue and GT timestamps

List overview All Threads
Download

newer

older

[PATCH v2 2/9] btrfs: fix double...

[PATCH 1/5] KVM: e500: retry if no...

Lucas De Marchi

12 Dec 2024 12 Dec '24

5:34 p.m.

This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Test scenario:

* IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load` * Platform: LNL, where CI occasionally reports failures * `stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com --- drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock); - xa_for_each(&xef->exec_queue.xa, i, q) { - xe_exec_queue_get(q); - mutex_unlock(&xef->exec_queue.lock); + preempt_disable();

+ xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);

- mutex_lock(&xef->exec_queue.lock); - xe_exec_queue_put(q); - } + preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

Show replies by date

Lucas De Marchi

12 Dec 12 Dec

9:53 p.m.

On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...

This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Test scenario:

IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`

Platform: LNL, where CI occasionally reports failures

`stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

so it fails after 2704 runs, but it seems like it's for a reason other than what's fixed here that was previously failing at ~150 runs. See below. Relevant lines from the log are marked

####################################### 2705/10000 IGT-Version: 1.29-g19f996f8b (x86_64) (Linux: 6.13.0-rc2-xe+ x86_64) Using IGT_SRANDOM=1734033293 for randomisation Opened device: /dev/dri/card0 Starting subtest: utilization-single-full-load (xe_drm_fdinfo:40147) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528: (xe_drm_fdinfo:40147) CRITICAL: Failed assertion: percent < 105.0 (xe_drm_fdinfo:40147) CRITICAL: error: 105.811345 >= 105.000000 Stack trace: #0 ../lib/igt_core.c:2051 __igt_fail_assert() #1 ../tests/intel/xe_drm_fdinfo.c:520 check_results() #2 ../tests/intel/xe_drm_fdinfo.c:464 utilization_single() #3 ../tests/intel/xe_drm_fdinfo.c:852 __igt_unique____real_main806() #4 ../tests/intel/xe_drm_fdinfo.c:806 main() #5 ../sysdeps/nptl/libc_start_call_main.h:74 __libc_start_call_main() #6 ../csu/libc-start.c:128 __libc_start_main@@GLIBC_2.34() #7 [_start+0x25] Subtest utilization-single-full-load failed. **** DEBUG **** (xe_drm_fdinfo:40147) DEBUG: rcs: spinner started (xe_drm_fdinfo:40147) DEBUG: rcs: spinner ended (timestamp=19184279) (xe_drm_fdinfo:40147) DEBUG: rcs: sample 1: cycles 0, total_cycles 325162895781 (xe_drm_fdinfo:40147) DEBUG: rcs: sample 2: cycles 19184523, total_cycles 325182184120 (xe_drm_fdinfo:40147) DEBUG: rcs: percent: 99.461763 (xe_drm_fdinfo:40147) DEBUG: bcs: spinner started (xe_drm_fdinfo:40147) DEBUG: bcs: spinner ended (timestamp=19193059) (xe_drm_fdinfo:40147) DEBUG: bcs: sample 1: cycles 0, total_cycles 325182330134 (xe_drm_fdinfo:40147) DEBUG: bcs: sample 2: cycles 19193168, total_cycles 325201552996 (xe_drm_fdinfo:40147) DEBUG: bcs: percent: 99.845522 (xe_drm_fdinfo:40147) DEBUG: ccs: spinner started (xe_drm_fdinfo:40147) DEBUG: ccs: spinner ended (timestamp=19200849) (xe_drm_fdinfo:40147) DEBUG: ccs: sample 1: cycles 0, total_cycles 325201742269 (xe_drm_fdinfo:40147) DEBUG: ccs: sample 2: cycles 19201013, total_cycles 325220974694 (xe_drm_fdinfo:40147) DEBUG: ccs: percent: 99.836666 (xe_drm_fdinfo:40147) DEBUG: vcs: spinner started (xe_drm_fdinfo:40147) DEBUG: vcs: spinner ended (timestamp=19281420) (xe_drm_fdinfo:40147) DEBUG: vcs: sample 1: cycles 0, total_cycles 325221246644 (xe_drm_fdinfo:40147) DEBUG: vcs: sample 2: cycles 19281506, total_cycles 325240467813 (xe_drm_fdinfo:40147) DEBUG: vcs: percent: 100.313904 (xe_drm_fdinfo:40147) DEBUG: vecs: spinner started [0]---> (xe_drm_fdinfo:40147) DEBUG: vecs: spinner ended (timestamp=20348520) [1]---> (xe_drm_fdinfo:40147) DEBUG: vecs: sample 1: cycles 0, total_cycles 325242482039 [2]---> (xe_drm_fdinfo:40147) DEBUG: vecs: sample 2: cycles 20348601, total_cycles 325261713058 (xe_drm_fdinfo:40147) DEBUG: vecs: percent: 105.811345 (xe_drm_fdinfo:40147) CRITICAL: Test assertion failure function check_results, file ../tests/intel/xe_drm_fdinfo.c:528: (xe_drm_fdinfo:40147) CRITICAL: Failed assertion: percent < 105.0 (xe_drm_fdinfo:40147) CRITICAL: error: 105.811345 >= 105.000000

It fails because total_cyles is smaller than the reported delta. However [0] shows the timestamp recorded by the GPU itself, which is reasonably close to the delta cycles (there are schedule gpu schedule latencies that are accounted in one vs the other, so it's not exact). So, it indeed executed for (at least) 20348520 cycles, but the reported delta for total_cycles is 325261713058 - 325242482039 == 19231019

For me the reason for the failure is the first sample. It reads cycles == 0, which means the ctx wasn't scheduled out yet to update the counter. So, depending of how fast the CPU can read the first sample, it may read a total_cycles that is actually more than the timestamp it should have been: cycles may remain the same for some time while total_cycles is still ticking.

I still think this patch here makes sense from a user perspective: we should try to get the gpu timestamp and client timestamp close together if user is expected to calculate a ratio between them.

From a testing perspective, maybe it'd better to check if the timestamp reported by the GPU ([0]) closely matches the one reported via client info ([1]) rather than calculating any percentage, which means leaving the total_cycles out of the equation for igt. Thoughts?

Lucas De Marchi

...

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com

drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock);
xa_for_each(&xef->exec_queue.xa, i, q) {
xe_exec_queue_get(q);
mutex_unlock(&xef->exec_queue.lock);
preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);
mutex_lock(&xef->exec_queue.lock);
xe_exec_queue_put(q);
}
preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

Umesh Nerlige Ramappa

19 Dec 19 Dec

8:49 p.m.

On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...

This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would hate to add anything in the KMD to do so.

For IGT, why not just take 4 samples for the measurement (separate out the 2 counters)

1. get gt timestamp in the first sample 2. get run ticks in the second sample 3. get run ticks in the third sample 4. get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks delta.

A user can always sample them together, but rather than focus on few values, they should just normalize the utilization over a longer period of time to get smoother stats.

Thanks, Umesh

...

Test scenario:

IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`

Platform: LNL, where CI occasionally reports failures

`stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com

drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock);
xa_for_each(&xef->exec_queue.xa, i, q) {
xe_exec_queue_get(q);
mutex_unlock(&xef->exec_queue.lock);
preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);
mutex_lock(&xef->exec_queue.lock);
xe_exec_queue_put(q);
}
preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

Lucas De Marchi

20 Dec 20 Dec

6:32 p.m.

On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:

...

On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the correlated counters are sampled together... And that is already very good per my tests. Also, it'd not only be good from a testing perspective, but for any userspace trying to make sense of the 2 counters.

Note that this is not much different from how e.g. perf samples group events:

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

See __perf_event_read() that will call pmu->read() on all sibling events while disabling preemption:

perf_event_read() { ... preempt_disable(); event_cpu = __perf_event_read_cpu(event, event_cpu); ... (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); preempt_enable(); ... }

so... at least there's prior art for that... for the same reason that userspace should see the values sampled together.

...

For IGT, why not just take 4 samples for the measurement (separate out the 2 counters)

get gt timestamp in the first sample

get run ticks in the second sample

get run ticks in the third sample

get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks delta.

this won't fix it for the general case: you get rid of the > 100% case, you make the < 100% much worse.

For a testing perspective I think the non-flaky solution is to stop calculating percentages and rather check that the execution timestamp recorded by the GPU very closely matches (minus gpu scheduling delays) the one we got via fdinfo once the fence signals and we wait for the job completion.

Lucas De Marchi

...

A user can always sample them together, but rather than focus on few values, they should just normalize the utilization over a longer period of time to get smoother stats.

Thanks, Umesh

...
Test scenario:

IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`

Platform: LNL, where CI occasionally reports failures

`stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com

drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock);
xa_for_each(&xef->exec_queue.xa, i, q) {
xe_exec_queue_get(q);
mutex_unlock(&xef->exec_queue.lock);
preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);
mutex_lock(&xef->exec_queue.lock);
xe_exec_queue_put(q);
}
preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

Umesh Nerlige Ramappa

21 Dec 21 Dec

12:32 a.m.

On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:

...

On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:

...
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the correlated counters are sampled together... And that is already very good per my tests. Also, it'd not only be good from a testing perspective, but for any userspace trying to make sense of the 2 counters.

Note that this is not much different from how e.g. perf samples group events:

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

See __perf_event_read() that will call pmu->read() on all sibling events while disabling preemption:

perf_event_read() { ... preempt_disable(); event_cpu = __perf_event_read_cpu(event, event_cpu); ... (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); preempt_enable(); ... }

so... at least there's prior art for that... for the same reason that userspace should see the values sampled together.

Well, I have used the preempt_disable/enable when fixing some selftest (i915), but was not happy that there were still some rare failures. If reducing error rates is the intention, then it's fine. In my mind, the issue still exists and once in a while we would end up assessing such a failure. Maybe, in addition, fixing up the IGTs like you suggest below is a worthwhile option.

...

...
For IGT, why not just take 4 samples for the measurement (separate out the 2 counters)

get gt timestamp in the first sample

get run ticks in the second sample

get run ticks in the third sample

get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks delta.

this won't fix it for the general case: you get rid of the > 100% case, you make the < 100% much worse.

yeah, that's quite possible.

...

For a testing perspective I think the non-flaky solution is to stop calculating percentages and rather check that the execution timestamp recorded by the GPU very closely matches (minus gpu scheduling delays) the one we got via fdinfo once the fence signals and we wait for the job completion.

Agree, we should change how we validate the counters in IGT.

Thanks, Umesh

...

Lucas De Marchi

...
A user can always sample them together, but rather than focus on few values, they should just normalize the utilization over a longer period of time to get smoother stats.

Thanks, Umesh

...
Test scenario:

IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`

Platform: LNL, where CI occasionally reports failures

`stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com

drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock);
xa_for_each(&xef->exec_queue.xa, i, q) {
xe_exec_queue_get(q);
mutex_unlock(&xef->exec_queue.lock);
preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);
mutex_lock(&xef->exec_queue.lock);
xe_exec_queue_put(q);
}
preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

Lucas De Marchi

12:42 a.m.

On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote:

...

On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:

...
On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:

...
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the correlated counters are sampled together... And that is already very good per my tests. Also, it'd not only be good from a testing perspective, but for any userspace trying to make sense of the 2 counters.

Note that this is not much different from how e.g. perf samples group events:

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

See __perf_event_read() that will call pmu->read() on all sibling events while disabling preemption:

perf_event_read() { ... preempt_disable(); event_cpu = __perf_event_read_cpu(event, event_cpu); ... (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); preempt_enable(); ... }

so... at least there's prior art for that... for the same reason that userspace should see the values sampled together.

Well, I have used the preempt_disable/enable when fixing some selftest (i915), but was not happy that there were still some rare failures. If reducing error rates is the intention, then it's fine. In my mind, the issue still exists and once in a while we would end up assessing such a failure. Maybe, in addition, fixing up the IGTs like you suggest below is a worthwhile option.

for me this fix is not targeted at tests, even if it improves them a lot. It's more for consistent userspace behavior.

...

...
...
For IGT, why not just take 4 samples for the measurement (separate out the 2 counters)

get gt timestamp in the first sample

get run ticks in the second sample

get run ticks in the third sample

get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks delta.

this won't fix it for the general case: you get rid of the > 100% case, you make the < 100% much worse.

yeah, that's quite possible.

...
For a testing perspective I think the non-flaky solution is to stop calculating percentages and rather check that the execution timestamp recorded by the GPU very closely matches (minus gpu scheduling delays) the one we got via fdinfo once the fence signals and we wait for the job completion.

Agree, we should change how we validate the counters in IGT.

I have a wip patch to cleanup and submit to igt. I will submit it soon.

Lucas De Marchi

4 Jan 4 Jan

7:19 a.m.

On Fri, Dec 20, 2024 at 06:42:09PM -0600, Lucas De Marchi wrote:

...

On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote:

...
On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:

...
On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:

...
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the correlated counters are sampled together... And that is already very good per my tests. Also, it'd not only be good from a testing perspective, but for any userspace trying to make sense of the 2 counters.

Note that this is not much different from how e.g. perf samples group events:

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

See __perf_event_read() that will call pmu->read() on all sibling events while disabling preemption:

perf_event_read() { ... preempt_disable(); event_cpu = __perf_event_read_cpu(event, event_cpu); ... (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); preempt_enable(); ... }

so... at least there's prior art for that... for the same reason that userspace should see the values sampled together.

Well, I have used the preempt_disable/enable when fixing some selftest (i915), but was not happy that there were still some rare failures. If reducing error rates is the intention, then it's fine. In my mind, the issue still exists and once in a while we would end up assessing such a failure. Maybe, in addition, fixing up the IGTs like you suggest below is a worthwhile option.

for me this fix is not targeted at tests, even if it improves them a lot. It's more for consistent userspace behavior.

...
...
...
For IGT, why not just take 4 samples for the measurement (separate out the 2 counters)

get gt timestamp in the first sample

get run ticks in the second sample

get run ticks in the third sample

get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks delta.

this won't fix it for the general case: you get rid of the > 100% case, you make the < 100% much worse.

yeah, that's quite possible.

...
For a testing perspective I think the non-flaky solution is to stop calculating percentages and rather check that the execution timestamp recorded by the GPU very closely matches (minus gpu scheduling delays) the one we got via fdinfo once the fence signals and we wait for the job completion.

Agree, we should change how we validate the counters in IGT.

I have a wip patch to cleanup and submit to igt. I will submit it soon.

Just submitted that as the last patch in the series: https://lore.kernel.org/igt-dev/20250104071548.737612-8-lucas.demarchi@intel...

but I'd also like to apply this one in the kernel and still looking for a review.

thanks Lucas De Marchi

Matthew Brost

9 Jan 9 Jan

4:50 a.m.

On Sat, Jan 04, 2025 at 01:19:59AM -0600, Lucas De Marchi wrote:

...

On Fri, Dec 20, 2024 at 06:42:09PM -0600, Lucas De Marchi wrote:

...
On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote:

...
On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:

...
On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:

...
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the correlated counters are sampled together... And that is already very good per my tests. Also, it'd not only be good from a testing perspective, but for any userspace trying to make sense of the 2 counters.

Note that this is not much different from how e.g. perf samples group events:

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

See __perf_event_read() that will call pmu->read() on all sibling events while disabling preemption:

perf_event_read() { ... preempt_disable(); event_cpu = __perf_event_read_cpu(event, event_cpu); ... (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); preempt_enable(); ... }

so... at least there's prior art for that... for the same reason that userspace should see the values sampled together.

Well, I have used the preempt_disable/enable when fixing some selftest (i915), but was not happy that there were still some rare failures. If reducing error rates is the intention, then it's fine. In my mind, the issue still exists and once in a while we would end up assessing such a failure. Maybe, in addition, fixing up the IGTs like you suggest below is a worthwhile option.

IMO, we should strive to avoid using low-level calls like preempt_disable and preempt_enable, as they lead to unmaintainable nonsense, as seen in the i915.

Sure, in Umesh's example, this is pretty clear and not an unreasonable usage. However, I’m more concerned that this sets a precedent in Xe that doing this is acceptable.

Not a blocker, just expressing concerns.

Matt

...

...
for me this fix is not targeted at tests, even if it improves them a lot. It's more for consistent userspace behavior.

...
...
...
For IGT, why not just take 4 samples for the measurement (separate out the 2 counters)

get gt timestamp in the first sample

get run ticks in the second sample

get run ticks in the third sample

get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks delta.

this won't fix it for the general case: you get rid of the > 100% case, you make the < 100% much worse.

yeah, that's quite possible.

...
For a testing perspective I think the non-flaky solution is to stop calculating percentages and rather check that the execution timestamp recorded by the GPU very closely matches (minus gpu scheduling delays) the one we got via fdinfo once the fence signals and we wait for the job completion.

Agree, we should change how we validate the counters in IGT.

I have a wip patch to cleanup and submit to igt. I will submit it soon.

Just submitted that as the last patch in the series: https://lore.kernel.org/igt-dev/20250104071548.737612-8-lucas.demarchi@intel...

but I'd also like to apply this one in the kernel and still looking for a review.

thanks Lucas De Marchi

Lucas De Marchi

5:19 a.m.

On Wed, Jan 08, 2025 at 08:50:43PM -0800, Matthew Brost wrote:

...

On Sat, Jan 04, 2025 at 01:19:59AM -0600, Lucas De Marchi wrote:

...
On Fri, Dec 20, 2024 at 06:42:09PM -0600, Lucas De Marchi wrote:

...
On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote:

...
On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:

...
On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote:

...
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote: > This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec > queue file lock usage."). While it's desired to have the mutex to > protect only the reference to the exec queue, getting and dropping each > mutex and then later getting the GPU timestamp, doesn't produce a > correct result: it introduces multiple opportunities for the task to be > scheduled out and thus wrecking havoc the deltas reported to userspace. > > Also, to better correlate the timestamp from the exec queues with the > GPU, disable preemption so they can be updated without allowing the task > to be scheduled out. We leave interrupts enabled as that shouldn't be > enough disturbance for the deltas to matter to userspace.

Like I said in the past, this is not trivial to solve and I would hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the correlated counters are sampled together... And that is already very good per my tests. Also, it'd not only be good from a testing perspective, but for any userspace trying to make sense of the 2 counters.

Note that this is not much different from how e.g. perf samples group events:

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

See __perf_event_read() that will call pmu->read() on all sibling events while disabling preemption:

perf_event_read() { ... preempt_disable(); event_cpu = __perf_event_read_cpu(event, event_cpu); ... (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); preempt_enable(); ... }

so... at least there's prior art for that... for the same reason that userspace should see the values sampled together.

Well, I have used the preempt_disable/enable when fixing some selftest (i915), but was not happy that there were still some rare failures. If reducing error rates is the intention, then it's fine. In my mind, the issue still exists and once in a while we would end up assessing such a failure. Maybe, in addition, fixing up the IGTs like you suggest below is a worthwhile option.

IMO, we should strive to avoid using low-level calls like preempt_disable and preempt_enable, as they lead to unmaintainable nonsense, as seen in the i915.

Sure, in Umesh's example, this is pretty clear and not an unreasonable usage. However, I’m more concerned that this sets a precedent in Xe that doing this is acceptable.

each such usage needs to be carefully reviewed, but there are cases in which it's useful and we shouldn't ban it. In my early reply on wdcrw3du2ssykmsrda3mvwjhreengeovwasikmixdiowqsfnwj@lsputsgtmha4 I even showed how the same construct is used by perf when reading counters that should be sampled together.

I will post a new version, but I will delay in a week or so merging it. That's because I want to check the result of the other patch series that changes the test in igt and afaics should give all green results all the time: https://patchwork.freedesktop.org/series/143204/

Lucas De Marchi

...

Not a blocker, just expressing concerns.

Matt

...
...
for me this fix is not targeted at tests, even if it improves them a lot. It's more for consistent userspace behavior.

...
...
...
For IGT, why not just take 4 samples for the measurement (separate out the 2 counters)

get gt timestamp in the first sample

get run ticks in the second sample

get run ticks in the third sample

get gt timestamp in the fourth sample

Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for run ticks delta.

this won't fix it for the general case: you get rid of the > 100% case, you make the < 100% much worse.

yeah, that's quite possible.

...
For a testing perspective I think the non-flaky solution is to stop calculating percentages and rather check that the execution timestamp recorded by the GPU very closely matches (minus gpu scheduling delays) the one we got via fdinfo once the fence signals and we wait for the job completion.

Agree, we should change how we validate the counters in IGT.

I have a wip patch to cleanup and submit to igt. I will submit it soon.

Just submitted that as the last patch in the series: https://lore.kernel.org/igt-dev/20250104071548.737612-8-lucas.demarchi@intel...

but I'd also like to apply this one in the kernel and still looking for a review.

thanks Lucas De Marchi

Matthew Brost

8:21 p.m.

On Wed, Jan 08, 2025 at 11:19:52PM -0600, Lucas De Marchi wrote:

...

On Wed, Jan 08, 2025 at 08:50:43PM -0800, Matthew Brost wrote:

...
On Sat, Jan 04, 2025 at 01:19:59AM -0600, Lucas De Marchi wrote:

...
On Fri, Dec 20, 2024 at 06:42:09PM -0600, Lucas De Marchi wrote:

...
On Fri, Dec 20, 2024 at 04:32:09PM -0800, Umesh Nerlige Ramappa wrote:

...
On Fri, Dec 20, 2024 at 12:32:16PM -0600, Lucas De Marchi wrote:

...
On Thu, Dec 19, 2024 at 12:49:16PM -0800, Umesh Nerlige Ramappa wrote: > On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote: > > This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec > > queue file lock usage."). While it's desired to have the mutex to > > protect only the reference to the exec queue, getting and dropping each > > mutex and then later getting the GPU timestamp, doesn't produce a > > correct result: it introduces multiple opportunities for the task to be > > scheduled out and thus wrecking havoc the deltas reported to userspace. > > > > Also, to better correlate the timestamp from the exec queues with the > > GPU, disable preemption so they can be updated without allowing the task > > to be scheduled out. We leave interrupts enabled as that shouldn't be > > enough disturbance for the deltas to matter to userspace. > > Like I said in the past, this is not trivial to solve and I > would hate to add anything in the KMD to do so.

I think the best we can do in the kernel side is to try to guarantee the correlated counters are sampled together... And that is already very good per my tests. Also, it'd not only be good from a testing perspective, but for any userspace trying to make sense of the 2 counters.

Note that this is not much different from how e.g. perf samples group events:

The unit of scheduling in perf is not an individual event, but rather an event group, which may contain one or more events (potentially on different PMUs). The notion of an event group is useful for ensuring that a set of mathematically related events are all simultaneously measured for the same period of time. For example, the number of L1 cache misses should not be larger than the number of L2 cache accesses. Otherwise, it may happen that the events get multiplexed and their measurements would no longer be comparable, making the analysis more difficult.

See __perf_event_read() that will call pmu->read() on all sibling events while disabling preemption:

perf_event_read() { ... preempt_disable(); event_cpu = __perf_event_read_cpu(event, event_cpu); ... (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); preempt_enable(); ... }

so... at least there's prior art for that... for the same reason that userspace should see the values sampled together.

Well, I have used the preempt_disable/enable when fixing some selftest (i915), but was not happy that there were still some rare failures. If reducing error rates is the intention, then it's fine. In my mind, the issue still exists and once in a while we would end up assessing such a failure. Maybe, in addition, fixing up the IGTs like you suggest below is a worthwhile option.

IMO, we should strive to avoid using low-level calls like preempt_disable and preempt_enable, as they lead to unmaintainable nonsense, as seen in the i915.

Sure, in Umesh's example, this is pretty clear and not an unreasonable usage. However, I’m more concerned that this sets a precedent in Xe that doing this is acceptable.

each such usage needs to be carefully reviewed, but there are cases

+1 to carefully reviewing each usage. That goes for any low-level kernel calls which typically shouldn't be used by drivers.

...

in which it's useful and we shouldn't ban it. In my early reply on wdcrw3du2ssykmsrda3mvwjhreengeovwasikmixdiowqsfnwj@lsputsgtmha4 I even showed how the same construct is used by perf when reading counters that should be sampled together.

Yea, this usage looks fine. AMD seems to do something similar to programming / reading HW registers which are tightly coupled.

Matt

...

I will post a new version, but I will delay in a week or so merging it. That's because I want to check the result of the other patch series that changes the test in igt and afaics should give all green results all the time: https://patchwork.freedesktop.org/series/143204/

Lucas De Marchi

...
Not a blocker, just expressing concerns.

Matt

...
...
for me this fix is not targeted at tests, even if it improves them a lot. It's more for consistent userspace behavior.

...
...
> > For IGT, why not just take 4 samples for the measurement > (separate out the 2 counters) > > 1. get gt timestamp in the first sample > 2. get run ticks in the second sample > 3. get run ticks in the third sample > 4. get gt timestamp in the fourth sample > > Rely on 1 and 4 for gt timestamp delta and on 2 and 3 for > run ticks delta.

this won't fix it for the general case: you get rid of the > 100% case, you make the < 100% much worse.

yeah, that's quite possible.

...
For a testing perspective I think the non-flaky solution is to stop calculating percentages and rather check that the execution timestamp recorded by the GPU very closely matches (minus gpu scheduling delays) the one we got via fdinfo once the fence signals and we wait for the job completion.

Agree, we should change how we validate the counters in IGT.

I have a wip patch to cleanup and submit to igt. I will submit it soon.

Just submitted that as the last patch in the series: https://lore.kernel.org/igt-dev/20250104071548.737612-8-lucas.demarchi@intel...

but I'd also like to apply this one in the kernel and still looking for a review.

thanks Lucas De Marchi

Umesh Nerlige Ramappa

6 Jan 6 Jan

10:44 p.m.

On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...

This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Assuming

- timestamp from exec queues = xe_exec_queue_update_run_ticks() - GPU timestamp = xe_hw_engine_read_timestamp()

shouldn't you also move the xe_hw_engine_read_timestamp() within the preempt_disable/preempt_enable section?

Something like this ..

preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

preempt_enable();

Thanks, Umesh

...

Test scenario:

IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`

Platform: LNL, where CI occasionally reports failures

`stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com

drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock);
xa_for_each(&xef->exec_queue.xa, i, q) {
xe_exec_queue_get(q);
mutex_unlock(&xef->exec_queue.lock);
preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);
mutex_lock(&xef->exec_queue.lock);
xe_exec_queue_put(q);
}
preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

Lucas De Marchi

8 Jan 8 Jan

10:57 p.m.

On Mon, Jan 06, 2025 at 02:44:27PM -0800, Umesh Nerlige Ramappa wrote:

...

On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Assuming

timestamp from exec queues = xe_exec_queue_update_run_ticks()

GPU timestamp = xe_hw_engine_read_timestamp()

shouldn't you also move the xe_hw_engine_read_timestamp() within the preempt_disable/preempt_enable section?

this is what I thought I had done, but it seems I messed up.

...

Something like this ..

preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

I'd move the gpu_timestamp to be done before the exec queue, so we don't call it with the exec queue mutex still taken. AFAIR this is what I was doing in the test machine, but sent the wrong version of the patch. Let me double check locally here and resend.

thanks Lucas De Marchi

...

preempt_enable();

Thanks, Umesh

...
Test scenario:

IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`

Platform: LNL, where CI occasionally reports failures

`stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com

drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock);
xa_for_each(&xef->exec_queue.xa, i, q) {
xe_exec_queue_get(q);
mutex_unlock(&xef->exec_queue.lock);
preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);
mutex_lock(&xef->exec_queue.lock);
xe_exec_queue_put(q);
}
preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

Lucas De Marchi

9 Jan 9 Jan

8:07 p.m.

On Wed, Jan 08, 2025 at 04:57:40PM -0600, Lucas De Marchi wrote:

...

On Mon, Jan 06, 2025 at 02:44:27PM -0800, Umesh Nerlige Ramappa wrote:

...
On Thu, Dec 12, 2024 at 09:34:32AM -0800, Lucas De Marchi wrote:

...
This partially reverts commit fe4f5d4b6616 ("drm/xe: Clean up VM / exec queue file lock usage."). While it's desired to have the mutex to protect only the reference to the exec queue, getting and dropping each mutex and then later getting the GPU timestamp, doesn't produce a correct result: it introduces multiple opportunities for the task to be scheduled out and thus wrecking havoc the deltas reported to userspace.

Also, to better correlate the timestamp from the exec queues with the GPU, disable preemption so they can be updated without allowing the task to be scheduled out. We leave interrupts enabled as that shouldn't be enough disturbance for the deltas to matter to userspace.

Assuming

timestamp from exec queues = xe_exec_queue_update_run_ticks()

GPU timestamp = xe_hw_engine_read_timestamp()

shouldn't you also move the xe_hw_engine_read_timestamp() within the preempt_disable/preempt_enable section?

this is what I thought I had done, but it seems I messed up.

...
Something like this ..

preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

I'd move the gpu_timestamp to be done before the exec queue, so we don't call it with the exec queue mutex still taken. AFAIR this is what I was doing in the test machine, but sent the wrong version of the patch. Let me double check locally here and resend.

just re-tested that and submitted a v2. With and without the pending igt patch series I can run the test for 1000 times while running cpu stress and not have any failure.

Lucas De Marchi

...

thanks Lucas De Marchi

...
preempt_enable();

Thanks, Umesh

...
Test scenario:

IGT'S `xe_drm_fdinfo --r --r utilization-single-full-load`

Platform: LNL, where CI occasionally reports failures

`stress -c $(nproc)` running in parallel to disturb the system

This brings a first failure from "after ~150 executions" to "never occurs after 1000 attempts".

Cc: stable@vger.kernel.org # v6.11+ Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3512 Signed-off-by: Lucas De Marchi lucas.demarchi@intel.com

drivers/gpu/drm/xe/xe_drm_client.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 298a587da7f17..e307b4d6bab5a 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -338,15 +338,12 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file)

/* Accumulate all the exec queues from this client */ mutex_lock(&xef->exec_queue.lock);
xa_for_each(&xef->exec_queue.xa, i, q) {
xe_exec_queue_get(q);
mutex_unlock(&xef->exec_queue.lock);
preempt_disable();

xa_for_each(&xef->exec_queue.xa, i, q) xe_exec_queue_update_run_ticks(q);
mutex_lock(&xef->exec_queue.lock);
xe_exec_queue_put(q);
}
preempt_enable(); mutex_unlock(&xef->exec_queue.lock);

gpu_timestamp = xe_hw_engine_read_timestamp(hwe);

-- 2.47.0

291

days inactive

319

days old

linux-stable-mirror@lists.linaro.org

12 comments

participants

tags (0)

participants (3)

Lucas De Marchi
Matthew Brost
Umesh Nerlige Ramappa