Hi Sergey,
On 11/21/25 03:55, Sergey Senozhatsky wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote:
On 11/20/25 04:45, Sergey Senozhatsky wrote:
Hi,
We are observing a performance regression on one of our arm64 boards. We tracked it down to the linux-6.6.y commit ada8d7fa0ad4 ("sched/cpufreq: Rework schedutil governor performance estimation").
UI speedometer benchmark: w/commit: 395 +/-38 w/o commit: 439 +/-14
Hi Sergey, Would be nice to get some details. What board?
It's an MT8196 chromebook.
What do the OPPs look like?
How do I find that out?
Does this system use uclamp during the benchmark? How?
How do I find that out?
Given how large the stddev given by speedometer (version 3?) itself is, can we get the stats of a few runs?
v2.1
w/o patch w/ patch 440 +/-30 406 +/-11 440 +/-14 413 +/-16 444 +/-12 403 +/-14 442 +/-12 412 +/-15
Maybe traces of cpu_frequency for both w/ and w/o?
trace-cmd record -e power:cpu_frequency attached.
"base" is with ada8d7fa0ad4 "revert" is ada8d7fa0ad4 reverted.
I did some analysis based on your trace files. I have been playing some time ago with speedometer performance issues so that's why I'm curious about your report here.
I've filtered your trace purely based on cpu7 (the single biggest cpu). Then I have cut the data from the 'warm-up' phase in both traces, to have similar start point (I think).
It looks like the 2 traces can show similar 'pattern' of that benchmark which is good for analysis. If you align the timestamp: 176.051s and 972.465s then both plots (frequency changes in time) look similar.
There are some differences, though: 1. there are more deeps in the freq in time, so more often you would pay extra penalty for the ramp-up again 2. some of the ramp-up phases are a bit longer ~100ms instead of ~80ms going from 2GHz to 3.6GHz 3.
There are idle phases missing in the trace, so we have to be careful when e.g. comparing avg frequency, because that might not be the real indication of the delivered computation and not indicate the gap in the score.
Here are the stats: 1. revert: frequency count 1.318000e+03 mean 2.932240e+06 std 5.434045e+05 min 2.000000e+06 50% 3.000000e+06 85% 3.600000e+06 90% 3.626000e+06 95% 3.626000e+06 99% 3.626000e+06 max 3.626000e+06
2. base: frequency count 1.551000e+03 mean 2.809391e+06 std 5.369750e+05 min 2.000000e+06 50% 2.800000e+06 85% 3.500000e+06 90% 3.600000e+06 95% 3.626000e+06 99% 3.626000e+06 max 3.626000e+06
A better indication in this case would be comparison of the frequency residency in time, especially for the max freq: 1. revert: 11.92s 2. base: 9.11s
So there is 2.8s longer residency for that fmax (while we even have longer period for finishing that Speedometer 2 test on 'base').
Here is some detail about that run*: +---------------+---------------------+---------------+----------------+ | Trace | Total Trace | Time at Max | % of Total | | | Duration (s) | Freq (s) | Time | +---------------+---------------------+---------------+----------------+ | Base Trace | 24.72 | 9.11 | 36.9% | | Revert Trace | 22.88 | 11.92 | 52.1% | +---------------+---------------------+---------------+----------------+
*We don't know the idle periods which might happen for those frequencies
I wonder if you had a fix patch for the util_est in your kernel... That fix has been recently backported to 6.6 stable [1].
You might want to try that patch as well, w/ or w/o this revert. IMHO it might be worth to have it on top. It might help the main Chrome task ('CrRendererMain') to stay longer on the biggest cpu, since the util_est would be higher. You can read the discussion that I had back then with PeterZ and VincentG [2].
Regards, Lukasz
[1] https://lore.kernel.org/stable/20251121130232.828187990@linuxfoundation.org/ [2] https://lore.kernel.org/lkml/20230912142821.GA22166@noisy.programming.kicks-...
Hi Lukasz,
On Tue, Nov 25, 2025 at 5:45 PM Lukasz Luba lukasz.luba@arm.com wrote:
Hi Sergey,
On 11/21/25 03:55, Sergey Senozhatsky wrote:
Hi Christian,
On (25/11/20 10:15), Christian Loehle wrote:
On 11/20/25 04:45, Sergey Senozhatsky wrote:
Hi,
We are observing a performance regression on one of our arm64 boards. We tracked it down to the linux-6.6.y commit ada8d7fa0ad4 ("sched/cpufreq: Rework schedutil governor performance estimation").
UI speedometer benchmark: w/commit: 395 +/-38 w/o commit: 439 +/-14
Hi Sergey, Would be nice to get some details. What board?
It's an MT8196 chromebook.
What do the OPPs look like?
How do I find that out?
Does this system use uclamp during the benchmark? How?
How do I find that out?
Given how large the stddev given by speedometer (version 3?) itself is, can we get the stats of a few runs?
v2.1
w/o patch w/ patch 440 +/-30 406 +/-11 440 +/-14 413 +/-16 444 +/-12 403 +/-14 442 +/-12 412 +/-15
Maybe traces of cpu_frequency for both w/ and w/o?
trace-cmd record -e power:cpu_frequency attached.
"base" is with ada8d7fa0ad4 "revert" is ada8d7fa0ad4 reverted.
I did some analysis based on your trace files. I have been playing some time ago with speedometer performance issues so that's why I'm curious about your report here.
I've filtered your trace purely based on cpu7 (the single biggest cpu). Then I have cut the data from the 'warm-up' phase in both traces, to have similar start point (I think).
It looks like the 2 traces can show similar 'pattern' of that benchmark which is good for analysis. If you align the timestamp: 176.051s and 972.465s then both plots (frequency changes in time) look similar.
There are some differences, though:
- there are more deeps in the freq in time, so more often you would pay extra penalty for the ramp-up again
- some of the ramp-up phases are a bit longer ~100ms instead of ~80ms going from 2GHz to 3.6GHz
Agree. From the visualized frequency changes in the Perfetto traces, it's more obvious that the ramp-up from 2GHz to 3.6GHz becomes much slower and a bit unstable in v6.6.99, and it's also easier to go down to a low frequency after a short idle.
There are idle phases missing in the trace, so we have to be careful when e.g. comparing avg frequency, because that might not be the real indication of the delivered computation and not indicate the gap in the score.
Here are the stats:
- revert:
frequency count 1.318000e+03 mean 2.932240e+06 std 5.434045e+05 min 2.000000e+06 50% 3.000000e+06 85% 3.600000e+06 90% 3.626000e+06 95% 3.626000e+06 99% 3.626000e+06 max 3.626000e+06
- base: frequency
count 1.551000e+03 mean 2.809391e+06 std 5.369750e+05 min 2.000000e+06 50% 2.800000e+06 85% 3.500000e+06 90% 3.600000e+06 95% 3.626000e+06 99% 3.626000e+06 max 3.626000e+06
A better indication in this case would be comparison of the frequency residency in time, especially for the max freq:
- revert: 11.92s
- base: 9.11s
So there is 2.8s longer residency for that fmax (while we even have longer period for finishing that Speedometer 2 test on 'base').
Here is some detail about that run*: +---------------+---------------------+---------------+----------------+ | Trace | Total Trace | Time at Max | % of Total | | | Duration (s) | Freq (s) | Time | +---------------+---------------------+---------------+----------------+ | Base Trace | 24.72 | 9.11 | 36.9% | | Revert Trace | 22.88 | 11.92 | 52.1% | +---------------+---------------------+---------------+----------------+
*We don't know the idle periods which might happen for those frequencies
I wonder if you had a fix patch for the util_est in your kernel... That fix has been recently backported to 6.6 stable [1].
You might want to try that patch as well, w/ or w/o this revert. IMHO it might be worth to have it on top. It might help the main Chrome task ('CrRendererMain') to stay longer on the biggest cpu, since the util_est would be higher. You can read the discussion that I had back then with PeterZ and VincentG [2].
No, the util_est fix isn't in our kernel yet. It looks like after cherry-picking the fix, without the revert, the Speedometer 2.0 score becomes even slightly higher than that on v6.6.88 (450 ~ 460 vs 435 ~ 440). On the other hand, with both the fix and the revert, the Speedometer score becomes about 475 ~ 480, which is almost the same as using the performance governor (i.e. pinning at the maximum frequency). It looks like more tasks that originally run on the little cores are migrated to the middle and big cores more often, which also makes CPU7 more likely to stay at a higher frequency during some short idle in the main thread.
Also attach the Perfetto trace for both of them:
fix without revert: https://ui.perfetto.dev/#%21/?s=ff4d10bd58982555eada61648786adf6f7187ac3 fix with revert: https://ui.perfetto.dev/#%21/?s=05da3cedfb3851ad694f523ef59d3cd1092d74ae
Regards, Lukasz
[1] https://lore.kernel.org/stable/20251121130232.828187990@linuxfoundation.org/ [2] https://lore.kernel.org/lkml/20230912142821.GA22166@noisy.programming.kicks-...
Best regards, Yu-Che
Hi Yu-Che,
On 11/25/25 13:01, Yu-Che Cheng wrote:
Hi Lukasz,
[snip]
There are some differences, though:
- there are more deeps in the freq in time, so more often you would pay extra penalty for the ramp-up again
- some of the ramp-up phases are a bit longer ~100ms instead of ~80ms going from 2GHz to 3.6GHz
Agree. From the visualized frequency changes in the Perfetto traces, it's more obvious that the ramp-up from 2GHz to 3.6GHz becomes much slower and a bit unstable in v6.6.99, and it's also easier to go down to a low frequency after a short idle.
[snip]
I wonder if you had a fix patch for the util_est in your kernel... That fix has been recently backported to 6.6 stable [1].
You might want to try that patch as well, w/ or w/o this revert. IMHO it might be worth to have it on top. It might help the main Chrome task ('CrRendererMain') to stay longer on the biggest cpu, since the util_est would be higher. You can read the discussion that I had back then with PeterZ and VincentG [2].
No, the util_est fix isn't in our kernel yet. It looks like after cherry-picking the fix, without the revert, the Speedometer 2.0 score becomes even slightly higher than that on v6.6.88 (450 ~ 460 vs 435 ~ 440). On the other hand, with both the fix and the revert, the Speedometer score becomes about 475 ~ 480, which is almost the same as using the performance governor (i.e. pinning at the maximum frequency).
Sounds really good to get such score.
It looks like more tasks that originally run on the little cores are migrated to the middle and big cores more often, which also makes CPU7 more likely to stay at a higher frequency during some short idle in the main thread.
Yes, that's the desired behavior.
Also attach the Perfetto trace for both of them:
fix without revert: https://ui.perfetto.dev/#%21/?s=ff4d10bd58982555eada61648786adf6f7187ac3 fix with revert: https://ui.perfetto.dev/#%21/?s=05da3cedfb3851ad694f523ef59d3cd1092d74ae
Thanks for the traces, there are idle periods there as well - cool.
I will link your email with the results for the history in that stable patch backport.
Thanks for sharing those tests' scores. Community works :)
Regards, Lukasz
linux-stable-mirror@lists.linaro.org