Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
Key points:
- All the tests were done on Hikey960, with a 5V Fan placed over the SoC to cool it down. - HDMI port was disconnected while running tests. - CONFIG_SCHED_TUNE was configured out to keep things simple. - Only the PCmark bench was tested, with help of workload automation. - Below number shows the average out of 3 runs, performed during a single kernel boot cycle. - Pelt 8/16/32 are the half-life periods. - While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+ | | | | | | | Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms| +------------------+----------+------------+------------+-----------+ | | | | | | | DataManipulation | 5341 | 5561 | 5453 | 5400 | | | | | | | | PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 | | | | | | | | VideoEditing | 0 | 4291 | 3746 | 3755 | | | | | | | | WebV2 | 6202 | 6448 | 5465 | 4648 | | | | | | | | Workv2 | 0 | 5697 | 5069 | 4517 | | | | | | | | WritingV2 | 4302 | 4549 | 3811 | 3306 | +------------------+----------+------------+------------+-----------+
As you can see in the results Pelt 8 is very much comparable to the Walt results now. Hurray ? :)
A detailed report is present here with some more useful numbers:
How to replicate setup:
- Android kernel tree: https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
This has several patches over latest 4.9-hikey aosp tree.
- Some patches to reduce disturbances, which Vincent shared earlier with a document.
- "thermal: Add debugfs support for cooling devices" and "cpufreq: stats: New sysfs attribute for clearing statistics" are used to read some more data from userspace after tests are done which can be used to build conclusions on working of pelt/walt and how they are behaving differently.
For example, we can know the amount of time we spent on individual cpu frequencies while the test was running. And also the time for which cpu-cooling and devfreq (ddr) has throttled some frequencies.
- Pelt 16 and pelt 8 patches.
The below changes are required to capture the extra data that I have captured in my sheet above.
I have attached pelt_walt.sh script, which you need to push to /data:
$ adb push pelt_walt.sh /data
And I have updated the pcmark plugin file to run the script and collect data. That is attached as well.
Happy testing !!
I heard from Vincent earlier that ARM did similar testing earlier on but never found anything significant. Why ? I may have an answer to that, not sure though.
I found a patch from Juri which someone is using:
https://android.googlesource.com/kernel/msm/+/b52bb1f248e4cef65edaece54a68c6...
and one of the problem here is that the patch hasn't updated the __accumulated_sum_N32 array, but only runnable_avg_yN_inv and runnable_avg_yN_sum.
That's pretty much it. Thanks for reading.
-- viresh
Hi Viresh,
On Fri, Jan 05, 2018 at 04:13:20PM +0530, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
Key points:
All the tests were done on Hikey960, with a 5V Fan placed over the SoC to cool it down.
HDMI port was disconnected while running tests.
CONFIG_SCHED_TUNE was configured out to keep things simple.
Only the PCmark bench was tested, with help of workload automation.
Below number shows the average out of 3 runs, performed during a single kernel boot cycle.
Pelt 8/16/32 are the half-life periods.
While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+ | | | | | | | Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms| +------------------+----------+------------+------------+-----------+ | | | | | | | DataManipulation | 5341 | 5561 | 5453 | 5400 | | | | | | | | PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 | | | | | | | | VideoEditing | 0 | 4291 | 3746 | 3755 | | | | | | | | WebV2 | 6202 | 6448 | 5465 | 4648 | | | | | | | | Workv2 | 0 | 5697 | 5069 | 4517 | | | | | | | | WritingV2 | 4302 | 4549 | 3811 | 3306 | +------------------+----------+------------+------------+-----------+
As you can see in the results Pelt 8 is very much comparable to the Walt results now. Hurray ? :)
A detailed report is present here with some more useful numbers:
How to replicate setup:
Android kernel tree: https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
This has several patches over latest 4.9-hikey aosp tree.
Some patches to reduce disturbances, which Vincent shared earlier with a document.
"thermal: Add debugfs support for cooling devices" and "cpufreq: stats: New sysfs attribute for clearing statistics" are used to read some more data from userspace after tests are done which can be used to build conclusions on working of pelt/walt and how they are behaving differently.
For example, we can know the amount of time we spent on individual cpu frequencies while the test was running. And also the time for which cpu-cooling and devfreq (ddr) has throttled some frequencies.
Pelt 16 and pelt 8 patches.
The below changes are required to capture the extra data that I have captured in my sheet above.
I have attached pelt_walt.sh script, which you need to push to /data:
$ adb push pelt_walt.sh /data
And I have updated the pcmark plugin file to run the script and collect data. That is attached as well.
Happy testing !!
I heard from Vincent earlier that ARM did similar testing earlier on but never found anything significant. Why ? I may have an answer to that, not sure though.
I may not be suitable person to answser this; just give some background info for this:
- I remembered Dietmar used Jankbench for comparing WALT and PELT, the scheduler signals have quite different effect on different testing case. Jankbench's tasks workload I think it's more related with responsiveness but from my understanding PCMark testing case is more related with sustainable workload?
Before I even observed the PELT signal (32ms period) might have better performance result than WALT for the sustainable workload, this is because the PELT signal has much longer decay time so it's more stable than WALT for some special case.
- I saw you have disabled SchedTune for testing, but SchedTune is quite fatal for Jankbench (Or Uibench) testing. The 'prefer_idle' and boost margin have quite important effect, even WALT signal also need heavily to rely on these knobs for Jankbench tunning. E.g 'prefer_idle' is fatal for reducing janks for UI cases.
Thanks, Leo Yan
On 08-01-18, 12:56, Leo Yan wrote:
I may not be suitable person to answser this; just give some background info for this:
I remembered Dietmar used Jankbench for comparing WALT and PELT, the scheduler signals have quite different effect on different testing case. Jankbench's tasks workload I think it's more related with responsiveness but from my understanding PCMark testing case is more related with sustainable workload?
Before I even observed the PELT signal (32ms period) might have better performance result than WALT for the sustainable workload, this is because the PELT signal has much longer decay time so it's more stable than WALT for some special case.
Sure, and the same reason can be used to argue against using WALT as that will have the same (bad) effects ?
And I am not arguing on what's the best one for us here, but rather wanted to show that PELT can be modified with trivial changes to make it perform like WALT and maybe remove WALT support later on from Android as that is never going to be upstreamed.
- I saw you have disabled SchedTune for testing, but SchedTune is quite fatal for Jankbench (Or Uibench) testing. The 'prefer_idle' and boost margin have quite important effect, even WALT signal also need heavily to rely on these knobs for Jankbench tunning. E.g 'prefer_idle' is fatal for reducing janks for UI cases.
I didn't wanted any special effects to trick with my results and wanted least number of variables. I do believe that we will continue to use schedtune and that can be tuned to make WALT and PELT (8) behave in a similar way ?
-- viresh
On Mon, Jan 08, 2018 at 12:21:17PM +0530, Viresh Kumar wrote:
On 08-01-18, 12:56, Leo Yan wrote:
I may not be suitable person to answser this; just give some background info for this:
I remembered Dietmar used Jankbench for comparing WALT and PELT, the scheduler signals have quite different effect on different testing case. Jankbench's tasks workload I think it's more related with responsiveness but from my understanding PCMark testing case is more related with sustainable workload?
Before I even observed the PELT signal (32ms period) might have better performance result than WALT for the sustainable workload, this is because the PELT signal has much longer decay time so it's more stable than WALT for some special case.
Sure, and the same reason can be used to argue against using WALT as that will have the same (bad) effects ?
I think so.
And I am not arguing on what's the best one for us here, but rather wanted to show that PELT can be modified with trivial changes to make it perform like WALT and maybe remove WALT support later on from Android as that is never going to be upstreamed.
Understand. I totally agree your testing is reasonable, WALT is mainly used to optimize UI responsiveness for better user experience; for this purpose there have many profiling on Uibench and Jankbench, so if the PELT optimization can achieve the similiar performance with WALT based on these test case, IMHO this will be more convienced.
- I saw you have disabled SchedTune for testing, but SchedTune is quite fatal for Jankbench (Or Uibench) testing. The 'prefer_idle' and boost margin have quite important effect, even WALT signal also need heavily to rely on these knobs for Jankbench tunning. E.g 'prefer_idle' is fatal for reducing janks for UI cases.
I didn't wanted any special effects to trick with my results and wanted least number of variables. I do believe that we will continue to use schedtune and that can be tuned to make WALT and PELT (8) behave in a similar way ?
Agree. Before SFO17 connect I profiled on Hikey960 with kernel 4.4, WALT with boosting margin 10% can get < 5% janks. PELT need improve boost margin to 15% or 20% to get similiar performance with WALT.
So If PELT can acheive same performance with WALT with same boosting margin, and if PELT can save power at the same time, this will be very nice :)
Just want to remind that besides the CPU frequency as one important metric, the "over-utilization" will be another important metric. I can see the PELT(8ms) can improve the performance result significantly, but not sure if this because the cpu util can easily reach to 80% tipping point so trigger "over-utilized" and finally boost performance by spreading tasks with SMP balance.
Thanks, Leo Yan
On Mon, Jan 08, 2018 at 03:46:47PM +0800, Leo Yan wrote:
[...]
- I saw you have disabled SchedTune for testing, but SchedTune is quite fatal for Jankbench (Or Uibench) testing. The 'prefer_idle' and boost margin have quite important effect, even WALT signal also need heavily to rely on these knobs for Jankbench tunning. E.g 'prefer_idle' is fatal for reducing janks for UI cases.
I didn't wanted any special effects to trick with my results and wanted least number of variables. I do believe that we will continue to use schedtune and that can be tuned to make WALT and PELT (8) behave in a similar way ?
Agree. Before SFO17 connect I profiled on Hikey960 with kernel 4.4, WALT with boosting margin 10% can get < 5% janks. PELT need improve boost margin to 15% or 20% to get similiar performance with WALT.
Just reminding, Android kernel 4.9 has performance regression for Jankbench on Hikey960 (now has not root caused yet), but kernel 4.4 is good for performance. Please note for this when profile for UI cases.
Thanks, Leo Yan
On 08-01-18, 16:16, Leo Yan wrote:
Just reminding, Android kernel 4.9 has performance regression for Jankbench on Hikey960 (now has not root caused yet), but kernel 4.4 is good for performance. Please note for this when profile for UI cases.
I hate to go back to 4.4 :)
Will 4.9 be good enough for comparison of Walt/Pelt? I mean the regression should be there for both of them, isn't it ?
I am planning to do Jankbench tests sometime soon.
-- viresh
On 05/01/2018 11:43, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
Key points:
All the tests were done on Hikey960, with a 5V Fan placed over the SoC to cool it down.
HDMI port was disconnected while running tests.
CONFIG_SCHED_TUNE was configured out to keep things simple.
Only the PCmark bench was tested, with help of workload automation.
Below number shows the average out of 3 runs, performed during a single kernel boot cycle.
Pelt 8/16/32 are the half-life periods.
While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+ | | | | | | | Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms| +------------------+----------+------------+------------+-----------+ | | | | | | | DataManipulation | 5341 | 5561 | 5453 | 5400 | | | | | | | | PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 | | | | | | | | VideoEditing | 0 | 4291 | 3746 | 3755 | | | | | | | | WebV2 | 6202 | 6448 | 5465 | 4648 | | | | | | | | Workv2 | 0 | 5697 | 5069 | 4517 | | | | | | | | WritingV2 | 4302 | 4549 | 3811 | 3306 | +------------------+----------+------------+------------+-----------+
As you can see in the results Pelt 8 is very much comparable to the Walt results now. Hurray ? :)
Modulating the pelt period to approach the walt results is a good approach to remove the walt code, IMO.
To make sure that is going to the right direction, may be you can the same with benchmarks measuring latencies / responsiveness ?
A detailed report is present here with some more useful numbers:
Why the amplitude of the differences for the TIS are high between different rounds ? I mean the delta between min and max.
-- http://www.linaro.org/ Linaro.org │ Open source software for ARM SoCs
Follow Linaro: http://www.facebook.com/pages/Linaro Facebook | http://twitter.com/#!/linaroorg Twitter | http://www.linaro.org/linaro-blog/ Blog
On 08-01-18, 09:00, Daniel Lezcano wrote:
To make sure that is going to the right direction, may be you can the same with benchmarks measuring latencies / responsiveness ?
Yeah, I am going to run Jankbench now.
Why the amplitude of the differences for the TIS are high between different rounds ? I mean the delta between min and max.
Ahh, so its only for the cpufreq numbers and not for thermal ones. The reason is very straight forward though, I somehow failed to reset the stats between individual runs. The second run is almost twice of the first one always and the third one is thrice :)
-- viresh
On 01/05/2018 11:43 AM, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
Key points:
All the tests were done on Hikey960, with a 5V Fan placed over the SoC to cool it down.
HDMI port was disconnected while running tests.
CONFIG_SCHED_TUNE was configured out to keep things simple.
Only the PCmark bench was tested, with help of workload automation.
Below number shows the average out of 3 runs, performed during a single kernel boot cycle.
Pelt 8/16/32 are the half-life periods.
While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+ | | | | | | | Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms| +------------------+----------+------------+------------+-----------+ | | | | | | | DataManipulation | 5341 | 5561 | 5453 | 5400 | | | | | | | | PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 | | | | | | | | VideoEditing | 0 | 4291 | 3746 | 3755 | | | | | | | | WebV2 | 6202 | 6448 | 5465 | 4648 | | | | | | | | Workv2 | 0 | 5697 | 5069 | 4517 | | | | | | | | WritingV2 | 4302 | 4549 | 3811 | 3306 | +------------------+----------+------------+------------+-----------+
As you can see in the results Pelt 8 is very much comparable to the Walt results now. Hurray ? :)
A detailed report is present here with some more useful numbers:
How to replicate setup:
Android kernel tree: https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
This has several patches over latest 4.9-hikey aosp tree.
Some patches to reduce disturbances, which Vincent shared earlier with a document.
"thermal: Add debugfs support for cooling devices" and "cpufreq: stats: New sysfs attribute for clearing statistics" are used to read some more data from userspace after tests are done which can be used to build conclusions on working of pelt/walt and how they are behaving differently.
For example, we can know the amount of time we spent on individual cpu frequencies while the test was running. And also the time for which cpu-cooling and devfreq (ddr) has throttled some frequencies.
Pelt 16 and pelt 8 patches.
The below changes are required to capture the extra data that I have captured in my sheet above.
I have attached pelt_walt.sh script, which you need to push to /data:
$ adb push pelt_walt.sh /data
And I have updated the pcmark plugin file to run the script and collect data. That is attached as well.
Happy testing !!
I heard from Vincent earlier that ARM did similar testing earlier on but never found anything significant. Why ? I may have an answer to that, not sure though.
I found a patch from Juri which someone is using:
https://android.googlesource.com/kernel/msm/+/b52bb1f248e4cef65edaece54a68c6...
and one of the problem here is that the patch hasn't updated the __accumulated_sum_N32 array, but only runnable_avg_yN_inv and runnable_avg_yN_sum.
We are aware of the fact that reducing PELT's half-life periods can make the system more responsive. I remember that earlier product versions actually shipped with 16ms. But being more responsive is only one side of the problem. Energy consumption is the other big problem. And by making PELT too aggressive (making half-life period smaller) you risk overshooting of the signal and you reduce the effect of history, two things which could be fatal for energy savings.
IIRC, for newer product kernel, we went back to 32ms.
Driving this idea of reducing PELT's half-life period further ... you end up not using PELT at all but instantaneous load(/util) (se->load.weight, cfs_rq->load.weight).
So the energy consumption values wltests spits out or some tests are also very important to have. And then there is still this issue that h960 (especially close to mainline) is not mature enough to show the same results as when the tests run on a production device.
-- Dietmar
[...]
On 05-Jan 16:13, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
Key points:
All the tests were done on Hikey960, with a 5V Fan placed over the SoC to cool it down.
HDMI port was disconnected while running tests.
CONFIG_SCHED_TUNE was configured out to keep things simple.
Only the PCmark bench was tested, with help of workload automation.
Below number shows the average out of 3 runs, performed during a single kernel boot cycle.
Pelt 8/16/32 are the half-life periods.
While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+ | | | | | | | Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms| +------------------+----------+------------+------------+-----------+ | | | | | | | DataManipulation | 5341 | 5561 | 5453 | 5400 | | | | | | | | PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 | | | | | | | | VideoEditing | 0 | 4291 | 3746 | 3755 | | | | | | | | WebV2 | 6202 | 6448 | 5465 | 4648 | | | | | | | | Workv2 | 0 | 5697 | 5069 | 4517 | | | | | | | | WritingV2 | 4302 | 4549 | 3811 | 3306 | +------------------+----------+------------+------------+-----------+
As you can see in the results Pelt 8 is very much comparable to the Walt results now. Hurray ? :)
A detailed report is present here with some more useful numbers:
I don't have access to this report... just sent a requests.
How to replicate setup:
Android kernel tree: https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
This has several patches over latest 4.9-hikey aosp tree.
Some patches to reduce disturbances, which Vincent shared earlier with a document.
"thermal: Add debugfs support for cooling devices" and "cpufreq: stats: New sysfs attribute for clearing statistics" are used to read some more data from userspace after tests are done which can be used to build conclusions on working of pelt/walt and how they are behaving differently.
For example, we can know the amount of time we spent on individual cpu frequencies while the test was running. And also the time for which cpu-cooling and devfreq (ddr) has throttled some frequencies.
Pelt 16 and pelt 8 patches.
Are those the patches I've shared few weeks ago, on top of util_est?
http://www.linux-arm.org/git?p=linux-pb.git%3Ba=shortlog%3Bh=refs/heads/eas/...
There are two main observations regarding PELT speedups:
1) faster decay time: by speeding up the ramp-up we also have faster decay times, which ultimately make PELT even more different than WALT, where instead utilization never decays. This can benefits benchmarks but can affect other interactive use-cases.
2) the constants you change affects LOAD too, do we know what are the side-effects in this case?
Moreover, as Leo pointed out, speeding up PELT can also have side effects on overutilization, thus reducing the time we run in energy aware mode.
All that considered, IMHO evaluating PELT speed-up requires a much more extensive set of tests then just comparing 3 runs of PCMark. Energy must definitively be one of the metrics and a more comprehensive set of workloads is also required to get a full picture.
That's why we spent 1 month time to create a simple and reproducible workflow based on LISA/WA which allows to collect a complete set of measurements and easy share them.
The below changes are required to capture the extra data that I have captured in my sheet above.
I have attached pelt_walt.sh script, which you need to push to /data:
$ adb push pelt_walt.sh /data
And I have updated the pcmark plugin file to run the script and collect data. That is attached as well.
Happy testing !!
Do we need another one? Can't you share instead wltest results? That's also what ultimately Google want to see as experimental evaluation of scheduler propose modifications.
I heard from Vincent earlier that ARM did similar testing earlier on but never found anything significant. Why ? I may have an answer to that, not sure though.
That's not completely true, we did testing and we are doing testing. The branch above is part of the testing we are doing, on both PELT speed-ups and util_est, which we still consider as part of the same story to have a more WALT-like PELT.
Maybe it's just that for us testing requires more time to run all of them? ;-)
I found a patch from Juri which someone is using:
https://android.googlesource.com/kernel/msm/+/b52bb1f248e4cef65edaece54a68c6...
and one of the problem here is that the patch hasn't updated the __accumulated_sum_N32 array, but only runnable_avg_yN_inv and runnable_avg_yN_sum.
That patch did not updated __accumulated_sum_N32 because it was not used in that kernel, a 3.18 codebased, where PELT was updated using a different set of support data structures: the ones modified by the patch.
Regarding the results however there was benefits, and that's why Pixel phones have been released with a 16ms PELT.
On 08-01-18, 16:11, Patrick Bellasi wrote:
On 05-Jan 16:13, Viresh Kumar wrote:
I don't have access to this report... just sent a requests.
I thought I shared it correctly earlier but looks like it was set to "Anyone with link from Linaro". Fixed it now.
How to replicate setup:
Android kernel tree: https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
This has several patches over latest 4.9-hikey aosp tree.
Some patches to reduce disturbances, which Vincent shared earlier with a document.
"thermal: Add debugfs support for cooling devices" and "cpufreq: stats: New sysfs attribute for clearing statistics" are used to read some more data from userspace after tests are done which can be used to build conclusions on working of pelt/walt and how they are behaving differently.
For example, we can know the amount of time we spent on individual cpu frequencies while the test was running. And also the time for which cpu-cooling and devfreq (ddr) has throttled some frequencies.
Pelt 16 and pelt 8 patches.
Are those the patches I've shared few weeks ago, on top of util_est?
http://www.linux-arm.org/git?p=linux-pb.git%3Ba=shortlog%3Bh=refs/heads/eas/...
Ah no. I never saw yours and created my own :)
I even uploaded a very similar patch for inclusion in the Android tree yesterday.
https://android-review.googlesource.com/c/kernel/common/+/581203
Just reinvented the wheel it seems :(
There are two main observations regarding PELT speedups:
faster decay time: by speeding up the ramp-up we also have faster decay times, which ultimately make PELT even more different than WALT, where instead utilization never decays. This can benefits benchmarks but can affect other interactive use-cases.
the constants you change affects LOAD too, do we know what are the side-effects in this case?
Moreover, as Leo pointed out, speeding up PELT can also have side effects on overutilization, thus reducing the time we run in energy aware mode.
I agree.
All that considered, IMHO evaluating PELT speed-up requires a much more extensive set of tests then just comparing 3 runs of PCMark.
Sure. I just wanted to start an initial discussion to get some feedback on the direction of the work I was doing.
Do we need another one? Can't you share instead wltest results?
I have used only WA for these tests, but yeah wltest/lisa would be better.
That's also what ultimately Google want to see as experimental evaluation of scheduler propose modifications.
I heard from Vincent earlier that ARM did similar testing earlier on but never found anything significant. Why ? I may have an answer to that, not sure though.
That's not completely true, we did testing and we are doing testing. The branch above is part of the testing we are doing, on both PELT speed-ups and util_est, which we still consider as part of the same story to have a more WALT-like PELT.
Hmm, I had a chat around this long back with Vincent and was under the impression that ARM hasn't found much value in playing with half-life period. Of course, I didn't knew the above stuff.
Regarding the results however there was benefits, and that's why Pixel phones have been released with a 16ms PELT.
Aren't the pixel phones using WALT for cpu and task utill stuff ?
-- viresh
On 01/08/2018 11:04 PM, Viresh Kumar wrote:
On 08-01-18, 16:11, Patrick Bellasi wrote:
On 05-Jan 16:13, Viresh Kumar wrote:
I don't have access to this report... just sent a requests.
I thought I shared it correctly earlier but looks like it was set to "Anyone with link from Linaro". Fixed it now.
How to replicate setup:
Android kernel tree: https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
This has several patches over latest 4.9-hikey aosp tree.
Some patches to reduce disturbances, which Vincent shared earlier with a document.
"thermal: Add debugfs support for cooling devices" and "cpufreq: stats: New sysfs attribute for clearing statistics" are used to read some more data from userspace after tests are done which can be used to build conclusions on working of pelt/walt and how they are behaving differently.
For example, we can know the amount of time we spent on individual cpu frequencies while the test was running. And also the time for which cpu-cooling and devfreq (ddr) has throttled some frequencies.
Pelt 16 and pelt 8 patches.
Are those the patches I've shared few weeks ago, on top of util_est?
http://www.linux-arm.org/git?p=linux-pb.git%3Ba=shortlog%3Bh=refs/heads/eas/...
Ah no. I never saw yours and created my own :)
I even uploaded a very similar patch for inclusion in the Android tree yesterday.
https://android-review.googlesource.com/c/kernel/common/+/581203
Just reinvented the wheel it seems :(
Kinda offtopic, but chiming in because I've had enough people rant about this to me:
I know this isn't an intentional reinvention. But even for an unintentional case, I would recommend going back to Patrick's patches for future profiling instead of reinventing similar functionality. I've had several people rant to me that upstream maintainers don't pick up the patches people send and instead rewrite it themselves. We shouldn't continue propagating that notion. The kernel will lose devs if that keeps happening.
Thanks, Saravana
-- Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
On 12-01-18, 15:37, Saravana Kannan wrote:
Kinda offtopic, but chiming in because I've had enough people rant about this to me:
:)
I know this isn't an intentional reinvention. But even for an unintentional case, I would recommend going back to Patrick's patches for future profiling instead of reinventing similar functionality. I've had several people rant to me that upstream maintainers don't pick up the patches people send and instead rewrite it themselves. We shouldn't continue propagating that notion. The kernel will lose devs if that keeps happening.
Well, this is something which is targeted just for Android right now and this isn't going to make me earn any BitCoins :). I wouldn't mind whose name gets on this patch and I am quite sure Patrick wouldn't either. It was too trivial of a patch.
Anyway, here is a short summary of events:
- Most of the content of this patch is obtained from Documentation/scheduler/sched-pelt.c, its too trivial otherwise.
- I believe the first version of this patch (in any form) was to move the period to 16 ms instead of 32 ms and that was done by Juri few years ago (while he was at ARM).
- I started playing with this stuff around 9th November and both me and Vincent had an internal patch we were sharing on this since then.
- Not sure when Patrick wrote his version of the patch, the date from his tree goes back to 21st November though.
Now even if both me and Patrick had our versions of this patch, anyone can come up and send his own patch for it to the lists or android tree and we can't object to that because: "We never posted our patch for review".
Anyway, @Patrick please feel free to post the patch for Android inclusion and I will review/verify/ack it and drop mine :)
-- viresh
On 08-01-18, 16:11, Patrick Bellasi wrote:
http://www.linux-arm.org/git?p=linux-pb.git%3Ba=shortlog%3Bh=refs/heads/eas/...
My numbers with the util-est patches were terrible and found out that below hunk is required to make it work, specially that I was testing without sched tune.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b5b2aa40654c..ab00d6b653a5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6303,7 +6303,7 @@ boosted_cpu_util(int cpu) util = max(util, cpu_rq(cpu)->cfs.util_est_runnable);
margin = schedtune_cpu_margin(util, cpu); - util = min_t(long, (margin), SCHED_CAPACITY_SCALE); + util = min_t(long, util + margin, SCHED_CAPACITY_SCALE);
trace_sched_boost_cpu(cpu, util, margin);
-- viresh
Hi Viresh,
On 17-Jan 11:45, Viresh Kumar wrote:
On 08-01-18, 16:11, Patrick Bellasi wrote:
http://www.linux-arm.org/git?p=linux-pb.git%3Ba=shortlog%3Bh=refs/heads/eas/...
My numbers with the util-est patches were terrible and found out that below hunk is required to make it work, specially that I was testing without sched tune.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b5b2aa40654c..ab00d6b653a5 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6303,7 +6303,7 @@ boosted_cpu_util(int cpu) util = max(util, cpu_rq(cpu)->cfs.util_est_runnable);
margin = schedtune_cpu_margin(util, cpu);
util = min_t(long, (margin), SCHED_CAPACITY_SCALE);
util = min_t(long, util + margin, SCHED_CAPACITY_SCALE);
You right, I did notice the same on the patches I've backported for testing on Pixel 2... I'll update the branch on linux-arm.
Without this fix both energy and performance are really crappy on Pixel too :-/
trace_sched_boost_cpu(cpu, util, margin);
Cheers Patrick
-- #include <best/regards.h>
Patrick Bellasi
On 05-01-18, 16:13, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
Key points:
All the tests were done on Hikey960, with a 5V Fan placed over the SoC to cool it down.
HDMI port was disconnected while running tests.
CONFIG_SCHED_TUNE was configured out to keep things simple.
Only the PCmark bench was tested, with help of workload automation.
Below number shows the average out of 3 runs, performed during a single kernel boot cycle.
Pelt 8/16/32 are the half-life periods.
While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+ | | | | | | | Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms| +------------------+----------+------------+------------+-----------+ | | | | | | | DataManipulation | 5341 | 5561 | 5453 | 5400 | | | | | | | | PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 | | | | | | | | VideoEditing | 0 | 4291 | 3746 | 3755 | | | | | | | | WebV2 | 6202 | 6448 | 5465 | 4648 | | | | | | | | Workv2 | 0 | 5697 | 5069 | 4517 | | | | | | | | WritingV2 | 4302 | 4549 | 3811 | 3306 | +------------------+----------+------------+------------+-----------+
So I tried Jankbench (apk that Patrick once provided in one of the emails) and Pelt (8) seems to outperform Walt there as well, of course still no power numbers and I haven't connect HDMI as well. Tested with the 4.9 kernel without Sched-Tune.
I ran the janbench tests from sched-evaluation-full.yaml agenda (attached).
Results updated in: https://goo.gl/eCx4Pk
Average across 30 iterations:
+-----------------+---------+--------+----------------------+ | Test | Pelt(8)| Walt | Pelt Improvement % | +-----------------+---------+--------+----------------------+ | list_view | 8 | 17 | 52.94 | | | | | | | image_list_view | 11 | 24 | 54.17 | | | | | | | shadow_grid | 18 | 27 | 33.33 | | | | | | | low_hitrate_text| 39 | 45 | 13.33 | | | | | | | edit_text | 5 | 7 | 28.57 | | | | | | +-----------------+---------+--------+----------------------+
-- viresh
Hello Viresh,
The pixel.devices are using Walt. Are you able to share your full patch set to allow me to implement it on my Pixel 2 tree? I would like to see real day usage with pelt 8ms vs Walt.
Regards,
Matthew
On Jan 9, 2018 1:43 AM, Viresh Kumar viresh.kumar@linaro.org wrote:
On 05-01-18, 16:13, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
Key points:
All the tests were done on Hikey960, with a 5V Fan placed over the SoC to cool it down.
HDMI port was disconnected while running tests.
CONFIG_SCHED_TUNE was configured out to keep things simple.
Only the PCmark bench was tested, with help of workload automation.
Below number shows the average out of 3 runs, performed during a single kernel boot cycle.
Pelt 8/16/32 are the half-life periods.
While testing Pelt, CONFIG_WALT was disabled.
+------------------+----------+------------+------------+-----------+ | | | | | | | Test name | WALT | Pelt 8 ms | Pelt 16 ms | Pelt 32 ms| +------------------+----------+------------+------------+-----------+ | | | | | | | DataManipulation | 5341 | 5561 | 5453 | 5400 | | | | | | | | PhotoEditingV2 | 9015 | 8577 | 7911 | 6043 | | | | | | | | VideoEditing | 0 | 4291 | 3746 | 3755 | | | | | | | | WebV2 | 6202 | 6448 | 5465 | 4648 | | | | | | | | Workv2 | 0 | 5697 | 5069 | 4517 | | | | | | | | WritingV2 | 4302 | 4549 | 3811 | 3306 | +------------------+----------+------------+------------+-----------+
So I tried Jankbench (apk that Patrick once provided in one of the emails) and Pelt (8) seems to outperform Walt there as well, of course still no power numbers and I haven't connect HDMI as well. Tested with the 4.9 kernel without Sched-Tune.
I ran the janbench tests from sched-evaluation-full.yaml agenda (attached).
Results updated in: https://goo.gl/eCx4Pk
Average across 30 iterations:
+-----------------+---------+--------+----------------------+ | Test | Pelt(8)| Walt | Pelt Improvement % | +-----------------+---------+--------+----------------------+ | list_view | 8 | 17 | 52.94 | | | | | | | image_list_view | 11 | 24 | 54.17 | | | | | | | shadow_grid | 18 | 27 | 33.33 | | | | | | | low_hitrate_text| 39 | 45 | 13.33 | | | | | | | edit_text | 5 | 7 | 28.57 | | | | | | +-----------------+---------+--------+----------------------+
-- viresh
_______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
On 09-01-18, 13:05, Matthew Alex wrote:
Hello Viresh,
The pixel.devices are using Walt. Are you able to share your full patch set to allow me to implement it on my Pixel 2 tree? I would like to see real day usage with pelt 8ms vs Walt.
Hi,
I already shared my setup in the initial email, you can find it here. You just need to disable CONFIG_SCHED_WALT and CONFIG_SCHED_TUNE in the config for your platform to make your setup similar to mine.
https://git.linaro.org/people/vireshk/mylinux.git android-4.9-hikey
-- viresh
Hi Viresh,
On Tue, Jan 9, 2018 at 12:12 PM, Viresh Kumar viresh.kumar@linaro.org wrote:
On 05-01-18, 16:13, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
I ran the janbench tests from sched-evaluation-full.yaml agenda (attached).
Results updated in: https://goo.gl/eCx4Pk
Average across 30 iterations:
+-----------------+---------+--------+----------------------+ | Test | Pelt(8)| Walt | Pelt Improvement % | +-----------------+---------+--------+----------------------+ | list_view | 8 | 17 | 52.94 | | | | | | | image_list_view | 11 | 24 | 54.17 | | | | | | | shadow_grid | 18 | 27 | 33.33 | | | | | | | low_hitrate_text| 39 | 45 | 13.33 | | | | | | | edit_text | 5 | 7 | 28.57 | | | | | | +-----------------+---------+--------+----------------------+
--
What is the WALT window size (walt_ravg_window) in your setup? I see it is 20 msec here [1], but just want to double confirm. Reducing WALT window size makes it more responsive. Have you tried with different window size like 10 msec?
[1] https://git.linaro.org/people/vireshk/mylinux.git/tree/kernel/sched/walt.c?h...
-- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
On 11-Jan 16:44, Pavan Kondeti wrote:
Hi Viresh,
On Tue, Jan 9, 2018 at 12:12 PM, Viresh Kumar viresh.kumar@linaro.org wrote:
On 05-01-18, 16:13, Viresh Kumar wrote:
Hello,
I did some comparisons of Pelt and Walt and have some very interesting performance results that I wanted to share with all of you. I haven't got any power numbers as I don't have setup for that.
I ran the janbench tests from sched-evaluation-full.yaml agenda (attached).
Results updated in: https://goo.gl/eCx4Pk
Average across 30 iterations:
+-----------------+---------+--------+----------------------+ | Test | Pelt(8)| Walt | Pelt Improvement % | +-----------------+---------+--------+----------------------+ | list_view | 8 | 17 | 52.94 | | | | | | | image_list_view | 11 | 24 | 54.17 | | | | | | | shadow_grid | 18 | 27 | 33.33 | | | | | | | low_hitrate_text| 39 | 45 | 13.33 | | | | | | | edit_text | 5 | 7 | 28.57 | | | | | | +-----------------+---------+--------+----------------------+
--
What is the WALT window size (walt_ravg_window) in your setup? I see it is 20 msec here [1], but just want to double confirm. Reducing WALT
Good point, 10ms is actually the default value on Pixel2 devices too.
window size makes it more responsive. Have you tried with different window size like 10 msec?
[1] https://git.linaro.org/people/vireshk/mylinux.git/tree/kernel/sched/walt.c?h...
-- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
-- #include <best/regards.h>
Patrick Bellasi
On 11-01-18, 16:44, Pavan Kondeti wrote:
What is the WALT window size (walt_ravg_window) in your setup? I see it is 20 msec here [1], but just want to double confirm. Reducing WALT window size makes it more responsive. Have you tried with different window size like 10 msec?
Hi Pavan,
Sorry for the delayed response, I wanted to make sure I get back after running some tests with the 10 ms thing.
- So yes, the earlier tests were with the 20 ms window size. I have tested it again with 10 ms and the results have surely improved significantly. No power numbers yet.
- Another thing to notice is that the PCMark Video-decoding and Work-V2 tests show proper numbers after this change, which were failing earlier and giving 0 as the results.
Few observations from pelt vs walt tests I performed:
- The cpufreq statistics show that with walt the big cluster runs for a significantly longer period of time at the highest OPP compared to pelt. Which plays a significant role boosting Walt's performance numbers.
This may happen due to one of the two reasons (or both): - The slope of the signal is higher and so cpu_util() increases rapidly. - We put (more) tasks on the bigger CPU quickly with Walt.
My test shows that the main reason is the second one and it isn't really about the slope of the signal thing.
- I also performed Walt tests without walt_cpu_high_irqload() thing and the results weren't significantly different than walt with it. So that isn't playing a big role either.
Until now my tests show that pelt 8 or 16 with util-est (from patrick) show very good results (just a bit lower than walt on specific cases). The cpufreq stat figures are also equivalent to walt and so we must not be doing bad energy wise.
I am quite sure that using sched-tune to boost up things, we will more quickly move the tasks to the big cluster with pelt 8/16 and that may help us getting results equivalent to that of walt. Will do that going forward..
Thanks.
-- viresh
Hi Viresh,
On Fri, Jan 19, 2018 at 11:28:16AM +0530, Viresh Kumar wrote:
On 11-01-18, 16:44, Pavan Kondeti wrote:
What is the WALT window size (walt_ravg_window) in your setup? I see it is 20 msec here [1], but just want to double confirm. Reducing WALT window size makes it more responsive. Have you tried with different window size like 10 msec?
Hi Pavan,
Sorry for the delayed response, I wanted to make sure I get back after running some tests with the 10 ms thing.
- So yes, the earlier tests were with the 20 ms window size. I have tested it again with 10 ms and the results have surely improved significantly. No power numbers yet.
Good to know :-) btw what is HZ in your setup? Is it 300?
- Another thing to notice is that the PCMark Video-decoding and Work-V2 tests show proper numbers after this change, which were failing earlier and giving 0 as the results.
It is not clear how WALT window size is breaking these tests. I have not seen any issues (on internal platform) with 20 msec WALT window size.
Few observations from pelt vs walt tests I performed:
The cpufreq statistics show that with walt the big cluster runs for a significantly longer period of time at the highest OPP compared to pelt. Which plays a significant role boosting Walt's performance numbers.
This may happen due to one of the two reasons (or both):
- The slope of the signal is higher and so cpu_util() increases rapidly.
- We put (more) tasks on the bigger CPU quickly with Walt.
My test shows that the main reason is the second one and it isn't really about the slope of the signal thing.
The results I have presented in WALT vs PELT session, are obtained with schedtune.boost = 10 for top-app. So the main thread of PCMark (AsyncTask) is always running on the BIG cluster for both PELT and WALT. So the benefit is purley coming from the fast frequency rampup of WALT. The gap is reduced from ~23% to ~10% with Patrick's util_est patches.
If you don't have boost and util_est feature, you would see better BIG cluster residency with WALT since PELT forgets the history. The main threads runs for about 500 msec and sleep for ~900msec in many subscores.
- I also performed Walt tests without walt_cpu_high_irqload() thing and the results weren't significantly different than walt with it. So that isn't playing a big role either.
walt_cpu_high_irqload() does not come into the picture at all for PCMark or for any other CPU benchmark. It is meant to avoid placing tasks on CPUs busy with IRQs and Softirqs. For example use cases involving high rate WiFi and data transfers.
Thanks, Pavan
-- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
On 19-01-18, 19:16, Pavan Kondeti wrote:
Good to know :-) btw what is HZ in your setup? Is it 300?
250
- Another thing to notice is that the PCMark Video-decoding and Work-V2 tests show proper numbers after this change, which were failing earlier and giving 0 as the results.
It is not clear how WALT window size is breaking these tests. I have not seen any issues (on internal platform) with 20 msec WALT window size.
It is somehow making it slow enough to not complete video decoding.
-- viresh