Hi all,
Below are my trying for profiling EAS; please help review and welcome any suggestion or question.
* Purpose
This is the first round profiling for EASv5 patches on Hikey board; With profiling EASv5 patches on Hikey board, can get below info and feedback for EAS's developement:
- Created the profiling enviornment for ARM64 - Collected manifestation after applied EASv5 patches on SoC with two CA53 clusters - I cannot measure hardware power cosumption, so currently _ONLY_ check CPU duty cycle for comparasion scheduler behavior
* Hardware Environment
- Platform: 96boards Hikey - SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster - CPU clock: 2 clusters with 8 CPUs have coupled clock source and support 208mHz/432mHz/729mHz/960mHz/1200mHz - Support CPU and cluster level low power mode
* Software Environment
- Kernel: 4.2rc4 + EAS RFCv5 - ARM-TF: [1] - Enabled CPUIdle with PSCI - Enabled CPUFreq with cpufreq-dt driver
* Profiling Data
CLS0: Cluster 0 CLS1: Cluster 1 CPU_PD: CPU power down state CLS_PW: Cluster power down state
- Case Sysbench: sysbench --test=cpu --cpu-max-prime=20000 run
Respone Time Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) min 4.19ms +00.00ms +00.00ms +00.00ms avg 4.21ms +00.00ms +00.00ms -00.01ms max 6.86ms +00.09ms +00.04ms +13.61ms approx 95% 4.23ms +00.00ms +00.00ms -00.02ms
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 000.20ms +000.35ms +000.34ms +000.89ms CLS0: CPU_PD 001.58ms +000.76ms +000.88ms +001.40ms CLS0: CLS_PD 001.82ms -000.41ms -000.12ms +001.10ms CLS1: WFI n.a. n.a. +2.9s +000.07ms CLS1: CPU_PD n.a. n.a. +001.30ms +6.7s CLS1: CLS_PD 42.11s +003.8ms -2.9s -6.8s
- Case MP3: ./idlestat --trace -f mp3_trace.log -t 30 -p -c -w -o mp3_report.log -- rt-app ./doc/examples/mp3-long.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 067.31ms -022.10ms -022.20ms +019.70ms CLS0: CPU_PD 887.74ms +919.80ms -292.40ms -316.90ms CLS0: CLS_PD 17.08s -444.80ms +895.30ms +5.3s CLS1: WFI 000.59ms +002.30ms +000.28ms +196.70ms CLS1: CPU_PD n.a. +000.26ms n.a. +004.00ms CLS1: CLS_PD 28.80s +189.10ms -269.40ms -10.6s
- Case rt-app 6%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-6.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 7.82s +037.40ms -7.1s -7.3s CLS0: CPU_PD 4.26s -154.20ms -3.5s -4.2s CLS0: CLS_PD 005.76ms n.a. +17.6s +18.9s CLS1: WFI 6.46s +2.0s +2.4s -155.90ms CLS1: CPU_PD 6.01s -1.9s -5.3s -6.0s CLS1: CLS_PD 123.76ms -118.50ms -121.90ms +1.2s
- Case rt-app 13%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-13.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 9.26s -304.70ms -8.8s -4.2s CLS0: CPU_PD 1.11s -695.20ms -275.2ms -1.1s CLS0: CLS_PD 003.49ms -001.30ms +18.3s -613us CLS1: WFI 8.32s -8.3s -3.4s -8.3s CLS1: CPU_PD 2.65s -2.6s -2.6s -2.6s CLS1: CLS_PD 123.07ms +19.9s -121.6ms +20.3s
- Case rt-app 19%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-19.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.91s -256.70ms -473.4ms -2.7s CLS0: CPU_PD 428.42ms +000.86ms -425.5ms -356.20ms CLS0: CLS_PD 002.28ms +000.45ms +1.1ms n.a. CLS1: WFI 6.20s +224.80ms -6.2s -5.2s CLS1: CPU_PD 4.02s -193.20ms -4.0s -4.0s CLS1: CLS_PD 073.93ms +1.3s +20.0s +039.80ms
- Case rt-app 25%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-25.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 9.60s -170.90ms -3.5s -3.4s CLS0: CPU_PD 025.02ms -018.90ms -023.20ms -023.50ms CLS0: CLS_PD 004.43ms -001.80ms -003.50ms +004.40ms CLS1: WFI 9.72s -1.1s -9.6s -9.2s CLS1: CPU_PD 239.13ms +1.7s -237.90ms -152.00ms CLS1: CLS_PD 075.45ms +001.00ms +19.9s +20.1s
- Case rt-app 31%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-31.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.54s +108.10ms -5.6s -189.20ms CLS0: CPU_PD 001.97ms +001.70ms -001.40ms +005.50ms CLS0: CLS_PD 003.24ms -002.30ms +003.00ms +001.90ms CLS1: WFI 8.56s +108.60ms -8.6s -260.00ms CLS1: CPU_PD n.a. +001.70ms n.a. +199.60ms CLS1: CLS_PD 189.15ms -000.28ms +19.9s +449.60ms
- Case rt-app 38%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-38.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.77s +695.00ms -7.0s -4.9s CLS0: CPU_PD 000.96ms +001.50ms +001.70ms -000.01ms CLS0: CLS_PD 003.06ms -000.59ms -002.50ms +002.30ms CLS1: WFI 8.79s +913.70ms -8.8s -8.8s CLS1: CPU_PD 001.71ms +151.30ms -001.70ms -000.45ms CLS1: CLS_PD 123.32ms -120.60ms +19.6s +20.4s
- Case rt-app 44%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-44.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 5.76s +101.20ms -4.6s -4.5s CLS0: CPU_PD 003.07ms -000.18ms -001.70ms -001.80ms CLS0: CLS_PD 002.02ms -001.20ms +000.23ms +001.30ms CLS1: WFI 6.01s +108.60ms -5.9s -5.9s CLS1: CPU_PD 001.14ms +001.20ms +000.07ms +000.65ms CLS1: CLS_PD 190.75ms -115.50ms +19.6s +19.6s
- Case rt-app 50%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-50.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 6.62s -267.20ms -6.6s -6.6s CLS0: CPU_PD n.a. +001.80ms +001.50ms +000.72ms CLS0: CLS_PD 001.47ms +001.30ms -000.33ms +005.10ms CLS1: WFI 6.89s -348.20ms -6.9s -6.9s CLS1: CPU_PD 001.63ms -001.50ms -001.60ms -001.60ms CLS1: CLS_PD 001.87ms +123.50ms +19.9s +20.2s
* Summary
- For two same clusters case, EAS (ndm) has the best performance than other configurations. We can get most benefit for CPU/cluster's duty cycle with EAS (ndm); The reason is for almost all cases, EAS (ndm) increases much time for cluster's power down, than means the scheduler have optimized load balance within one cluster rather than spread tasks to two clusters.
- EAS (sched) is not consistent for all cases, for case MP3 even worse than mainline; for cases rt-app 6%/13%/25%/38%/44%/50% it has almost same behavior with EAS (ndm), but it's not stable for cases rt-app 19%/31%.
- EAS (sched) introduces very high latency for sysbench, so the max response time it will take 20.47ms, which is much higher than other three configs.
- EAS (sched) and EAS (ndm) will impact idle states selection in CPUIdle, and it increases the possibility for CPU level's state rather than cluster level's state; this result is coming from sysbench case.
[1] https://github.com/Leo-Yan/arm-trusted-firmware/tree/hikey_enable_low_power_...
Thanks, Leo Yan
Hi Leo,
Interesting analysis. You didn't explain what EAS (ndm) and EAS (sched) stand for. :)
Also, can you confirm that for clock topology that each cluster of 4 cpus can be scaled (DVFS) independently?
Regards, Amit
On Thu, Aug 13, 2015 at 10:26 AM, Leo Yan leo.yan@linaro.org wrote:
Hi all,
Below are my trying for profiling EAS; please help review and welcome any suggestion or question.
Purpose
This is the first round profiling for EASv5 patches on Hikey board; With profiling EASv5 patches on Hikey board, can get below info and feedback for EAS's developement:
- Created the profiling enviornment for ARM64
- Collected manifestation after applied EASv5 patches on SoC with two CA53 clusters
- I cannot measure hardware power cosumption, so currently _ONLY_ check CPU duty cycle for comparasion scheduler behavior
Hardware Environment
- Platform: 96boards Hikey
- SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster
- CPU clock: 2 clusters with 8 CPUs have coupled clock source and support 208mHz/432mHz/729mHz/960mHz/1200mHz
- Support CPU and cluster level low power mode
Software Environment
- Kernel: 4.2rc4 + EAS RFCv5
- ARM-TF: [1]
- Enabled CPUIdle with PSCI
- Enabled CPUFreq with cpufreq-dt driver
Profiling Data
CLS0: Cluster 0 CLS1: Cluster 1 CPU_PD: CPU power down state CLS_PW: Cluster power down state
- Case Sysbench: sysbench --test=cpu --cpu-max-prime=20000 run
Respone Time Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) min 4.19ms +00.00ms +00.00ms +00.00ms avg 4.21ms +00.00ms +00.00ms -00.01ms max 6.86ms +00.09ms +00.04ms +13.61ms approx 95% 4.23ms +00.00ms +00.00ms -00.02ms
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 000.20ms +000.35ms +000.34ms +000.89ms CLS0: CPU_PD 001.58ms +000.76ms +000.88ms +001.40ms CLS0: CLS_PD 001.82ms -000.41ms -000.12ms +001.10ms CLS1: WFI n.a. n.a. +2.9s +000.07ms CLS1: CPU_PD n.a. n.a. +001.30ms +6.7s CLS1: CLS_PD 42.11s +003.8ms -2.9s -6.8s
- Case MP3: ./idlestat --trace -f mp3_trace.log -t 30 -p -c -w -o mp3_report.log -- rt-app ./doc/examples/mp3-long.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 067.31ms -022.10ms -022.20ms +019.70ms CLS0: CPU_PD 887.74ms +919.80ms -292.40ms -316.90ms CLS0: CLS_PD 17.08s -444.80ms +895.30ms +5.3s CLS1: WFI 000.59ms +002.30ms +000.28ms +196.70ms CLS1: CPU_PD n.a. +000.26ms n.a. +004.00ms CLS1: CLS_PD 28.80s +189.10ms -269.40ms -10.6s
- Case rt-app 6%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-6.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 7.82s +037.40ms -7.1s -7.3s CLS0: CPU_PD 4.26s -154.20ms -3.5s -4.2s CLS0: CLS_PD 005.76ms n.a. +17.6s +18.9s CLS1: WFI 6.46s +2.0s +2.4s -155.90ms CLS1: CPU_PD 6.01s -1.9s -5.3s -6.0s CLS1: CLS_PD 123.76ms -118.50ms -121.90ms +1.2s
- Case rt-app 13%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-13.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 9.26s -304.70ms -8.8s -4.2s CLS0: CPU_PD 1.11s -695.20ms -275.2ms -1.1s CLS0: CLS_PD 003.49ms -001.30ms +18.3s -613us CLS1: WFI 8.32s -8.3s -3.4s -8.3s CLS1: CPU_PD 2.65s -2.6s -2.6s -2.6s CLS1: CLS_PD 123.07ms +19.9s -121.6ms +20.3s
- Case rt-app 19%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-19.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.91s -256.70ms -473.4ms -2.7s CLS0: CPU_PD 428.42ms +000.86ms -425.5ms -356.20ms CLS0: CLS_PD 002.28ms +000.45ms +1.1ms n.a. CLS1: WFI 6.20s +224.80ms -6.2s -5.2s CLS1: CPU_PD 4.02s -193.20ms -4.0s -4.0s CLS1: CLS_PD 073.93ms +1.3s +20.0s +039.80ms
- Case rt-app 25%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-25.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 9.60s -170.90ms -3.5s -3.4s CLS0: CPU_PD 025.02ms -018.90ms -023.20ms -023.50ms CLS0: CLS_PD 004.43ms -001.80ms -003.50ms +004.40ms CLS1: WFI 9.72s -1.1s -9.6s -9.2s CLS1: CPU_PD 239.13ms +1.7s -237.90ms -152.00ms CLS1: CLS_PD 075.45ms +001.00ms +19.9s +20.1s
- Case rt-app 31%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-31.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.54s +108.10ms -5.6s -189.20ms CLS0: CPU_PD 001.97ms +001.70ms -001.40ms +005.50ms CLS0: CLS_PD 003.24ms -002.30ms +003.00ms +001.90ms CLS1: WFI 8.56s +108.60ms -8.6s -260.00ms CLS1: CPU_PD n.a. +001.70ms n.a. +199.60ms CLS1: CLS_PD 189.15ms -000.28ms +19.9s +449.60ms
- Case rt-app 38%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-38.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.77s +695.00ms -7.0s -4.9s CLS0: CPU_PD 000.96ms +001.50ms +001.70ms -000.01ms CLS0: CLS_PD 003.06ms -000.59ms -002.50ms +002.30ms CLS1: WFI 8.79s +913.70ms -8.8s -8.8s CLS1: CPU_PD 001.71ms +151.30ms -001.70ms -000.45ms CLS1: CLS_PD 123.32ms -120.60ms +19.6s +20.4s
- Case rt-app 44%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-44.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 5.76s +101.20ms -4.6s -4.5s CLS0: CPU_PD 003.07ms -000.18ms -001.70ms -001.80ms CLS0: CLS_PD 002.02ms -001.20ms +000.23ms +001.30ms CLS1: WFI 6.01s +108.60ms -5.9s -5.9s CLS1: CPU_PD 001.14ms +001.20ms +000.07ms +000.65ms CLS1: CLS_PD 190.75ms -115.50ms +19.6s +19.6s
- Case rt-app 50%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-50.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 6.62s -267.20ms -6.6s -6.6s CLS0: CPU_PD n.a. +001.80ms +001.50ms +000.72ms CLS0: CLS_PD 001.47ms +001.30ms -000.33ms +005.10ms CLS1: WFI 6.89s -348.20ms -6.9s -6.9s CLS1: CPU_PD 001.63ms -001.50ms -001.60ms -001.60ms CLS1: CLS_PD 001.87ms +123.50ms +19.9s +20.2s
Summary
For two same clusters case, EAS (ndm) has the best performance than other configurations. We can get most benefit for CPU/cluster's duty cycle with EAS (ndm); The reason is for almost all cases, EAS (ndm) increases much time for cluster's power down, than means the scheduler have optimized load balance within one cluster rather than spread tasks to two clusters.
EAS (sched) is not consistent for all cases, for case MP3 even worse than mainline; for cases rt-app 6%/13%/25%/38%/44%/50% it has almost same behavior with EAS (ndm), but it's not stable for cases rt-app 19%/31%.
EAS (sched) introduces very high latency for sysbench, so the max response time it will take 20.47ms, which is much higher than other three configs.
EAS (sched) and EAS (ndm) will impact idle states selection in CPUIdle, and it increases the possibility for CPU level's state rather than cluster level's state; this result is coming from sysbench case.
[1] https://github.com/Leo-Yan/arm-trusted-firmware/tree/hikey_enable_low_power_...
Thanks, Leo Yan _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
Hi Amit,
Thanks for review and please see below comments.
On Thu, Aug 20, 2015 at 10:27:39AM -0700, Amit Kucheria wrote:
Hi Leo,
Interesting analysis. You didn't explain what EAS (ndm) and EAS (sched) stand for. :)
I refer these conventions from Morten's patch series, they stand for: noEAS (ndm) = Applied EAS patches but disabled EAS feature, with cpufreq ondemand governor with 20ms sampling rate; EAS (ndm) = Enabled EAS, cpufreq ondemand governor with 20ms sampling rate; EAS (sched) = Enabled EAS, scheduler driven DVFS
Also, can you confirm that for clock topology that each cluster of 4 cpus can be scaled (DVFS) independently?
No, Hi6220 is special case, two clusters have coupled their clock source, so all CPUs's frequency will be changed at the same time.
Thanks, Leo Yan
On Thu, Aug 13, 2015 at 10:26 AM, Leo Yan leo.yan@linaro.org wrote:
Hi all,
Below are my trying for profiling EAS; please help review and welcome any suggestion or question.
Purpose
This is the first round profiling for EASv5 patches on Hikey board; With profiling EASv5 patches on Hikey board, can get below info and feedback for EAS's developement:
- Created the profiling enviornment for ARM64
- Collected manifestation after applied EASv5 patches on SoC with two CA53 clusters
- I cannot measure hardware power cosumption, so currently _ONLY_ check CPU duty cycle for comparasion scheduler behavior
Hardware Environment
- Platform: 96boards Hikey
- SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster
- CPU clock: 2 clusters with 8 CPUs have coupled clock source and support 208mHz/432mHz/729mHz/960mHz/1200mHz
- Support CPU and cluster level low power mode
Software Environment
- Kernel: 4.2rc4 + EAS RFCv5
- ARM-TF: [1]
- Enabled CPUIdle with PSCI
- Enabled CPUFreq with cpufreq-dt driver
Profiling Data
CLS0: Cluster 0 CLS1: Cluster 1 CPU_PD: CPU power down state CLS_PW: Cluster power down state
- Case Sysbench: sysbench --test=cpu --cpu-max-prime=20000 run
Respone Time Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) min 4.19ms +00.00ms +00.00ms +00.00ms avg 4.21ms +00.00ms +00.00ms -00.01ms max 6.86ms +00.09ms +00.04ms +13.61ms approx 95% 4.23ms +00.00ms +00.00ms -00.02ms
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 000.20ms +000.35ms +000.34ms +000.89ms CLS0: CPU_PD 001.58ms +000.76ms +000.88ms +001.40ms CLS0: CLS_PD 001.82ms -000.41ms -000.12ms +001.10ms CLS1: WFI n.a. n.a. +2.9s +000.07ms CLS1: CPU_PD n.a. n.a. +001.30ms +6.7s CLS1: CLS_PD 42.11s +003.8ms -2.9s -6.8s
- Case MP3: ./idlestat --trace -f mp3_trace.log -t 30 -p -c -w -o mp3_report.log -- rt-app ./doc/examples/mp3-long.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 067.31ms -022.10ms -022.20ms +019.70ms CLS0: CPU_PD 887.74ms +919.80ms -292.40ms -316.90ms CLS0: CLS_PD 17.08s -444.80ms +895.30ms +5.3s CLS1: WFI 000.59ms +002.30ms +000.28ms +196.70ms CLS1: CPU_PD n.a. +000.26ms n.a. +004.00ms CLS1: CLS_PD 28.80s +189.10ms -269.40ms -10.6s
- Case rt-app 6%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-6.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 7.82s +037.40ms -7.1s -7.3s CLS0: CPU_PD 4.26s -154.20ms -3.5s -4.2s CLS0: CLS_PD 005.76ms n.a. +17.6s +18.9s CLS1: WFI 6.46s +2.0s +2.4s -155.90ms CLS1: CPU_PD 6.01s -1.9s -5.3s -6.0s CLS1: CLS_PD 123.76ms -118.50ms -121.90ms +1.2s
- Case rt-app 13%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-13.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 9.26s -304.70ms -8.8s -4.2s CLS0: CPU_PD 1.11s -695.20ms -275.2ms -1.1s CLS0: CLS_PD 003.49ms -001.30ms +18.3s -613us CLS1: WFI 8.32s -8.3s -3.4s -8.3s CLS1: CPU_PD 2.65s -2.6s -2.6s -2.6s CLS1: CLS_PD 123.07ms +19.9s -121.6ms +20.3s
- Case rt-app 19%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-19.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.91s -256.70ms -473.4ms -2.7s CLS0: CPU_PD 428.42ms +000.86ms -425.5ms -356.20ms CLS0: CLS_PD 002.28ms +000.45ms +1.1ms n.a. CLS1: WFI 6.20s +224.80ms -6.2s -5.2s CLS1: CPU_PD 4.02s -193.20ms -4.0s -4.0s CLS1: CLS_PD 073.93ms +1.3s +20.0s +039.80ms
- Case rt-app 25%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-25.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 9.60s -170.90ms -3.5s -3.4s CLS0: CPU_PD 025.02ms -018.90ms -023.20ms -023.50ms CLS0: CLS_PD 004.43ms -001.80ms -003.50ms +004.40ms CLS1: WFI 9.72s -1.1s -9.6s -9.2s CLS1: CPU_PD 239.13ms +1.7s -237.90ms -152.00ms CLS1: CLS_PD 075.45ms +001.00ms +19.9s +20.1s
- Case rt-app 31%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-31.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.54s +108.10ms -5.6s -189.20ms CLS0: CPU_PD 001.97ms +001.70ms -001.40ms +005.50ms CLS0: CLS_PD 003.24ms -002.30ms +003.00ms +001.90ms CLS1: WFI 8.56s +108.60ms -8.6s -260.00ms CLS1: CPU_PD n.a. +001.70ms n.a. +199.60ms CLS1: CLS_PD 189.15ms -000.28ms +19.9s +449.60ms
- Case rt-app 38%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-38.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 8.77s +695.00ms -7.0s -4.9s CLS0: CPU_PD 000.96ms +001.50ms +001.70ms -000.01ms CLS0: CLS_PD 003.06ms -000.59ms -002.50ms +002.30ms CLS1: WFI 8.79s +913.70ms -8.8s -8.8s CLS1: CPU_PD 001.71ms +151.30ms -001.70ms -000.45ms CLS1: CLS_PD 123.32ms -120.60ms +19.6s +20.4s
- Case rt-app 44%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-44.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 5.76s +101.20ms -4.6s -4.5s CLS0: CPU_PD 003.07ms -000.18ms -001.70ms -001.80ms CLS0: CLS_PD 002.02ms -001.20ms +000.23ms +001.30ms CLS1: WFI 6.01s +108.60ms -5.9s -5.9s CLS1: CPU_PD 001.14ms +001.20ms +000.07ms +000.65ms CLS1: CLS_PD 190.75ms -115.50ms +19.6s +19.6s
- Case rt-app 50%: ./idlestat --trace -f trace.log -t 30 -p -c -w -o report.log -- rt-app ./doc/examples/rt-app-50.json
Idle Dutycycle Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) CLS0: WFI 6.62s -267.20ms -6.6s -6.6s CLS0: CPU_PD n.a. +001.80ms +001.50ms +000.72ms CLS0: CLS_PD 001.47ms +001.30ms -000.33ms +005.10ms CLS1: WFI 6.89s -348.20ms -6.9s -6.9s CLS1: CPU_PD 001.63ms -001.50ms -001.60ms -001.60ms CLS1: CLS_PD 001.87ms +123.50ms +19.9s +20.2s
Summary
For two same clusters case, EAS (ndm) has the best performance than other configurations. We can get most benefit for CPU/cluster's duty cycle with EAS (ndm); The reason is for almost all cases, EAS (ndm) increases much time for cluster's power down, than means the scheduler have optimized load balance within one cluster rather than spread tasks to two clusters.
EAS (sched) is not consistent for all cases, for case MP3 even worse than mainline; for cases rt-app 6%/13%/25%/38%/44%/50% it has almost same behavior with EAS (ndm), but it's not stable for cases rt-app 19%/31%.
EAS (sched) introduces very high latency for sysbench, so the max response time it will take 20.47ms, which is much higher than other three configs.
EAS (sched) and EAS (ndm) will impact idle states selection in CPUIdle, and it increases the possibility for CPU level's state rather than cluster level's state; this result is coming from sysbench case.
[1] https://github.com/Leo-Yan/arm-trusted-firmware/tree/hikey_enable_low_power_...
Thanks, Leo Yan _______________________________________________ eas-dev mailing list eas-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/eas-dev
On 08/20/2015 07:14 PM, Leo Yan wrote:
Hi Amit,
Thanks for review and please see below comments.
On Thu, Aug 20, 2015 at 10:27:39AM -0700, Amit Kucheria wrote:
Hi Leo,
Interesting analysis. You didn't explain what EAS (ndm) and EAS (sched) stand for. :)
I refer these conventions from Morten's patch series, they stand for: noEAS (ndm) = Applied EAS patches but disabled EAS feature, with cpufreq ondemand governor with 20ms sampling rate; EAS (ndm) = Enabled EAS, cpufreq ondemand governor with 20ms sampling rate; EAS (sched) = Enabled EAS, scheduler driven DVFS
Also, can you confirm that for clock topology that each cluster of 4 cpus can be scaled (DVFS) independently?
No, Hi6220 is special case, two clusters have coupled their clock source, so all CPUs's frequency will be changed at the same time.
Thanks, Leo Yan
On Thu, Aug 13, 2015 at 10:26 AM, Leo Yan leo.yan@linaro.org wrote:
Hi all,
Below are my trying for profiling EAS; please help review and welcome any suggestion or question.
Thanks for the test data!
[...]
Hardware Environment
- Platform: 96boards Hikey
- SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster
- CPU clock: 2 clusters with 8 CPUs have coupled clock source and support 208mHz/432mHz/729mHz/960mHz/1200mHz
- Support CPU and cluster level low power mode
Software Environment
- Kernel: 4.2rc4 + EAS RFCv5
Could you share the platform adaptation code [especially arch/arm64/kernel/topology.c]. I'm interested in the Energy Model you're using and where you put the SD_SHARE_CAP_STATES sd flag :-)
- ARM-TF: [1]
- Enabled CPUIdle with PSCI
- Enabled CPUFreq with cpufreq-dt driver
[...]
Hi Dietmar,
On Thu, Aug 20, 2015 at 11:02:17PM -0700, Dietmar Eggemann wrote:
On 08/20/2015 07:14 PM, Leo Yan wrote:
Hi Amit,
Thanks for review and please see below comments.
On Thu, Aug 20, 2015 at 10:27:39AM -0700, Amit Kucheria wrote:
Hi Leo,
Interesting analysis. You didn't explain what EAS (ndm) and EAS (sched) stand for. :)
I refer these conventions from Morten's patch series, they stand for: noEAS (ndm) = Applied EAS patches but disabled EAS feature, with cpufreq ondemand governor with 20ms sampling rate; EAS (ndm) = Enabled EAS, cpufreq ondemand governor with 20ms sampling rate; EAS (sched) = Enabled EAS, scheduler driven DVFS
Also, can you confirm that for clock topology that each cluster of 4 cpus can be scaled (DVFS) independently?
No, Hi6220 is special case, two clusters have coupled their clock source, so all CPUs's frequency will be changed at the same time.
Thanks, Leo Yan
On Thu, Aug 13, 2015 at 10:26 AM, Leo Yan leo.yan@linaro.org wrote:
Hi all,
Below are my trying for profiling EAS; please help review and welcome any suggestion or question.
Thanks for the test data!
[...]
Hardware Environment
- Platform: 96boards Hikey
- SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster
- CPU clock: 2 clusters with 8 CPUs have coupled clock source and support 208mHz/432mHz/729mHz/960mHz/1200mHz
- Support CPU and cluster level low power mode
Software Environment
- Kernel: 4.2rc4 + EAS RFCv5
Could you share the platform adaptation code [especially arch/arm64/kernel/topology.c]. I'm interested in the Energy Model you're using and where you put the SD_SHARE_CAP_STATES sd flag :-)
Thanks for reminding, directly used the same energy model with Juno which Juri shared with me. I recognized need change code like below so that can best match for my case; will do second round profiling and update with you.
static inline int cpu_corepower_flags(void) { return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \ SD_SHARE_CAP_STATES; }
static inline int cpu_cluster_flags(void) { return SD_SHARE_CAP_STATES; }
static struct sched_domain_topology_level arm64_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) }, #endif { cpu_cpu_mask, cpu_cluster_flags, cpu_cluster_energy, SD_INIT_NAME(DIE) }, { NULL, }, };
Thanks, Leo Yan
On 08/21/2015 12:57 AM, Leo Yan wrote:
[...]
Hardware Environment
- Platform: 96boards Hikey
- SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster
- CPU clock: 2 clusters with 8 CPUs have coupled clock source and support 208mHz/432mHz/729mHz/960mHz/1200mHz
- Support CPU and cluster level low power mode
Software Environment
- Kernel: 4.2rc4 + EAS RFCv5
Could you share the platform adaptation code [especially arch/arm64/kernel/topology.c]. I'm interested in the Energy Model you're using and where you put the SD_SHARE_CAP_STATES sd flag :-)
Thanks for reminding, directly used the same energy model with Juno which Juri shared with me. I recognized need change code like below so that can best match for my case; will do second round profiling and update with you.
That's true for this platform. The direct dependency is in sched_group_energy():
5051 /* 5052 * Is the group utilization affected by cpus outside this 5053 * sched_group? 5054 */ 5055 sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
Since you can't measure power on this board yet you can't create your own EM. Using the Juno one is fine but you have to adapt the 'struct capacity_state' arrays for the A53 and the clusters.
You have '208mHz/432mHz/729mHz/960mHz/1200mHz' whereas the A53 on Juno R0 has '450000 575000 700000 775000 850000'
So both have 5 OPPs. That means you could just take the values from Juno except the fact that the highest OPP on A53 does not have a .cap value of 1024 because on Juno you have the big cluster as well. I would rescale the .cap values to 1024 for the highest OPP, just to be save. The A53 at 1200000 is the highest capacity value available so it should be 1024. Maybe you did this already ...
Another thing ... Did you adapt the rt-app-X.json files to 8 cpus?
static inline int cpu_corepower_flags(void) { return SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN | \ SD_SHARE_CAP_STATES; }
static inline int cpu_cluster_flags(void) { return SD_SHARE_CAP_STATES; }
static struct sched_domain_topology_level arm64_topology[] = { #ifdef CONFIG_SCHED_MC { cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) }, #endif { cpu_cpu_mask, cpu_cluster_flags, cpu_cluster_energy, SD_INIT_NAME(DIE) }, { NULL, }, };
Thanks, Leo Yan
Hi Dietmar,
On Fri, Aug 21, 2015 at 07:09:09AM -0700, Dietmar Eggemann wrote:
On 08/21/2015 12:57 AM, Leo Yan wrote:
[...]
Hardware Environment
- Platform: 96boards Hikey
- SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster
- CPU clock: 2 clusters with 8 CPUs have coupled clock source and support 208mHz/432mHz/729mHz/960mHz/1200mHz
- Support CPU and cluster level low power mode
Software Environment
- Kernel: 4.2rc4 + EAS RFCv5
Could you share the platform adaptation code [especially arch/arm64/kernel/topology.c]. I'm interested in the Energy Model you're using and where you put the SD_SHARE_CAP_STATES sd flag :-)
Thanks for reminding, directly used the same energy model with Juno which Juri shared with me. I recognized need change code like below so that can best match for my case; will do second round profiling and update with you.
That's true for this platform. The direct dependency is in sched_group_energy():
5051 /* 5052 * Is the group utilization affected by cpus outside this 5053 * sched_group? 5054 */ 5055 sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
Since you can't measure power on this board yet you can't create your own EM. Using the Juno one is fine but you have to adapt the 'struct capacity_state' arrays for the A53 and the clusters.
You have '208mHz/432mHz/729mHz/960mHz/1200mHz' whereas the A53 on Juno R0 has '450000 575000 700000 775000 850000'
So both have 5 OPPs. That means you could just take the values from Juno except the fact that the highest OPP on A53 does not have a .cap value of 1024 because on Juno you have the big cluster as well. I would rescale the .cap values to 1024 for the highest OPP, just to be save. The A53 at 1200000 is the highest capacity value available so it should be 1024. Maybe you did this already ...
No, i will adjust them.
Another thing ... Did you adapt the rt-app-X.json files to 8 cpus?
No, still only create 5 threads; So need create 8 threads instead.
Looks like my first round profiling is quite rough, will do second round with more careful parameters for energy model; your detailed review is very helpful :)
BTW, i saw Morten's patch series have some performance comaparsion for sysbench and rt-app's cases, like below table's colume "prf"; so could you help explain how to get these performance result?
Energy Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) nrg prf nrg prf nrg prf nrg prf sysbench 100 100 107 105 108 105 107 105
rt-app mp3 100 n.a. 101 n.a. 45 n.a. 43 n.a.
rt-app 6% 100 85 103 85 31 60 33 59 rt-app 13% 100 76 102 76 39 46 41 50 rt-app 19% 100 64 102 64 93 54 93 54 rt-app 25% 100 53 102 53 93 43 96 45 rt-app 31% 100 44 102 43 115 35 145 43 rt-app 38% 100 35 116 32 113 2 140 29 rt-app 44% 100 -40k 142 -9k 141 -9k 145 -1k rt-app 50% 100 -100k 133 -21k 131 -22k 131 -4k
Thanks, Leo Yan
On 08/21/2015 07:44 AM, Leo Yan wrote:
Hi Dietmar,
On Fri, Aug 21, 2015 at 07:09:09AM -0700, Dietmar Eggemann wrote:
On 08/21/2015 12:57 AM, Leo Yan wrote:
[...]
BTW, i saw Morten's patch series have some performance comaparsion for sysbench and rt-app's cases, like below table's colume "prf"; so could you help explain how to get these performance result?
Energy Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) nrg prf nrg prf nrg prf nrg prf sysbench 100 100 107 105 108 105 107 105
Run 'sysbench --test=cpu --num-threads=1 --max-time=30 run'
and take the number of events as your performance score:
# sysbench --test=cpu --num-threads=1 --max-time=30 run sysbench 0.4.12: multi-threaded system evaluation benchmark
Running the test with following options: Number of threads: 1
...
Test execution summary: total time: 30.0069s total number of events: 986
...
The numbers are normalized against mainline for comparison.
rt-app mp3 100 n.a. 101 n.a. 45 n.a. 43 n.a.
Performance numbers are only applicable to periodic tasks.
rt-app 6% 100 85 103 85 31 60 33 59 rt-app 13% 100 76 102 76 39 46 41 50 rt-app 19% 100 64 102 64 93 54 93 54 rt-app 25% 100 53 102 53 93 43 96 45 rt-app 31% 100 44 102 43 115 35 145 43 rt-app 38% 100 35 116 32 113 2 140 29 rt-app 44% 100 -40k 142 -9k 141 -9k 145 -1k rt-app 50% 100 -100k 133 -21k 131 -22k 131 -4k
You need a special version of rt-app (depicted in EAS RFCv5 cover letter under [4]) which gives you additional columns (slack c_run c_period wu_lat) in the per task log file.
From EAS RFCv5 cover letter: 'The performance metric expresses the average time left from completion of the run period until the next activation normalized to best case: 100 is best case (not achievable in practice), the busy period ended as fast as possible, 0 means on average we just finished in time before the next activation, negative means we continued running past the next activation.
task performance = slack/(c_period - c_run) * 100
We calculate the average value out of all per-run period 'task performance' values for the the individual task and the average values of the tasks for the test run (equal workload if only 1 test run).
The rt-app performance numbers are not normalized against mainline.
[...]