Hi all,
This is the third round profiling for EASv5 patches on Hikey board (6th, Sept); Welcome any comment and suggestion.
* Overview
- 3rd round vs 2nd round:
Add two patches based on EASv5, the first patch will select small group/cpu if the groups have same capacity, so it will place tasks into the first cluster for the LITTLE.LITTLE case; the second patch will fix the case: if the sched domain is already the highest level, need directly use its group to calculate shared capacity and energy difference.
Also have enclosed these two patches for review.
- 2nd round vs 1st round:
According to review comments from eas-dev mailing list, refined the energy model for Hikey, also developed several python/shell scripts for automatic analysis.
So profiling data will be devided into three parts: - C-state profiling data - P-state profiling data - Scheduler performance profiling data
* Hardware Environment
- Platform: 96boards Hikey - SoC: Hi6220, 2 clusters, 4xCA53 CPUs in each cluster - CPU clock: 2 clusters with 8 CPUs have same clock source and support 208MHz/432MHz/729MHz/960MHz/1200MHz - Support CPU and cluster level low power mode
* Software Environment
- Kernel (4.2 + EAS RFCv5) + extra two patches [1] - ARM-TF [2] - Enable CPUIdle with PSCI - Enable CPUFreq with cpufreq-dt driver
- Profiling scritps: calc_idle_diff.py [3]: calculate C-state's difference for different configurations calc_pstate_time.py [4]: calculate P-state's difference for different configurations calc_sched_preformance.py [5]: calculate scheduler performance
* Conventions
Below are some conventions which used in below tables:
CLS0: Cluster 0 CLS1: Cluster 1 WFI: CPU WFI state C2: CPU power down state M2: Cluster power down state DC: Duty cycle
Configuration | Mailine EASv5 Enable CPUFreq CPUFreq | ENERGY_AWARE ondemand sched ------------- | ------- ----- ------------ -------- ------- Mainline (ndm) | Yes No No Yes No noEAS (ndm) | Yes Yes No Yes No EAS (ndm) | Yes Yes Yes Yes No EAS (sched) | Yes Yes Yes No Yes
* Profiling: C-state & P-state
The detailed profiling result have been uploaded to git-hub [6];
- Case MP3:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 104.68ms -35.3ms +36.6ms -65.0ms clusterA: C2 2.11s -2.0s -1.8s -973.0ms clusterA: M2 26.45s -6.7s -4.5s -15.7s clusterB: WFI 3.07ms -3.1ms -3.1ms -3.1ms clusterB: C2 98.88ms -56.4ms -98.9ms -88.8ms clusterB: M2 19.52s +9.0s +10.4s +10.3s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 870.39 | 4872.34 | 93.79 | 27.43 | 5880.00 | 9436597.71 | | EAS DIS | +2.87%| -6.05%| -14.43%| -1.53%| 0.00%| -1.40%| | EAS NDM | -91.80%| -78.45%| +146.91%| +1384.29%| -0.16%| -14.46%| | EAS SCHED | +1771.55%| -79.00%| -100.00%| -100.00%| -84.15%| -47.56%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 919.34 | 4882.11 | 95.01 | 27.20 | 5888.56 | 9461934.73 | | EAS DIS | +6.26%| -3.38%| -11.19%| +0.18%| +0.94%| -0.01%| | EAS NDM | -88.31%| -78.24%| +155.38%| +1374.49%| -0.23%| -14.47%| | EAS SCHED | +1684.96%| -78.96%| -100.00%| -100.00%| -84.14%| -47.40%| -------------------------------------------------------------------------------------------------
- Case rt-app 6%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 8.99s -425.4ms -4.9s -462.3ms clusterA: C2 2.18s +810.8ms -2.2s -2.2s clusterA: M2 9.20ms -5.6ms -6.0ms -4.4ms clusterB: WFI 8.69s +135.2ms -8.6s -8.7s clusterB: C2 1.46s -229.9ms -1.5s -1.4s clusterB: M2 191.43ms -304us +19.8s +21.2s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 16.73 | 18640.00 | 42.41 | 9.79 | 176.25 | 8307775.13 | | EAS DIS | -2.99%| -1.29%| -76.22%| +1.71%| +0.16%| -1.53%| | EAS NDM | -26.18%| -99.72%| +35.65%| +161267.72%| -62.97%| +84.31%| | EAS SCHED | +2939.63%| -100.00%| +26837.28%| +150.46%| +486.62%| +16.74%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 56.28 | 42010.00 | 76.46 | 9.86 | 181.47 | 18442998.06 | | EAS DIS | -4.92%| -1.33%| -85.96%| -2.70%| +0.50%| -1.57%| | EAS NDM | -59.28%| -99.66%| +14.01%| +160382.48%| -61.11%| -16.44%| | EAS SCHED | +879.42%| -100.00%| +28572.37%| +355.98%| +489.88%| -5.52%| -------------------------------------------------------------------------------------------------
- Case rt-app 13%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 7.08s -25.3ms -3.3s -2.3s clusterA: C2 4.11s -15.6ms -4.1s -4.1s clusterA: M2 6.80ms -3.4ms -695us +11.0ms clusterB: WFI 6.47s -329.8ms -6.5s -6.5s clusterB: C2 4.13s +497.4ms -4.1s -4.1s clusterB: M2 71.32ms +121.8ms +20.0s +21.3s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 16.74 | 9730.00 | 8510.00 | 0.00 | 58.70 | 10481071.92 | | EAS DIS | -2.03%| -10.89%| +10.69%| +1864.00%| +190.19%| +3.41%| | EAS NDM | +0.78%| -99.77%| -98.47%| +1602036.00%| +222.13%| +49.94%| | EAS SCHED | +3304.36%| -81.57%| -79.87%| +1004144.00%| +4555.06%| +43.69%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 52.98 | 32160.00 | 24740.00 | 0.00 | 64.87 | 32017443.84 | | EAS DIS | +3.59%| -6.69%| +10.02%| +3186.00%| +164.31%| +3.24%| | EAS NDM | +3.68%| -99.78%| -98.58%| +3109608.80%| +275.04%| -4.92%| | EAS SCHED | +1057.87%| -78.78%| -75.17%| +1954311.10%| +5900.74%| -3.22%| -------------------------------------------------------------------------------------------------
- Case rt-app 19%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 8.59s +730.0ms -5.2s -4.6s clusterA: C2 795.08ms -479.8ms -792.0ms -791.9ms clusterA: M2 6.67ms +1.4ms -5.0ms -510us clusterB: WFI 6.78s +1.7s -6.7s -6.8s clusterB: C2 2.77s -2.1s -2.8s -2.8s clusterB: M2 73.12ms +2.9ms +20.0s +21.3s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 16.91 | 6330.00 | 14440.00 | 321.43 | 58.54 | 13643658.08 | | EAS DIS | -1.54%| -42.34%| +21.88%| -100.00%| -0.07%| +6.14%| | EAS NDM | -3.08%| -100.00%| -99.19%| +5041.93%| +206.22%| +18.52%| | EAS SCHED | +3458.49%| -98.23%| -80.94%| +3764.90%| +2791.53%| +18.27%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 55.47 | 23900.00 | 45480.00 | 944.74 | 64.38 | 44475464.16 | | EAS DIS | +0.25%| -40.50%| +9.78%| -100.00%| +2.95%| -4.14%| | EAS NDM | -3.62%| -100.00%| -99.21%| +4529.65%| +185.78%| -4.48%| | EAS SCHED | +1466.81%| -98.46%| -78.07%| +3391.27%| +4333.30%| -4.00%| -------------------------------------------------------------------------------------------------
- Case rt-app 25%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 9.41s +80.7ms -5.2s -3.1s clusterA: C2 6.62ms +2.5ms -2.6ms -386us clusterA: M2 9.02ms -5.5ms -1.8ms +12.0ms clusterB: WFI 9.35s +72.1ms -9.3s -9.3s clusterB: C2 11.49ms +208us +2.0ms +40.2ms clusterB: M2 121.57ms +67.8ms +19.8s +20.4s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 16.86 | 0.00 | 21240.00 | 9.90 | 109.76 | 15628682.88 | | EAS DIS | -5.69%| +1592.00%| -0.85%| +95.05%| +64.24%| -0.20%| | EAS NDM | +3.80%| 0.00%| -62.71%| +27272.12%| +4797.52%| -5.11%| | EAS SCHED | +3999.88%| +62.70%| -97.79%| +132936.06%| -57.80%| -15.63%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 55.41 | 0.00 | 76010.00 | 9.64 | 122.04 | 55578519.60 | | EAS DIS | -3.66%| +6163.00%| -2.84%| +318.59%| +66.44%| -2.56%| | EAS NDM | +0.43%| 0.00%| -59.18%| +91362.04%| +16748.26%| +0.34%| | EAS SCHED | +1258.89%| +159.70%| -99.03%| +136755.93%| -34.05%| -75.79%| -------------------------------------------------------------------------------------------------
- Case rt-app 31%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 8.64s -193.4ms -9.0ms -6.2s clusterA: C2 2.23ms -1.0ms +3.8ms +2.2ms clusterA: M2 8.42ms +1.2ms -1.4ms -5.0ms clusterB: WFI 8.64s -210.4ms -11.2ms -7.7s clusterB: C2 n.a. +1.6ms +4.9ms +79.7ms clusterB: M2 190.32ms -439us -119.2ms +18.9s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 16.49 | 0.00 | 22710.00 | 18.56 | 181.44 | 16794565.52 | | EAS DIS | +0.12%| 0.00%| +1.81%| -46.77%| +0.23%| +1.73%| | EAS NDM | +2.55%| 0.00%| +0.18%| -100.00%| -61.71%| -0.73%| | EAS SCHED | +320.07%| 0.00%| -95.70%| +4179.96%| +9229.78%| +29.82%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 55.15 | 0.00 | 89550.00 | 9.60 | 205.88 | 65549695.12 | | EAS DIS | -3.25%| 0.00%| +1.82%| +0.33%| +0.66%| +1.81%| | EAS NDM | -0.25%| 0.00%| +0.45%| -100.00%| -51.82%| +0.24%| | EAS SCHED | +102.30%| 0.00%| -96.24%| +30805.83%| +23883.23%| -1.48%| -------------------------------------------------------------------------------------------------
- Case rt-app 38%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 7.38s -68.4ms -3.2s -4.7s clusterA: C2 3.81ms +3.0ms +7.7ms +4.4ms clusterA: M2 8.30ms -6.5ms -5.4ms -2.6ms clusterB: WFI 7.38s -6.7ms -2.6s -3.6s clusterB: C2 655us -655us +191.1ms +447.0ms clusterB: M2 71.73ms +52.2ms +8.7s +10.5s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 16.43 | 0.00 | 13740.00 | 11510.00 | 70.25 | 21153777.44 | | EAS DIS | +1.77%| 0.00%| +0.44%| +0.09%| +70.23%| +0.53%| | EAS NDM | -0.79%| 0.00%| -90.94%| -72.37%| +25593.95%| +21.13%| | EAS SCHED | +2751.25%| 0.00%| -74.16%| -99.45%| +29707.83%| +31.77%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 55.33 | 0.00 | 53790.00 | 44650.00 | 103.91 | 82213110.64 | | EAS DIS | +0.56%| 0.00%| 0.00%| -1.81%| +50.12%| -0.87%| | EAS NDM | -1.66%| 0.00%| -91.35%| -75.88%| +53561.82%| -1.89%| | EAS SCHED | +821.76%| 0.00%| -77.45%| -99.53%| +55525.06%| -4.51%| -------------------------------------------------------------------------------------------------
- Case rt-app 44%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 5.62s +123.1ms -4.0s -1.7s clusterA: C2 4.22ms +933us -606us -1.6ms clusterA: M2 3.77ms +86us -697us +2.6ms clusterB: WFI 5.61s +87.6ms -46.5ms -859.0ms clusterB: C2 1.36ms -1.4ms +48.7ms +80.8ms clusterB: M2 71.19ms +10.7ms +4.1s +2.4s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 15.51 | 19.62 | 15660.00 | 12720.00 | 451.61 | 24180973.92 | | EAS DIS | +5.74%| -100.00%| +0.19%| -5.82%| +115.02%| -0.30%| | EAS NDM | +13.73%| -100.00%| -74.27%| +44.65%| +1343.72%| +17.57%| | EAS SCHED | +210.57%| -100.00%| +30.46%| -90.64%| +1664.80%| +5.91%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 51.45 | 76.43 | 61390.00 | 50060.00 | 1482.15 | 94633209.36 | | EAS DIS | +6.03%| -100.00%| -1.24%| -6.07%| +129.76%| -1.26%| | EAS NDM | +12.05%| -100.00%| -74.51%| +15.60%| +1314.79%| -2.64%| | EAS SCHED | +72.01%| -100.00%| +27.51%| -91.65%| +1547.61%| -4.47%| -------------------------------------------------------------------------------------------------
- Case rt-app 50%:
C-state DC Mainline (ndm) noEAS (ndm) EAS (ndm) EAS (sched) clusterA: WFI 6.34s +227.7ms +67.6ms +304.1ms clusterA: C2 3.35ms -1.5ms -2.0ms -1.7ms clusterA: M2 7.73ms -1.6ms -1.9ms -3.1ms clusterB: WFI 6.34s +231.2ms +69.2ms +308.0ms clusterB: C2 n.a. n.a. +2.0ms +2.2ms clusterB: M2 188.71ms -64.3ms -120.2ms +1.3s
P-State Statistics ------------------------------------------------------------------------------------------------- | Cluster Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 16.81 | 0.00 | 9.95 | 27340.00 | 169.39 | 26460416.57 | | EAS DIS | +0.54%| 0.00%| +208.91%| -1.79%| -26.08%| -1.92%| | EAS NDM | -3.51%| 0.00%| +98.23%| -0.55%| -63.03%| -1.00%| | EAS SCHED | +2926.71%| 0.00%| -100.00%| -2.93%| +587.69%| +1.97%| ------------------------------------------------------------------------------------------------- | CPU Level (ms) | ------ ------------------------------------------------------------------------------------------ | Item | 208MHz | 432MHz | 729MHz | 960MHz | 1.2GHz | Cycles | ------------------------------------------------------------------------------------------------- | MAINLINE | 55.44 | 0.00 | 10.62 | 108060.00 | 170.45 | 103961414.23 | | EAS DIS | +0.41%| 0.00%| +743.42%| -1.43%| +2.25%| -1.37%| | EAS NDM | -1.75%| 0.00%| +90.32%| -0.92%| -50.47%| -1.01%| | EAS SCHED | +894.37%| 0.00%| -100.00%| -3.08%| +865.46%| -1.28%| -------------------------------------------------------------------------------------------------
* Profiling: performance
sysbench --test=cpu --num-threads=1 --max-time=10 run
rt-app performance is calculate with below formula: task performance = slack/(c_period - c_run) * 1024
energy mainline (ndm) noeas (ndm) eas (ndm) eas (sched) prf prf prf prf sysbench 100 100 100 92
rt-app 6% 662 665 393 615 rt-app 13% 648 645 465 394 rt-app 19% 610 648 479 57 rt-app 25% 649 664 306 518 rt-app 31% 600 585 596 366 rt-app 38% 576 584 259 -166 rt-app 44% 466 487 30 -349 rt-app 50% 583 602 598 612
* Summary
- After applied the two extra patches, the profiling result is consistent and stable for EAS (ndm) and EAS (sched). The tasks will be placed into first cluster for LITTLE.LITTLE; so EAS (ndm) and EAS (sched) are much better for cluster level's idle duty cycle compared with noEAS (ndm).
- If the tasks are placed only on one cluster, the CPU's cycles will not change too much, but cluster level's cycles will increase much higher with EAS (ndm) and EAS (sched); So after packed tasks into one cluster, the cluster level will run longer time.
- With "sched" governor, it is more aggressive than "ondemand" governor, the CPUs will easily run at high OPPs (1.2GHz) and low OPP (208MHz); With "ondemand" governor, CPUs have many chances run at middle OPPs (729MHz or 960MHz).
- Need investigate rt-app 31% case, it is abnormal for cpu/cluster's idle duty cycle.
- Need investigate rt-app 25% case, it is abnormal for "ondemand" governor, which stay at OPP 1.2GHz much longer than other configurations.
[1] https://github.com/Leo-Yan/linux/tree/profile_easv5_hikey_round3 [2] https://github.com/96boards/arm-trusted-firmware/tree/hikey [3] https://github.com/Leo-Yan/utility/blob/master/profile_eas/calc_idle_diff.py [4] https://github.com/Leo-Yan/utility/blob/master/profile_eas/calc_pstate_time.... [5] https://github.com/Leo-Yan/utility/blob/master/profile_eas/calc_sched_prefor... [6] https://github.com/Leo-Yan/utility/tree/master/profile_eas/hikey_easv5_round...
Thanks, Leo Yan
On 07/09/15 06:50, Leo Yan wrote:
Hi all,
[...]
Also have enclosed these two patches for review.
Let's discuss these patches on LKML since you have sent out emails to LKML discussing these changes.
[...]
Software Environment
Kernel (4.2 + EAS RFCv5) + extra two patches [1]
ARM-TF [2]
Enable CPUIdle with PSCI
Enable CPUFreq with cpufreq-dt driver
Profiling scritps: calc_idle_diff.py [3]: calculate C-state's difference for different configurations calc_pstate_time.py [4]: calculate P-state's difference for different configurations calc_sched_preformance.py [5]: calculate scheduler performance
I saw that you saved an x86_64 idlestat binary on your github utility/profile_eas project. I thought so far we have to run idlestat on the target so it can retrieve the target idle state names like WFI, C2 or M2?
There is this energy model (EM) feature in idlestat (-e energy_model_file) which calculates energy consumption per trace file.
example on TC2:
# idlestat --trace -f trace.dat -t T -e energy_model_arm_tc2 Parsed energy model file successfully ... ClusterA Energy Caps 22027 (2.202654e+04) ClusterA Energy Idle 57 (5.740462e+01) ClusterA Energy Index 22084 (2.208395e+04) ClusterB Energy Caps 3236 (3.235515e+03) ClusterB Energy Idle 40 (4.041970e+01) ClusterB Energy Index 3276 (3.275935e+03)
Total Energy Index 25360 (2.535988e+04)
The current idlestat code has only this ARM TC2 specific EM file put it should be easy for you to create one for your Hikey board.
[...]
Profiling: performance
sysbench --test=cpu --num-threads=1 --max-time=10 run
rt-app performance is calculate with below formula: task performance = slack/(c_period - c_run) * 1024
energy mainline (ndm) noeas (ndm) eas (ndm) eas (sched) prf prf prf prf sysbench 100 100 100 92
rt-app 6% 662 665 393 615 rt-app 13% 648 645 465 394 rt-app 19% 610 648 479 57 rt-app 25% 649 664 306 518 rt-app 31% 600 585 596 366 rt-app 38% 576 584 259 -166 rt-app 44% 466 487 30 -349 rt-app 50% 583 602 598 612
Seeing these performance numbers, have you calibrated your json files for your hikey board?
ARM TC2 example, calibrated against A15:
# cat wl_test.json | grep calibration "calibration": 141,
Summary
- After applied the two extra patches, the profiling result is consistent and stable for EAS (ndm) and EAS (sched). The tasks will be placed into first cluster for LITTLE.LITTLE; so EAS (ndm) and EAS (sched) are much
Shouldn’t we call it an SMP system instead LITTLE.LITTLE?
[...]
-- Dietmar
On 17/09/15 18:09, Dietmar Eggemann wrote:
On 07/09/15 06:50, Leo Yan wrote:
Hi all,
[...]
Also have enclosed these two patches for review.
Let's discuss these patches on LKML since you have sent out emails to LKML discussing these changes.
[...]
Software Environment
Kernel (4.2 + EAS RFCv5) + extra two patches [1]
ARM-TF [2]
Enable CPUIdle with PSCI
Enable CPUFreq with cpufreq-dt driver
Profiling scritps: calc_idle_diff.py [3]: calculate C-state's difference for different configurations calc_pstate_time.py [4]: calculate P-state's difference for different configurations calc_sched_preformance.py [5]: calculate scheduler performance
I saw that you saved an x86_64 idlestat binary on your github utility/profile_eas project. I thought so far we have to run idlestat on the target so it can retrieve the target idle state names like WFI, C2 or M2?
There is this energy model (EM) feature in idlestat (-e energy_model_file) which calculates energy consumption per trace file.
example on TC2:
# idlestat --trace -f trace.dat -t T -e energy_model_arm_tc2 Parsed energy model file successfully ... ClusterA Energy Caps 22027 (2.202654e+04) ClusterA Energy Idle 57 (5.740462e+01) ClusterA Energy Index 22084 (2.208395e+04) ClusterB Energy Caps 3236 (3.235515e+03) ClusterB Energy Idle 40 (4.041970e+01) ClusterB Energy Index 3276 (3.275935e+03)
Total Energy Index 25360 (2.535988e+04)
The current idlestat code has only this ARM TC2 specific EM file put it should be easy for you to create one for your Hikey board.
[...]
Profiling: performance
sysbench --test=cpu --num-threads=1 --max-time=10 run
rt-app performance is calculate with below formula: task performance = slack/(c_period - c_run) * 1024
energy mainline (ndm) noeas (ndm) eas (ndm) eas (sched) prf prf prf prf sysbench 100 100 100 92
rt-app 6% 662 665 393 615 rt-app 13% 648 645 465 394 rt-app 19% 610 648 479 57 rt-app 25% 649 664 306 518 rt-app 31% 600 585 596 366 rt-app 38% 576 584 259 -166 rt-app 44% 466 487 30 -349 rt-app 50% 583 602 598 612
Seeing these performance numbers, have you calibrated your json files for your hikey board?
ARM TC2 example, calibrated against A15:
# cat wl_test.json | grep calibration "calibration": 141,
You're multiplying w/ 1024 whereas we use 100 :-)
Have you found the reason for these crazy outliers (e.g. 'eas (sched)' 19%, 38%, 44% or 'eas (ndm)' 44%)?
Morten just mentioned that even if you calibrated your system correctly, you will not stress the hikey board as much as we stress the TC2 especially with rt-app 38%, 44% and 50%. We calibrate against a big cpu on TC2 and that means starting with an run/period ratio of 38% we start to saturate the 3 little cpus of the 5 TC2 cpus (we're using # rt-app threads eq. # logical cpus). So with the current setup you should never see negative performance numbers on hikey (SMP) but the performance numbers should decrease with higher rt-app percentage values.
[...]
On Fri, Sep 18, 2015 at 12:23:32PM +0100, Dietmar Eggemann wrote:
On 17/09/15 18:09, Dietmar Eggemann wrote:
On 07/09/15 06:50, Leo Yan wrote:
Hi all,
[...]
Also have enclosed these two patches for review.
Let's discuss these patches on LKML since you have sent out emails to LKML discussing these changes.
[...]
Software Environment
Kernel (4.2 + EAS RFCv5) + extra two patches [1]
ARM-TF [2]
Enable CPUIdle with PSCI
Enable CPUFreq with cpufreq-dt driver
Profiling scritps: calc_idle_diff.py [3]: calculate C-state's difference for different configurations calc_pstate_time.py [4]: calculate P-state's difference for different configurations calc_sched_preformance.py [5]: calculate scheduler performance
I saw that you saved an x86_64 idlestat binary on your github utility/profile_eas project. I thought so far we have to run idlestat on the target so it can retrieve the target idle state names like WFI, C2 or M2?
There is this energy model (EM) feature in idlestat (-e energy_model_file) which calculates energy consumption per trace file.
example on TC2:
# idlestat --trace -f trace.dat -t T -e energy_model_arm_tc2 Parsed energy model file successfully ... ClusterA Energy Caps 22027 (2.202654e+04) ClusterA Energy Idle 57 (5.740462e+01) ClusterA Energy Index 22084 (2.208395e+04) ClusterB Energy Caps 3236 (3.235515e+03) ClusterB Energy Idle 40 (4.041970e+01) ClusterB Energy Index 3276 (3.275935e+03)
Total Energy Index 25360 (2.535988e+04)
The current idlestat code has only this ARM TC2 specific EM file put it should be easy for you to create one for your Hikey board.
[...]
Profiling: performance
sysbench --test=cpu --num-threads=1 --max-time=10 run
rt-app performance is calculate with below formula: task performance = slack/(c_period - c_run) * 1024
energy mainline (ndm) noeas (ndm) eas (ndm) eas (sched) prf prf prf prf sysbench 100 100 100 92
rt-app 6% 662 665 393 615 rt-app 13% 648 645 465 394 rt-app 19% 610 648 479 57 rt-app 25% 649 664 306 518 rt-app 31% 600 585 596 366 rt-app 38% 576 584 259 -166 rt-app 44% 466 487 30 -349 rt-app 50% 583 602 598 612
Seeing these performance numbers, have you calibrated your json files for your hikey board?
ARM TC2 example, calibrated against A15:
# cat wl_test.json | grep calibration "calibration": 141,
You're multiplying w/ 1024 whereas we use 100 :-)
Yes.
Have you found the reason for these crazy outliers (e.g. 'eas (sched)' 19%, 38%, 44% or 'eas (ndm)' 44%)?
Not yet and will dig into this issue.
Morten just mentioned that even if you calibrated your system correctly, you will not stress the hikey board as much as we stress the TC2 especially with rt-app 38%, 44% and 50%. We calibrate against a big cpu on TC2 and that means starting with an run/period ratio of 38% we start to saturate the 3 little cpus of the 5 TC2 cpus (we're using # rt-app threads eq. # logical cpus).
Please help review the json file for Hikey which i pasted in another email, it will launch 8 threads for 8 CPUs.
So with the current setup you should never see negative performance numbers on hikey (SMP) but the performance numbers should decrease with higher rt-app percentage values.
Thanks, Leo Yan
Hi Dietmar,
Thanks a lot for reviewing, please see below comments.
On Thu, Sep 17, 2015 at 06:09:43PM +0100, Dietmar Eggemann wrote:
On 07/09/15 06:50, Leo Yan wrote:
Hi all,
[...]
Also have enclosed these two patches for review.
Let's discuss these patches on LKML since you have sent out emails to LKML discussing these changes.
Yeah, will look into Morten's comments for related patches.
[...]
Software Environment
Kernel (4.2 + EAS RFCv5) + extra two patches [1]
ARM-TF [2]
Enable CPUIdle with PSCI
Enable CPUFreq with cpufreq-dt driver
Profiling scritps: calc_idle_diff.py [3]: calculate C-state's difference for different configurations calc_pstate_time.py [4]: calculate P-state's difference for different configurations calc_sched_preformance.py [5]: calculate scheduler performance
I saw that you saved an x86_64 idlestat binary on your github utility/profile_eas project.
I use x86_64 idlestat to compare the trace logs to get difference of idle's duty cycle. For example, i get the rt-app 6%'s trace log file for mainline and eas (ndm), then i will use below command on host PC to get their difference for CPU's idle duty cycle:
./idlestat --import -f eas_ndm_trace.log -b mainline_trace.log -r comparison >> idlestat_compare.txt
So finally can summary idle duty cycle's for different configuration.
I thought so far we have to run idlestat on the target so it can retrieve the target idle state names like WFI, C2 or M2?
Yes, i run idlestat on host with below commands:
./idlestat --trace -f ./result/mp3/trace.log -t 30 -p -c -w -o ./result/mp3/report.log -- ./rt-app ./doc/examples/mp3-long.json
./idlestat --trace -f ./result/rt-app-6/trace.log -t 30 -p -c -w -o ./result/rt-app-6/report.log -- ./rt-app ./doc/examples/rt-app-6.json
There is this energy model (EM) feature in idlestat (-e energy_model_file) which calculates energy consumption per trace file.
example on TC2:
# idlestat --trace -f trace.dat -t T -e energy_model_arm_tc2 Parsed energy model file successfully ... ClusterA Energy Caps 22027 (2.202654e+04) ClusterA Energy Idle 57 (5.740462e+01) ClusterA Energy Index 22084 (2.208395e+04) ClusterB Energy Caps 3236 (3.235515e+03) ClusterB Energy Idle 40 (4.041970e+01) ClusterB Energy Index 3276 (3.275935e+03)
Total Energy Index 25360 (2.535988e+04)
The current idlestat code has only this ARM TC2 specific EM file put it should be easy for you to create one for your Hikey board.
Thanks for pointing out this, before don't know about this. Just now did a quick try, it can works at my side. But i found if add "WFI" state, it will report below error; if remove "WFI" state, then the error will dismiss. "WFI" state should not be ignored, will check idlestat's source code.
Error: parse_energy_model: too many C states specified for cluster in energy_model_hikey can't parse energy model file
Profiling: performance
sysbench --test=cpu --num-threads=1 --max-time=10 run
rt-app performance is calculate with below formula: task performance = slack/(c_period - c_run) * 1024
energy mainline (ndm) noeas (ndm) eas (ndm) eas (sched) prf prf prf prf sysbench 100 100 100 92
rt-app 6% 662 665 393 615 rt-app 13% 648 645 465 394 rt-app 19% 610 648 479 57 rt-app 25% 649 664 306 518 rt-app 31% 600 585 596 366 rt-app 38% 576 584 259 -166 rt-app 44% 466 487 30 -349 rt-app 50% 583 602 598 612
Seeing these performance numbers, have you calibrated your json files for your hikey board?
ARM TC2 example, calibrated against A15:
# cat wl_test.json | grep calibration "calibration": 141,
No, still use "CPU0" for calibration. So below are my rt-app-6.json file, could you check if there still have other things i missed?
{ "tasks": { "thread0": { "instance": 5, "loop": -1, "run": 120, "sleep": 0, "timer": { "ref": "unique", "period": 2000 } } }, "global": { "duration": 20, "calibration": "CPU0", "default_policy": "SCHED_OTHER", "pi_enabled": false, "lock_pages": false, "logdir": "./", "log_basename": "rt-app-6", "gnuplot": true } }
Summary
- After applied the two extra patches, the profiling result is consistent and stable for EAS (ndm) and EAS (sched). The tasks will be placed into first cluster for LITTLE.LITTLE; so EAS (ndm) and EAS (sched) are much
Shouldn’t we call it an SMP system instead LITTLE.LITTLE?
From CPU's topology, LITTLE.LITTLE is somehow different with SMP
system with only one cluster. Later will directly use "SMP".
Thanks, Leo Yan
On Fri, Sep 18, 2015 at 11:36:28PM +0800, Leo Yan wrote:
Hi Dietmar,
Thanks a lot for reviewing, please see below comments.
On Thu, Sep 17, 2015 at 06:09:43PM +0100, Dietmar Eggemann wrote:
On 07/09/15 06:50, Leo Yan wrote:
Hi all,
[...]
Also have enclosed these two patches for review.
Let's discuss these patches on LKML since you have sent out emails to LKML discussing these changes.
Yeah, will look into Morten's comments for related patches.
[...]
Software Environment
Kernel (4.2 + EAS RFCv5) + extra two patches [1]
ARM-TF [2]
Enable CPUIdle with PSCI
Enable CPUFreq with cpufreq-dt driver
Profiling scritps: calc_idle_diff.py [3]: calculate C-state's difference for different configurations calc_pstate_time.py [4]: calculate P-state's difference for different configurations calc_sched_preformance.py [5]: calculate scheduler performance
I saw that you saved an x86_64 idlestat binary on your github utility/profile_eas project.
I use x86_64 idlestat to compare the trace logs to get difference of idle's duty cycle. For example, i get the rt-app 6%'s trace log file for mainline and eas (ndm), then i will use below command on host PC to get their difference for CPU's idle duty cycle:
./idlestat --import -f eas_ndm_trace.log -b mainline_trace.log -r comparison >> idlestat_compare.txt
So finally can summary idle duty cycle's for different configuration.
I thought so far we have to run idlestat on the target so it can retrieve the target idle state names like WFI, C2 or M2?
Yes, i run idlestat on host with below commands:
./idlestat --trace -f ./result/mp3/trace.log -t 30 -p -c -w -o ./result/mp3/report.log -- ./rt-app ./doc/examples/mp3-long.json
./idlestat --trace -f ./result/rt-app-6/trace.log -t 30 -p -c -w -o ./result/rt-app-6/report.log -- ./rt-app ./doc/examples/rt-app-6.json
There is this energy model (EM) feature in idlestat (-e energy_model_file) which calculates energy consumption per trace file.
example on TC2:
# idlestat --trace -f trace.dat -t T -e energy_model_arm_tc2 Parsed energy model file successfully ... ClusterA Energy Caps 22027 (2.202654e+04) ClusterA Energy Idle 57 (5.740462e+01) ClusterA Energy Index 22084 (2.208395e+04) ClusterB Energy Caps 3236 (3.235515e+03) ClusterB Energy Idle 40 (4.041970e+01) ClusterB Energy Index 3276 (3.275935e+03)
Total Energy Index 25360 (2.535988e+04)
The current idlestat code has only this ARM TC2 specific EM file put it should be easy for you to create one for your Hikey board.
Thanks for pointing out this, before don't know about this. Just now did a quick try, it can works at my side. But i found if add "WFI" state, it will report below error; if remove "WFI" state, then the error will dismiss. "WFI" state should not be ignored, will check idlestat's source code.
Error: parse_energy_model: too many C states specified for cluster in energy_model_hikey can't parse energy model file
Profiling: performance
sysbench --test=cpu --num-threads=1 --max-time=10 run
rt-app performance is calculate with below formula: task performance = slack/(c_period - c_run) * 1024
energy mainline (ndm) noeas (ndm) eas (ndm) eas (sched) prf prf prf prf sysbench 100 100 100 92
rt-app 6% 662 665 393 615 rt-app 13% 648 645 465 394 rt-app 19% 610 648 479 57 rt-app 25% 649 664 306 518 rt-app 31% 600 585 596 366 rt-app 38% 576 584 259 -166 rt-app 44% 466 487 30 -349 rt-app 50% 583 602 598 612
Seeing these performance numbers, have you calibrated your json files for your hikey board?
ARM TC2 example, calibrated against A15:
# cat wl_test.json | grep calibration "calibration": 141,
No, still use "CPU0" for calibration. So below are my rt-app-6.json file, could you check if there still have other things i missed?
Sorry for i gave wrong json file, please see below code which is used for my profiling for rt-app-6 case:
{ "tasks": { "thread0": { "instance": 8, "loop": -1, "run": 120, "sleep": 0, "timer": { "ref": "unique", "period": 2000 } } }, "global": { "duration": 20, "calibration": "CPU0", "default_policy": "SCHED_OTHER", "pi_enabled": false, "lock_pages": false, "logdir": "./", "log_basename": "rt-app-6", "gnuplot": true } }
Summary
- After applied the two extra patches, the profiling result is consistent and stable for EAS (ndm) and EAS (sched). The tasks will be placed into first cluster for LITTLE.LITTLE; so EAS (ndm) and EAS (sched) are much
Shouldn’t we call it an SMP system instead LITTLE.LITTLE?
From CPU's topology, LITTLE.LITTLE is somehow different with SMP system with only one cluster. Later will directly use "SMP".
Thanks, Leo Yan
On 18/09/15 17:10, Leo Yan wrote:
[...]
No, still use "CPU0" for calibration. So below are my rt-app-6.json file, could you check if there still have other things i missed?
Sorry for i gave wrong json file, please see below code which is used for my profiling for rt-app-6 case:
{ "tasks": { "thread0": { "instance": 8, "loop": -1, "run": 120, "sleep": 0, "timer": { "ref": "unique", "period": 2000 } } }, "global": { "duration": 20, "calibration": "CPU0", "default_policy": "SCHED_OTHER", "pi_enabled": false, "lock_pages": false, "logdir": "./", "log_basename": "rt-app-6", "gnuplot": true } }
I assume here you run the tests with '"calibration": "CPU0"' ...
This is what you have to use in the calibration step prior to the actual tests. We always do this using the highest OPP on the big cpu on TC2 or JUNO so a 50% periodic task has run/period = 0.5 on a big cpu on the highest OPP.
This will give you a line like:
[rt-app] <notice> pLoad = 334ns : calib_cpu 0
Then you have to replace "CPU0" with 334 in your json files for the actual test step.
[...]