Hi,
Tried out Sebastians patches and got some similarities to Kim but a couple of differences and some interesting results if you look at the disassemble of the resulting routines.
So as per the AutoFDO instructions I built a sort program with no optimisations and debug: gcc -g sort.c -o sort This I profiled on juno- with 3000 interations
The resulting disassembly of the bubble_sort routine is in bubble-sorts-disass.txt and the dump gcov profile is below... -------------------------------- bubble_sort total:33987051 head:0 0: 0 1: 0 2: 2839 3: 2839 4: 2839 4.1: 8522673 4.2: 8519834 5: 8517035 6: 2104748 7: 2104748 8: 2104748 9: 2104748 13: 0 ------------------------------- So in my view - the swap lines (6:-9:) - see attached sort.c, are run less than the enclosing loop (2:-4:,4.1:-5:) - which is what Kim observed with the intel version. The synthesized LBR records looked reasonable from comparison with the disassembly too.
Trying out the O3 and O3-autofdo from this profile resulted in O3 running marginally faster, but both faster than unoptimised debug.
So now look at the disassemblies from the -O3 and -autofdo-O3 versions of the sort routine [bubble-sorts-disass.txt again]. Both appear to define a bubble_sort routine, but embed the same / similar code into sort_array. Unsurprisingly the O3 version is considerably more compact - hence it runs faster. I have no idea what the autofdo version is up to, but the I cannot see how the massive expansion of the routine with compare and jump tables is going to help.
So perhaps:- 1) the LBR stacks are still not correct - though code and dump inspection might suggest otherwise - are there features in the intel LBR we are not yet synthesizing? 2) There is some adverse interaction with the profiles we are generating and the autofdo code generation. 3) The amount of coverage in the file is hitting the process - looking at gcov above then we only have trace from the bubble sort routine. I did reduce the number of iterations to get more of the program captured in coverage but this did not seem to make a difference. Mike
On 25 May 2017 at 05:12, Kim Phillips kim.phillips@arm.com wrote:
On Wed, 24 May 2017 12:48:04 -0500 Sebastian Pop sebpop@gmail.com wrote:
On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier mathieu.poirier@linaro.org wrote:
Are the instructions in the autoFDO section of the HOWTO.md on GitHub sufficient to test this or there is another way?
Here is how I tested it: (supposing that perf.data contains an ETM trace)
# perf inject -i perf.data -o inj --itrace=il64 --strip # perf report -i inj -D &> dump
and I inspected the addresses from the last branch stack in the output dump with the addresses of the disassembled program from:
# objdump -d sort
Re-running the AutoFDO process with these two patches continue to make the resultant executable perform worse, however:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5306 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5304 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5851 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5889 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5888 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5318 ms
The gcov file generated from the inj.data (no matter whether it's --itrace=il64 or --itrace=i100usle) still looks wrong:
$ ~/git/autofdo/dump_gcov -gcov_version=1 sort-O3.gcov sort_array total:19309128 head:0 0: 0 1: 0 5: 0 6: 0 7.1: 0 7.3: 0 8.3: 0 15: 2 16: 2 17: 2 10: start total:0 1: 0 11: bubble_sort total:19309119 2: 1566 4: 6266668 5: 6071341 7: 6266668 9: 702876 12: stop total:3 2: 0 3: 1 4: 1 5: 1 main total:1 head:0 0: 0 2: 0 4: 1 1: cmd_line total:0 3: 0 4: 0 5: 0 6: 0
Whereas the one generated by intel-pt run looks correct, showing the swap (11: bubble_sort 7,8) as executed less times:
kim@juno sort-etm$ ~/git/autofdo/dump_gcov -gcov_version=1 ../sort-O3.gcov sort_array total:105658 head:0 0: 0 5: 0 6: 0 7.1: 0 7.3: 0 8.3: 0 16: 0 17: 0 1: printf total:0 2: 0 10: start total:0 1: 0 11: bubble_sort total:105658 2: 14 4: 28740 5: 28628 7: 9768 8: 9768 9: 28740 12: stop total:0 2: 0 3: 0 4: 0 5: printf total:0 2: 0 15: printf total:0 2: 0
I have to run the 'perf inject' on the x86 host because of the aforementioned:
0x350 [0x50]: failed to process type: 1
problem when trying to run it natively on the aarch64 target.
However, it doesn't matter whether I run the create_gcov - like so btw:
~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data --gcov=sort-O3.gcov -gcov_version=1
on the x86 host or the aarch64 target: I still get the same (negative performance) results.
As Sebastian asked, if I take the intel-pt sourced inject generated .gcov onto the target and rebuild sort, the performance improves:
$ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3-autofdo $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5309 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5310 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 4443 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 4443 ms
And if I take the ETM-generated gcov and use that to build a new x86_64 binary, it indeed performs worse on x86_64 also:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1502 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1500 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1501 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1907 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1893 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1907 ms
Kim _______________________________________________ CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight