On 05/15/2017 12:15 PM, Kim Phillips wrote:
I see this gcov generated with the intel-pt trace data:
$ dump_gcov ./sort-O3.gcov -gcov_version=1
<snip> 11: bubble_sort total:101485 2: 10 4: 28064 5: 27985 7: 8681 8: 8681 9: 28064 <snip>
which makes an improvement:
- taskset -c 2 ./sort-O3 30000
Bubble sorting array of 30000 elements 1452 ms
- taskset -c 2 ./sort-autofdo 30000
Bubble sorting array of 30000 elements 1356 ms
but the aarch64 version does not - in fact it makes things worse:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5302 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 6484 ms
This may be a problem in tuning of the aarch64 optimizations in the compiler.
Here are the instructions I used to make the aarch64 version:
x86host$ perf inject -i perf.data -o inj.data --itrace=i100usl --strip x86host$ create_gcov --binary=sort-O3 --profile=inj.data --gcov=sort-O3.gcov -gcov_version=1 x86host$ aarch64-linux-gnu-gcc -O3 -fauto-profile=./sort-O3.gcov ./sort.c -o ./sort-O3-autofdo
and here is the equivalent snippet of the gcov dump:
x86host$ dump_gcov sort-O3.gcov -gcov_version=1
<snip> 11: bubble_sort total:11570765 2: 2090 4: 2089 5: 0 7: 5436878 9: 6129708 <snip>
I wonder why the profile you get on aarch64 looks so different than the x86_64. Could you please use the x86_64 profile when optimizing with autoFDO for aarch64? Also could you please use the aarch64 profile to optimize the x86_64 program, and report the execution time numbers?
Also what version of gcc are you using?
Thanks, Sebastian