On Mon, 15 May 2017 13:21:09 -0500 Sebastian Pop s.pop@samsung.com wrote:
On 05/15/2017 12:15 PM, Kim Phillips wrote:
I see this gcov generated with the intel-pt trace data:
$ dump_gcov ./sort-O3.gcov -gcov_version=1
<snip> 11: bubble_sort total:101485 2: 10 4: 28064 5: 27985 7: 8681 8: 8681 9: 28064 <snip>
which makes an improvement:
- taskset -c 2 ./sort-O3 30000
Bubble sorting array of 30000 elements 1452 ms
- taskset -c 2 ./sort-autofdo 30000
Bubble sorting array of 30000 elements 1356 ms
but the aarch64 version does not - in fact it makes things worse:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5302 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 6484 ms
This may be a problem in tuning of the aarch64 optimizations in the compiler.
the arch-independent gcov file merely suggests what paths are more or less likely, and I've seen the aarch64 compiler backend honour a likely branch, given a coverage file derived with software instrumentation: on juno r2:
$ gcc -O3 -o sort-O3 sort.c $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5303 ms $ gcc -O3 -o sort-O3-profgen -fprofile-generate sort.c $ taskset -c 2 ./sort-O3-profgen 3000 Bubble sorting array of 3000 elements 44 ms $ gcc -O3 -o sort-O3-profopt -fprofile-use sort.c $ taskset -c 2 ./sort-O3-profopt Bubble sorting array of 30000 elements 3884 ms
so it's not that the aarch64 backend isn't doing the right thing wrt branch optimization.
Here are the instructions I used to make the aarch64 version:
x86host$ perf inject -i perf.data -o inj.data --itrace=i100usl --strip x86host$ create_gcov --binary=sort-O3 --profile=inj.data --gcov=sort-O3.gcov -gcov_version=1 x86host$ aarch64-linux-gnu-gcc -O3 -fauto-profile=./sort-O3.gcov ./sort.c -o ./sort-O3-autofdo
and here is the equivalent snippet of the gcov dump:
x86host$ dump_gcov sort-O3.gcov -gcov_version=1
<snip> 11: bubble_sort total:11570765 2: 2090 4: 2089 5: 0 7: 5436878 9: 6129708 <snip>
I wonder why the profile you get on aarch64 looks so different than the x86_64. Could you please use the x86_64 profile when optimizing with autoFDO for aarch64? Also could you please use the aarch64 profile to optimize the x86_64 program, and report the execution time numbers?
I have to pack my juno now, so I'll have get to that later, but I don't think it'll tell us anything new: the x86 gcov is good whereas the arm one is bad, and they are arch-independent. Your results (from the documentation) also suggest the compiler(s) is/are working properly.
Also what version of gcc are you using?
native target/juno: gcc (Debian 6.3.0-14) 6.3.0 20170415
host/x86: aarch64-linux-gnu-gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005
Thanks,
Kim