I also have seen a slow-down when generating the ETM trace with sorting 3000 elements, and using that profile in optimizing the sort of 30000 elements. I see some speedup when generating the perf.data with the same number of sorted elements as the base:
+ gcc -O3 -g3 sort.c -o sort_30k-base -DARRAY_LEN=30000 + export LD_LIBRARY_PATH=/root/etm/decoder-may/decoder/tests/bin/linux-arm64/dbg + LD_LIBRARY_PATH=/root/etm/decoder-may/decoder/tests/bin/linux-arm64/dbg + /root/etm/OpenCSD-may/tools/perf/perf record -e cs_etm/@20070000.etr/u --per-thread ./sort_30k-base Bubble sorting array of 30000 elements 8686 ms [ perf record: Woken up 10 times to write data ] Warning: AUX data lost 7 times out of 14!
[ perf record: Captured and wrote 22.132 MB perf.data ] + /root/etm/OpenCSD-may/tools/perf/perf inject -f -i perf.data -o inj --itrace=i100usl --strip + create_gcov --binary=./sort_30k-base --profile=inj --gcov=sort_30k-base.gcov -gcov_version=1 + dump_gcov sort_30k-base.gcov -gcov_version=1 sort_array total:44830 head:1 0: 1 3.3: 4981 4.3: 4981 7: 4981 1: printf total:4981 2: 4981 6: bubble_sort total:24905 2: 4981 4: 4981 5: 4981 7: 4981 9: 4981 + gcc -O3 -fauto-profile=sort_30k-base.gcov sort.c -o sort_autofdo1 -DARRAY_LEN=30000 + taskset -c 2 ./sort_autofdo1 Bubble sorting array of 30000 elements 5687 ms # taskset -c 2 ./sort_30k-base Bubble sorting array of 30000 elements 5910 ms # taskset -c 2 ./sort_30k-base Bubble sorting array of 30000 elements 5911 ms # taskset -c 2 ./sort_autofdo1 Bubble sorting array of 30000 elements 5686 ms # taskset -c 2 ./sort_autofdo1 Bubble sorting array of 30000 elements 5685 ms
The output of dump_gcov looks wrong. And this is far from the compiler generated profile:
# gcc sort.c -O3 -g -fprofile-generate -o sort_30k-gen-profile -DARRAY_LEN=30000 # taskset -c 2 ./sort_30k-gen-profile Bubble sorting array of 30000 elements 5877 ms # gcc sort.c -O3 -g -fprofile-use -o sort_30k-profile-use -DARRAY_LEN=30000 # taskset -c 2 ./sort_30k-profile-use Bubble sorting array of 30000 elements 4332 ms # taskset -c 2 ./sort_30k-profile-use Bubble sorting array of 30000 elements 4328 ms # taskset -c 2 ./sort_30k-profile-use Bubble sorting array of 30000 elements 4331 ms
It looks like the profile we generate from the ETM has some problem. The compiler can generate much better code with a correct profile. The profile from x86 also gets better performance, though it does not match the compiler generated profile:
# taskset -c 2 ./sort_autofdo_x86 Bubble sorting array of 30000 elements 4971 ms