Re: [PATCH 1/2] perf inject: correct recording of branch address and destination

26 May 2017


      Hi,
Tried out Sebastians patches and got some similarities to Kim but a
couple of differences and some interesting results if you look at the
disassemble of the resulting routines.
So as per the AutoFDO instructions I built a sort program with no
optimisations and debug:
gcc -g sort.c -o sort
This I profiled on juno- with 3000 interations
The resulting disassembly of the bubble_sort routine is in
bubble-sorts-disass.txt and the dump gcov profile is below...
--------------------------------
bubble_sort total:33987051 head:0
  0: 0
  1: 0
  2: 2839
  3: 2839
  4: 2839
  4.1: 8522673
  4.2: 8519834
  5: 8517035
  6: 2104748
  7: 2104748
  8: 2104748
  9: 2104748
  13: 0
-------------------------------
So in my view - the swap lines (6:-9:) - see attached sort.c, are run
less than the enclosing loop (2:-4:,4.1:-5:) - which is what Kim
observed with the intel version.
The synthesized LBR records looked reasonable from comparison with the
disassembly too.
Trying out the O3 and O3-autofdo from this profile resulted in O3
running marginally faster, but both faster than unoptimised debug.
So now look at the disassemblies from the -O3 and -autofdo-O3 versions
of the sort routine [bubble-sorts-disass.txt again]. Both appear to
define a bubble_sort routine, but embed the same / similar code into
sort_array.
Unsurprisingly the O3 version is considerably more compact - hence it
runs faster. I have no idea what the autofdo version is up to, but the
I cannot see how the massive expansion of the routine with compare and
jump tables is going to help.
So perhaps:-
1) the LBR stacks are still not correct - though code and dump
inspection might suggest otherwise - are there features in the intel
LBR we are not yet synthesizing?
2) There is some adverse interaction with the profiles we are
generating and the autofdo code generation.
3) The amount of coverage in the file is hitting the process - looking
at gcov above then we only have trace from the bubble sort routine. I
did reduce the number of iterations to get more of the program
captured in coverage but this did not seem to make a difference.
Mike
On 25 May 2017 at 05:12, Kim Phillips kim.phillips@arm.com wrote:
...
On Wed, 24 May 2017 12:48:04 -0500
Sebastian Pop sebpop@gmail.com wrote:
...
On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier
mathieu.poirier@linaro.org wrote:
...
Are the instructions in the autoFDO section of the HOWTO.md on GitHub sufficient
to test this or there is another way?
Here is how I tested it: (supposing that perf.data contains an ETM trace)
# perf inject -i perf.data -o inj --itrace=il64 --strip
# perf report -i inj -D &> dump
and I inspected the addresses from the last branch stack in the output dump
with the addresses of the disassembled program from:
# objdump -d sort
Re-running the AutoFDO process with these two patches continue to make
the resultant executable perform worse, however:
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5306 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5304 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
5851 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
5889 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
5888 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5318 ms
The gcov file generated from the inj.data (no matter whether it's
--itrace=il64 or --itrace=i100usle) still looks wrong:
$ ~/git/autofdo/dump_gcov  -gcov_version=1 sort-O3.gcov
sort_array total:19309128 head:0
  0: 0
  1: 0
  5: 0
  6: 0
  7.1: 0
  7.3: 0
  8.3: 0
  15: 2
  16: 2
  17: 2
  10: start total:0
    1: 0
  11: bubble_sort total:19309119
    2: 1566
    4: 6266668
    5: 6071341
    7: 6266668
    9: 702876
  12: stop total:3
    2: 0
    3: 1
    4: 1
    5: 1
main total:1 head:0
  0: 0
  2: 0
  4: 1
  1: cmd_line total:0
    3: 0
    4: 0
    5: 0
    6: 0
Whereas the one generated by intel-pt run looks correct, showing the
swap (11: bubble_sort 7,8) as executed less times:
kim@juno sort-etm$ ~/git/autofdo/dump_gcov  -gcov_version=1 ../sort-O3.gcov
sort_array total:105658 head:0
  0: 0
  5: 0
  6: 0
  7.1: 0
  7.3: 0
  8.3: 0
  16: 0
  17: 0
  1: printf total:0
    2: 0
  10: start total:0
    1: 0
  11: bubble_sort total:105658
    2: 14
    4: 28740
    5: 28628
    7: 9768
    8: 9768
    9: 28740
  12: stop total:0
    2: 0
    3: 0
    4: 0
    5: printf total:0
      2: 0
  15: printf total:0
    2: 0
I have to run the 'perf inject' on the x86 host because of the
aforementioned:
0x350 [0x50]: failed to process type: 1
problem when trying to run it natively on the aarch64 target.
However, it doesn't matter whether I run the create_gcov - like so btw:
~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data --gcov=sort-O3.gcov -gcov_version=1
on the x86 host or the aarch64 target:  I still get the same (negative
performance) results.
As Sebastian asked, if I take the intel-pt sourced inject
generated .gcov onto the target and rebuild sort, the performance
improves:
$ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3-autofdo
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5309 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5310 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
4443 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
4443 ms
And if I take the ETM-generated gcov and use that to build a new x86_64
binary, it indeed performs worse on x86_64 also:
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
1502 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
1500 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
1501 ms
$ taskset -c 2 ./sort-O3-autofdo-etmgcov
Bubble sorting array of 30000 elements
1907 ms
$ taskset -c 2 ./sort-O3-autofdo-etmgcov
Bubble sorting array of 30000 elements
1893 ms
$ taskset -c 2 ./sort-O3-autofdo-etmgcov
Bubble sorting array of 30000 elements
1907 ms
Kim
_______________________________________________
CoreSight mailing list
CoreSight@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/coresight
-- 
Mike Leach
Principal Engineer, ARM Ltd.
Blackburn Design Centre. UK

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [PATCH 1/2] perf inject: correct recording of branch address and destination