RE: [PATCH 1/2] perf inject: correct recording of branch address and destination

27 Jun 2017

      On Fri, 26 May 2017 14:12:21 +0100 Mike Leach wrote:
...
Hi,
Tried out Sebastians patches and got some similarities to Kim but a
couple of differences and some interesting results if you look at the
disassemble of the resulting routines.
So as per the AutoFDO instructions I built a sort program with no
optimisations and debug:
gcc -g sort.c -o sort
This I profiled on juno- with 3000 interations
The resulting disassembly of the bubble_sort routine is in
bubble-sorts-disass.txt and the dump gcov profile is below...

bubble_sort total:33987051 head:0
  0: 0
  1: 0
  2: 2839
  3: 2839
  4: 2839
  4.1: 8522673
  4.2: 8519834
  5: 8517035
  6: 2104748
  7: 2104748
  8: 2104748
  9: 2104748
  13: 0

So in my view - the swap lines (6:-9:) - see attached sort.c, are run
less than the enclosing loop (2:-4:,4.1:-5:) - which is what Kim
observed with the intel version.
The synthesized LBR records looked reasonable from comparison with the
disassembly too.
Trying out the O3 and O3-autofdo from this profile resulted in O3
running marginally faster, but both faster than unoptimised debug.
So now look at the disassemblies from the -O3 and -autofdo-O3 versions
of the sort routine [bubble-sorts-disass.txt again]. Both appear to
define a bubble_sort routine, but embed the same / similar code into
sort_array.
Unsurprisingly the O3 version is considerably more compact - hence it
runs faster. I have no idea what the autofdo version is up to, but the
I cannot see how the massive expansion of the routine with compare and
jump tables is going to help.
So perhaps:-

the LBR stacks are still not correct - though code and dump

inspection might suggest otherwise - are there features in the intel
LBR we are not yet synthesizing?
2) There is some adverse interaction with the profiles we are
generating and the autofdo code generation.
3) The amount of coverage in the file is hitting the process - looking
at gcov above then we only have trace from the bubble sort routine. I
did reduce the number of iterations to get more of the program
captured in coverage but this did not seem to make a difference.
Mike
Apologies for the delay in replying to this.
Some further thoughts on this.
1) This is not an apples-to-apples comparison. The baseline code will most likely have different optimizations applied for x86-64, which will give rise to different code paths and so different profiles. Also is someone here able to comment on to what extent the optimizations applied by the "autofdo-O3" compiler are machine independent?
I assume that the work done to create that flow has been done on an x86 version of the compiler, and it might be that regressions exist in the A64 compiler that do not exist in x86: I don't know. For example, the unrolling done for the sort.c example might not be a suitable optimization for the target CPU.
This isn't a real-world code example. Bubble sort is sorting random data, so at its heart is an unpredictable compare-and-swap check, and a small inner-loop. The unrolled code, on the other hand, contains many unpredictable branches. It would be better to reproduce this experiment, if not on real-world code then at least on a more sensible benchmark.
2) AIUI, "perf inject --itrace" on the ETM uses systematic block-based sampling to break the trace into LBR records. (That is, after N trace block records it creates a sample with an LBR attached, where a trace block represents a sequence of instructions between two waypoints.) E.g. "perf inject --itrace=il64"
Conversely, also AIUI, the reference method for doing this with Intel PT samples based on a reconstructed view of time. (That is, every N reconstructed clock periods, it creates a sample with an LBR attached.) E.g. "perf inject --itrace=i100usle".
Time-based sampling will generate more samples from code hot spots, where a hot spot is defined as where *time* is spent in the program. The ETM flow will also favour hot spots, obviously, because these will appear more in the trace. However, because the sampling is not time-based, each *range* is as likely to be sampled as any other range.
E.g. if there is a short code sequence that executes in 10 clock periods and a long sequence that executes in 100 clock periods, and both appear equally often in the code, then using time-based sampling the former will appear 10x less often than the latter, but using systematic block-based sampling they appear at the same rate.
Furthermore, from a cursory look at the Intel PT code, it looks to me like the Intel PT perf driver walks through each block, instruction by instruction. If I understand this correctly, then that means that even if sampling were systematic and instruction-based rather than time-based (e.g. would "--itrace=i64i" do this on PT?), then the population for sampling is instructions rather than blocks, and again won't match what cs-etm.c is doing.
E.g. if the short code sequence is 10 instructions and the long sequence is 100 instructions, then with systematic instruction-based sampling the former block will appear 10x less often in the code, whereas with systematic block-based sampling, they appear at the same rate.
One could hack the Intel PT inject tool to implement the same kind of block-based sampling, and see what effect this has (assuming there is a good reason why the ETM inject doesn't implement the time-based sampling -- I've not investigated this). If you have such a sample you can also use the profile_diff tool from AutoFDO to compare the shape of the samples.
Now, the extent to which this affects the compiler I do not know. E.g. both sampling schemes are OK for telling a compiler which branches are taken, but if the compiler thinks the samples are time-based and so represent code hotspots, then systematic block-based sampling would be misleading.
Mike.
...
On 25 May 2017 at 05:12, Kim Phillips <kim.phillips at arm.com> wrote:
...
On Wed, 24 May 2017 12:48:04 -0500
Sebastian Pop <sebpop at gmail.com> wrote:
...
On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier
<mathieu.poirier at linaro.org> wrote:
...
Are the instructions in the autoFDO section of the HOWTO.md on
GitHub sufficient
...
...
...
to test this or there is another way?
Here is how I tested it: (supposing that perf.data contains an ETM
trace)
...
...
# perf inject -i perf.data -o inj --itrace=il64 --strip
# perf report -i inj -D &> dump
and I inspected the addresses from the last branch stack in the output
dump
...
...
with the addresses of the disassembled program from:
# objdump -d sort
Re-running the AutoFDO process with these two patches continue to make
the resultant executable perform worse, however:
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5306 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5304 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
5851 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
5889 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
5888 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5318 ms
The gcov file generated from the inj.data (no matter whether it's
--itrace=il64 or --itrace=i100usle) still looks wrong:
$ ~/git/autofdo/dump_gcov  -gcov_version=1 sort-O3.gcov
sort_array total:19309128 head:0
  0: 0
  1: 0
  5: 0
  6: 0
  7.1: 0
  7.3: 0
  8.3: 0
  15: 2
  16: 2
  17: 2
  10: start total:0
    1: 0
  11: bubble_sort total:19309119
    2: 1566
    4: 6266668
    5: 6071341
    7: 6266668
    9: 702876
  12: stop total:3
    2: 0
    3: 1
    4: 1
    5: 1
main total:1 head:0
  0: 0
  2: 0
  4: 1
  1: cmd_line total:0
    3: 0
    4: 0
    5: 0
    6: 0
Whereas the one generated by intel-pt run looks correct, showing the
swap (11: bubble_sort 7,8) as executed less times:
kim at juno sort-etm$ ~/git/autofdo/dump_gcov  -gcov_version=1 ../sort-
O3.gcov
...
sort_array total:105658 head:0
  0: 0
  5: 0
  6: 0
  7.1: 0
  7.3: 0
  8.3: 0
  16: 0
  17: 0
  1: printf total:0
    2: 0
  10: start total:0
    1: 0
  11: bubble_sort total:105658
    2: 14
    4: 28740
    5: 28628
    7: 9768
    8: 9768
    9: 28740
  12: stop total:0
    2: 0
    3: 0
    4: 0
    5: printf total:0
      2: 0
  15: printf total:0
    2: 0
I have to run the 'perf inject' on the x86 host because of the
aforementioned:
0x350 [0x50]: failed to process type: 1
problem when trying to run it natively on the aarch64 target.
However, it doesn't matter whether I run the create_gcov - like so btw:
~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data --
gcov=sort-O3.gcov -gcov_version=1
...
on the x86 host or the aarch64 target:  I still get the same (negative
performance) results.
As Sebastian asked, if I take the intel-pt sourced inject
generated .gcov onto the target and rebuild sort, the performance
improves:
$ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3-
autofdo
...
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5309 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
5310 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
4443 ms
$ taskset -c 2 ./sort-O3-autofdo
Bubble sorting array of 30000 elements
4443 ms
And if I take the ETM-generated gcov and use that to build a new x86_64
binary, it indeed performs worse on x86_64 also:
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
1502 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
1500 ms
$ taskset -c 2 ./sort-O3
Bubble sorting array of 30000 elements
1501 ms
$ taskset -c 2 ./sort-O3-autofdo-etmgcov
Bubble sorting array of 30000 elements
1907 ms
$ taskset -c 2 ./sort-O3-autofdo-etmgcov
Bubble sorting array of 30000 elements
1893 ms
$ taskset -c 2 ./sort-O3-autofdo-etmgcov
Bubble sorting array of 30000 elements
1907 ms
Kim
_______________________________________________
CoreSight mailing list
CoreSight at lists.linaro.org
https://lists.linaro.org/mailman/listinfo/coresight
--
Mike Leach
Principal Engineer, ARM Ltd.
Blackburn Design Centre. UK
<snip>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

RE: [PATCH 1/2] perf inject: correct recording of branch address and destination