On Fri, 26 May 2017 14:12:21 +0100 Mike Leach wrote:
Hi,
Tried out Sebastians patches and got some similarities to Kim but a couple of differences and some interesting results if you look at the disassemble of the resulting routines.
So as per the AutoFDO instructions I built a sort program with no optimisations and debug: gcc -g sort.c -o sort This I profiled on juno- with 3000 interations
The resulting disassembly of the bubble_sort routine is in bubble-sorts-disass.txt and the dump gcov profile is below...
bubble_sort total:33987051 head:0 0: 0 1: 0 2: 2839 3: 2839 4: 2839 4.1: 8522673 4.2: 8519834 5: 8517035 6: 2104748 7: 2104748 8: 2104748 9: 2104748 13: 0
So in my view - the swap lines (6:-9:) - see attached sort.c, are run less than the enclosing loop (2:-4:,4.1:-5:) - which is what Kim observed with the intel version. The synthesized LBR records looked reasonable from comparison with the disassembly too.
Trying out the O3 and O3-autofdo from this profile resulted in O3 running marginally faster, but both faster than unoptimised debug.
So now look at the disassemblies from the -O3 and -autofdo-O3 versions of the sort routine [bubble-sorts-disass.txt again]. Both appear to define a bubble_sort routine, but embed the same / similar code into sort_array. Unsurprisingly the O3 version is considerably more compact - hence it runs faster. I have no idea what the autofdo version is up to, but the I cannot see how the massive expansion of the routine with compare and jump tables is going to help.
So perhaps:-
- the LBR stacks are still not correct - though code and dump
inspection might suggest otherwise - are there features in the intel LBR we are not yet synthesizing? 2) There is some adverse interaction with the profiles we are generating and the autofdo code generation. 3) The amount of coverage in the file is hitting the process - looking at gcov above then we only have trace from the bubble sort routine. I did reduce the number of iterations to get more of the program captured in coverage but this did not seem to make a difference. Mike
Apologies for the delay in replying to this.
Some further thoughts on this.
1) This is not an apples-to-apples comparison. The baseline code will most likely have different optimizations applied for x86-64, which will give rise to different code paths and so different profiles. Also is someone here able to comment on to what extent the optimizations applied by the "autofdo-O3" compiler are machine independent?
I assume that the work done to create that flow has been done on an x86 version of the compiler, and it might be that regressions exist in the A64 compiler that do not exist in x86: I don't know. For example, the unrolling done for the sort.c example might not be a suitable optimization for the target CPU.
This isn't a real-world code example. Bubble sort is sorting random data, so at its heart is an unpredictable compare-and-swap check, and a small inner-loop. The unrolled code, on the other hand, contains many unpredictable branches. It would be better to reproduce this experiment, if not on real-world code then at least on a more sensible benchmark.
2) AIUI, "perf inject --itrace" on the ETM uses systematic block-based sampling to break the trace into LBR records. (That is, after N trace block records it creates a sample with an LBR attached, where a trace block represents a sequence of instructions between two waypoints.) E.g. "perf inject --itrace=il64"
Conversely, also AIUI, the reference method for doing this with Intel PT samples based on a reconstructed view of time. (That is, every N reconstructed clock periods, it creates a sample with an LBR attached.) E.g. "perf inject --itrace=i100usle".
Time-based sampling will generate more samples from code hot spots, where a hot spot is defined as where *time* is spent in the program. The ETM flow will also favour hot spots, obviously, because these will appear more in the trace. However, because the sampling is not time-based, each *range* is as likely to be sampled as any other range.
E.g. if there is a short code sequence that executes in 10 clock periods and a long sequence that executes in 100 clock periods, and both appear equally often in the code, then using time-based sampling the former will appear 10x less often than the latter, but using systematic block-based sampling they appear at the same rate.
Furthermore, from a cursory look at the Intel PT code, it looks to me like the Intel PT perf driver walks through each block, instruction by instruction. If I understand this correctly, then that means that even if sampling were systematic and instruction-based rather than time-based (e.g. would "--itrace=i64i" do this on PT?), then the population for sampling is instructions rather than blocks, and again won't match what cs-etm.c is doing.
E.g. if the short code sequence is 10 instructions and the long sequence is 100 instructions, then with systematic instruction-based sampling the former block will appear 10x less often in the code, whereas with systematic block-based sampling, they appear at the same rate.
One could hack the Intel PT inject tool to implement the same kind of block-based sampling, and see what effect this has (assuming there is a good reason why the ETM inject doesn't implement the time-based sampling -- I've not investigated this). If you have such a sample you can also use the profile_diff tool from AutoFDO to compare the shape of the samples.
Now, the extent to which this affects the compiler I do not know. E.g. both sampling schemes are OK for telling a compiler which branches are taken, but if the compiler thinks the samples are time-based and so represent code hotspots, then systematic block-based sampling would be misleading.
Mike.
On 25 May 2017 at 05:12, Kim Phillips <kim.phillips at arm.com> wrote:
On Wed, 24 May 2017 12:48:04 -0500 Sebastian Pop <sebpop at gmail.com> wrote:
On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier <mathieu.poirier at linaro.org> wrote:
Are the instructions in the autoFDO section of the HOWTO.md on
GitHub sufficient
to test this or there is another way?
Here is how I tested it: (supposing that perf.data contains an ETM
trace)
# perf inject -i perf.data -o inj --itrace=il64 --strip # perf report -i inj -D &> dump
and I inspected the addresses from the last branch stack in the output
dump
with the addresses of the disassembled program from:
# objdump -d sort
Re-running the AutoFDO process with these two patches continue to make the resultant executable perform worse, however:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5306 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5304 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5851 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5889 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 5888 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5318 ms
The gcov file generated from the inj.data (no matter whether it's --itrace=il64 or --itrace=i100usle) still looks wrong:
$ ~/git/autofdo/dump_gcov -gcov_version=1 sort-O3.gcov sort_array total:19309128 head:0 0: 0 1: 0 5: 0 6: 0 7.1: 0 7.3: 0 8.3: 0 15: 2 16: 2 17: 2 10: start total:0 1: 0 11: bubble_sort total:19309119 2: 1566 4: 6266668 5: 6071341 7: 6266668 9: 702876 12: stop total:3 2: 0 3: 1 4: 1 5: 1 main total:1 head:0 0: 0 2: 0 4: 1 1: cmd_line total:0 3: 0 4: 0 5: 0 6: 0
Whereas the one generated by intel-pt run looks correct, showing the swap (11: bubble_sort 7,8) as executed less times:
kim at juno sort-etm$ ~/git/autofdo/dump_gcov -gcov_version=1 ../sort-
O3.gcov
sort_array total:105658 head:0 0: 0 5: 0 6: 0 7.1: 0 7.3: 0 8.3: 0 16: 0 17: 0 1: printf total:0 2: 0 10: start total:0 1: 0 11: bubble_sort total:105658 2: 14 4: 28740 5: 28628 7: 9768 8: 9768 9: 28740 12: stop total:0 2: 0 3: 0 4: 0 5: printf total:0 2: 0 15: printf total:0 2: 0
I have to run the 'perf inject' on the x86 host because of the aforementioned:
0x350 [0x50]: failed to process type: 1
problem when trying to run it natively on the aarch64 target.
However, it doesn't matter whether I run the create_gcov - like so btw:
~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data --
gcov=sort-O3.gcov -gcov_version=1
on the x86 host or the aarch64 target: I still get the same (negative performance) results.
As Sebastian asked, if I take the intel-pt sourced inject generated .gcov onto the target and rebuild sort, the performance improves:
$ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3-
autofdo
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5309 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 5310 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 4443 ms $ taskset -c 2 ./sort-O3-autofdo Bubble sorting array of 30000 elements 4443 ms
And if I take the ETM-generated gcov and use that to build a new x86_64 binary, it indeed performs worse on x86_64 also:
$ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1502 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1500 ms $ taskset -c 2 ./sort-O3 Bubble sorting array of 30000 elements 1501 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1907 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1893 ms $ taskset -c 2 ./sort-O3-autofdo-etmgcov Bubble sorting array of 30000 elements 1907 ms
Kim _______________________________________________ CoreSight mailing list CoreSight at lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-- Mike Leach Principal Engineer, ARM Ltd. Blackburn Design Centre. UK
<snip>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.