Hi Mathieu,
On 16/11/17 18:37, Mathieu Poirier wrote:
Hello Robert,
On 15 November 2017 at 04:55, Robert Walker robert.walker@arm.com wrote:
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 14 November 2017 17:56 To: Robert Walker Robert.Walker@arm.com Cc: CoreSight@lists.linaro.org Subject: Re: [PATCH 0/2] perf inject branch stack fixes
Hi Robert and thanks for the code.
On 13 November 2017 at 08:11, Robert Walker robert.walker@arm.com wrote:
These patches fix some issues with the branch stacks generated from CoreSight ETM trace.
The main issues addressed are:
- The branch stack should only contain taken branches.
- The instruction samples are generated using the period specified by the --itrace option to perf inject. Currently, the period can only be specified as an instruction count - further work is required to specify the period as a cycle count or time interval.
- The ordering of the branch stack should have newest branch first.
- Some minor fixes to the address calculations.
With these fixes, the branch stacks are more similar to the last branch records produced by 'perf record -b' and Intel-PT on x86. There are similar improvements in the autofdo profiles generated from
these traces.
I'm a little confused. Here you mention that reverting d3fa0f70b7e8 make records look more similar to intelPT, but the changelog in d3fa0f70b7e8 claims the same thing. We obviously have two diverging point of views and I'd like to have a better understanding of the situation. Is there any way I can test this on my side?
Thanks, Mathieu
I used the attached test program to test this - main() calls f1() which then calls f2(), which calls f3().
On an x86 PC, I recorded last branch records with: perf record -b ./call_chain
That worked on x86 but blew up on Juno:
root@linaro-nano:~/rwalker# perf record -b ./call_chain Error: PMU Hardware doesn't support sampling/overflow-interrupts.
Have you also seen this?
Arm cores don't have support for last branch records - this is why we need to record the execution trace and reconstruct the branch records with perf inject.
And Intel-PT with: perf record -e intel_pt//u ./call_chain perf inject -i perf.data -o inj.data --itrace=i10000il --strip
A couple of questions here:
- The command "perf record -e intel_pt//u ./call_chain" will trigger
a cpu-wide trace session, something that is not (yet) supported on CS. Did you add the --per-thread switch when doing this on Juno?
Yes, I did add the --per-thread switch when recording on Arm (I'm using HiKey960)
- The command "perf inject -i perf.data -o inj.data --itrace=i10000il
--strip" doesn't affect the content of "perf.data". Wince "inj.data" isn't referenced below, what is the purpose of that command?
The perf inject command decodes the execution trace and constructs instruction samples containing branch records at the requested interval (10000 instructions in the example above) - these are written to inj.data. When viewing branch records created by perf inject, inj.data should be passed to perf report instead of perf.data.
In each case, perf report -D is used to view the instruction samples with branch stacks.
The attached script, addr_mapper.py makes it easier to see what's going on in the branch stack by annotating addresses with their offsets from symbols.
objdump -d ./call_chain > call_chain.dump perf report -D -i perf.data | ./addr_mapper.py call_chain.dump | less
Doing this on x86 and feeding it the "perf.data" file resulting from "perf record -b ./call_chain", I was able to get something [1] that somewhat resembles the output below. On my snapshot function names don't come out as neatly and addresses are in the higher part of the memory.
The high addresses will be the shared libraries linked to the program - the first few samples are when the C runtime is preparing to start the program (i.e. before main() is reached). I found I had to skip the first few samples (around 8-10) before the main program is reached and I then get the branch stacks as found in the previous mail.
Regards
Rob
This results in a branch stack like this:
13 12877531596650324 [main+1287753159624fd3e] 0x4528 [0x640]: PERF_RECORD_SAMPLE(IP, 0x2): 17548/17548: 0x400608 [main+22] period: 10000 a ddr: 0 ... branch stack: nr:64 ..... 0: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 1: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 2: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 3: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 4: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 5: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 6: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0 ..... 7: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 8: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 9: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 10: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 11: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 12: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 13: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0
Entry 13 is the call from main() to f1(), entry 12 is the call from f1() to f2(), entry 11 is the call from f2() to f2(). Then entries 10, 9 & 8 are the returns from f3(), f2(), f1() to main().
Without the reversion of d3fa0f7, the Arm trace produced the reverse stack, so that the call from main() to f1() appeared at the top, f1() to f2() as the 2nd entry, f2() to f3() as the 3rd and so on. With d3fa0f7 reverted, the Arm stacks match the order of the intel stacks.
Hope this helps.
Regards
Rob
The patches apply to the autoFDO branch of https://github.com/Linaro/perf-opencsd.git (d3fa0f7)
Regards
Robert Walker
Robert Walker (2): Revert "perf inject: record branches in chronological order" perf: Fix branch stack records from CoreSight ETM decode
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 134 +++++++++++++----------- 3 files changed, 73 insertions(+), 67 deletions(-)
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight