Re: [PATCH 0/2] perf inject branch stack fixes

16 Nov 2017


      Hello Robert,
On 15 November 2017 at 04:55, Robert Walker robert.walker@arm.com wrote:
...
...
-----Original Message-----
From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
Sent: 14 November 2017 17:56
To: Robert Walker Robert.Walker@arm.com
Cc: CoreSight@lists.linaro.org
Subject: Re: [PATCH 0/2] perf inject branch stack fixes
Hi Robert and thanks for the code.
On 13 November 2017 at 08:11, Robert Walker robert.walker@arm.com
wrote:
...
These patches fix some issues with the branch stacks generated from
CoreSight ETM trace.
The main issues addressed are:

The branch stack should only contain taken branches.
The instruction samples are generated using the period specified by the
--itrace option to perf inject.  Currently, the period can only be
specified as an instruction count - further work is required to specify
the period as a cycle count or time interval.
The ordering of the branch stack should have newest branch first.
Some minor fixes to the address calculations.

With these fixes, the branch stacks are more similar to the last
branch records produced by 'perf record -b' and Intel-PT on x86.
There are similar improvements in the autofdo profiles generated from
these traces.
I'm a little confused.  Here you mention that reverting d3fa0f70b7e8 make
records look more similar to intelPT, but the changelog in
d3fa0f70b7e8 claims the same thing.  We obviously have two diverging point
of views and I'd like to have a better understanding of the situation.  Is there
any way I can test this on my side?
Thanks,
Mathieu
I used the attached test program to test this - main() calls f1() which then calls f2(), which calls f3().
On an x86 PC, I recorded last branch records with:
  perf record -b ./call_chain
That worked on x86 but blew up on Juno:
root@linaro-nano:~/rwalker# perf record -b ./call_chain
Error:
PMU Hardware doesn't support sampling/overflow-interrupts.
Have you also seen this?
...
And Intel-PT with:
  perf record -e intel_pt//u ./call_chain
  perf inject -i perf.data -o inj.data --itrace=i10000il --strip
A couple of questions here:
1) The command "perf record -e intel_pt//u ./call_chain" will trigger
a cpu-wide trace session, something that is not (yet) supported on CS.
Did you add the --per-thread switch when doing this on Juno?
2) The command "perf inject -i perf.data -o inj.data --itrace=i10000il
--strip" doesn't affect the content of "perf.data".  Wince "inj.data"
isn't referenced below, what is the purpose of that command?
...
In each case, perf report -D  is used to view the instruction samples with branch stacks.
The attached script, addr_mapper.py makes it easier to see what's going on in the branch stack by annotating addresses with their offsets from symbols.
objdump -d ./call_chain > call_chain.dump
perf report -D -i perf.data | ./addr_mapper.py call_chain.dump | less
Doing this on x86 and feeding it the "perf.data" file resulting from
"perf record -b ./call_chain", I was able to get something [1] that
somewhat resembles the output below.  On my snapshot function names
don't come out as neatly and addresses are in the higher part of the
memory.
[1]. https://pastebin.com/pEQT4aP0
...
This results in a branch stack like this:
13 12877531596650324 [main+1287753159624fd3e] 0x4528 [0x640]: PERF_RECORD_SAMPLE(IP, 0x2): 17548/17548: 0x400608 [main+22] period: 10000 a
ddr: 0
... branch stack: nr:64
.....  0: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles      0
.....  1: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles      0
.....  2: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles      0
.....  3: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles      0
.....  4: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles      0
.....  5: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles      0
.....  6: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles      0
.....  7: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles      0
.....  8: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles      0
.....  9: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles      0
..... 10: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles      0
..... 11: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles      0
..... 12: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles      0
..... 13: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles      0
Entry 13 is the call from main() to f1(), entry 12 is the call from f1() to f2(), entry 11 is the call from f2() to f2().  Then entries 10, 9 & 8 are the returns from f3(), f2(), f1() to main().
Without the reversion of d3fa0f7, the Arm trace produced the reverse stack, so that the call from main() to f1() appeared at the top, f1() to f2() as the 2nd entry, f2() to f3() as the 3rd and so on.  With d3fa0f7 reverted, the Arm stacks match the order of the intel stacks.
Hope this helps.
Regards
Rob
...
...
The patches apply to the autoFDO branch of
https://github.com/Linaro/perf-opencsd.git (d3fa0f7)
Regards
Robert Walker
Robert Walker (2):
  Revert "perf inject: record branches in chronological order"
  perf: Fix branch stack records from CoreSight ETM decode
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c |   4 +-
 tools/perf/util/cs-etm-decoder/cs-etm-decoder.h |   2 +-
 tools/perf/util/cs-etm.c                        | 134 +++++++++++++-----------
 3 files changed, 73 insertions(+), 67 deletions(-)
--
1.9.1

CoreSight mailing list
CoreSight@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/coresight

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [PATCH 0/2] perf inject branch stack fixes