Just curious: what's the profiling overhead for this?
Thanks, Dehao
On Thu, May 18, 2017 at 3:09 AM, Mike Leach mike.leach@linaro.org wrote:
Hi Sebastian,
You should be further aware due to the way that the program flow trace works, all branches will result in a trace range, but not all trace ranges end in a branch. The input packets from the decoder [struct ocsd_generic_trace_elem @trc_gen_elem_types.h] contain the trace range as you have seen plus information on the last instruction in that range:- a) if it was a branch or not b) if it was executed (i.e. branch taken or not).
However this additional information is not transferred into the internal cs_etm_packet structure used in cs_etm__update_last_branch_rb()
So unless I have mis-interpreted what is required, to build a correctly formatted last branch record series, this additional information is going to be required - (we don't want none-branch ranges?), meaning an update in the cs-etm-decoder.c code is also required.
Regards
Mike
On 18 May 2017 at 03:10, Sebastian Pop sebpop@gmail.com wrote:
On Wed, May 17, 2017 at 11:35 AM, Mike Leach mike.leach@linaro.org
wrote:
Hi,
The OpenCSD decoder outputs executed instruction ranges - which do look very much like the branch stack ranges below. These ranges are inclusive to exclusive addresses.
I'm not familiar with the required from format of the branch stack, but assuming this is a range output by the decoder....
..... 29: 00000000004005a0 -> 00000000004005b0 0 cycles P 0
it means that we traced execution starting at the instruction @ 4005a0 to the instruction _before_ address 4005b0 (i.e. the instruction @ 4005ac if aarch32 / aarch64, or the instruction @ 4005ae if Thumb16).
On 17 May 2017 at 16:43, Dehao Chen dehao@google.com wrote:
I think there is something wrong with the branch_stack:
e.g. ..... 29: 00000000004005a0 -> 00000000004005b0 0 cycles P 0 ..... 30: 0000000000400888 -> 000000000040088c 0 cycles P 0 ..... 31: 000000000040088c -> 0000000000400898 0 cycles P 0
from the objdump: 400884: 97ffff3f bl 400580 __printf_chk@plt 400888: 97ffff46 bl 4005a0 rand@plt 40088c: b8004660 str w0, [x19],#4 400890: eb14027f cmp x19, x20 400894: 54ffffa1 b.ne 400888 <sort_array+0x50> 400898: d285e280 mov x0, #0x2f14
So taking a bit more of the stack from above e-mail.....
..... 30: 0000000000400888 -> 000000000040088c 0 cycles P 0 ..... 31: 000000000040088c -> 0000000000400898 0 cycles P 0 (snip the stuff in high memory ....) ..... 45: 00000000004005a0 -> 00000000004005b0 0 cycles P 0 ..... 46: 0000000000400888 -> 000000000040088c 0 cycles P 0 ..... 47: 000000000040088c -> 0000000000400898 0 cycles P 0
interpreting this as I would the decoder output....
@29 is the execution of the code @4005a0 @30 executes the bl 4005a0 @31 runs from 40088c to the b.ne 400888 @ 400894 thus looping round
again.
So perhaps the interpretation of the output from the decoder in building the branch stack could be the issue here?
Mike, I think you are right: my interpretation of the ETM trace was
wrong.
The ETM trace contains the boundaries of the executed basic blocks, and not the start and end addresses of the branch instructions as in the LBR branch stack. The conversion from the ETM to LBR events is not a 1-to-1 translation as currently implemented. I will submit a patch to fix this in perf inject.
Thanks Kim, Dehao, and Mike for your help on identifying this issue.
Sebastian
-- Mike Leach Principal Engineer, ARM Ltd. Blackburn Design Centre. UK