Just curious: what's the profiling overhead for this?

Thanks,
Dehao

On Thu, May 18, 2017 at 3:09 AM, Mike Leach <mike.leach@linaro.org> wrote:
Hi Sebastian,

You should be further aware due to the way that the program flow trace
works, all branches will result in a trace range, but not all trace
ranges end in a branch.
The input packets from the decoder [struct ocsd_generic_trace_elem
@trc_gen_elem_types.h] contain the trace range as you have seen plus
information on the last instruction in that range:-
a) if it was a branch or not
b) if it was executed (i.e. branch taken or not).

However this additional information is not transferred into the
internal cs_etm_packet structure used in
cs_etm__update_last_branch_rb()

So unless I have mis-interpreted what is required, to build a
correctly formatted last branch record series, this additional
information is going to be required - (we don't want none-branch
ranges?), meaning an update in the cs-etm-decoder.c code is also
required.

Regards

Mike


On 18 May 2017 at 03:10, Sebastian Pop <sebpop@gmail.com> wrote:
> On Wed, May 17, 2017 at 11:35 AM, Mike Leach <mike.leach@linaro.org> wrote:
>> Hi,
>>
>> The OpenCSD decoder outputs executed instruction ranges - which do
>> look very much like the branch stack ranges below.
>> These ranges are inclusive to exclusive addresses.
>>
>> I'm not familiar with the required from format of the branch stack,
>> but assuming this is a range output by the decoder....
>>
>> ..... 29: 00000000004005a0 -> 00000000004005b0 0 cycles  P   0
>>
>> it means that we traced execution  starting at the instruction @
>> 4005a0 to the instruction _before_ address 4005b0 (i.e. the
>> instruction @ 4005ac if aarch32 / aarch64, or the instruction @ 4005ae
>> if Thumb16).
>>
>>
>> On 17 May 2017 at 16:43, Dehao Chen <dehao@google.com> wrote:
>>> I think there is something wrong with the branch_stack:
>>>
>>> e.g.
>>> ..... 29: 00000000004005a0 -> 00000000004005b0 0 cycles  P   0
>>> ..... 30: 0000000000400888 -> 000000000040088c 0 cycles  P   0
>>> ..... 31: 000000000040088c -> 0000000000400898 0 cycles  P   0
>>>
>>> from the objdump:
>>>   400884:       97ffff3f        bl      400580 <__printf_chk@plt>
>>>   400888:       97ffff46        bl      4005a0 <rand@plt>
>>>   40088c:       b8004660        str     w0, [x19],#4
>>>   400890:       eb14027f        cmp     x19, x20
>>>   400894:       54ffffa1        b.ne    400888 <sort_array+0x50>
>>>   400898:       d285e280        mov     x0, #0x2f14
>>>
>>
>> So taking a bit more of the stack from above e-mail.....
>>
>>
>> ..... 30: 0000000000400888 -> 000000000040088c 0 cycles  P   0
>> ..... 31: 000000000040088c -> 0000000000400898 0 cycles  P   0
>> (snip the stuff in high memory ....)
>> ..... 45: 00000000004005a0 -> 00000000004005b0 0 cycles  P   0
>> ..... 46: 0000000000400888 -> 000000000040088c 0 cycles  P   0
>> ..... 47: 000000000040088c -> 0000000000400898 0 cycles  P   0
>>
>> interpreting this as I would the decoder output....
>>
>> @29 is the execution of the code @4005a0
>> @30 executes the bl 4005a0
>> @31 runs from 40088c to the b.ne 400888 @ 400894 thus looping round again.
>>
>> So perhaps the interpretation of the output from the decoder in
>> building the branch stack could be the issue here?
>
> Mike, I think you are right: my interpretation of the ETM trace was wrong.
> The ETM trace contains the boundaries of the executed basic blocks,
> and not the start and end addresses of the branch instructions as in
> the LBR branch stack.  The conversion from the ETM to LBR events
> is not a 1-to-1 translation as currently implemented.
> I will submit a patch to fix this in perf inject.
>
> Thanks Kim, Dehao, and Mike for your help on identifying this issue.
>
> Sebastian



--
Mike Leach
Principal Engineer, ARM Ltd.
Blackburn Design Centre. UK