Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)

30 Apr 2019

On Tue, 30 Apr 2019 at 10:22, Wojciech Żmuda wzmuda@n7space.com wrote:
...
Hi Mathieu,
...
-----Original Message-----
From: Mathieu Poirier mathieu.poirier@linaro.org
Sent: Monday, April 29, 2019 11:49 PM
To: Wojciech Żmuda wzmuda@n7space.com
Cc: Leo Yan leo.yan@linaro.org; coresight@lists.linaro.org; Michał
Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com
Subject: Re: [RFC] Trace-based execution time profiling (timestamp in
perf-script?)
On Mon, 29 Apr 2019 at 11:24, Wojciech Żmuda wzmuda@n7space.com wrote:
...
Hi Leo,
...
-----Original Message-----
From: Leo Yan leo.yan@linaro.org
Sent: Monday, April 29, 2019 10:19 AM
To: Wojciech Żmuda wzmuda@n7space.com
Cc: coresight@lists.linaro.org; Michał Kurowski
mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com
Subject: Re: [RFC] Trace-based execution time profiling (timestamp
in
perf-script?)
Hi Wojciech,
On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:
...
Hello,
I'm trying to design a solution to use CoreSight for measuring
application execution time, with granularity of specific ranges of
instructions.
...
I have some idea how this may be achieved and I'd like to know
your
opinion.
...
Great inspiration comes from this patch set by Leo Yan, especially
from
the disassembly script:
...
https://lists.linaro.org/pipermail/coresight/2018-May/001325.html
The related works were initiliazed by Tor (TI) and Mathieu :)
This was a very long time ago.  The arm-cs-trace-disasm.py may be used for
reference purposes but nothing more and it would certainly not work
nowadays.
Could you please elaborate some more? I've already noticed that this script
is not going to be upstreamed, but I'm still happy I managed to dig it out
on the mailing list. It works for me (not perfectly and quite slow, but still).
Given the serious amount of code influx in the perf tools subsystem
coupled with the fact that we haven't maintained that script for year,
I was expecting it to blow up on you.  You are quite lucky that it
does anything for you.
...
Having a simple way to reconstruct the actual program flow is a pretty neat feature.
I know that for Intel PT, perf has xed integration to produce disassembly.
Are there any plans to come up with an official way to do it for CoreSight?
Or is it just left for the users to come up with their scripts?
I don't know of any plans to work on this - Tor came up with the
script to highlight the potential and power yielded by the "perf
script" command.  From there I endeavoured to keep making it work with
each new kernel release but it quickly got to consume too much time.
...
...
...
...
...
Analyzing this, I learned that perf-script is capable of
understanding perf.data AUXTRACE section and parsing some of the
trace elements to
branch samples, which illustrate how the IP moved around.
...
These pieces of information are available for the built-in python
interpreter, so we can script it to get assembly from the program
image.
...
...
...
If I understand perf-script in its current shape correctly, it
ignores all the non-branching events (so everything that's not an
ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping
is lost during the process. I'd like to modify perf-script to
generate samples on such timing events,
The CoreSight trace data is saved into perf.data (it is compressed
data) and we need to use OpenCSD to decode the trace data and output
for different kinds packets.
Based on these packet, perf-script (and perf-report) can generate
branch samples, it also can generate out instruction samples and
last branch stack samples if we specify flags 'i' and 'l' for option
'-itrace'; but by default perf-script will only generate branch
samples if we don't specify any flags for '-itrace'.
Looking at the disassembly python script, I thought branch samples are
enough to reconstruct the program flow. Since it is possible to get
instruction samples as well (with --itrace=il - skipping 'l' here
gives me a segfault), do you see any usage of this sample type? It was
not mentioned in Linux/CoreSight documents I've read, so I'm not quite
sure how can I use this feature.
...
...
If we use command 'perf script -F time' it should output samples
with timestamp field.  But from my testing, this command will fail;
but I am not sure if this is caused by the reason mentioned the
timestamping is lost during the process.  If it is, how about to fix
the issue for 'perf script -F time'.
This is the first time I see the "script -F time" option, but admittedly I
never went looking for it.  We never output time samples because we had no
use for it.  Doing so is probably not hard if there is already a packet
type for timestamps.
Actually my current idea is to populate the 'time' field in branch samples,
not to generate a separate sample for time. I'm not trying to be nit-picky
here, just clarifying, so we're on the same page.
Ok
...
Basically, I think that taking these TIMESTAMP packets we have in the decoded
output and concatenating them with ATOM/EXCEPTION/some-other-flow-related packets
will make it easy to programmatically measure execution time:
So, knowing this:
    Idx:1802; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbed
    Idx:1813; ID:14;        I_ATOM_F1 : Atom format 1.; N
    Idx:1814; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbef
    Idx:1816; ID:14;        I_ATOM_F1 : Atom format 1.; E
    Idx:1817; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf1
    Idx:1819; ID:14;        I_ATOM_F1 : Atom format 1.; N
    Idx:1820; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf2
    Idx:1822; ID:14;        I_ATOM_F1 : Atom format 1.; E
    Idx:1824; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf5
    Idx:1826; ID:14;        I_ATOM_F1 : Atom format 1.; N
    Idx:1827; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf6


and knowing how to turn ATOMs into program flow (which already comes with perf-script),
I'd like to be able to say "execution of instruction block from 0xdeadbeef to 0xdeadffff
started with timestamp of value 0xaaaa and lasted until value 0xbbbb, which means
the execution lasted for approx. 1234 microseconds of real-life time", or something similar.
The thing to keep in mind here is that timestsamps are emitted *after*
traces have been generated - more on that below.
...
...
On the topic of timestamps, be mindful the clock
used to generate timestamps is completely disjoint from the CPU clock and
the system clock.
That may be the missing piece. Currently my understanding is that the timestamp
generator is clocked with F [Hz], which means value in its output register
increments F times per second. This is effectively limited by how fast we sample
this register and how many of these values actually end in the sink and later
in perf.data. Apart from that, I don't see other requirements w.r.t the clock.
Could you please confirm my understanding?
The only thing we have access to is the information in the timestamp
packets, and has Al pointed out, how many times they are generated.
At this time the configuration is set to generate the most packet that
is possible.
...
...
...
I confirm that '-F time' does not work. This does not seem odd, since
the time field is empty, but I don't understand the error. Despite
perf-script generated branch samples, it complains about 'dummy'
samples:
...
# perf script -F time
Samples for 'dummy:u' event do not have TIME attribute set. Cannot print
'time' field.
...
I browsed the sample generation code (cs-etm.c) and I can't see code
producing this type of samples. Anwyay, I think this may be a bug in
the printing part of perf-script, since I actually managed to populate
the time field and access it in python (see below).
...
...
...
so later I can have them in between assembly instructions to
calculate
deltas and be able to tell either:
...

how much time and/or CPU cycles have been spent between two

arbitrary instructions (ideally), or

what instructions have been executed between timestamp T and T+1

(this seems to be more in-line with how timestamping in CS works,
I
think)
Brief analysis of tools/perf/util.cs-etm.c and
cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are
not turned into packets, but merely recorded as a packet queue
parameter (I'm not sure why this is needed, though). The cycacc
event is not processed further at all, beside being later decoded
to plaintext by
OpenCSD. I think it may be worth to give them both a dedicated `enum
cs_etm_sample_type` value and packet generator functions.
I'd like to leave this part for Mike/Mathieu for comments.
Wojciech, I have a new patchset waiting to be released that makes use of
the cycle accurate information to correlate time between sample packets
when working in CPU wide scenarios [1].  It also rearranges a lot of the
code you have been looking at.  Patches prepended with "perf tools:" will
be sent out as soon as we have a 5.2-rc1.
[1]. https://git.linaro.org/people/mathieu.poirier/coresight.git
(branch 5.1-rc3-cpu-wide-v3).
...
Mike, Mathieu, I'll be grateful if you joined this discussion and
shared your knowledge. Thanks!
Meanwhile, my today's achievement in this field is discovery of
cs_etm__synth_branch_sample() and cs_etm__synth_instruction_sample()
functions. They look like good places to populate the 'time' field.
With the following change it is indeed possible to get the timestamp
value in the sample. While it is still not visible in 'perf script -F
time'
...
(because of the error mentioned above), it is enough to get this
visible in python:

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index
8496190ad553..34ca6749fc94 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct
cs_etm_queue *etmq,
...
    sample.cpu = tidq->packet->cpu;
    sample.flags = tidq->prev_packet->flags;
    sample.cpumode = event->sample.header.misc;


  sample.time = tidq->packet_queue.timestamp;



Humm...  Have a look at what I did in [2], found in the tree I indicated
above.  In there you will find an algorithm that uses timestamp packet to
do pretty much what you want to do in the above code snippet.  Mike and I
have talked about it recently and concluded it was the best way to
proceed.
[2]. 311207df4a71 perf tools: Add notion of time to decoding code
Thank you for this reference. Actually I'm working on v1 of this patch set
on my devboard. I took a look at v3 and I don't see major differences around
the commit you mentioned. Anyway, I examined it closely and noticed that my
previous solution does not make much sense, since the decoder updates
the timestamp value cached in the packet queue much quicker than samples
are produced, which resulted in samples having the same timestamp value
over and over again (from the last decoding pass), until the queue is drained.
I'm glad you have noticed this set.  It introduces a lot of
modifications to the perf tools and it would have been a shame for you
to have to rebase all your code.  You are also correct that regarding
the user space solution, not much has changed between V1 and V3.  The
current solution strive to peg a timestamp for each range packet that
is received, so that their execution can be correlated to range
packets executed on other processors.
...
I came up with another approach and it seems to work fine. I extended packet
with the timestamp field, so each branch-related packet is timestamped with
the last observed timestamp value. Then, each sample may have this timestamp
as well.
I'm still not quite sure if timestamps don't require some clever adjustments,
since branch samples are composed from two consecutive packets, and I assign
simply one of them to the sample. As you suggested, I tried to understand your
[2] patch, but I don't quite get relation of timestamp, next_timestamp and
adjusting them by instr_count.
A typical trace will look as follow:
Sync_packet
Range_packet_A(3 instructions)
Range_packet_B(2 instructions)
Timestamp(x)
Range_packet_C(5 instructions)
Range_packet_D(4 instructions)
Range_packet_E(2 instructions)
Timestamp(y)
...
...
As I mentionned above timestamps are generated for instructions that
have *already* executed.  So timestamp X was generated after
instructions in range packet A an B have executed.  The best we can do
is estimate the execution time with once clock cycle per instructions.
Given the above we can estimate that range packet A started at time X
- 5 and range packet B at X - 2.
Once we have received a timestamp we can anchor the next range packets
on it.  So range packet C started at time X, range packet D started at
X + 5 and range packet E at X + 9.  As soon as a new timestamp is
received, i.e Y, we can readjust the hard timestamp to avoid incurring
too much deviation.
I hope this helps a little more.  Let me know if there is something I
wasn't clear on.
Mathieu
...
The change looks like this:
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
index 33e975c8d11b..1c052ae9aaaa 100644
--- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
+++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c
@@ -372,6 +372,7 @@ cs_etm_decoder__buffer_packet(struct cs_etm_packet_queue *packet_queue,
        packet_queue->packet_buffer[et].flags = 0;
        packet_queue->packet_buffer[et].exception_number = UINT32_MAX;
        packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id;

  packet_queue->packet_buffer[et].timestamp = packet_queue->next_timestamp;

  if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1)
          return OCSD_RESP_WAIT;



@@ -424,6 +425,7 @@ cs_etm_decoder__buffer_range(struct cs_etm_queue *etmq,
        case OCSD_INSTR_BR:
        case OCSD_INSTR_BR_INDIRECT:
                packet->last_instr_taken_branch = elem->last_instr_exec;

          packet->timestamp = packet_queue->next_timestamp;
          break;
  case OCSD_INSTR_ISB:
  case OCSD_INSTR_DSB_DMB:



@@ -483,6 +485,7 @@ cs_etm_decoder__buffer_exception(struct cs_etm_packet_queue *queue,
    packet = &queue->packet_buffer[queue->tail];
    packet->exception_number = elem->exception_number;


  packet->timestamp = queue->next_timestamp;

  return ret;



}
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c
index 8496190ad553..aaffb9026af6 100644
--- a/tools/perf/util/cs-etm.c
+++ b/tools/perf/util/cs-etm.c
@@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
        sample.cpu = tidq->packet->cpu;
        sample.flags = tidq->prev_packet->flags;
        sample.cpumode = event->sample.header.misc;

  sample.time = tidq->packet->timestamp;

  /*
   * perf report cannot handle events without a branch stack



diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h
index 1f6ec7babe70..a14b0220eaa7 100644
--- a/tools/perf/util/cs-etm.h
+++ b/tools/perf/util/cs-etm.h
@@ -122,6 +122,7 @@ struct cs_etm_packet {
        enum cs_etm_isa isa;
        u64 start_addr;
        u64 end_addr;

  u64 timestamp;
  u32 instr_count;
  u32 last_instr_type;
  u32 last_instr_subtype;




Thanks,
Wojciech
...
...
With the following change, the time field is populated in the
param_dict['sample']['time'] field (sample = param_dict['sample']):
diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index
1239ab4b6a58..da5cb4d53b4a 100644
--- a/arm-cs-trace-disasm.py
+++ b/arm-cs-trace-disasm.py
@@ -232,4 +232,5 @@ def process_event(param_dict):
                print "Address range [ %s .. %s ]: isn't in same dso" %
(start_addr, stop_addr)
...
            return



  print('Timestamp 0x%lx' % sample['time'])
  dump_disam(prev_dso, start_addr, stop_addr)




Then, applying the script to perf-script prints timestamps along with
disassembly:
...
root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump |
less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used
to dump assembler Timestamp 0x53e56a31002
        0000000000400a80 <__libc_csu_init>:
          400a80:       a9bc7bfd        stp     x29, x30, [sp, #-64]!
          400a84:       910003fd        mov     x29, sp
          400a88:       a901d7f4        stp     x20, x21, [sp, #24]
          400a8c:       90000094        adrp    x20, 410000
<__FRAME_END__+0xf3f0>
...
      400a90:       90000095        adrp    x21, 410000

<__FRAME_END__+0xf3f0>
...
      400a94:       9137c294        add     x20, x20, #0xdf0
      400a98:       9137a2b5        add     x21, x21, #0xde8
      400a9c:       a902dff6        stp     x22, x23, [sp, #40]
      400aa0:       cb150294        sub     x20, x20, x21
      400aa4:       f9001ff8        str     x24, [sp, #56]
      400aa8:       2a0003f6        mov     w22, w0
      400aac:       aa0103f7        mov     x23, x1
      400ab0:       9343fe94        asr     x20, x20, #3
      400ab4:       aa0203f8        mov     x24, x2
      400ab8:       97fffece        bl      4005f0 <_init>

CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e
        00000000004008dc <main>:
          4008dc:       a9b37bfd        stp     x29, x30, [sp, #-208]!
          4008e0:       910003fd        mov     x29, sp
          4008e4:       b9001fa0        str     w0, [x29, #28]
          4008e8:       f9000ba1        str     x1, [x29, #16]
          4008ec:       f9400ba0        ldr     x0, [x29, #16]
          4008f0:       91002000        add     x0, x0, #0x8
          4008f4:       f9400001        ldr     x1, [x0]
          4008f8:       f9400ba0        ldr     x0, [x29, #16]
          4008fc:       91004000        add     x0, x0, #0x10
          400900:       f9400000        ldr     x0, [x0]
          400904:       aa0003e2        mov     x2, x0
          400908:       b9401fa0        ldr     w0, [x29, #28]
          40090c:       97ffffde        bl      400884 <opt_validate>
CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924
        0000000000400910 <main+0x34>:
          400910:       7100001f        cmp     w0, #0x0
          400914:       540000c0        b.eq    40092c <main+0x50>  //
b.none
...
Timestamp 0x53e56a31924
        000000000040092c <main+0x50>:
          40092c:       f9400ba0        ldr     x0, [x29, #16]
          400930:       91004000        add     x0, x0, #0x10
          400934:       f9400000        ldr     x0, [x0]
          400938:       39400000        ldrb    w0, [x0]
          40093c:       7101941f        cmp     w0, #0x65
          400940:       1a9f17e0        cset    w0, eq  // eq = none
          400944:       12001c00        and     w0, w0, #0xff
          400948:       b9002fa0        str     w0, [x29, #44]
          40094c:       f9400ba0        ldr     x0, [x29, #16]
          400950:       91002000        add     x0, x0, #0x8
I need to investigate it further to make sure getting timestamps from
this source is a good idea - I'm not convinced if timestamp as a
packet queue parameter is refreshed frequently enough to keep up with
timestamp packets actually emitted.
Look at [2] above, it should give you what you need.
...
...
...
Then, I think, it should be possible to generate samples (not sure
what type though, perhaps not 'branch' this time) for
timestamp/cycacc packets, analogically to what has been done for
TRACE_ON
https://lists.linaro.org/pipermail/coresight/2018-May/001327.html
and then expose it in the python interface.
I'd be grateful for any opinion about this idea, especially about
usefulness of such feature for the general audience, as well as
any possible compatibility issues. If you are aware of another
approach to achieve timestamp correlation with branch samples, it
would also be very
welcome.
Could you review Mathieu's patches 'perf tools: Configure timestsamp
generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure
tracers to emit timestamps' [2]?
Thanks for pointing it. I did reviewed them, however they didn't help
much.
...
They are pretty much just configuration of the timestamping module,
which works well for me.
You are correct, they just deal with the mechanic of enabling the
generation of timestamp packets.  How I use them is in [2] above.
...
...
...
I hope the idea is not completely pointless. I'm still making my
way through the perf subsystem, so I might have missed some crucial
details.
I don't think the idea is pointless.  You simply have a new use case and
from my point of view it is enough to receive consideration.
...
...
The idea looks good to me.  Actually I am very curious with below
question:
What's the brief benefit we can get from enabling timestamp for
CoreSight branch events, and this cannot be fulfilled by Perf's
cpu-clock/task-clock events and PMU cpu cycle event?
I tried to research cpu-clock and task-clock and it looks like they
are based on wall clock, while CS timestamping is CPU-independent.
Measurement with CS may help to narrow down instruction stalls, which, I
believe, would be hidden otherwise.
...
CPU cycle seems like a good measurement in this case, but I'm not sure
if correlating PMU events with specific instruction range wouldn't be
harder than extracting timestamps we already have in the stream.
Best regards,
Wojciech
...
Thanks,
Leo Yan
[1]
https://lists.linaro.org/pipermail/coresight/2019-March/002259.html
[2]
https://lists.linaro.org/pipermail/coresight/2019-March/002232.html

CoreSight mailing list
CoreSight@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/coresight

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)