[RFC] Trace-based execution time profiling (timestamp in perf-script?)

List overview All Threads
Download

newer

older

[PATCH 1/2] coresight: Do not call...

[PATCH v2 00/36] coresight:...

Wojciech Żmuda

26 Apr 2019 26 Apr '19

10:36 a.m.

Hello,

I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of instructions. I have some idea how this may be achieved and I'd like to know your opinion.

Great inspiration comes from this patch set by Leo Yan, especially from the disassembly script: https://lists.linaro.org/pipermail/coresight/2018-May/001325.html Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to branch samples, which illustrate how the IP moved around. These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program image.

If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events, so later I can have them in between assembly instructions to calculate deltas and be able to tell either: - how much time and/or CPU cycles have been spent between two arbitrary instructions (ideally), or - what instructions have been executed between timestamp T and T+1 (this seems to be more in-line with how timestamping in CS works, I think)

Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.

Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.

I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very welcome.

I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial details.

Thank you for your time, Wojciech

Show replies by date

Leo Yan

29 Apr 29 Apr

8:19 a.m.

Hi Wojciech,

On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:

...

Hello,

I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of instructions. I have some idea how this may be achieved and I'd like to know your opinion.

Great inspiration comes from this patch set by Leo Yan, especially from the disassembly script: https://lists.linaro.org/pipermail/coresight/2018-May/001325.html

The related works were initiliazed by Tor (TI) and Mathieu :)

...

Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to branch samples, which illustrate how the IP moved around. These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program image.

If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,

The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.

Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.

If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.

...

so later I can have them in between assembly instructions to calculate deltas and be able to tell either:

how much time and/or CPU cycles have been spent between two arbitrary instructions (ideally), or

what instructions have been executed between timestamp T and T+1 (this seems to be more

in-line with how timestamping in CS works, I think)

Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.

I'd like to leave this part for Mike/Mathieu for comments.

...

Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.

I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very welcome.

Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?

...

I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial details.

The idea looks good to me. Actually I am very curious with below question:

What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?

Thanks, Leo Yan

[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.html [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.html

Wojciech Żmuda

5:24 p.m.

Hi Leo,

...

-----Original Message----- From: Leo Yan leo.yan@linaro.org Sent: Monday, April 29, 2019 10:19 AM To: Wojciech Żmuda wzmuda@n7space.com Cc: coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)

Hi Wojciech,

On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:

...
Hello,

I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of

instructions.

...
I have some idea how this may be achieved and I'd like to know your

opinion.

...
Great inspiration comes from this patch set by Leo Yan, especially from

the disassembly script:

...
https://lists.linaro.org/pipermail/coresight/2018-May/001325.html

The related works were initiliazed by Tor (TI) and Mathieu :)

...
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to

branch samples, which illustrate how the IP moved around.

...
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program image.

If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,

The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.

Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.

Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite sure how can I use this feature.

...

If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.

I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy' samples:

# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print 'time' field.

I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate the time field and access it in python (see below).

...

...
so later I can have them in between assembly instructions to calculate

deltas and be able to tell either:

...

how much time and/or CPU cycles have been spent between two

arbitrary instructions (ideally), or

what instructions have been executed between timestamp T and T+1

(this seems to be more in-line with how timestamping in CS works, I think)

Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by

OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.

I'd like to leave this part for Mike/Mathieu for comments.

Mike, Mathieu, I'll be grateful if you joined this discussion and shared your knowledge. Thanks!

Meanwhile, my today's achievement in this field is discovery of cs_etm__synth_branch_sample() and cs_etm__synth_instruction_sample() functions. They look like good places to populate the 'time' field.

With the following change it is indeed possible to get the timestamp value in the sample. While it is still not visible in 'perf script -F time' (because of the error mentioned above), it is enough to get this visible in python:

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..34ca6749fc94 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc; + sample.time = tidq->packet_queue.timestamp;

With the following change, the time field is populated in the param_dict['sample']['time'] field (sample = param_dict['sample']):

diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index 1239ab4b6a58..da5cb4d53b4a 100644 --- a/arm-cs-trace-disasm.py +++ b/arm-cs-trace-disasm.py @@ -232,4 +232,5 @@ def process_event(param_dict): print "Address range [ %s .. %s ]: isn't in same dso" % (start_addr, stop_addr) return

+ print('Timestamp 0x%lx' % sample['time']) dump_disam(prev_dso, start_addr, stop_addr)

Then, applying the script to perf-script prints timestamps along with disassembly:

root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump | less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used to dump assembler Timestamp 0x53e56a31002 0000000000400a80 <__libc_csu_init>: 400a80: a9bc7bfd stp x29, x30, [sp, #-64]! 400a84: 910003fd mov x29, sp 400a88: a901d7f4 stp x20, x21, [sp, #24] 400a8c: 90000094 adrp x20, 410000 <__FRAME_END__+0xf3f0> 400a90: 90000095 adrp x21, 410000 <__FRAME_END__+0xf3f0> 400a94: 9137c294 add x20, x20, #0xdf0 400a98: 9137a2b5 add x21, x21, #0xde8 400a9c: a902dff6 stp x22, x23, [sp, #40] 400aa0: cb150294 sub x20, x20, x21 400aa4: f9001ff8 str x24, [sp, #56] 400aa8: 2a0003f6 mov w22, w0 400aac: aa0103f7 mov x23, x1 400ab0: 9343fe94 asr x20, x20, #3 400ab4: aa0203f8 mov x24, x2 400ab8: 97fffece bl 4005f0 <_init> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e 00000000004008dc <main>: 4008dc: a9b37bfd stp x29, x30, [sp, #-208]! 4008e0: 910003fd mov x29, sp 4008e4: b9001fa0 str w0, [x29, #28] 4008e8: f9000ba1 str x1, [x29, #16] 4008ec: f9400ba0 ldr x0, [x29, #16] 4008f0: 91002000 add x0, x0, #0x8 4008f4: f9400001 ldr x1, [x0] 4008f8: f9400ba0 ldr x0, [x29, #16] 4008fc: 91004000 add x0, x0, #0x10 400900: f9400000 ldr x0, [x0] 400904: aa0003e2 mov x2, x0 400908: b9401fa0 ldr w0, [x29, #28] 40090c: 97ffffde bl 400884 <opt_validate> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924 0000000000400910 <main+0x34>: 400910: 7100001f cmp w0, #0x0 400914: 540000c0 b.eq 40092c <main+0x50> // b.none Timestamp 0x53e56a31924 000000000040092c <main+0x50>: 40092c: f9400ba0 ldr x0, [x29, #16] 400930: 91004000 add x0, x0, #0x10 400934: f9400000 ldr x0, [x0] 400938: 39400000 ldrb w0, [x0] 40093c: 7101941f cmp w0, #0x65 400940: 1a9f17e0 cset w0, eq // eq = none 400944: 12001c00 and w0, w0, #0xff 400948: b9002fa0 str w0, [x29, #44] 40094c: f9400ba0 ldr x0, [x29, #16] 400950: 91002000 add x0, x0, #0x8

I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.

...

...
Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.

I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very

welcome.

Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?

Thanks for pointing it. I did reviewed them, however they didn't help much. They are pretty much just configuration of the timestamping module, which works well for me.

...

...
I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial details.

The idea looks good to me. Actually I am very curious with below question:

What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?

I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I believe, would be hidden otherwise. CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.

Best regards, Wojciech

...

Thanks, Leo Yan

[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.html [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.html

Mathieu Poirier

9:48 p.m.

On Mon, 29 Apr 2019 at 11:24, Wojciech Żmuda wzmuda@n7space.com wrote:

...

Hi Leo,

...
-----Original Message----- From: Leo Yan leo.yan@linaro.org Sent: Monday, April 29, 2019 10:19 AM To: Wojciech Żmuda wzmuda@n7space.com Cc: coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)

Hi Wojciech,

On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:

...
Hello,

I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of

instructions.

...
I have some idea how this may be achieved and I'd like to know your

opinion.

...
Great inspiration comes from this patch set by Leo Yan, especially from

the disassembly script:

...
https://lists.linaro.org/pipermail/coresight/2018-May/001325.html

The related works were initiliazed by Tor (TI) and Mathieu :)

This was a very long time ago. The arm-cs-trace-disasm.py may be used for reference purposes but nothing more and it would certainly not work nowadays.

...

...
...
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to

branch samples, which illustrate how the IP moved around.

...
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program image.

If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,

The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.

Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.

Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite sure how can I use this feature.

...
If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.

This is the first time I see the "script -F time" option, but admittedly I never went looking for it. We never output time samples because we had no use for it. Doing so is probably not hard if there is already a packet type for timestamps. On the topic of timestamps, be mindful the clock used to generate timestamps is completely disjoint from the CPU clock and the system clock.

...

I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy' samples:

# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print 'time' field.

I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate the time field and access it in python (see below).

...
...
so later I can have them in between assembly instructions to calculate

deltas and be able to tell either:

...

how much time and/or CPU cycles have been spent between two

arbitrary instructions (ideally), or

what instructions have been executed between timestamp T and T+1

(this seems to be more in-line with how timestamping in CS works, I think)

Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by

OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.

I'd like to leave this part for Mike/Mathieu for comments.

Wojciech, I have a new patchset waiting to be released that makes use of the cycle accurate information to correlate time between sample packets when working in CPU wide scenarios [1]. It also rearranges a lot of the code you have been looking at. Patches prepended with "perf tools:" will be sent out as soon as we have a 5.2-rc1.

[1]. https://git.linaro.org/people/mathieu.poirier/coresight.git (branch 5.1-rc3-cpu-wide-v3).

...

Mike, Mathieu, I'll be grateful if you joined this discussion and shared your knowledge. Thanks!

Meanwhile, my today's achievement in this field is discovery of cs_etm__synth_branch_sample() and cs_etm__synth_instruction_sample() functions. They look like good places to populate the 'time' field.

With the following change it is indeed possible to get the timestamp value in the sample. While it is still not visible in 'perf script -F time' (because of the error mentioned above), it is enough to get this visible in python:

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..34ca6749fc94 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc;
  sample.time = tidq->packet_queue.timestamp;

Humm... Have a look at what I did in [2], found in the tree I indicated above. In there you will find an algorithm that uses timestamp packet to do pretty much what you want to do in the above code snippet. Mike and I have talked about it recently and concluded it was the best way to proceed.

[2]. 311207df4a71 perf tools: Add notion of time to decoding code

...

With the following change, the time field is populated in the param_dict['sample']['time'] field (sample = param_dict['sample']):

diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index 1239ab4b6a58..da5cb4d53b4a 100644 --- a/arm-cs-trace-disasm.py +++ b/arm-cs-trace-disasm.py @@ -232,4 +232,5 @@ def process_event(param_dict): print "Address range [ %s .. %s ]: isn't in same dso" % (start_addr, stop_addr) return
  print('Timestamp 0x%lx' % sample['time'])
  dump_disam(prev_dso, start_addr, stop_addr)
Then, applying the script to perf-script prints timestamps along with disassembly:

root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump | less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used to dump assembler Timestamp 0x53e56a31002 0000000000400a80 <__libc_csu_init>: 400a80: a9bc7bfd stp x29, x30, [sp, #-64]! 400a84: 910003fd mov x29, sp 400a88: a901d7f4 stp x20, x21, [sp, #24] 400a8c: 90000094 adrp x20, 410000 <__FRAME_END__+0xf3f0> 400a90: 90000095 adrp x21, 410000 <__FRAME_END__+0xf3f0> 400a94: 9137c294 add x20, x20, #0xdf0 400a98: 9137a2b5 add x21, x21, #0xde8 400a9c: a902dff6 stp x22, x23, [sp, #40] 400aa0: cb150294 sub x20, x20, x21 400aa4: f9001ff8 str x24, [sp, #56] 400aa8: 2a0003f6 mov w22, w0 400aac: aa0103f7 mov x23, x1 400ab0: 9343fe94 asr x20, x20, #3 400ab4: aa0203f8 mov x24, x2 400ab8: 97fffece bl 4005f0 <_init> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e 00000000004008dc <main>: 4008dc: a9b37bfd stp x29, x30, [sp, #-208]! 4008e0: 910003fd mov x29, sp 4008e4: b9001fa0 str w0, [x29, #28] 4008e8: f9000ba1 str x1, [x29, #16] 4008ec: f9400ba0 ldr x0, [x29, #16] 4008f0: 91002000 add x0, x0, #0x8 4008f4: f9400001 ldr x1, [x0] 4008f8: f9400ba0 ldr x0, [x29, #16] 4008fc: 91004000 add x0, x0, #0x10 400900: f9400000 ldr x0, [x0] 400904: aa0003e2 mov x2, x0 400908: b9401fa0 ldr w0, [x29, #28] 40090c: 97ffffde bl 400884 <opt_validate> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924 0000000000400910 <main+0x34>: 400910: 7100001f cmp w0, #0x0 400914: 540000c0 b.eq 40092c <main+0x50> // b.none Timestamp 0x53e56a31924 000000000040092c <main+0x50>: 40092c: f9400ba0 ldr x0, [x29, #16] 400930: 91004000 add x0, x0, #0x10 400934: f9400000 ldr x0, [x0] 400938: 39400000 ldrb w0, [x0] 40093c: 7101941f cmp w0, #0x65 400940: 1a9f17e0 cset w0, eq // eq = none 400944: 12001c00 and w0, w0, #0xff 400948: b9002fa0 str w0, [x29, #44] 40094c: f9400ba0 ldr x0, [x29, #16] 400950: 91002000 add x0, x0, #0x8

I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.

Look at [2] above, it should give you what you need.

...

...
...
Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.

I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very

welcome.

Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?

Thanks for pointing it. I did reviewed them, however they didn't help much. They are pretty much just configuration of the timestamping module, which works well for me.

You are correct, they just deal with the mechanic of enabling the generation of timestamp packets. How I use them is in [2] above.

...

...
...
I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial details.

I don't think the idea is pointless. You simply have a new use case and from my point of view it is enough to receive consideration.

...

...
The idea looks good to me. Actually I am very curious with below question:

What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?

I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I believe, would be hidden otherwise. CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.

Best regards, Wojciech

...
Thanks, Leo Yan

[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.html [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.html

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

Wojciech Żmuda

30 Apr 30 Apr

4:22 p.m.

Hi Mathieu,

...

-----Original Message----- From: Mathieu Poirier mathieu.poirier@linaro.org Sent: Monday, April 29, 2019 11:49 PM To: Wojciech Żmuda wzmuda@n7space.com Cc: Leo Yan leo.yan@linaro.org; coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)

On Mon, 29 Apr 2019 at 11:24, Wojciech Żmuda wzmuda@n7space.com wrote:

...
Hi Leo,

...
-----Original Message----- From: Leo Yan leo.yan@linaro.org Sent: Monday, April 29, 2019 10:19 AM To: Wojciech Żmuda wzmuda@n7space.com Cc: coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)

Hi Wojciech,

On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:

...
Hello,

I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of

instructions.

...
I have some idea how this may be achieved and I'd like to know your

opinion.

...
Great inspiration comes from this patch set by Leo Yan, especially from

the disassembly script:

...
https://lists.linaro.org/pipermail/coresight/2018-May/001325.html

The related works were initiliazed by Tor (TI) and Mathieu :)

This was a very long time ago. The arm-cs-trace-disasm.py may be used for reference purposes but nothing more and it would certainly not work nowadays.

Could you please elaborate some more? I've already noticed that this script is not going to be upstreamed, but I'm still happy I managed to dig it out on the mailing list. It works for me (not perfectly and quite slow, but still). Having a simple way to reconstruct the actual program flow is a pretty neat feature. I know that for Intel PT, perf has xed integration to produce disassembly. Are there any plans to come up with an official way to do it for CoreSight? Or is it just left for the users to come up with their scripts?

...

...
...
...
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to

branch samples, which illustrate how the IP moved around.

...
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program

image.

...
...
...
If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,

The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.

Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.

Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite

sure how can I use this feature.

...
...
If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.

This is the first time I see the "script -F time" option, but admittedly I never went looking for it. We never output time samples because we had no use for it. Doing so is probably not hard if there is already a packet type for timestamps.

Actually my current idea is to populate the 'time' field in branch samples, not to generate a separate sample for time. I'm not trying to be nit-picky here, just clarifying, so we're on the same page.

Basically, I think that taking these TIMESTAMP packets we have in the decoded output and concatenating them with ATOM/EXCEPTION/some-other-flow-related packets will make it easy to programmatically measure execution time:

So, knowing this:

Idx:1802; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbed Idx:1813; ID:14; I_ATOM_F1 : Atom format 1.; N Idx:1814; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbef Idx:1816; ID:14; I_ATOM_F1 : Atom format 1.; E Idx:1817; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf1 Idx:1819; ID:14; I_ATOM_F1 : Atom format 1.; N Idx:1820; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf2 Idx:1822; ID:14; I_ATOM_F1 : Atom format 1.; E Idx:1824; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf5 Idx:1826; ID:14; I_ATOM_F1 : Atom format 1.; N Idx:1827; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf6

and knowing how to turn ATOMs into program flow (which already comes with perf-script), I'd like to be able to say "execution of instruction block from 0xdeadbeef to 0xdeadffff started with timestamp of value 0xaaaa and lasted until value 0xbbbb, which means the execution lasted for approx. 1234 microseconds of real-life time", or something similar.

...

On the topic of timestamps, be mindful the clock used to generate timestamps is completely disjoint from the CPU clock and the system clock.

That may be the missing piece. Currently my understanding is that the timestamp generator is clocked with F [Hz], which means value in its output register increments F times per second. This is effectively limited by how fast we sample this register and how many of these values actually end in the sink and later in perf.data. Apart from that, I don't see other requirements w.r.t the clock. Could you please confirm my understanding?

...

...
I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy'

samples:

...
# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print

'time' field.

...
I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate

the time field and access it in python (see below).

...
...
...
so later I can have them in between assembly instructions to calculate

deltas and be able to tell either:

...

how much time and/or CPU cycles have been spent between two

arbitrary instructions (ideally), or

what instructions have been executed between timestamp T and T+1

(this seems to be more in-line with how timestamping in CS works, I think)

Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by

OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.

I'd like to leave this part for Mike/Mathieu for comments.

Wojciech, I have a new patchset waiting to be released that makes use of the cycle accurate information to correlate time between sample packets when working in CPU wide scenarios [1]. It also rearranges a lot of the code you have been looking at. Patches prepended with "perf tools:" will be sent out as soon as we have a 5.2-rc1.

[1]. https://git.linaro.org/people/mathieu.poirier/coresight.git (branch 5.1-rc3-cpu-wide-v3).

...
Mike, Mathieu, I'll be grateful if you joined this discussion and shared your knowledge. Thanks!

Meanwhile, my today's achievement in this field is discovery of cs_etm__synth_branch_sample() and cs_etm__synth_instruction_sample() functions. They look like good places to populate the 'time' field.

With the following change it is indeed possible to get the timestamp value in the sample. While it is still not visible in 'perf script -F

time'

...
(because of the error mentioned above), it is enough to get this visible in python:

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..34ca6749fc94 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct

cs_etm_queue *etmq,

...
    sample.cpu = tidq->packet->cpu;
    sample.flags = tidq->prev_packet->flags;
    sample.cpumode = event->sample.header.misc;
  sample.time = tidq->packet_queue.timestamp;
Humm... Have a look at what I did in [2], found in the tree I indicated above. In there you will find an algorithm that uses timestamp packet to do pretty much what you want to do in the above code snippet. Mike and I have talked about it recently and concluded it was the best way to proceed.

[2]. 311207df4a71 perf tools: Add notion of time to decoding code

Thank you for this reference. Actually I'm working on v1 of this patch set on my devboard. I took a look at v3 and I don't see major differences around the commit you mentioned. Anyway, I examined it closely and noticed that my previous solution does not make much sense, since the decoder updates the timestamp value cached in the packet queue much quicker than samples are produced, which resulted in samples having the same timestamp value over and over again (from the last decoding pass), until the queue is drained.

I came up with another approach and it seems to work fine. I extended packet with the timestamp field, so each branch-related packet is timestamped with the last observed timestamp value. Then, each sample may have this timestamp as well.

I'm still not quite sure if timestamps don't require some clever adjustments, since branch samples are composed from two consecutive packets, and I assign simply one of them to the sample. As you suggested, I tried to understand your [2] patch, but I don't quite get relation of timestamp, next_timestamp and adjusting them by instr_count.

The change looks like this:

diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 33e975c8d11b..1c052ae9aaaa 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -372,6 +372,7 @@ cs_etm_decoder__buffer_packet(struct cs_etm_packet_queue *packet_queue, packet_queue->packet_buffer[et].flags = 0; packet_queue->packet_buffer[et].exception_number = UINT32_MAX; packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id; + packet_queue->packet_buffer[et].timestamp = packet_queue->next_timestamp;

if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1) return OCSD_RESP_WAIT; @@ -424,6 +425,7 @@ cs_etm_decoder__buffer_range(struct cs_etm_queue *etmq, case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT: packet->last_instr_taken_branch = elem->last_instr_exec; + packet->timestamp = packet_queue->next_timestamp; break; case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB: @@ -483,6 +485,7 @@ cs_etm_decoder__buffer_exception(struct cs_etm_packet_queue *queue,

packet = &queue->packet_buffer[queue->tail]; packet->exception_number = elem->exception_number; + packet->timestamp = queue->next_timestamp;

return ret; } diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..aaffb9026af6 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc; + sample.time = tidq->packet->timestamp;

/* * perf report cannot handle events without a branch stack diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 1f6ec7babe70..a14b0220eaa7 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -122,6 +122,7 @@ struct cs_etm_packet { enum cs_etm_isa isa; u64 start_addr; u64 end_addr; + u64 timestamp; u32 instr_count; u32 last_instr_type; u32 last_instr_subtype;

Thanks, Wojciech

...

...
With the following change, the time field is populated in the param_dict['sample']['time'] field (sample = param_dict['sample']):

diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index 1239ab4b6a58..da5cb4d53b4a 100644 --- a/arm-cs-trace-disasm.py +++ b/arm-cs-trace-disasm.py @@ -232,4 +232,5 @@ def process_event(param_dict): print "Address range [ %s .. %s ]: isn't in same dso" %

(start_addr, stop_addr)

...
            return
  print('Timestamp 0x%lx' % sample['time'])
  dump_disam(prev_dso, start_addr, stop_addr)
Then, applying the script to perf-script prints timestamps along with
disassembly:

...
root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump | less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used to dump assembler Timestamp 0x53e56a31002 0000000000400a80 <__libc_csu_init>: 400a80: a9bc7bfd stp x29, x30, [sp, #-64]! 400a84: 910003fd mov x29, sp 400a88: a901d7f4 stp x20, x21, [sp, #24] 400a8c: 90000094 adrp x20, 410000

<__FRAME_END__+0xf3f0>

...
      400a90:       90000095        adrp    x21, 410000
<__FRAME_END__+0xf3f0>

...
      400a94:       9137c294        add     x20, x20, #0xdf0
      400a98:       9137a2b5        add     x21, x21, #0xde8
      400a9c:       a902dff6        stp     x22, x23, [sp, #40]
      400aa0:       cb150294        sub     x20, x20, x21
      400aa4:       f9001ff8        str     x24, [sp, #56]
      400aa8:       2a0003f6        mov     w22, w0
      400aac:       aa0103f7        mov     x23, x1
      400ab0:       9343fe94        asr     x20, x20, #3
      400ab4:       aa0203f8        mov     x24, x2
      400ab8:       97fffece        bl      4005f0 <_init>
CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e 00000000004008dc <main>: 4008dc: a9b37bfd stp x29, x30, [sp, #-208]! 4008e0: 910003fd mov x29, sp 4008e4: b9001fa0 str w0, [x29, #28] 4008e8: f9000ba1 str x1, [x29, #16] 4008ec: f9400ba0 ldr x0, [x29, #16] 4008f0: 91002000 add x0, x0, #0x8 4008f4: f9400001 ldr x1, [x0] 4008f8: f9400ba0 ldr x0, [x29, #16] 4008fc: 91004000 add x0, x0, #0x10 400900: f9400000 ldr x0, [x0] 400904: aa0003e2 mov x2, x0 400908: b9401fa0 ldr w0, [x29, #28] 40090c: 97ffffde bl 400884 <opt_validate> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924 0000000000400910 <main+0x34>: 400910: 7100001f cmp w0, #0x0 400914: 540000c0 b.eq 40092c <main+0x50> //
b.none

...
Timestamp 0x53e56a31924 000000000040092c <main+0x50>: 40092c: f9400ba0 ldr x0, [x29, #16] 400930: 91004000 add x0, x0, #0x10 400934: f9400000 ldr x0, [x0] 400938: 39400000 ldrb w0, [x0] 40093c: 7101941f cmp w0, #0x65 400940: 1a9f17e0 cset w0, eq // eq = none 400944: 12001c00 and w0, w0, #0xff 400948: b9002fa0 str w0, [x29, #44] 40094c: f9400ba0 ldr x0, [x29, #16] 400950: 91002000 add x0, x0, #0x8

I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.

Look at [2] above, it should give you what you need.

...
...
...
Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.

I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very

welcome.

Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?

Thanks for pointing it. I did reviewed them, however they didn't help

much.

...
They are pretty much just configuration of the timestamping module, which works well for me.

You are correct, they just deal with the mechanic of enabling the generation of timestamp packets. How I use them is in [2] above.

...
...
...
I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial

details.

I don't think the idea is pointless. You simply have a new use case and from my point of view it is enough to receive consideration.

...
...
The idea looks good to me. Actually I am very curious with below question:

What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?

I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I

believe, would be hidden otherwise.

...
CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.

Best regards, Wojciech

...
Thanks, Leo Yan

[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.html [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.html

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

Al Grant

4:47 p.m.

...

That may be the missing piece. Currently my understanding is that the timestamp generator is clocked with F [Hz], which means value in its output register increments F times per second. This is effectively limited by how fast we sample this register and how many of these values actually end in the sink and later in perf.data. Apart from that, I don't see other requirements w.r.t the clock. Could you please confirm my understanding?

The timestamp generator is global (one per chip) and runs it a fixed frequency. It might be possible to change this but it would be more of a BIOS thing. Definitely not something you'd do in response to a perf option.

What you can do (and what is reasonable to control through perf) is change how often it's sampled into the trace stream.

...

Thank you for this reference. Actually I'm working on v1 of this patch set on my devboard. I took a look at v3 and I don't see major differences around the commit you mentioned. Anyway, I examined it closely and noticed that my previous solution does not make much sense, since the decoder updates the timestamp value cached in the packet queue much quicker than samples are produced, which resulted in samples having the same timestamp value over and over again (from the last decoding pass), until the queue is drained.

I came up with another approach and it seems to work fine. I extended packet with the timestamp field, so each branch-related packet is timestamped with the last observed timestamp value. Then, each sample may have this timestamp as well.

It should be done the other way round. When you see a timestamp value you apply it to the last observed: - atom i.e. Y/N branch - event packet (perf doesn't use these) - exception

So when you see a branch, you don't know what its timestamp is, until you see the next timestamp packet.

...

I'm still not quite sure if timestamps don't require some clever adjustments, since branch samples are composed from two consecutive packets, and I assign simply one of them to the sample. As you suggested, I tried to understand your [2] patch, but I don't quite get relation of timestamp, next_timestamp and adjusting them by instr_count.

The change looks like this:

diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 33e975c8d11b..1c052ae9aaaa 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -372,6 +372,7 @@ cs_etm_decoder__buffer_packet(struct cs_etm_packet_queue *packet_queue, packet_queue->packet_buffer[et].flags = 0; packet_queue->packet_buffer[et].exception_number = UINT32_MAX; packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id;
  packet_queue->packet_buffer[et].timestamp =
packet_queue->next_timestamp;
  if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1)
          return OCSD_RESP_WAIT;
@@ -424,6 +425,7 @@ cs_etm_decoder__buffer_range(struct cs_etm_queue *etmq, case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT: packet->last_instr_taken_branch = elem->last_instr_exec;
          packet->timestamp = packet_queue->next_timestamp;
          break;
  case OCSD_INSTR_ISB:
  case OCSD_INSTR_DSB_DMB:
@@ -483,6 +485,7 @@ cs_etm_decoder__buffer_exception(struct cs_etm_packet_queue *queue,
    packet = &queue->packet_buffer[queue->tail];
    packet->exception_number = elem->exception_number;
  packet->timestamp = queue->next_timestamp;

  return ret;
} diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..aaffb9026af6 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc;
  sample.time = tidq->packet->timestamp;

  /*
   * perf report cannot handle events without a branch stack diff --git
a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 1f6ec7babe70..a14b0220eaa7 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -122,6 +122,7 @@ struct cs_etm_packet { enum cs_etm_isa isa; u64 start_addr; u64 end_addr;
  u64 timestamp;
  u32 instr_count;
  u32 last_instr_type;
  u32 last_instr_subtype;
Thanks, Wojciech

...
...
With the following change, the time field is populated in the param_dict['sample']['time'] field (sample = param_dict['sample']):

diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index 1239ab4b6a58..da5cb4d53b4a 100644 --- a/arm-cs-trace-disasm.py +++ b/arm-cs-trace-disasm.py @@ -232,4 +232,5 @@ def process_event(param_dict): print "Address range [ %s .. %s ]: isn't in same dso" %

(start_addr, stop_addr)

...
            return
  print('Timestamp 0x%lx' % sample['time'])
  dump_disam(prev_dso, start_addr, stop_addr)
Then, applying the script to perf-script prints timestamps along with
disassembly:

...
root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump | less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used to dump assembler Timestamp 0x53e56a31002 0000000000400a80 <__libc_csu_init>: 400a80: a9bc7bfd stp x29, x30, [sp, #-64]! 400a84: 910003fd mov x29, sp 400a88: a901d7f4 stp x20, x21, [sp, #24] 400a8c: 90000094 adrp x20, 410000

<__FRAME_END__+0xf3f0>

...
      400a90:       90000095        adrp    x21, 410000
<__FRAME_END__+0xf3f0>

...
      400a94:       9137c294        add     x20, x20, #0xdf0
      400a98:       9137a2b5        add     x21, x21, #0xde8
      400a9c:       a902dff6        stp     x22, x23, [sp, #40]
      400aa0:       cb150294        sub     x20, x20, x21
      400aa4:       f9001ff8        str     x24, [sp, #56]
      400aa8:       2a0003f6        mov     w22, w0
      400aac:       aa0103f7        mov     x23, x1
      400ab0:       9343fe94        asr     x20, x20, #3
      400ab4:       aa0203f8        mov     x24, x2
      400ab8:       97fffece        bl      4005f0 <_init>
CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e 00000000004008dc <main>: 4008dc: a9b37bfd stp x29, x30, [sp, #-208]! 4008e0: 910003fd mov x29, sp 4008e4: b9001fa0 str w0, [x29, #28] 4008e8: f9000ba1 str x1, [x29, #16] 4008ec: f9400ba0 ldr x0, [x29, #16] 4008f0: 91002000 add x0, x0, #0x8 4008f4: f9400001 ldr x1, [x0] 4008f8: f9400ba0 ldr x0, [x29, #16] 4008fc: 91004000 add x0, x0, #0x10 400900: f9400000 ldr x0, [x0] 400904: aa0003e2 mov x2, x0 400908: b9401fa0 ldr w0, [x29, #28] 40090c: 97ffffde bl 400884 <opt_validate> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924 0000000000400910 <main+0x34>: 400910: 7100001f cmp w0, #0x0 400914: 540000c0 b.eq 40092c <main+0x50> //
b.none

...
Timestamp 0x53e56a31924 000000000040092c <main+0x50>: 40092c: f9400ba0 ldr x0, [x29, #16] 400930: 91004000 add x0, x0, #0x10 400934: f9400000 ldr x0, [x0] 400938: 39400000 ldrb w0, [x0] 40093c: 7101941f cmp w0, #0x65 400940: 1a9f17e0 cset w0, eq // eq = none 400944: 12001c00 and w0, w0, #0xff 400948: b9002fa0 str w0, [x29, #44] 40094c: f9400ba0 ldr x0, [x29, #16] 400950: 91002000 add x0, x0, #0x8

I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.

Look at [2] above, it should give you what you need.

...
...
...
Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.htm l and then expose it in the python interface.

I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very

welcome.

Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?

Thanks for pointing it. I did reviewed them, however they didn't help

much.

...
They are pretty much just configuration of the timestamping module, which works well for me.

You are correct, they just deal with the mechanic of enabling the generation of timestamp packets. How I use them is in [2] above.

...
...
...
I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial

details.

I don't think the idea is pointless. You simply have a new use case and from my point of view it is enough to receive consideration.

...
...
The idea looks good to me. Actually I am very curious with below question:

What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?

I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I

believe, would be hidden otherwise.

...
CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the
stream.

...
...
Best regards, Wojciech

...
Thanks, Leo Yan

[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.htm l [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.htm l

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Mathieu Poirier

10:43 p.m.

On Tue, 30 Apr 2019 at 10:22, Wojciech Żmuda wzmuda@n7space.com wrote:

...

Hi Mathieu,

...
-----Original Message----- From: Mathieu Poirier mathieu.poirier@linaro.org Sent: Monday, April 29, 2019 11:49 PM To: Wojciech Żmuda wzmuda@n7space.com Cc: Leo Yan leo.yan@linaro.org; coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)

On Mon, 29 Apr 2019 at 11:24, Wojciech Żmuda wzmuda@n7space.com wrote:

...
Hi Leo,

...
-----Original Message----- From: Leo Yan leo.yan@linaro.org Sent: Monday, April 29, 2019 10:19 AM To: Wojciech Żmuda wzmuda@n7space.com Cc: coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)

Hi Wojciech,

On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:

...
Hello,

I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of

instructions.

...
I have some idea how this may be achieved and I'd like to know your

opinion.

...
Great inspiration comes from this patch set by Leo Yan, especially from

the disassembly script:

...
https://lists.linaro.org/pipermail/coresight/2018-May/001325.html

The related works were initiliazed by Tor (TI) and Mathieu :)

This was a very long time ago. The arm-cs-trace-disasm.py may be used for reference purposes but nothing more and it would certainly not work nowadays.

Could you please elaborate some more? I've already noticed that this script is not going to be upstreamed, but I'm still happy I managed to dig it out on the mailing list. It works for me (not perfectly and quite slow, but still).

Given the serious amount of code influx in the perf tools subsystem coupled with the fact that we haven't maintained that script for year, I was expecting it to blow up on you. You are quite lucky that it does anything for you.

...

Having a simple way to reconstruct the actual program flow is a pretty neat feature. I know that for Intel PT, perf has xed integration to produce disassembly. Are there any plans to come up with an official way to do it for CoreSight? Or is it just left for the users to come up with their scripts?

I don't know of any plans to work on this - Tor came up with the script to highlight the potential and power yielded by the "perf script" command. From there I endeavoured to keep making it work with each new kernel release but it quickly got to consume too much time.

...

...
...
...
...
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to

branch samples, which illustrate how the IP moved around.

...
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program

image.

...
...
...
If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,

The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.

Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.

Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite

sure how can I use this feature.

...
...
If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.

This is the first time I see the "script -F time" option, but admittedly I never went looking for it. We never output time samples because we had no use for it. Doing so is probably not hard if there is already a packet type for timestamps.

Actually my current idea is to populate the 'time' field in branch samples, not to generate a separate sample for time. I'm not trying to be nit-picky here, just clarifying, so we're on the same page.

...

Basically, I think that taking these TIMESTAMP packets we have in the decoded output and concatenating them with ATOM/EXCEPTION/some-other-flow-related packets will make it easy to programmatically measure execution time:

So, knowing this:
    Idx:1802; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbed
    Idx:1813; ID:14;        I_ATOM_F1 : Atom format 1.; N
    Idx:1814; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbef
    Idx:1816; ID:14;        I_ATOM_F1 : Atom format 1.; E
    Idx:1817; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf1
    Idx:1819; ID:14;        I_ATOM_F1 : Atom format 1.; N
    Idx:1820; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf2
    Idx:1822; ID:14;        I_ATOM_F1 : Atom format 1.; E
    Idx:1824; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf5
    Idx:1826; ID:14;        I_ATOM_F1 : Atom format 1.; N
    Idx:1827; ID:14;        I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf6
and knowing how to turn ATOMs into program flow (which already comes with perf-script), I'd like to be able to say "execution of instruction block from 0xdeadbeef to 0xdeadffff started with timestamp of value 0xaaaa and lasted until value 0xbbbb, which means the execution lasted for approx. 1234 microseconds of real-life time", or something similar.

The thing to keep in mind here is that timestsamps are emitted *after* traces have been generated - more on that below.

...

...
On the topic of timestamps, be mindful the clock used to generate timestamps is completely disjoint from the CPU clock and the system clock.

That may be the missing piece. Currently my understanding is that the timestamp generator is clocked with F [Hz], which means value in its output register increments F times per second. This is effectively limited by how fast we sample this register and how many of these values actually end in the sink and later in perf.data. Apart from that, I don't see other requirements w.r.t the clock. Could you please confirm my understanding?

The only thing we have access to is the information in the timestamp packets, and has Al pointed out, how many times they are generated. At this time the configuration is set to generate the most packet that is possible.

...

...
...
I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy'

samples:

...
# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print

'time' field.

...
I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate

the time field and access it in python (see below).

...
...
...
so later I can have them in between assembly instructions to calculate

deltas and be able to tell either:

...

how much time and/or CPU cycles have been spent between two

arbitrary instructions (ideally), or

what instructions have been executed between timestamp T and T+1

(this seems to be more in-line with how timestamping in CS works, I think)

Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by

OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.

I'd like to leave this part for Mike/Mathieu for comments.

Wojciech, I have a new patchset waiting to be released that makes use of the cycle accurate information to correlate time between sample packets when working in CPU wide scenarios [1]. It also rearranges a lot of the code you have been looking at. Patches prepended with "perf tools:" will be sent out as soon as we have a 5.2-rc1.

[1]. https://git.linaro.org/people/mathieu.poirier/coresight.git (branch 5.1-rc3-cpu-wide-v3).

...
Mike, Mathieu, I'll be grateful if you joined this discussion and shared your knowledge. Thanks!

Meanwhile, my today's achievement in this field is discovery of cs_etm__synth_branch_sample() and cs_etm__synth_instruction_sample() functions. They look like good places to populate the 'time' field.

With the following change it is indeed possible to get the timestamp value in the sample. While it is still not visible in 'perf script -F

time'

...
(because of the error mentioned above), it is enough to get this visible in python:

diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..34ca6749fc94 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct

cs_etm_queue *etmq,

...
    sample.cpu = tidq->packet->cpu;
    sample.flags = tidq->prev_packet->flags;
    sample.cpumode = event->sample.header.misc;
  sample.time = tidq->packet_queue.timestamp;
Humm... Have a look at what I did in [2], found in the tree I indicated above. In there you will find an algorithm that uses timestamp packet to do pretty much what you want to do in the above code snippet. Mike and I have talked about it recently and concluded it was the best way to proceed.

[2]. 311207df4a71 perf tools: Add notion of time to decoding code
Thank you for this reference. Actually I'm working on v1 of this patch set on my devboard. I took a look at v3 and I don't see major differences around the commit you mentioned. Anyway, I examined it closely and noticed that my previous solution does not make much sense, since the decoder updates the timestamp value cached in the packet queue much quicker than samples are produced, which resulted in samples having the same timestamp value over and over again (from the last decoding pass), until the queue is drained.

I'm glad you have noticed this set. It introduces a lot of modifications to the perf tools and it would have been a shame for you to have to rebase all your code. You are also correct that regarding the user space solution, not much has changed between V1 and V3. The current solution strive to peg a timestamp for each range packet that is received, so that their execution can be correlated to range packets executed on other processors.

...

I came up with another approach and it seems to work fine. I extended packet with the timestamp field, so each branch-related packet is timestamped with the last observed timestamp value. Then, each sample may have this timestamp as well.

I'm still not quite sure if timestamps don't require some clever adjustments, since branch samples are composed from two consecutive packets, and I assign simply one of them to the sample. As you suggested, I tried to understand your [2] patch, but I don't quite get relation of timestamp, next_timestamp and adjusting them by instr_count.

A typical trace will look as follow:

Sync_packet Range_packet_A(3 instructions) Range_packet_B(2 instructions) Timestamp(x) Range_packet_C(5 instructions) Range_packet_D(4 instructions) Range_packet_E(2 instructions) Timestamp(y) ... ...

As I mentionned above timestamps are generated for instructions that have *already* executed. So timestamp X was generated after instructions in range packet A an B have executed. The best we can do is estimate the execution time with once clock cycle per instructions. Given the above we can estimate that range packet A started at time X - 5 and range packet B at X - 2.

Once we have received a timestamp we can anchor the next range packets on it. So range packet C started at time X, range packet D started at X + 5 and range packet E at X + 9. As soon as a new timestamp is received, i.e Y, we can readjust the hard timestamp to avoid incurring too much deviation.

I hope this helps a little more. Let me know if there is something I wasn't clear on.

Mathieu

...

The change looks like this:

diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 33e975c8d11b..1c052ae9aaaa 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -372,6 +372,7 @@ cs_etm_decoder__buffer_packet(struct cs_etm_packet_queue *packet_queue, packet_queue->packet_buffer[et].flags = 0; packet_queue->packet_buffer[et].exception_number = UINT32_MAX; packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id;
  packet_queue->packet_buffer[et].timestamp = packet_queue->next_timestamp;

  if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1)
          return OCSD_RESP_WAIT;
@@ -424,6 +425,7 @@ cs_etm_decoder__buffer_range(struct cs_etm_queue *etmq, case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT: packet->last_instr_taken_branch = elem->last_instr_exec;
          packet->timestamp = packet_queue->next_timestamp;
          break;
  case OCSD_INSTR_ISB:
  case OCSD_INSTR_DSB_DMB:
@@ -483,6 +485,7 @@ cs_etm_decoder__buffer_exception(struct cs_etm_packet_queue *queue,
    packet = &queue->packet_buffer[queue->tail];
    packet->exception_number = elem->exception_number;
  packet->timestamp = queue->next_timestamp;

  return ret;
} diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..aaffb9026af6 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc;
  sample.time = tidq->packet->timestamp;

  /*
   * perf report cannot handle events without a branch stack
diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 1f6ec7babe70..a14b0220eaa7 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -122,6 +122,7 @@ struct cs_etm_packet { enum cs_etm_isa isa; u64 start_addr; u64 end_addr;
  u64 timestamp;
  u32 instr_count;
  u32 last_instr_type;
  u32 last_instr_subtype;
Thanks, Wojciech

...
...
With the following change, the time field is populated in the param_dict['sample']['time'] field (sample = param_dict['sample']):

diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index 1239ab4b6a58..da5cb4d53b4a 100644 --- a/arm-cs-trace-disasm.py +++ b/arm-cs-trace-disasm.py @@ -232,4 +232,5 @@ def process_event(param_dict): print "Address range [ %s .. %s ]: isn't in same dso" %

(start_addr, stop_addr)

...
            return
  print('Timestamp 0x%lx' % sample['time'])
  dump_disam(prev_dso, start_addr, stop_addr)
Then, applying the script to perf-script prints timestamps along with
disassembly:

...
root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump | less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used to dump assembler Timestamp 0x53e56a31002 0000000000400a80 <__libc_csu_init>: 400a80: a9bc7bfd stp x29, x30, [sp, #-64]! 400a84: 910003fd mov x29, sp 400a88: a901d7f4 stp x20, x21, [sp, #24] 400a8c: 90000094 adrp x20, 410000

<__FRAME_END__+0xf3f0>

...
      400a90:       90000095        adrp    x21, 410000
<__FRAME_END__+0xf3f0>

...
      400a94:       9137c294        add     x20, x20, #0xdf0
      400a98:       9137a2b5        add     x21, x21, #0xde8
      400a9c:       a902dff6        stp     x22, x23, [sp, #40]
      400aa0:       cb150294        sub     x20, x20, x21
      400aa4:       f9001ff8        str     x24, [sp, #56]
      400aa8:       2a0003f6        mov     w22, w0
      400aac:       aa0103f7        mov     x23, x1
      400ab0:       9343fe94        asr     x20, x20, #3
      400ab4:       aa0203f8        mov     x24, x2
      400ab8:       97fffece        bl      4005f0 <_init>
CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e 00000000004008dc <main>: 4008dc: a9b37bfd stp x29, x30, [sp, #-208]! 4008e0: 910003fd mov x29, sp 4008e4: b9001fa0 str w0, [x29, #28] 4008e8: f9000ba1 str x1, [x29, #16] 4008ec: f9400ba0 ldr x0, [x29, #16] 4008f0: 91002000 add x0, x0, #0x8 4008f4: f9400001 ldr x1, [x0] 4008f8: f9400ba0 ldr x0, [x29, #16] 4008fc: 91004000 add x0, x0, #0x10 400900: f9400000 ldr x0, [x0] 400904: aa0003e2 mov x2, x0 400908: b9401fa0 ldr w0, [x29, #28] 40090c: 97ffffde bl 400884 <opt_validate> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924 0000000000400910 <main+0x34>: 400910: 7100001f cmp w0, #0x0 400914: 540000c0 b.eq 40092c <main+0x50> //
b.none

...
Timestamp 0x53e56a31924 000000000040092c <main+0x50>: 40092c: f9400ba0 ldr x0, [x29, #16] 400930: 91004000 add x0, x0, #0x10 400934: f9400000 ldr x0, [x0] 400938: 39400000 ldrb w0, [x0] 40093c: 7101941f cmp w0, #0x65 400940: 1a9f17e0 cset w0, eq // eq = none 400944: 12001c00 and w0, w0, #0xff 400948: b9002fa0 str w0, [x29, #44] 40094c: f9400ba0 ldr x0, [x29, #16] 400950: 91002000 add x0, x0, #0x8

I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.

Look at [2] above, it should give you what you need.

...
...
...
Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.

I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very

welcome.

Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?

Thanks for pointing it. I did reviewed them, however they didn't help

much.

...
They are pretty much just configuration of the timestamping module, which works well for me.

You are correct, they just deal with the mechanic of enabling the generation of timestamp packets. How I use them is in [2] above.

...
...
...
I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial

details.

I don't think the idea is pointless. You simply have a new use case and from my point of view it is enough to receive consideration.

...
...
The idea looks good to me. Actually I am very curious with below question:

What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?

I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I

believe, would be hidden otherwise.

...
CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.

Best regards, Wojciech

...
Thanks, Leo Yan

[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.html [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.html

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

Leo Yan

1 May 1 May

12:42 a.m.

On Tue, Apr 30, 2019 at 04:43:03PM -0600, Mathieu Poirier wrote:

[...]

...

...
...
...
...
...
Great inspiration comes from this patch set by Leo Yan, especially from

the disassembly script:

...
https://lists.linaro.org/pipermail/coresight/2018-May/001325.html

The related works were initiliazed by Tor (TI) and Mathieu :)

This was a very long time ago. The arm-cs-trace-disasm.py may be used for reference purposes but nothing more and it would certainly not work nowadays.

Could you please elaborate some more? I've already noticed that this script is not going to be upstreamed, but I'm still happy I managed to dig it out on the mailing list. It works for me (not perfectly and quite slow, but still).

Given the serious amount of code influx in the perf tools subsystem coupled with the fact that we haven't maintained that script for year, I was expecting it to blow up on you. You are quite lucky that it does anything for you.

...
Having a simple way to reconstruct the actual program flow is a pretty neat feature. I know that for Intel PT, perf has xed integration to produce disassembly. Are there any plans to come up with an official way to do it for CoreSight? Or is it just left for the users to come up with their scripts?

I don't know of any plans to work on this - Tor came up with the script to highlight the potential and power yielded by the "perf script" command. From there I endeavoured to keep making it work with each new kernel release but it quickly got to consume too much time.

Just more input for this :)

I also tried to upstream the python script (arm-cs-trace-disasm.py which mainly based on Tor and Mathieu's patch) at last year, finally we hold on the related work. I was purposed to use this script to verify the CoreSight kdump verification. IIRC, the main reason for stopping upstreaming this patch is we have no much confidence for this script to work with CPU wide trace mode.

Since now Mathieu's patches for CPU wide have been almostly landed into mainline kernel, maybe now it's a good time point to restart this work and upstream it to mainline kernel? Thus we can have a parity functionality with Intel PT's script [1].

How about you think for this?

Thanks, Leo Yan

[1] $LINUX/tools/perf/scripts/python/intel-pt-events.py

Mathieu Poirier

3:24 p.m.

On Tue, 30 Apr 2019 at 18:42, Leo Yan leo.yan@linaro.org wrote:

...

On Tue, Apr 30, 2019 at 04:43:03PM -0600, Mathieu Poirier wrote:

[...]

...
...
...
...
...
> Great inspiration comes from this patch set by Leo Yan, especially > from the disassembly script: > https://lists.linaro.org/pipermail/coresight/2018-May/001325.html

The related works were initiliazed by Tor (TI) and Mathieu :)

This was a very long time ago. The arm-cs-trace-disasm.py may be used for reference purposes but nothing more and it would certainly not work nowadays.

Could you please elaborate some more? I've already noticed that this script is not going to be upstreamed, but I'm still happy I managed to dig it out on the mailing list. It works for me (not perfectly and quite slow, but still).

Given the serious amount of code influx in the perf tools subsystem coupled with the fact that we haven't maintained that script for year, I was expecting it to blow up on you. You are quite lucky that it does anything for you.

...
Having a simple way to reconstruct the actual program flow is a pretty neat feature. I know that for Intel PT, perf has xed integration to produce disassembly. Are there any plans to come up with an official way to do it for CoreSight? Or is it just left for the users to come up with their scripts?

I don't know of any plans to work on this - Tor came up with the script to highlight the potential and power yielded by the "perf script" command. From there I endeavoured to keep making it work with each new kernel release but it quickly got to consume too much time.

Just more input for this :)

I also tried to upstream the python script (arm-cs-trace-disasm.py which mainly based on Tor and Mathieu's patch) at last year, finally we hold on the related work. I was purposed to use this script to verify the CoreSight kdump verification. IIRC, the main reason for stopping upstreaming this patch is we have no much confidence for this script to work with CPU wide trace mode.

Since now Mathieu's patches for CPU wide have been almostly landed into mainline kernel, maybe now it's a good time point to restart this work and upstream it to mainline kernel? Thus we can have a parity functionality with Intel PT's script [1].

How about you think for this?

Yes, this is something that could be explored. But before rushing into things I would like to wait until the user space portion of CPU-wide support has been merged. Otherwise it is difficult for people to get all the right pieces in place to try things out. I would also like to give priority to the work on testing that was recently started. I think the motion is interesting and would be disappointed to see it go to waste.

Mathieu

...

Thanks, Leo Yan

[1] $LINUX/tools/perf/scripts/python/intel-pt-events.py

Wojciech Żmuda

11:52 p.m.

Hello,

On 30/04/2019, 18:48, "Al Grant" Al.Grant@arm.com wrote:

...

...
That may be the missing piece. Currently my understanding is that the timestamp generator is clocked with F [Hz], which means value in its output register increments F times per second. This is effectively limited by how fast we sample this register and how many of these values actually end in the sink and later in perf.data. Apart from that, I don't see other requirements w.r.t the clock. Could you please confirm my understanding?

The timestamp generator is global (one per chip) and runs it a fixed frequency. It might be possible to change this but it would be more of a BIOS thing. Definitely not something you'd do in response to a perf option.

Thanks Al. Sorry, I was not quite clear here. What I wanted to understand was How exactly does timestamp generator increment its counter? I guess that it is F times per second, where F is the frequency of TSGEN's clock, but I may be wrong.

On 01/05/2019, 00:43, "Mathieu Poirier" mathieu.poirier@linaro.org wrote:

...

...
I came up with another approach and it seems to work fine. I extended packet with the timestamp field, so each branch-related packet is timestamped with the last observed timestamp value. Then, each sample may have this timestamp as well.

I'm still not quite sure if timestamps don't require some clever adjustments, since branch samples are composed from two consecutive packets, and I assign simply one of them to the sample. As you suggested, I tried to understand your [2] patch, but I don't quite get relation of timestamp, next_timestamp and adjusting them by instr_count.

A typical trace will look as follow:

Sync_packet Range_packet_A(3 instructions) Range_packet_B(2 instructions) Timestamp(x) Range_packet_C(5 instructions) Range_packet_D(4 instructions) Range_packet_E(2 instructions) Timestamp(y) ... ...

As I mentionned above timestamps are generated for instructions that have *already* executed. So timestamp X was generated after instructions in range packet A an B have executed. The best we can do is estimate the execution time with once clock cycle per instructions. Given the above we can estimate that range packet A started at time X

5 and range packet B at X - 2.

Once we have received a timestamp we can anchor the next range packets on it. So range packet C started at time X, range packet D started at X + 5 and range packet E at X + 9. As soon as a new timestamp is received, i.e Y, we can readjust the hard timestamp to avoid incurring too much deviation.

Thanks for this great write-up. So, concatenating this as well as Al's suggestions, I think what needs to be done is the following:

1. When timestamp packet is encountered, get the packet queue and iterate backwards, until the previous timestamp packet or discontinuity occurs. 2. For each encountered range/exception/exception_ret packet: - if this packet happened immediately before the timestamp packet, mark it with value of this timestamp, decremented by the number of instructions executed in this range - if this is a subsequent non-TS packet, mark it with the timestamp value of the previous non-TS packet, Decremented further in the same fashion.

BTW, is timestamp unit a clock cycle? It seems so, since we decrement it by instruction count with assumption that one instruction lasts one clock cycle. If so, what would be the difference between cycacc packets and timestamp packets?

Thank you and best regards, Wojciech

Al Grant

2 May 2 May

8:30 a.m.

...

Thanks Al. Sorry, I was not quite clear here. What I wanted to understand was How exactly does timestamp generator increment its counter? I guess that it is F times per second, where F is the frequency of TSGEN's clock, but I may be wrong.

Generally yes, though this will be a fixed frequency system clock which may be independent of the CPU clock. Register 0x020 in TSGEN holds the frequency, but I believe this is simply a place for software to make a note of the value in case other software wants to know. I.e. it's zero unless something writes it, and writing it has no effect on the frequency.

Whoever first enables the timestamp, if it finds register 0x020 is zero, ought to measure the frequency and make a note of it.

...

Thanks for this great write-up. So, concatenating this as well as Al's suggestions, I think what needs to be done is the following:

When timestamp packet is encountered, get the packet queue and iterate

backwards, until the previous timestamp packet or discontinuity occurs. 2. For each encountered range/exception/exception_ret packet:

if this packet happened immediately before the timestamp packet, mark it with

value of this timestamp, decremented by the number of instructions executed in this range

I don't think you need to decrement. The TS packet applies to the most recent branch or exception. Also, decrementing by number of instructions is definitely wrong as there is little relation between instruction counts and timing of any sort.

...

if this is a subsequent non-TS packet, mark it with the timestamp value of the

previous non-TS packet, Decremented further in the same fashion.

BTW, is timestamp unit a clock cycle? It seems so, since we decrement it by instruction count with assumption that one instruction lasts one clock cycle. If so, what would be the difference between cycacc packets and timestamp packets?

A timestamp unit is a system clock cycle of some kind.

Any system bigger than a small microcontroller would typically have several clocks. There will be the system interconnect clock, generally fixed frequency. CPUs may run at a variable ratio to the interconnect clock and it may be possible to vary them independently of each other. In addition to this, there is the system generic timer/counter (which Linux uses for timing), and the CoreSight timestamp generator, which both run at yet another fixed frequency. So you might have

- system interconnect at 1GHz - one CPU varying between 1.5GHz and 3GHz - another CPU varying between 500MHz and 2GHz - system generic timer/counter at 40MHz - CoreSight timestamp generator at 100MHz

You are likely to see some convergence between the system generic timer, and the CoreSight timestamp generator. It's always been a bit of a bugbear that across our various logs and traces, we have these two different timebases. On the other hand, there's something to be said for having a debug-related timestamp that's protected against the sort of adjustment Linux might make to the system timer.

On top of all this lot, there is very little relation between instruction count and cycle count. At best the core will execute 3 or 4 instructions per cycle, at worst it will take tens or even hundreds of cycles to do one instruction. The actual time at which an instruction does its work may be quite loosely related to where you see it in the trace - you can see this if you use the ETM Event packets to instrument performance events, as you may get an event relating to an instruction when you haven't even had the branch atom that leads up to the instruction. Cycle counts in traces can help us understand how long instructions are taking.

...

Thank you and best regards, Wojciech

Mathieu Poirier

2:32 p.m.

On Wed, 1 May 2019 at 17:52, Wojciech Żmuda wzmuda@n7space.com wrote:

...

Hello,

On 30/04/2019, 18:48, "Al Grant" Al.Grant@arm.com wrote:

...
...
That may be the missing piece. Currently my understanding is that the timestamp generator is clocked with F [Hz], which means value in its output register increments F times per second. This is effectively limited by how fast we sample this register and how many of these values actually end in the sink and later in perf.data. Apart from that, I don't see other requirements w.r.t the clock. Could you please confirm my understanding?

The timestamp generator is global (one per chip) and runs it a fixed frequency. It might be possible to change this but it would be more of a BIOS thing. Definitely not something you'd do in response to a perf option.

Thanks Al. Sorry, I was not quite clear here. What I wanted to understand was How exactly does timestamp generator increment its counter? I guess that it is F times per second, where F is the frequency of TSGEN's clock, but I may be wrong.

On 01/05/2019, 00:43, "Mathieu Poirier" mathieu.poirier@linaro.org wrote:

...
...
I came up with another approach and it seems to work fine. I extended packet with the timestamp field, so each branch-related packet is timestamped with the last observed timestamp value. Then, each sample may have this timestamp as well.

I'm still not quite sure if timestamps don't require some clever adjustments, since branch samples are composed from two consecutive packets, and I assign simply one of them to the sample. As you suggested, I tried to understand your [2] patch, but I don't quite get relation of timestamp, next_timestamp and adjusting them by instr_count.

A typical trace will look as follow:

Sync_packet Range_packet_A(3 instructions) Range_packet_B(2 instructions) Timestamp(x) Range_packet_C(5 instructions) Range_packet_D(4 instructions) Range_packet_E(2 instructions) Timestamp(y) ... ...

As I mentionned above timestamps are generated for instructions that have *already* executed. So timestamp X was generated after instructions in range packet A an B have executed. The best we can do is estimate the execution time with once clock cycle per instructions. Given the above we can estimate that range packet A started at time X

5 and range packet B at X - 2.

Once we have received a timestamp we can anchor the next range packets on it. So range packet C started at time X, range packet D started at X + 5 and range packet E at X + 9. As soon as a new timestamp is received, i.e Y, we can readjust the hard timestamp to avoid incurring too much deviation.

Thanks for this great write-up. So, concatenating this as well as Al's suggestions, I think what needs to be done is the following:

When timestamp packet is encountered, get the packet queue and iterate backwards, until the previous

timestamp packet or discontinuity occurs. 2. For each encountered range/exception/exception_ret packet:

if this packet happened immediately before the timestamp packet, mark it with value of this timestamp,

decremented by the number of instructions executed in this range

if this is a subsequent non-TS packet, mark it with the timestamp value of the previous non-TS packet,

Decremented further in the same fashion.

This seems very similar to what is currently done to correlate traces from different processors in the CPU-wide implementation. Without a patch to look at I can't really say much more.

...

BTW, is timestamp unit a clock cycle? It seems so, since we decrement it by instruction count with assumption that one instruction lasts one clock cycle. If so, what would be the difference between cycacc packets and timestamp packets?

Thank you and best regards, Wojciech

Wojciech Żmuda

7 May 7 May

9:51 p.m.

Hello,

On 02/05/2019, 16:33, "Mathieu Poirier" mathieu.poirier@linaro.org wrote:

...

...
Thanks for this great write-up. So, concatenating this as well as Al's suggestions, I think what needs to be done is the following:

When timestamp packet is encountered, get the packet queue and iterate backwards, until the previous

timestamp packet or discontinuity occurs. 2. For each encountered range/exception/exception_ret packet:

if this packet happened immediately before the timestamp packet, mark it with value of this timestamp,

decremented by the number of instructions executed in this range

if this is a subsequent non-TS packet, mark it with the timestamp value of the previous non-TS packet,

Decremented further in the same fashion.

This seems very similar to what is currently done to correlate traces from different processors in the CPU-wide implementation. Without a patch to look at I can't really say much more.

Mathieu, thank you very much for your suggestions.

I tried to implement steps quoted above. The result is still not perfect. Could you please take a look at the code below? Perhaps you can notice if I overlooked something obvious.

One problem I cannot solve is that I still see samples that are not timestamped. Majority of samples get the timestamp, but several always end with value 0x0 (or other, if as experiment I initialize the timestamp field with different value). It looks like some packets, that later are turned into samples, do not hit my modified code path, but I cannot figure out where do they come from. I try to timestamp everything including discontinuity, since it also is converted into a sample. I'm testing in per-thread mode, if it may be a clue.

Also, I wonder if my change in cs_etm__synth_branch_sample() is correct. I know that a sample is composed of two consecutive packets. If I understand correctly, tidq->prev_packet in this context refer to a packet previous w.r.t. to the slot in the queue, but not the one that came earlier in time. I chose tidq->packet instead of tidq->prev_packet to get the earlier timestamp, which would be the approximate timestamp of the first instruction in the range. I hope that makes sense.

Thank you very much, Wojciech

From a7008abcceaf6489dfac6e91fb8bae5ba6880dca Mon Sep 17 00:00:00 2001 From: Wojciech Zmuda wzmuda@n7space.com Date: Tue, 30 Apr 2019 15:11:19 +0200 Subject: [PATCH] perf tools: associate samples with timestamp

--- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 31 +++++++++++++++++++++++++ tools/perf/util/cs-etm.c | 3 +++ tools/perf/util/cs-etm.h | 3 +++ 3 files changed, 37 insertions(+)

diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 33e975c8d11b..cd3b1d9dbb54 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -269,6 +269,31 @@ cs_etm_decoder__create_etm_packet_printer(struct cs_etm_trace_params *t_params, trace_config); }

+static void +cs_etm_decoder__timestamp_previous_packets(struct cs_etm_packet_queue *packet_queue, + const ocsd_generic_trace_elem *timestamp_elem) +{ + u32 count, packet_id; + struct cs_etm_packet *packet, *prev_packet; + + count = packet_queue->timestamp_pending_count; + prev_packet = NULL; + while (count--) { + packet_id = packet_queue->timestamp_pending[count]; + packet = &packet_queue->packet_buffer[packet_id]; + + /* TODO: find better approximation of the number of instructions per cycle. */ + if (!prev_packet) + packet->timestamp = timestamp_elem->timestamp - packet->instr_count; + else + packet->timestamp = prev_packet->timestamp - packet->instr_count; + + prev_packet = packet; + } + + packet_queue->timestamp_pending_count = 0; +} + static ocsd_datapath_resp_t cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq, struct cs_etm_packet_queue *packet_queue, @@ -302,6 +327,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq, if (!packet_queue) return OCSD_RESP_FATAL_SYS_ERR;

+ cs_etm_decoder__timestamp_previous_packets(packet_queue, elem); + /* * We've seen a timestamp packet before - simply record the new value. * Function do_soft_timestamp() will report the value to the front end, @@ -372,6 +399,10 @@ cs_etm_decoder__buffer_packet(struct cs_etm_packet_queue *packet_queue, packet_queue->packet_buffer[et].flags = 0; packet_queue->packet_buffer[et].exception_number = UINT32_MAX; packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id; + packet_queue->packet_buffer[et].timestamp = 0; + + if (sample_type != CS_ETM_EMPTY) + packet_queue->timestamp_pending[packet_queue->timestamp_pending_count++] = et;

if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1) return OCSD_RESP_WAIT; diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..1f1ef69ff06c 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -190,6 +190,7 @@ static void cs_etm__clear_packet_queue(struct cs_etm_packet_queue *queue) queue->head = 0; queue->tail = 0; queue->packet_count = 0; + queue->timestamp_pending_count = 0; for (i = 0; i < CS_ETM_PACKET_MAX_BUFFER; i++) { queue->packet_buffer[i].isa = CS_ETM_ISA_UNKNOWN; queue->packet_buffer[i].start_addr = CS_ETM_INVAL_ADDR; @@ -204,6 +205,7 @@ static void cs_etm__clear_packet_queue(struct cs_etm_packet_queue *queue) queue->packet_buffer[i].exception_number = UINT32_MAX; queue->packet_buffer[i].trace_chan_id = UINT8_MAX; queue->packet_buffer[i].cpu = INT_MIN; + queue->packet_buffer[i].timestamp = 0; } }

@@ -1157,6 +1159,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc; + sample.time = tidq->packet->timestamp;

/* * perf report cannot handle events without a branch stack diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 1f6ec7babe70..c67d47654c8d 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -122,6 +122,7 @@ struct cs_etm_packet { enum cs_etm_isa isa; u64 start_addr; u64 end_addr; + u64 timestamp; u32 instr_count; u32 last_instr_type; u32 last_instr_subtype; @@ -154,6 +155,8 @@ struct cs_etm_packet_queue { u32 instr_count; u64 timestamp; u64 next_timestamp; + u32 timestamp_pending_count; + u32 timestamp_pending[CS_ETM_PACKET_MAX_BUFFER]; struct cs_etm_packet packet_buffer[CS_ETM_PACKET_MAX_BUFFER]; };

Mathieu Poirier

11:39 p.m.

Hey Wojciech,

On Tue, 7 May 2019 at 15:51, Wojciech Żmuda wzmuda@n7space.com wrote:

...

Hello,

On 02/05/2019, 16:33, "Mathieu Poirier" mathieu.poirier@linaro.org wrote:

...
...
Thanks for this great write-up. So, concatenating this as well as Al's suggestions, I think what needs to be done is the following:

When timestamp packet is encountered, get the packet queue and iterate backwards, until the previous

timestamp packet or discontinuity occurs. 2. For each encountered range/exception/exception_ret packet:

if this packet happened immediately before the timestamp packet, mark it with value of this timestamp,

decremented by the number of instructions executed in this range

if this is a subsequent non-TS packet, mark it with the timestamp value of the previous non-TS packet,

Decremented further in the same fashion.

This seems very similar to what is currently done to correlate traces from different processors in the CPU-wide implementation. Without a patch to look at I can't really say much more.

Mathieu, thank you very much for your suggestions.

I tried to implement steps quoted above. The result is still not perfect. Could you please take a look at the code below? Perhaps you can notice if I overlooked something obvious.

I am currently juggling with 5 patchset and that is just my coresight work. I think your time will be much better spent trying to figure out what is going on than waiting for me to take a look at it. This is user space and trace data yield consistent results across runs. Pin point a sample that hasn't been timestamped and work your way back from there. That is the best advice I can give you.

Mathieu

...

One problem I cannot solve is that I still see samples that are not timestamped. Majority of samples get the timestamp, but several always end with value 0x0 (or other, if as experiment I initialize the timestamp field with different value). It looks like some packets, that later are turned into samples, do not hit my modified code path, but I cannot figure out where do they come from. I try to timestamp everything including discontinuity, since it also is converted into a sample. I'm testing in per-thread mode, if it may be a clue.

Also, I wonder if my change in cs_etm__synth_branch_sample() is correct. I know that a sample is composed of two consecutive packets. If I understand correctly, tidq->prev_packet in this context refer to a packet previous w.r.t. to the slot in the queue, but not the one that came earlier in time. I chose tidq->packet instead of tidq->prev_packet to get the earlier timestamp, which would be the approximate timestamp of the first instruction in the range. I hope that makes sense.

Thank you very much, Wojciech

From a7008abcceaf6489dfac6e91fb8bae5ba6880dca Mon Sep 17 00:00:00 2001 From: Wojciech Zmuda wzmuda@n7space.com Date: Tue, 30 Apr 2019 15:11:19 +0200 Subject: [PATCH] perf tools: associate samples with timestamp

tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 31 +++++++++++++++++++++++++ tools/perf/util/cs-etm.c | 3 +++ tools/perf/util/cs-etm.h | 3 +++ 3 files changed, 37 insertions(+)

diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 33e975c8d11b..cd3b1d9dbb54 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -269,6 +269,31 @@ cs_etm_decoder__create_etm_packet_printer(struct cs_etm_trace_params *t_params, trace_config); }

+static void +cs_etm_decoder__timestamp_previous_packets(struct cs_etm_packet_queue *packet_queue,
                            const ocsd_generic_trace_elem *timestamp_elem)
+{
  u32 count, packet_id;
  struct cs_etm_packet *packet, *prev_packet;
  count = packet_queue->timestamp_pending_count;
  prev_packet = NULL;
  while (count--) {
          packet_id = packet_queue->timestamp_pending[count];
          packet = &packet_queue->packet_buffer[packet_id];
          /* TODO: find better approximation of the number of instructions per cycle. */
          if (!prev_packet)
                  packet->timestamp = timestamp_elem->timestamp - packet->instr_count;
          else
                  packet->timestamp = prev_packet->timestamp - packet->instr_count;
          prev_packet = packet;
  }
  packet_queue->timestamp_pending_count = 0;
+}

static ocsd_datapath_resp_t cs_etm_decoder__do_soft_timestamp(struct cs_etm_queue *etmq, struct cs_etm_packet_queue *packet_queue, @@ -302,6 +327,8 @@ cs_etm_decoder__do_hard_timestamp(struct cs_etm_queue *etmq, if (!packet_queue) return OCSD_RESP_FATAL_SYS_ERR;
  cs_etm_decoder__timestamp_previous_packets(packet_queue, elem);
  /*
   * We've seen a timestamp packet before - simply record the new value.
   * Function do_soft_timestamp() will report the value to the front end,
@@ -372,6 +399,10 @@ cs_etm_decoder__buffer_packet(struct cs_etm_packet_queue *packet_queue, packet_queue->packet_buffer[et].flags = 0; packet_queue->packet_buffer[et].exception_number = UINT32_MAX; packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id;
  packet_queue->packet_buffer[et].timestamp = 0;
  if (sample_type != CS_ETM_EMPTY)
          packet_queue->timestamp_pending[packet_queue->timestamp_pending_count++] = et;

  if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1)
          return OCSD_RESP_WAIT;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..1f1ef69ff06c 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -190,6 +190,7 @@ static void cs_etm__clear_packet_queue(struct cs_etm_packet_queue *queue) queue->head = 0; queue->tail = 0; queue->packet_count = 0;
  queue->timestamp_pending_count = 0;
  for (i = 0; i < CS_ETM_PACKET_MAX_BUFFER; i++) {
          queue->packet_buffer[i].isa = CS_ETM_ISA_UNKNOWN;
          queue->packet_buffer[i].start_addr = CS_ETM_INVAL_ADDR;
@@ -204,6 +205,7 @@ static void cs_etm__clear_packet_queue(struct cs_etm_packet_queue *queue) queue->packet_buffer[i].exception_number = UINT32_MAX; queue->packet_buffer[i].trace_chan_id = UINT8_MAX; queue->packet_buffer[i].cpu = INT_MIN;
          queue->packet_buffer[i].timestamp = 0;
  }
}

@@ -1157,6 +1159,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc;
  sample.time = tidq->packet->timestamp;

  /*
   * perf report cannot handle events without a branch stack
diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 1f6ec7babe70..c67d47654c8d 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -122,6 +122,7 @@ struct cs_etm_packet { enum cs_etm_isa isa; u64 start_addr; u64 end_addr;
  u64 timestamp;
  u32 instr_count;
  u32 last_instr_type;
  u32 last_instr_subtype;
@@ -154,6 +155,8 @@ struct cs_etm_packet_queue { u32 instr_count; u64 timestamp; u64 next_timestamp;
  u32 timestamp_pending_count;
  u32 timestamp_pending[CS_ETM_PACKET_MAX_BUFFER];
  struct cs_etm_packet packet_buffer[CS_ETM_PACKET_MAX_BUFFER];
};

Leo Yan

30 Apr 30 Apr

1:50 a.m.

Hi Wojciech,

On Mon, Apr 29, 2019 at 05:24:47PM +0000, Wojciech Żmuda wrote:

[...]

...

...
...
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to

branch samples, which illustrate how the IP moved around.

...
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program image.

If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,

The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.

Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.

Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite sure how can I use this feature.

For the segment fault, please see the patches [1].

For the '--itrace' user case, as I know Mike/Rob worked on it for AutoFDO optimization for programs. For this part, you could refer Mike's slides [2] with AudoFDO advaced usage case, and my slides [3] gave a very brief introduction (start from page 26).

[1] https://archive.armlinux.org.uk/lurker/message/20190428.083227.a35c261b.en.h... [2] https://connect.linaro.org/resources/yvr18/sessions/yvr18-225/ [3] https://s3.amazonaws.com/connect.linaro.org/yvr18/presentations/yvr18-416.pd...

[...]

...

...
If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.

I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy' samples:

# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print 'time' field.

I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate the time field and access it in python (see below).

You could firstly check the time field with other events (e.g. PMU cycle event), if it can work well for other events, seems to me it's an issue in CoreSight samples generation rather than an issue in the perf-script common code.

[...]

...

I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.

Thanks for digging for this. I will test after you work out formal patches.

[...]

...

...
What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?

I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I believe, would be hidden otherwise. CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.

To be honest, I do not really understand for CPU's micro architecture and pipelines (so cannot give any useful suggestion for this part).

But if you can demonstrate CoreSight + timestamp for CPU micro architecture performance profiling (e.g. find the instruction/data dependency for CPU's pipeline), seems to me this will be very cool user case :)

Thanks, Leo Yan

2241

days inactive

2252

days old

coresight@lists.linaro.org

14 comments

participants

tags (0)

participants (4)

Al Grant
Leo Yan
Mathieu Poirier
Wojciech Żmuda