On Tue, 30 Apr 2019 at 10:22, Wojciech Żmuda wzmuda@n7space.com wrote:
Hi Mathieu,
-----Original Message----- From: Mathieu Poirier mathieu.poirier@linaro.org Sent: Monday, April 29, 2019 11:49 PM To: Wojciech Żmuda wzmuda@n7space.com Cc: Leo Yan leo.yan@linaro.org; coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)
On Mon, 29 Apr 2019 at 11:24, Wojciech Żmuda wzmuda@n7space.com wrote:
Hi Leo,
-----Original Message----- From: Leo Yan leo.yan@linaro.org Sent: Monday, April 29, 2019 10:19 AM To: Wojciech Żmuda wzmuda@n7space.com Cc: coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)
Hi Wojciech,
On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:
Hello,
I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of
instructions.
I have some idea how this may be achieved and I'd like to know your
opinion.
Great inspiration comes from this patch set by Leo Yan, especially from
the disassembly script:
https://lists.linaro.org/pipermail/coresight/2018-May/001325.html
The related works were initiliazed by Tor (TI) and Mathieu :)
This was a very long time ago. The arm-cs-trace-disasm.py may be used for reference purposes but nothing more and it would certainly not work nowadays.
Could you please elaborate some more? I've already noticed that this script is not going to be upstreamed, but I'm still happy I managed to dig it out on the mailing list. It works for me (not perfectly and quite slow, but still).
Given the serious amount of code influx in the perf tools subsystem coupled with the fact that we haven't maintained that script for year, I was expecting it to blow up on you. You are quite lucky that it does anything for you.
Having a simple way to reconstruct the actual program flow is a pretty neat feature. I know that for Intel PT, perf has xed integration to produce disassembly. Are there any plans to come up with an official way to do it for CoreSight? Or is it just left for the users to come up with their scripts?
I don't know of any plans to work on this - Tor came up with the script to highlight the potential and power yielded by the "perf script" command. From there I endeavoured to keep making it work with each new kernel release but it quickly got to consume too much time.
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to
branch samples, which illustrate how the IP moved around.
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program
image.
If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,
The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.
Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.
Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite
sure how can I use this feature.
If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.
This is the first time I see the "script -F time" option, but admittedly I never went looking for it. We never output time samples because we had no use for it. Doing so is probably not hard if there is already a packet type for timestamps.
Actually my current idea is to populate the 'time' field in branch samples, not to generate a separate sample for time. I'm not trying to be nit-picky here, just clarifying, so we're on the same page.
Ok
Basically, I think that taking these TIMESTAMP packets we have in the decoded output and concatenating them with ATOM/EXCEPTION/some-other-flow-related packets will make it easy to programmatically measure execution time:
So, knowing this:
Idx:1802; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbed Idx:1813; ID:14; I_ATOM_F1 : Atom format 1.; N Idx:1814; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbef Idx:1816; ID:14; I_ATOM_F1 : Atom format 1.; E Idx:1817; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf1 Idx:1819; ID:14; I_ATOM_F1 : Atom format 1.; N Idx:1820; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf2 Idx:1822; ID:14; I_ATOM_F1 : Atom format 1.; E Idx:1824; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf5 Idx:1826; ID:14; I_ATOM_F1 : Atom format 1.; N Idx:1827; ID:14; I_TIMESTAMP : Timestamp.; Updated val = 0x1be8df9dbf6
and knowing how to turn ATOMs into program flow (which already comes with perf-script), I'd like to be able to say "execution of instruction block from 0xdeadbeef to 0xdeadffff started with timestamp of value 0xaaaa and lasted until value 0xbbbb, which means the execution lasted for approx. 1234 microseconds of real-life time", or something similar.
The thing to keep in mind here is that timestsamps are emitted *after* traces have been generated - more on that below.
On the topic of timestamps, be mindful the clock used to generate timestamps is completely disjoint from the CPU clock and the system clock.
That may be the missing piece. Currently my understanding is that the timestamp generator is clocked with F [Hz], which means value in its output register increments F times per second. This is effectively limited by how fast we sample this register and how many of these values actually end in the sink and later in perf.data. Apart from that, I don't see other requirements w.r.t the clock. Could you please confirm my understanding?
The only thing we have access to is the information in the timestamp packets, and has Al pointed out, how many times they are generated. At this time the configuration is set to generate the most packet that is possible.
I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy'
samples:
# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print
'time' field.
I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate
the time field and access it in python (see below).
so later I can have them in between assembly instructions to calculate
deltas and be able to tell either:
- how much time and/or CPU cycles have been spent between two
arbitrary instructions (ideally), or
- what instructions have been executed between timestamp T and T+1
(this seems to be more in-line with how timestamping in CS works, I think)
Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by
OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.
I'd like to leave this part for Mike/Mathieu for comments.
Wojciech, I have a new patchset waiting to be released that makes use of the cycle accurate information to correlate time between sample packets when working in CPU wide scenarios [1]. It also rearranges a lot of the code you have been looking at. Patches prepended with "perf tools:" will be sent out as soon as we have a 5.2-rc1.
[1]. https://git.linaro.org/people/mathieu.poirier/coresight.git (branch 5.1-rc3-cpu-wide-v3).
Mike, Mathieu, I'll be grateful if you joined this discussion and shared your knowledge. Thanks!
Meanwhile, my today's achievement in this field is discovery of cs_etm__synth_branch_sample() and cs_etm__synth_instruction_sample() functions. They look like good places to populate the 'time' field.
With the following change it is indeed possible to get the timestamp value in the sample. While it is still not visible in 'perf script -F
time'
(because of the error mentioned above), it is enough to get this visible in python:
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..34ca6749fc94 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct
cs_etm_queue *etmq,
sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc;
sample.time = tidq->packet_queue.timestamp;
Humm... Have a look at what I did in [2], found in the tree I indicated above. In there you will find an algorithm that uses timestamp packet to do pretty much what you want to do in the above code snippet. Mike and I have talked about it recently and concluded it was the best way to proceed.
[2]. 311207df4a71 perf tools: Add notion of time to decoding code
Thank you for this reference. Actually I'm working on v1 of this patch set on my devboard. I took a look at v3 and I don't see major differences around the commit you mentioned. Anyway, I examined it closely and noticed that my previous solution does not make much sense, since the decoder updates the timestamp value cached in the packet queue much quicker than samples are produced, which resulted in samples having the same timestamp value over and over again (from the last decoding pass), until the queue is drained.
I'm glad you have noticed this set. It introduces a lot of modifications to the perf tools and it would have been a shame for you to have to rebase all your code. You are also correct that regarding the user space solution, not much has changed between V1 and V3. The current solution strive to peg a timestamp for each range packet that is received, so that their execution can be correlated to range packets executed on other processors.
I came up with another approach and it seems to work fine. I extended packet with the timestamp field, so each branch-related packet is timestamped with the last observed timestamp value. Then, each sample may have this timestamp as well.
I'm still not quite sure if timestamps don't require some clever adjustments, since branch samples are composed from two consecutive packets, and I assign simply one of them to the sample. As you suggested, I tried to understand your [2] patch, but I don't quite get relation of timestamp, next_timestamp and adjusting them by instr_count.
A typical trace will look as follow:
Sync_packet Range_packet_A(3 instructions) Range_packet_B(2 instructions) Timestamp(x) Range_packet_C(5 instructions) Range_packet_D(4 instructions) Range_packet_E(2 instructions) Timestamp(y) ... ...
As I mentionned above timestamps are generated for instructions that have *already* executed. So timestamp X was generated after instructions in range packet A an B have executed. The best we can do is estimate the execution time with once clock cycle per instructions. Given the above we can estimate that range packet A started at time X - 5 and range packet B at X - 2.
Once we have received a timestamp we can anchor the next range packets on it. So range packet C started at time X, range packet D started at X + 5 and range packet E at X + 9. As soon as a new timestamp is received, i.e Y, we can readjust the hard timestamp to avoid incurring too much deviation.
I hope this helps a little more. Let me know if there is something I wasn't clear on.
Mathieu
The change looks like this:
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 33e975c8d11b..1c052ae9aaaa 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -372,6 +372,7 @@ cs_etm_decoder__buffer_packet(struct cs_etm_packet_queue *packet_queue, packet_queue->packet_buffer[et].flags = 0; packet_queue->packet_buffer[et].exception_number = UINT32_MAX; packet_queue->packet_buffer[et].trace_chan_id = trace_chan_id;
packet_queue->packet_buffer[et].timestamp = packet_queue->next_timestamp; if (packet_queue->packet_count == CS_ETM_PACKET_MAX_BUFFER - 1) return OCSD_RESP_WAIT;
@@ -424,6 +425,7 @@ cs_etm_decoder__buffer_range(struct cs_etm_queue *etmq, case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT: packet->last_instr_taken_branch = elem->last_instr_exec;
packet->timestamp = packet_queue->next_timestamp; break; case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB:
@@ -483,6 +485,7 @@ cs_etm_decoder__buffer_exception(struct cs_etm_packet_queue *queue,
packet = &queue->packet_buffer[queue->tail]; packet->exception_number = elem->exception_number;
packet->timestamp = queue->next_timestamp; return ret;
} diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..aaffb9026af6 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc;
sample.time = tidq->packet->timestamp; /* * perf report cannot handle events without a branch stack
diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 1f6ec7babe70..a14b0220eaa7 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -122,6 +122,7 @@ struct cs_etm_packet { enum cs_etm_isa isa; u64 start_addr; u64 end_addr;
u64 timestamp; u32 instr_count; u32 last_instr_type; u32 last_instr_subtype;
Thanks, Wojciech
With the following change, the time field is populated in the param_dict['sample']['time'] field (sample = param_dict['sample']):
diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index 1239ab4b6a58..da5cb4d53b4a 100644 --- a/arm-cs-trace-disasm.py +++ b/arm-cs-trace-disasm.py @@ -232,4 +232,5 @@ def process_event(param_dict): print "Address range [ %s .. %s ]: isn't in same dso" %
(start_addr, stop_addr)
return
print('Timestamp 0x%lx' % sample['time']) dump_disam(prev_dso, start_addr, stop_addr)
Then, applying the script to perf-script prints timestamps along with
disassembly:
root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump | less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used to dump assembler Timestamp 0x53e56a31002 0000000000400a80 <__libc_csu_init>: 400a80: a9bc7bfd stp x29, x30, [sp, #-64]! 400a84: 910003fd mov x29, sp 400a88: a901d7f4 stp x20, x21, [sp, #24] 400a8c: 90000094 adrp x20, 410000
<__FRAME_END__+0xf3f0>
400a90: 90000095 adrp x21, 410000
<__FRAME_END__+0xf3f0>
400a94: 9137c294 add x20, x20, #0xdf0 400a98: 9137a2b5 add x21, x21, #0xde8 400a9c: a902dff6 stp x22, x23, [sp, #40] 400aa0: cb150294 sub x20, x20, x21 400aa4: f9001ff8 str x24, [sp, #56] 400aa8: 2a0003f6 mov w22, w0 400aac: aa0103f7 mov x23, x1 400ab0: 9343fe94 asr x20, x20, #3 400ab4: aa0203f8 mov x24, x2 400ab8: 97fffece bl 4005f0 <_init>
CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e 00000000004008dc <main>: 4008dc: a9b37bfd stp x29, x30, [sp, #-208]! 4008e0: 910003fd mov x29, sp 4008e4: b9001fa0 str w0, [x29, #28] 4008e8: f9000ba1 str x1, [x29, #16] 4008ec: f9400ba0 ldr x0, [x29, #16] 4008f0: 91002000 add x0, x0, #0x8 4008f4: f9400001 ldr x1, [x0] 4008f8: f9400ba0 ldr x0, [x29, #16] 4008fc: 91004000 add x0, x0, #0x10 400900: f9400000 ldr x0, [x0] 400904: aa0003e2 mov x2, x0 400908: b9401fa0 ldr w0, [x29, #28] 40090c: 97ffffde bl 400884 <opt_validate> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924 0000000000400910 <main+0x34>: 400910: 7100001f cmp w0, #0x0 400914: 540000c0 b.eq 40092c <main+0x50> //
b.none
Timestamp 0x53e56a31924 000000000040092c <main+0x50>: 40092c: f9400ba0 ldr x0, [x29, #16] 400930: 91004000 add x0, x0, #0x10 400934: f9400000 ldr x0, [x0] 400938: 39400000 ldrb w0, [x0] 40093c: 7101941f cmp w0, #0x65 400940: 1a9f17e0 cset w0, eq // eq = none 400944: 12001c00 and w0, w0, #0xff 400948: b9002fa0 str w0, [x29, #44] 40094c: f9400ba0 ldr x0, [x29, #16] 400950: 91002000 add x0, x0, #0x8
I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.
Look at [2] above, it should give you what you need.
Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.
I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very
welcome.
Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?
Thanks for pointing it. I did reviewed them, however they didn't help
much.
They are pretty much just configuration of the timestamping module, which works well for me.
You are correct, they just deal with the mechanic of enabling the generation of timestamp packets. How I use them is in [2] above.
I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial
details.
I don't think the idea is pointless. You simply have a new use case and from my point of view it is enough to receive consideration.
The idea looks good to me. Actually I am very curious with below question:
What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?
I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I
believe, would be hidden otherwise.
CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.
Best regards, Wojciech
Thanks, Leo Yan
[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.html [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.html
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight