On Mon, 29 Apr 2019 at 11:24, Wojciech Żmuda wzmuda@n7space.com wrote:
Hi Leo,
-----Original Message----- From: Leo Yan leo.yan@linaro.org Sent: Monday, April 29, 2019 10:19 AM To: Wojciech Żmuda wzmuda@n7space.com Cc: coresight@lists.linaro.org; Michał Kurowski mkurowski@n7space.com; Michal Mosdorf mmosdorf@n7space.com Subject: Re: [RFC] Trace-based execution time profiling (timestamp in perf-script?)
Hi Wojciech,
On Fri, Apr 26, 2019 at 10:36:37AM +0000, Wojciech Żmuda wrote:
I'm trying to design a solution to use CoreSight for measuring application execution time, with granularity of specific ranges of
I have some idea how this may be achieved and I'd like to know your
Great inspiration comes from this patch set by Leo Yan, especially from
the disassembly script:
The related works were initiliazed by Tor (TI) and Mathieu :)
This was a very long time ago. The arm-cs-trace-disasm.py may be used for reference purposes but nothing more and it would certainly not work nowadays.
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to
branch samples, which illustrate how the IP moved around.
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program image.
If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,
The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.
Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.
Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite sure how can I use this feature.
If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.
This is the first time I see the "script -F time" option, but admittedly I never went looking for it. We never output time samples because we had no use for it. Doing so is probably not hard if there is already a packet type for timestamps. On the topic of timestamps, be mindful the clock used to generate timestamps is completely disjoint from the CPU clock and the system clock.
I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy' samples:
# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print 'time' field.
I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate the time field and access it in python (see below).
so later I can have them in between assembly instructions to calculate
deltas and be able to tell either:
- how much time and/or CPU cycles have been spent between two
arbitrary instructions (ideally), or
- what instructions have been executed between timestamp T and T+1
(this seems to be more in-line with how timestamping in CS works, I think)
Brief analysis of tools/perf/util.cs-etm.c and cs-etm-decoder/cs-etm-decoder.c suggests that timestamp events are not turned into packets, but merely recorded as a packet queue parameter (I'm not sure why this is needed, though). The cycacc event is not processed further at all, beside being later decoded to plaintext by
OpenCSD. I think it may be worth to give them both a dedicated `enum cs_etm_sample_type` value and packet generator functions.
I'd like to leave this part for Mike/Mathieu for comments.
Wojciech, I have a new patchset waiting to be released that makes use of the cycle accurate information to correlate time between sample packets when working in CPU wide scenarios [1]. It also rearranges a lot of the code you have been looking at. Patches prepended with "perf tools:" will be sent out as soon as we have a 5.2-rc1.
[1]. https://git.linaro.org/people/mathieu.poirier/coresight.git (branch 5.1-rc3-cpu-wide-v3).
Mike, Mathieu, I'll be grateful if you joined this discussion and shared your knowledge. Thanks!
Meanwhile, my today's achievement in this field is discovery of cs_etm__synth_branch_sample() and cs_etm__synth_instruction_sample() functions. They look like good places to populate the 'time' field.
With the following change it is indeed possible to get the timestamp value in the sample. While it is still not visible in 'perf script -F time' (because of the error mentioned above), it is enough to get this visible in python:
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 8496190ad553..34ca6749fc94 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1157,6 +1157,7 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, sample.cpu = tidq->packet->cpu; sample.flags = tidq->prev_packet->flags; sample.cpumode = event->sample.header.misc;
sample.time = tidq->packet_queue.timestamp;
Humm... Have a look at what I did in [2], found in the tree I indicated above. In there you will find an algorithm that uses timestamp packet to do pretty much what you want to do in the above code snippet. Mike and I have talked about it recently and concluded it was the best way to proceed.
[2]. 311207df4a71 perf tools: Add notion of time to decoding code
With the following change, the time field is populated in the param_dict['sample']['time'] field (sample = param_dict['sample']):
diff --git a/arm-cs-trace-disasm.py b/arm-cs-trace-disasm.py index 1239ab4b6a58..da5cb4d53b4a 100644 --- a/arm-cs-trace-disasm.py +++ b/arm-cs-trace-disasm.py @@ -232,4 +232,5 @@ def process_event(param_dict): print "Address range [ %s .. %s ]: isn't in same dso" % (start_addr, stop_addr) return
print('Timestamp 0x%lx' % sample['time']) dump_disam(prev_dso, start_addr, stop_addr)
Then, applying the script to perf-script prints timestamps along with disassembly:
root@zynq:~# perf script -s arm-cs-trace-disasm.py -- -d objdump | less ARM CoreSight Trace Data Assembler Dump kallsyms cannot be used to dump assembler Timestamp 0x53e56a31002 0000000000400a80 <__libc_csu_init>: 400a80: a9bc7bfd stp x29, x30, [sp, #-64]! 400a84: 910003fd mov x29, sp 400a88: a901d7f4 stp x20, x21, [sp, #24] 400a8c: 90000094 adrp x20, 410000 <__FRAME_END__+0xf3f0> 400a90: 90000095 adrp x21, 410000 <__FRAME_END__+0xf3f0> 400a94: 9137c294 add x20, x20, #0xdf0 400a98: 9137a2b5 add x21, x21, #0xde8 400a9c: a902dff6 stp x22, x23, [sp, #40] 400aa0: cb150294 sub x20, x20, x21 400aa4: f9001ff8 str x24, [sp, #56] 400aa8: 2a0003f6 mov w22, w0 400aac: aa0103f7 mov x23, x1 400ab0: 9343fe94 asr x20, x20, #3 400ab4: aa0203f8 mov x24, x2 400ab8: 97fffece bl 4005f0 <_init> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a3100e 00000000004008dc <main>: 4008dc: a9b37bfd stp x29, x30, [sp, #-208]! 4008e0: 910003fd mov x29, sp 4008e4: b9001fa0 str w0, [x29, #28] 4008e8: f9000ba1 str x1, [x29, #16] 4008ec: f9400ba0 ldr x0, [x29, #16] 4008f0: 91002000 add x0, x0, #0x8 4008f4: f9400001 ldr x1, [x0] 4008f8: f9400ba0 ldr x0, [x29, #16] 4008fc: 91004000 add x0, x0, #0x10 400900: f9400000 ldr x0, [x0] 400904: aa0003e2 mov x2, x0 400908: b9401fa0 ldr w0, [x29, #28] 40090c: 97ffffde bl 400884 <opt_validate> CPU3: CS_ETM_TRACE_ON packet is inserted Timestamp 0x53e56a31924 0000000000400910 <main+0x34>: 400910: 7100001f cmp w0, #0x0 400914: 540000c0 b.eq 40092c <main+0x50> // b.none Timestamp 0x53e56a31924 000000000040092c <main+0x50>: 40092c: f9400ba0 ldr x0, [x29, #16] 400930: 91004000 add x0, x0, #0x10 400934: f9400000 ldr x0, [x0] 400938: 39400000 ldrb w0, [x0] 40093c: 7101941f cmp w0, #0x65 400940: 1a9f17e0 cset w0, eq // eq = none 400944: 12001c00 and w0, w0, #0xff 400948: b9002fa0 str w0, [x29, #44] 40094c: f9400ba0 ldr x0, [x29, #16] 400950: 91002000 add x0, x0, #0x8
I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.
Look at [2] above, it should give you what you need.
Then, I think, it should be possible to generate samples (not sure what type though, perhaps not 'branch' this time) for timestamp/cycacc packets, analogically to what has been done for TRACE_ON https://lists.linaro.org/pipermail/coresight/2018-May/001327.html and then expose it in the python interface.
I'd be grateful for any opinion about this idea, especially about usefulness of such feature for the general audience, as well as any possible compatibility issues. If you are aware of another approach to achieve timestamp correlation with branch samples, it would also be very
Could you review Mathieu's patches 'perf tools: Configure timestsamp generation in CPU-wide mode' [1] and 'coresight: etm4x: Configure tracers to emit timestamps' [2]?
Thanks for pointing it. I did reviewed them, however they didn't help much. They are pretty much just configuration of the timestamping module, which works well for me.
You are correct, they just deal with the mechanic of enabling the generation of timestamp packets. How I use them is in [2] above.
I hope the idea is not completely pointless. I'm still making my way through the perf subsystem, so I might have missed some crucial details.
I don't think the idea is pointless. You simply have a new use case and from my point of view it is enough to receive consideration.
The idea looks good to me. Actually I am very curious with below question:
What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?
I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I believe, would be hidden otherwise. CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.
Best regards, Wojciech
Thanks, Leo Yan
[1] https://lists.linaro.org/pipermail/coresight/2019-March/002259.html [2] https://lists.linaro.org/pipermail/coresight/2019-March/002232.html
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight