From: Wojciech Zmuda wzmuda@n7space.com
Hello,
After our discussion in [1] I came up with this proposal solution on how to timestamp perf samples, so we can tell in trace post-processing, how long does it take to execute traced blocks of code. I'll be grateful if you find some time to take a look at it. The patch itself is 37 lines of additions, so it shouldn't consume a lot of your time. I also have some ideas to further improve this patch I'm not quite sure about, so I'd like to ask you for your input.
The simplest verification is visibility of the time field in 'perf script' output: root@zynq:~# perf script --ns -F comm,cpu,event,time | head Frame deformatter: Found 4 FSYNCS singlethread-at [001] 4572.697372400: branches:u: singlethread-at [001] 4572.697372404: branches:u: singlethread-at [001] 4572.697372408: branches:u: singlethread-at [001] 4572.697372412: branches:u: singlethread-at [001] 4572.697372416: branches:u:
The step above works only if trace has been collected in CPU-wide mode. I have an idea on how to improve this, but I'd like to get some more information first. I elaborated in the last paragraph.
Another way to access these values is using the script engine, which I used for validation of sanity of timestamp values. Using python script I checked if timestamps in subsequent samples are monotonically increasing and it seems they are. The only exception are discontinuities in trace. From my understanding, we can't timestamp discontinuities reasonably, since after decoder synchronizes back after trace loss, it needs to wait for another timestamp packet.
from __future__ import print_function import os import sys
sys.path.append(os.environ['PERF_EXEC_PATH'] + \ '/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
from perf_trace_context import *
prev_sample = 0 cnt = 0;
def process_event(s): global prev_sample global cnt
sample = s['sample']
if not prev_sample: prev_sample = sample; cnt = cnt+1 return
if sample['time'] < prev_sample['time']: print('Sample %d has higher timestamp than sample %d' % (cnt-1, cnt)) cnt = cnt+1
By subtracting timestamps of two samples, we can approximately tell, how long did it take (in timestamp units) to execute instructions between these two points. Now, (basing on Al's response in [1]) knowing the timestamp generator clock frequency, we can tell how long did it take in proper time units. The frequency is known (we have it in DT) and can be saved to and retrieved from TSGEN register. This is not done at this moment (the register is empty), but the idea is, if it was, it might then be saved in perf.data headers (similarly as TRCIDR are saved now), and later programmatically retrieved to calculate timestamp differences into actual time. I didn't experiment with it yet - I'd like to learn your opinion on it.
I've also tried to access the time field with 'perf script -F time' and while it's possible for traces recorded in CPU-wide mode, it's not possible in per-thread mode due to PERF_SAMPLE_TIME not being set in that mode (cs_etm__synth_events()). I tried to analyze that code, but I can't understand relation between this flag and ETM's timeless_decoding property. It looks to me that timeless decoding was equal to per-thread mode, where timestamps were not really utilized, while in CPU-wide mode they were used to inter-queue correlation. Would it be a good direction to have a separate property of TSGEN being on/off and to set PERF_SAMPLE_TIME based on this, rather than on etm->timeless_decoding? I see an obstacle in the cs_etm__is_timeless_decoding() function, that sets etm->timeless_decoding basing on events having this SAMPLE_TIME flag or not - could the criterion here be the tracing mode (CPU-wide/per-thread) instead?
Thank you, Wojciech
[1] https://lists.linaro.org/pipermail/coresight/2019-May/002577.html
Wojciech Zmuda (1): perf cs-etm: Set time value for samples
tools/perf/util/cs-etm.c | 36 ++++++++++++++++++++++++++++++++++++ tools/perf/util/cs-etm.h | 1 + 2 files changed, 37 insertions(+)