This set adds support for CoreSight CPU-wide trace sessions. It borrows most of its code from the per-thread implementation with exception that range packets are processed and synthesised according to the time the trace they contain has been executed.
This is done using the timestamp and contextID feature available on ETM4x tracers (ETM3x/PTM aren't addressed yet). Decoding between processors is done in chronological order using a min heap.
Of special interest is the way timestamp packets are used to account for temporal execution of traced instructions. Since a timestamp typically happen after range packets have been recorded, the timestamp from the previous range is used as the start time of the current range. When a timestamp for the previous range doesn't exist (i.e start of trace or discontinuity) the start time is estimated.
Open question: --------------
At this time the implementation supports tracing a single CPU since the only HW we have exhibit an N:1 source/sink topology. The HW itself does support collecting traces from more than one source but using the feature in this way could be very confusing and mislead users.
For example the following:
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
would end up tracing everyting that is happening on CPU 2 and 3 for as long as appliation1 is executing. Because the HW doesn't give us an interrupt when buffers are full, traces from one CPU could easily clobber traces from the other, giving the impression that nothing was executed on the latter.
So this would work:
# perf record -e cs_etm/20070000.etr/ -C 3 application1
I am open to discussion on the topic should someone think of something.
As with the cleanup set this code has been uploaded here [1].
Thanks, Mathieu
[1].https://git.linaro.org/people/mathieu.poirier/coresight.git perf-opencsd-master-cpu-wide-support
Mathieu Poirier (12): perf tools: Add defines for CONTEXTID configuration perf tools: Configure contextID tracing in CPU-wide mode perf tools: Configure timestsamp generation in CPU-wide mode perf tools: Configure SWITCH_EVENTS in CPU-wide mode perf tools: Add handling of itrace start events perf tools: Add handling of switch-CPU-wide events perf tools: Linking PE contextID with perf thread mechanic perf tools: Allocate decoder tree as needed perf tools: Make cs_etm__dump_event() work with CPU-wide scenarios perf tools: Add notion of time to the decoding code perf tools: Make function cs_etm_decoder__clear_buffer() public perf tools: Add support for CPU-wide trace scenarios
include/linux/coresight-pmu.h | 2 + tools/include/linux/coresight-pmu.h | 2 + tools/perf/arch/arm/util/cs-etm.c | 174 ++++++++++-- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 140 +++++++++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 4 +- tools/perf/util/cs-etm.c | 334 ++++++++++++++++++++++-- tools/perf/util/cs-etm.h | 17 ++ 7 files changed, 623 insertions(+), 50 deletions(-)
Add defines for both kernel and perf-tools allowing the cs_etm PMU to use the CONTEXTID tracing feature offered by the HW.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- include/linux/coresight-pmu.h | 2 ++ tools/include/linux/coresight-pmu.h | 2 ++ 2 files changed, 4 insertions(+)
diff --git a/include/linux/coresight-pmu.h b/include/linux/coresight-pmu.h index a1a959ba24ff..b0e35eec6499 100644 --- a/include/linux/coresight-pmu.h +++ b/include/linux/coresight-pmu.h @@ -12,11 +12,13 @@
/* ETMv3.5/PTM's ETMCR config bit */ #define ETM_OPT_CYCACC 12 +#define ETM_OPT_CTXTID 14 #define ETM_OPT_TS 28 #define ETM_OPT_RETSTK 29
/* ETMv4 CONFIGR programming bits for the ETM OPTs */ #define ETM4_CFG_BIT_CYCACC 4 +#define ETM4_CFG_BIT_CTXTID 6 #define ETM4_CFG_BIT_TS 11 #define ETM4_CFG_BIT_RETSTK 12
diff --git a/tools/include/linux/coresight-pmu.h b/tools/include/linux/coresight-pmu.h index a1a959ba24ff..b0e35eec6499 100644 --- a/tools/include/linux/coresight-pmu.h +++ b/tools/include/linux/coresight-pmu.h @@ -12,11 +12,13 @@
/* ETMv3.5/PTM's ETMCR config bit */ #define ETM_OPT_CYCACC 12 +#define ETM_OPT_CTXTID 14 #define ETM_OPT_TS 28 #define ETM_OPT_RETSTK 29
/* ETMv4 CONFIGR programming bits for the ETM OPTs */ #define ETM4_CFG_BIT_CYCACC 4 +#define ETM4_CFG_BIT_CTXTID 6 #define ETM4_CFG_BIT_TS 11 #define ETM4_CFG_BIT_RETSTK 12
When operating in CPU-wide mode being notified of contextID changes is required so that the decoding mechanic is aware of the process context switch.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/arch/arm/util/cs-etm.c | 122 ++++++++++++++++++++++++++++++++------ tools/perf/util/cs-etm.h | 12 ++++ 2 files changed, 115 insertions(+), 19 deletions(-)
diff --git a/tools/perf/arch/arm/util/cs-etm.c b/tools/perf/arch/arm/util/cs-etm.c index 2f595cd73da6..f0f68c175038 100644 --- a/tools/perf/arch/arm/util/cs-etm.c +++ b/tools/perf/arch/arm/util/cs-etm.c @@ -36,8 +36,96 @@ struct cs_etm_recording { size_t snapshot_size; };
+static const char *metadata_etmv3_ro[CS_ETM_PRIV_MAX] = { + [CS_ETM_ETMCCER] = "mgmt/etmccer", + [CS_ETM_ETMIDR] = "mgmt/etmidr", +}; + +static const char *metadata_etmv4_ro[CS_ETMV4_PRIV_MAX] = { + [CS_ETMV4_TRCIDR0] = "trcidr/trcidr0", + [CS_ETMV4_TRCIDR1] = "trcidr/trcidr1", + [CS_ETMV4_TRCIDR2] = "trcidr/trcidr2", + [CS_ETMV4_TRCIDR8] = "trcidr/trcidr8", + [CS_ETMV4_TRCAUTHSTATUS] = "mgmt/trcauthstatus", +}; + static bool cs_etm_is_etmv4(struct auxtrace_record *itr, int cpu);
+static int cs_etm_set_context_id(struct perf_pmu *cs_etm_pmu, + struct perf_evsel *evsel, int cpu) +{ + char path[PATH_MAX]; + int err = -EINVAL; + u32 val; + + /* Get a handle on TRCIRD2 */ + snprintf(path, PATH_MAX, "cpu%d/%s", + cpu, metadata_etmv4_ro[CS_ETMV4_TRCIDR2]); + err = perf_pmu__scan_file(cs_etm_pmu, path, "%x", &val); + + /* There was a problem reading the file, bailing out */ + if (err != 1) { + pr_err("%s: can't read file %s\n", + CORESIGHT_ETM_PMU_NAME, path); + goto out; + } + + /* + * TRCIDR2.CIDSIZE, bit [9-5], indicates whether contextID tracing + * is supported: + * 0b00000 Context ID tracing is not supported. + * 0b00100 Maximum of 32-bit Context ID size. + * All other values are reserved. + */ + val = BMVAL(val, 5, 9); + if (!val || val != 0x4) { + err = -EINVAL; + goto out; + } + + /* All good, let the kernel know */ + evsel->attr.config |= (1 << ETM_OPT_CTXTID); + err = 0; + +out: + + return err; +} + +static int cs_etm_set_option(struct auxtrace_record *itr, + struct perf_evsel *evsel, u32 option) +{ + int i, err = -EINVAL; + struct cpu_map *event_cpus = evsel->evlist->cpus; + struct cpu_map *online_cpus = cpu_map__new(NULL); + struct cs_etm_recording *ptr = container_of(itr, + struct cs_etm_recording, + itr); + struct perf_pmu *cs_etm_pmu = ptr->cs_etm_pmu; + + /* Set option of each CPU we have */ + for (i = 0; i < cpu__max_cpu(); i++) { + if (!cpu_map__has(event_cpus, i) || + !cpu_map__has(online_cpus, i)) + continue; + + switch (option) { + case ETM_OPT_CTXTID: + err = cs_etm_set_context_id(cs_etm_pmu, evsel, i); + if (err) + goto out; + break; + default: + break; + } + } + + err = 0; +out: + cpu_map__put(online_cpus); + return err; +} + static int cs_etm_parse_snapshot_options(struct auxtrace_record *itr, struct record_opts *opts, const char *str) @@ -68,8 +156,9 @@ static int cs_etm_recording_options(struct auxtrace_record *itr, container_of(itr, struct cs_etm_recording, itr); struct perf_pmu *cs_etm_pmu = ptr->cs_etm_pmu; struct perf_evsel *evsel, *cs_etm_evsel = NULL; - const struct cpu_map *cpus = evlist->cpus; + struct cpu_map *cpus = evlist->cpus; bool privileged = (geteuid() == 0 || perf_event_paranoid() < 0); + int err = 0;
ptr->evlist = evlist; ptr->snapshot_mode = opts->auxtrace_snapshot_mode; @@ -200,19 +289,24 @@ static int cs_etm_recording_options(struct auxtrace_record *itr,
/* * In the case of per-cpu mmaps, we need the CPU on the - * AUX event. + * AUX event. We also need the contextID in order to be notified + * when a context switch happened. */ - if (!cpu_map__empty(cpus)) + if (!cpu_map__empty(cpus)) { perf_evsel__set_sample_bit(cs_etm_evsel, CPU);
+ err = cs_etm_set_option(itr, cs_etm_evsel, ETM_OPT_CTXTID); + if (err) + goto out; + } + /* Add dummy event to keep tracking */ if (opts->full_auxtrace) { struct perf_evsel *tracking_evsel; - int err;
err = parse_events(evlist, "dummy:u", NULL); if (err) - return err; + goto out;
tracking_evsel = perf_evlist__last(evlist); perf_evlist__set_tracking_event(evlist, tracking_evsel); @@ -225,7 +319,8 @@ static int cs_etm_recording_options(struct auxtrace_record *itr, perf_evsel__set_sample_bit(tracking_evsel, TIME); }
- return 0; +out: + return err; }
static u64 cs_etm_get_config(struct auxtrace_record *itr) @@ -273,6 +368,8 @@ static u64 cs_etmv4_get_config(struct auxtrace_record *itr) config_opts = cs_etm_get_config(itr); if (config_opts & BIT(ETM_OPT_CYCACC)) config |= BIT(ETM4_CFG_BIT_CYCACC); + if (config_opts & BIT(ETM_OPT_CTXTID)) + config |= BIT(ETM4_CFG_BIT_CTXTID); if (config_opts & BIT(ETM_OPT_TS)) config |= BIT(ETM4_CFG_BIT_TS); if (config_opts & BIT(ETM_OPT_RETSTK)) @@ -322,19 +419,6 @@ cs_etm_info_priv_size(struct auxtrace_record *itr __maybe_unused, (etmv3 * CS_ETMV3_PRIV_SIZE)); }
-static const char *metadata_etmv3_ro[CS_ETM_PRIV_MAX] = { - [CS_ETM_ETMCCER] = "mgmt/etmccer", - [CS_ETM_ETMIDR] = "mgmt/etmidr", -}; - -static const char *metadata_etmv4_ro[CS_ETMV4_PRIV_MAX] = { - [CS_ETMV4_TRCIDR0] = "trcidr/trcidr0", - [CS_ETMV4_TRCIDR1] = "trcidr/trcidr1", - [CS_ETMV4_TRCIDR2] = "trcidr/trcidr2", - [CS_ETMV4_TRCIDR8] = "trcidr/trcidr8", - [CS_ETMV4_TRCAUTHSTATUS] = "mgmt/trcauthstatus", -}; - static bool cs_etm_is_etmv4(struct auxtrace_record *itr, int cpu) { bool ret = false; diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 37f8d48179ca..11fac3116d64 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -59,6 +59,18 @@ struct intlist *traceid_list; #define KiB(x) ((x) * 1024) #define MiB(x) ((x) * 1024 * 1024)
+/* + * Create a contiguous bitmask starting at bit position @l and ending at + * position @h. For example + * GENMASK_ULL(39, 21) gives us the 64bit vector 0x000000ffffe00000. + * + * Carbon copy of implementation found in $KERNEL/include/linux/bitops.h + */ +#define GENMASK(h, l) \ + (((~0UL) - (1UL << (l)) + 1) & (~0UL >> (BITS_PER_LONG - 1 - (h)))) + +#define BMVAL(val, lsb, msb) ((val & GENMASK(msb, lsb)) >> lsb) + #define CS_ETM_HEADER_SIZE (CS_HEADER_VERSION_0_MAX * sizeof(u64))
static const u64 __perf_cs_etmv3_magic = 0x3030303030303030ULL;
When operating in CPU-wide mode tracers need to generate timestamps in order to correlate the code being traced on one CPU with what is executed on other CPUs.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/arch/arm/util/cs-etm.c | 49 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 49 insertions(+)
diff --git a/tools/perf/arch/arm/util/cs-etm.c b/tools/perf/arch/arm/util/cs-etm.c index f0f68c175038..dcf0fef9462f 100644 --- a/tools/perf/arch/arm/util/cs-etm.c +++ b/tools/perf/arch/arm/util/cs-etm.c @@ -92,6 +92,46 @@ static int cs_etm_set_context_id(struct perf_pmu *cs_etm_pmu, return err; }
+static int cs_etm_set_timestamp(struct perf_pmu *cs_etm_pmu, + struct perf_evsel *evsel, int cpu) +{ + char path[PATH_MAX]; + int err = -EINVAL; + u32 val; + + /* Get a handle on TRCIRD0 */ + snprintf(path, PATH_MAX, "cpu%d/%s", + cpu, metadata_etmv4_ro[CS_ETMV4_TRCIDR0]); + err = perf_pmu__scan_file(cs_etm_pmu, path, "%x", &val); + + /* There was a problem reading the file, bailing out */ + if (err != 1) { + pr_err("%s: can't read file %s\n", + CORESIGHT_ETM_PMU_NAME, path); + goto out; + } + + /* + * TRCIDR0.TSSIZE, bit [28-24], indicates whether global timestamping + * is supported: + * 0b00000 Global timestamping is not implemented + * 0b00110 Implementation supports a maximum timestamp of 48bits. + * 0b01000 Implementation supports a maximum timestamp of 64bits. + */ + val &= GENMASK(28, 24); + if (!val) { + err = -EINVAL; + goto out; + } + + /* All good, let the kernel know */ + evsel->attr.config |= (1 << ETM_OPT_TS); + err = 0; + +out: + return err; +} + static int cs_etm_set_option(struct auxtrace_record *itr, struct perf_evsel *evsel, u32 option) { @@ -115,6 +155,11 @@ static int cs_etm_set_option(struct auxtrace_record *itr, if (err) goto out; break; + case ETM_OPT_TS: + err = cs_etm_set_timestamp(cs_etm_pmu, evsel, i); + if (err) + goto out; + break; default: break; } @@ -298,6 +343,10 @@ static int cs_etm_recording_options(struct auxtrace_record *itr, err = cs_etm_set_option(itr, cs_etm_evsel, ETM_OPT_CTXTID); if (err) goto out; + + err = cs_etm_set_option(itr, cs_etm_evsel, ETM_OPT_TS); + if (err) + goto out; }
/* Add dummy event to keep tracking */
Ask the perf core to generate an event when processes are swapped in/out of context. That way proper action can be taken by the decoding code when faced with such event.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/arch/arm/util/cs-etm.c | 3 +++ 1 file changed, 3 insertions(+)
diff --git a/tools/perf/arch/arm/util/cs-etm.c b/tools/perf/arch/arm/util/cs-etm.c index dcf0fef9462f..11d4c5b12779 100644 --- a/tools/perf/arch/arm/util/cs-etm.c +++ b/tools/perf/arch/arm/util/cs-etm.c @@ -208,6 +208,9 @@ static int cs_etm_recording_options(struct auxtrace_record *itr, ptr->evlist = evlist; ptr->snapshot_mode = opts->auxtrace_snapshot_mode;
+ if (perf_can_record_switch_events()) + opts->record_switch_events = true; + evlist__for_each_entry(evlist, evsel) { if (evsel->attr.type == cs_etm_pmu->type) { if (cs_etm_evsel) {
Add handling of ITRACE events in order to add the tid/pid of the executing process to the perf tools machine infrastructure. This information is later retrieved when a contextID packet is found in the trace stream.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 28b05efcefbc..38034964da56 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1180,6 +1180,33 @@ static int cs_etm__process_timeless_queues(struct cs_etm_auxtrace *etm, return 0; }
+static int cs_etm__process_itrace_start(struct cs_etm_auxtrace *etm, + union perf_event *event) +{ + struct thread *th; + + /* + * This needs to be fixed if we want to trace more than + * one process. + */ + if (etm->timeless_decoding) + return 0; + + /* + * Add the tid/pid to the log so that we can get a match when + * we get a contextID from the decoder. + */ + th = machine__findnew_thread(etm->machine, + event->itrace_start.pid, + event->itrace_start.tid); + if (!th) + return -ENOMEM; + + thread__put(th); + + return 0; +} + static int cs_etm__process_event(struct perf_session *session, union perf_event *event, struct perf_sample *sample, @@ -1218,6 +1245,9 @@ static int cs_etm__process_event(struct perf_session *session, event->fork.tid, sample->time);
+ if (event->header.type == PERF_RECORD_ITRACE_START) + return cs_etm__process_itrace_start(etm, event); + return 0; }
Add handling of SWITCH-CPU-WIDE events in order to add the tid/pid of the incoming process to the perf tools machine infrastructure. This information is later retrieved when a contextID packet is found in the trace stream.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 38034964da56..632237b54b81 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1207,6 +1207,42 @@ static int cs_etm__process_itrace_start(struct cs_etm_auxtrace *etm, return 0; }
+static int cs_etm__process_switch_cpu_wide(struct cs_etm_auxtrace *etm, + union perf_event *event) +{ + struct thread *th; + bool out = event->header.misc & PERF_RECORD_MISC_SWITCH_OUT; + + /* + * Context switch in per-thread mode are irrelevant since perf + * will start/stop tracing as the process is scheduled. + */ + if (etm->timeless_decoding) + return 0; + + /* + * SWITCH_IN events carry the next process to be switched out while + * SWITCH_OUT events carry the process to be switched in. As such + * we don't care about IN events. + */ + if (!out) + return 0; + + /* + * Add the tid/pid to the log so that we can get a match when + * we get a contextID from the decoder. + */ + th = machine__findnew_thread(etm->machine, + event->context_switch.next_prev_pid, + event->context_switch.next_prev_tid); + if (!th) + return -ENOMEM; + + thread__put(th); + + return 0; +} + static int cs_etm__process_event(struct perf_session *session, union perf_event *event, struct perf_sample *sample, @@ -1247,6 +1283,8 @@ static int cs_etm__process_event(struct perf_session *session,
if (event->header.type == PERF_RECORD_ITRACE_START) return cs_etm__process_itrace_start(etm, event); + else if (event->header.type == PERF_RECORD_SWITCH_CPU_WIDE) + return cs_etm__process_switch_cpu_wide(etm, event);
return 0; }
Link contextID packets received from the decoder with the perf tool thread mechanic so that we know the specifics of the currently executing process.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 34 +++++++++++++++++++++++++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 -- tools/perf/util/cs-etm.c | 30 +++++++++++++++++++--- tools/perf/util/cs-etm.h | 5 ++++ 4 files changed, 66 insertions(+), 5 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 24aabf01d754..3e69abbfa2c5 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -345,6 +345,37 @@ cs_etm_decoder__buffer_trace_on(struct cs_etm_decoder *decoder, CS_ETM_TRACE_ON); }
+static ocsd_datapath_resp_t +cs_etm_decoder__set_tid(struct cs_etm_decoder *decoder, + const ocsd_generic_trace_elem *elem, + const uint8_t trace_chan_id) +{ + int cpu; + pid_t tid; + struct int_node *inode = NULL; + struct cs_etm_queue *etmq = decoder->data; + ocsd_datapath_resp_t ret = OCSD_RESP_CONT; + + /* Ignore PE_CONTEXT packets that don't have a valid contextID */ + if (!elem->context.ctxt_id_valid) + goto out; + + /* Search the RB tree for the cpu associated with this traceID */ + inode = intlist__find(traceid_list, trace_chan_id); + if (!inode) { + ret = OCSD_RESP_FATAL_SYS_ERR; + goto out; + } + + cpu = *((int *)inode->priv); + tid = elem->context.context_id; + + cs_etm__etmq_set_tid_cpu(etmq, tid, cpu); + +out: + return ret; +} + static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( const void *context, const ocsd_trc_index_t indx __maybe_unused, @@ -376,6 +407,9 @@ static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( decoder->packet_buffer[decoder->tail].exc_ret = true; break; case OCSD_GEN_TRC_ELEM_PE_CONTEXT: + resp = cs_etm_decoder__set_tid(decoder, elem, + trace_chan_id); + break; case OCSD_GEN_TRC_ELEM_EO_TRACE: case OCSD_GEN_TRC_ELEM_ADDR_NACC: case OCSD_GEN_TRC_ELEM_TIMESTAMP: diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index b2b6c8b5ba95..a61a8f75a135 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -33,8 +33,6 @@ struct cs_etm_packet { int cpu; };
-struct cs_etm_queue; - typedef u32 (*cs_etm_mem_cb_type)(struct cs_etm_queue *, u64, size_t, u8 *);
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 632237b54b81..91b77353429e 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -658,9 +658,11 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, { struct cs_etm_queue *etmq = queue->priv;
- /* CPU-wide tracing isn't supported yet */ - if (queue->tid == -1) - return; + /* queue-tid == -1 is CPU-wide scenario */ + if (queue->tid == -1) { + etmq->tid = machine__get_current_tid(etm->machine, etmq->cpu); + thread__zput(etmq->thread); + }
if ((!etmq->thread) && (etmq->tid != -1)) etmq->thread = machine__find_thread(etm->machine, -1, @@ -673,6 +675,28 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, } }
+void cs_etm__etmq_set_tid_cpu(struct cs_etm_queue *etmq, + pid_t tid, int cpu) +{ int err; + struct cs_etm_auxtrace *etm = etmq->etm; + int queue_nr = etmq->queue_nr; + struct auxtrace_queue *queue = &etm->queues.queue_array[queue_nr]; + + err = machine__set_current_tid(etm->machine, cpu, tid, tid); + + /* + * What else can we do here? All valid process should have been + * registed in cs_etm__process_itrace_start() and + * cs_etm__process_cpu_wide(). + */ + if (err) { + pr_err("%s: failed to set tid: %d\n", __func__, tid); + return; + } + + cs_etm__set_pid_tid_cpu(etm, queue); +} + static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, u64 addr, u64 period) { diff --git a/tools/perf/util/cs-etm.h b/tools/perf/util/cs-etm.h index 11fac3116d64..2aa63ee6aae2 100644 --- a/tools/perf/util/cs-etm.h +++ b/tools/perf/util/cs-etm.h @@ -56,6 +56,8 @@ enum { /* RB tree for quick conversion between traceID and CPUs */ struct intlist *traceid_list;
+struct cs_etm_queue; + #define KiB(x) ((x) * 1024) #define MiB(x) ((x) * 1024 * 1024)
@@ -78,6 +80,9 @@ static const u64 __perf_cs_etmv4_magic = 0x4040404040404040ULL; #define CS_ETMV3_PRIV_SIZE (CS_ETM_PRIV_MAX * sizeof(u64)) #define CS_ETMV4_PRIV_SIZE (CS_ETMV4_PRIV_MAX * sizeof(u64))
+void cs_etm__etmq_set_tid_cpu(struct cs_etm_queue *etmq, + pid_t tid, int cpu); + #ifdef HAVE_CSTRACE_SUPPORT int cs_etm__process_auxtrace_info(union perf_event *event, struct perf_session *session);
If working on a per-thread scenario, each etmq needs to have a decoder capable of handling packets from all CPUs in the session. But when working on CPU-wide scenario each etmq only needs to deal with packets generated by the CPU is has been associated with.
As such make the code aware of CPU specifics and allocate the minimum amount of decoder required for the trace session.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm.c | 56 ++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 45 insertions(+), 11 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 91b77353429e..f7c78c39266c 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -119,7 +119,7 @@ static void cs_etm__set_trace_param(struct cs_etm_trace_params *t_params, }
static int cs_etm__init_trace_params(struct cs_etm_trace_params *t_params, - struct cs_etm_auxtrace *etm, + struct cs_etm_auxtrace *etm, int cpu, enum cs_etm_decoder_protocol protocol) { int i, ret = 0; @@ -129,8 +129,32 @@ static int cs_etm__init_trace_params(struct cs_etm_trace_params *t_params, goto out; }
- for (i = 0; i < etm->num_cpu; i++) - cs_etm__set_trace_param(&t_params[i], etm, i, protocol); + /* + * cpu == -1 is per-thread scenario: setup parameters for all the CPUs. + */ + if (cpu == -1) { + for (i = 0; i < etm->num_cpu; i++) + cs_etm__set_trace_param(&t_params[i], etm, i, protocol); + + goto out; + } + + /* + * CPU-wide scenario: setup the parameter for the CPU that was + * requested. + */ + for (i = 0; i < etm->num_cpu; i++) { + if (cpu == (int)etm->metadata[i][CS_ETM_CPU]) + break; + } + + /* We didn't find a match */ + if (i == etm->num_cpu) { + ret = -EINVAL; + goto out; + } + + cs_etm__set_trace_param(&t_params[0], etm, i, protocol);
out: return ret; @@ -178,7 +202,7 @@ static void cs_etm__dump_event(struct cs_etm_auxtrace *etm, if (!t_params) return;
- if (cs_etm__init_trace_params(t_params, etm, CS_ETM_PROTO_ETMV4i)) + if (cs_etm__init_trace_params(t_params, etm, -1, CS_ETM_PROTO_ETMV4i)) return;
/* Set decoder parameters to simply print the trace packets */ @@ -336,8 +360,10 @@ static u32 cs_etm__mem_access(struct cs_etm_queue *etmq, u64 address, return len; }
-static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm) +static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, + int cpu) { + int num_cpu; struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq; @@ -374,12 +400,20 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm) if (!etmq->event_buf) goto out_free;
+ /* + * If working on a per-thread scenario, each etmq needs to have + * a decoder capable of handling packets from all CPUs in the session. + * If working on CPU-wide scenario each etmq needs to deal with the + * CPU it has been assocated with. + */ + num_cpu = etm->timeless_decoding ? etm->num_cpu : 1; + /* Use metadata to fill in trace parameters for trace decoder */ - t_params = zalloc(sizeof(*t_params) * etm->num_cpu); + t_params = zalloc(sizeof(*t_params) * num_cpu); if (!t_params) goto out_free;
- if (cs_etm__init_trace_params(t_params, etm, CS_ETM_PROTO_ETMV4i)) + if (cs_etm__init_trace_params(t_params, etm, cpu, CS_ETM_PROTO_ETMV4i)) goto out_free;
/* Set decoder parameters to decode trace packets */ @@ -387,7 +421,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm) CS_ETM_OPERATION_DECODE)) goto out_free;
- etmq->decoder = cs_etm_decoder__new(etm->num_cpu, &d_params, t_params); + etmq->decoder = cs_etm_decoder__new(num_cpu, &d_params, t_params);
zfree(&t_params);
@@ -422,13 +456,13 @@ static int cs_etm__setup_queue(struct cs_etm_auxtrace *etm, struct auxtrace_queue *queue, unsigned int queue_nr) { - int ret = 0; + int ret = 0, cpu = queue->cpu; struct cs_etm_queue *etmq = queue->priv;
if (list_empty(&queue->head) || etmq) goto out;
- etmq = cs_etm__alloc_queue(etm); + etmq = cs_etm__alloc_queue(etm, cpu);
if (!etmq) { ret = -ENOMEM; @@ -438,7 +472,7 @@ static int cs_etm__setup_queue(struct cs_etm_auxtrace *etm, queue->priv = etmq; etmq->etm = etm; etmq->queue_nr = queue_nr; - etmq->cpu = queue->cpu; + etmq->cpu = cpu; etmq->tid = queue->tid; etmq->pid = -1; etmq->offset = 0;
Make function cs_etm__dump_event() aware of CPU-wide trace scenarios and allocate the minimum amount of decoder required to handle the auxtrace buffers it is dealing with.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f7c78c39266c..f75942acfdc3 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -182,10 +182,10 @@ static int cs_etm__init_decoder_params(struct cs_etm_decoder_params *d_params, return ret; }
-static void cs_etm__dump_event(struct cs_etm_auxtrace *etm, +static void cs_etm__dump_event(struct cs_etm_auxtrace *etm, int cpu, struct auxtrace_buffer *buffer) { - int ret; + int ret, num_cpu; const char *color = PERF_COLOR_BLUE; struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; @@ -197,12 +197,20 @@ static void cs_etm__dump_event(struct cs_etm_auxtrace *etm, ". ... CoreSight ETM Trace data: size %zu bytes\n", buffer->size);
+ /* + * If working on a per-thread scenario, each etmq needs to have + * a decoder capable of handling packets from all CPUs in the session. + * If working on CPU-wide scenario each etmq needs to deal with the + * CPU is has been assocated with. + */ + num_cpu = etm->timeless_decoding ? etm->num_cpu : 1; + /* Use metadata to fill in trace parameters for trace decoder */ - t_params = zalloc(sizeof(*t_params) * etm->num_cpu); + t_params = zalloc(sizeof(*t_params) * num_cpu); if (!t_params) return;
- if (cs_etm__init_trace_params(t_params, etm, -1, CS_ETM_PROTO_ETMV4i)) + if (cs_etm__init_trace_params(t_params, etm, cpu, CS_ETM_PROTO_ETMV4i)) return;
/* Set decoder parameters to simply print the trace packets */ @@ -210,7 +218,7 @@ static void cs_etm__dump_event(struct cs_etm_auxtrace *etm, CS_ETM_OPERATION_PRINT)) return;
- decoder = cs_etm_decoder__new(etm->num_cpu, &d_params, t_params); + decoder = cs_etm_decoder__new(num_cpu, &d_params, t_params);
zfree(&t_params);
@@ -1376,7 +1384,8 @@ static int cs_etm__process_auxtrace_event(struct perf_session *session,
if (dump_trace) if (auxtrace_buffer__get_data(buffer, fd)) { - cs_etm__dump_event(etm, buffer); + cs_etm__dump_event(etm, event->auxtrace.cpu, + buffer); auxtrace_buffer__put_data(buffer); } }
This patch deals with timestamp packets received from the decoding library in order to give the front end packet processing loop a handle on the time instruction conveyed by range packets have been executed at.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 104 +++++++++++++++++++++++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + 2 files changed, 104 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 3e69abbfa2c5..016180de40f1 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -39,8 +39,11 @@ struct cs_etm_decoder { cs_etm_mem_cb_type mem_access; ocsd_datapath_resp_t prev_return; u32 packet_count; + u32 instr_count; u32 head; u32 tail; + u64 start_timestamp; + u64 end_timestamp; struct cs_etm_packet packet_buffer[MAX_BUFFER]; };
@@ -321,6 +324,7 @@ cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder,
packet->start_addr = elem->st_addr; packet->end_addr = elem->en_addr; + switch (elem->last_i_type) { case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT: @@ -334,6 +338,18 @@ cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder, break; }
+ /* + * At the beginning of a trace session or trace discontinuity, the + * tracers will generate a timestamp packet _after_ a range packet + * has been generated. In order to know when traces started to be + * executed we need to keep track of the amount of instructions that + * were processed. That way when the timestamp packet gets generated + * we can _estimate_ the start time by substracting the number of + * instructions that were executed from the timestamp value. + * Timestamp estimation is done in cs_etm_decoder__buffer_timestamp(). + */ + decoder->instr_count += elem->num_instr_range; + return ret; }
@@ -376,6 +392,54 @@ cs_etm_decoder__set_tid(struct cs_etm_decoder *decoder, return ret; }
+static void +cs_etm_decoder__reset_time_statistics(struct cs_etm_decoder *decoder) +{ + decoder->instr_count = 0; + decoder->start_timestamp = 0; + decoder->end_timestamp = 0; +} + +static ocsd_datapath_resp_t +cs_etm_decoder__update_time_statistics(struct cs_etm_decoder *decoder, + const ocsd_generic_trace_elem *elem) +{ + /* + * At the beginning of traces or upon trace discontinuity the time + * instructions started to be generated is not known since timestamp + * packets are generated _after_ a range packet. As such estimate + * the start timestamp by substracting the amount of instructions + * encountered in the range packet(s) from the current timestamp. + * + * There are also cases where we get a timestamp when no instructions + * have been received, ex. when the PE takes an exception or returns + * from one. In this case @instr_count is zero and the algorithm + * below works since statistics are reset in function + * cs_etm_decoder__gen_trace_elem_printer(). + */ + if (!decoder->start_timestamp) + decoder->start_timestamp = elem->timestamp - + decoder->instr_count; + else + decoder->start_timestamp = decoder->end_timestamp; + + /* + * @instr_count represent the amount of instructions executed + * since the last timestamp. + */ + decoder->instr_count = 0; + + /* + * Since timestamp packets are generated after range packets, the + * timestamp in this packet becomes the time at which instructions + * in the coming range packets began to execute. + */ + decoder->end_timestamp = elem->timestamp; + + /* Halt processing until we are being told to proceed */ + return OCSD_RESP_WAIT; +} + static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( const void *context, const ocsd_trc_index_t indx __maybe_unused, @@ -390,8 +454,12 @@ static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( break; case OCSD_GEN_TRC_ELEM_NO_SYNC: decoder->trace_on = false; + cs_etm_decoder__reset_time_statistics(decoder); break; case OCSD_GEN_TRC_ELEM_TRACE_ON: + /* A discontinuity has been observed, reset statistics. */ + cs_etm_decoder__reset_time_statistics(decoder); + resp = cs_etm_decoder__buffer_trace_on(decoder, trace_chan_id); decoder->trace_on = true; @@ -401,18 +469,32 @@ static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( trace_chan_id); break; case OCSD_GEN_TRC_ELEM_EXCEPTION: + /* A timestamp is generated after a PE takes an exception */ + cs_etm_decoder__reset_time_statistics(decoder); decoder->packet_buffer[decoder->tail].exc = true; break; case OCSD_GEN_TRC_ELEM_EXCEPTION_RET: + /* + * A timestamp is generated when a PE returns + * from an exception. + */ + cs_etm_decoder__reset_time_statistics(decoder); decoder->packet_buffer[decoder->tail].exc_ret = true; break; case OCSD_GEN_TRC_ELEM_PE_CONTEXT: + /* + * The context information has changed, better reset the + * time statistics. + */ + cs_etm_decoder__reset_time_statistics(decoder); resp = cs_etm_decoder__set_tid(decoder, elem, trace_chan_id); break; + case OCSD_GEN_TRC_ELEM_TIMESTAMP: + resp = cs_etm_decoder__update_time_statistics(decoder, elem); + break; case OCSD_GEN_TRC_ELEM_EO_TRACE: case OCSD_GEN_TRC_ELEM_ADDR_NACC: - case OCSD_GEN_TRC_ELEM_TIMESTAMP: case OCSD_GEN_TRC_ELEM_CYCLE_COUNT: case OCSD_GEN_TRC_ELEM_ADDR_UNKNOWN: case OCSD_GEN_TRC_ELEM_EVENT: @@ -582,6 +664,18 @@ int cs_etm_decoder__process_data_block(struct cs_etm_decoder *decoder, decoder->prev_return = cur; *consumed = processed;
+ /* + * If we are at the end of this block and @instr_count isn't 0, + * we have processed range packet(s) but haven't received a timestamp + * for them. As such estimate one so that the front end packet + * processing loop isn't impacted. + */ + if (processed == len && decoder->instr_count) { + decoder->start_timestamp = decoder->end_timestamp; + decoder->end_timestamp += decoder->instr_count; + decoder->instr_count = 0; + } + return ret; }
@@ -594,3 +688,11 @@ void cs_etm_decoder__free(struct cs_etm_decoder *decoder) decoder->dcd_tree = NULL; free(decoder); } + +u64 cs_etm_decoder__get_timestamp(struct cs_etm_decoder *decoder) +{ + if (!decoder) + return 0; + + return decoder->start_timestamp; +} diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index a61a8f75a135..1f560545f135 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -99,5 +99,6 @@ int cs_etm_decoder__get_packet(struct cs_etm_decoder *decoder, struct cs_etm_packet *packet);
int cs_etm_decoder__reset(struct cs_etm_decoder *decoder); +u64 cs_etm_decoder__get_timestamp(struct cs_etm_decoder *decoder);
#endif /* INCLUDE__CS_ETM_DECODER_H__ */
On Thu, 14 Jun 2018 14:26:55 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
@@ -334,6 +338,18 @@ cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder, break; }
- /*
* At the beginning of a trace session or trace discontinuity, the
* tracers will generate a timestamp packet _after_ a range packet
* has been generated. In order to know when traces started to be
* executed we need to keep track of the amount of instructions that
* were processed. That way when the timestamp packet gets generated
* we can _estimate_ the start time by substracting the number of
* instructions that were executed from the timestamp value.
* Timestamp estimation is done in cs_etm_decoder__buffer_timestamp().
*/
- decoder->instr_count += elem->num_instr_range;
so.. using your branch, this line doesn't build, native, gcc 6.3, with O= set:
util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ decoder->instr_count += elem->num_instr_range; ^~ mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed
Did the line adding num_instr_range get somehow lost on its way into the patch?
Kim
On Fri, 15 Jun 2018 at 13:39, Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:55 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
@@ -334,6 +338,18 @@ cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder, break; }
/*
* At the beginning of a trace session or trace discontinuity, the
* tracers will generate a timestamp packet _after_ a range packet
* has been generated. In order to know when traces started to be
* executed we need to keep track of the amount of instructions that
* were processed. That way when the timestamp packet gets generated
* we can _estimate_ the start time by substracting the number of
* instructions that were executed from the timestamp value.
* Timestamp estimation is done in cs_etm_decoder__buffer_timestamp().
*/
decoder->instr_count += elem->num_instr_range;
so.. using your branch, this line doesn't build, native, gcc 6.3, with O= set:
util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ decoder->instr_count += elem->num_instr_range; ^~ mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed
Did the line adding num_instr_range get somehow lost on its way into the patch?
That is not the case nor is it a compiler version problem. I completely forgot to highlight this set need to be compiled with the latest version of the openCSD library where the decoder counts the amount of instruction present in a range.
Thanks for pointing this out.
Kim
On Fri, 15 Jun 2018 15:38:27 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 13:39, Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:55 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
decoder->instr_count += elem->num_instr_range;
so.. using your branch, this line doesn't build, native, gcc 6.3, with O= set:
util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ decoder->instr_count += elem->num_instr_range; ^~ mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed
Did the line adding num_instr_range get somehow lost on its way into the patch?
That is not the case nor is it a compiler version problem. I completely forgot to highlight this set need to be compiled with the latest version of the openCSD library where the decoder counts the amount of instruction present in a range.
The code ought to be able to handle, and remain backwards compatible with an older version of the library.
Thanks,
Kim
On Fri, 15 Jun 2018 at 15:54, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 15:38:27 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 13:39, Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:55 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
decoder->instr_count += elem->num_instr_range;
so.. using your branch, this line doesn't build, native, gcc 6.3, with O= set:
util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ decoder->instr_count += elem->num_instr_range; ^~ mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed
Did the line adding num_instr_range get somehow lost on its way into the patch?
That is not the case nor is it a compiler version problem. I completely forgot to highlight this set need to be compiled with the latest version of the openCSD library where the decoder counts the amount of instruction present in a range.
The code ought to be able to handle, and remain backwards compatible with an older version of the library.
Once again I do not agree. It is only normal that developers take advantage of new functionality available in a library and it happens all the time. I think Rob was looking at a way to warn users about an openCSD that is too old.
Thanks,
Kim
On Fri, 15 Jun 2018 16:22:33 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 15:54, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 15:38:27 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 13:39, Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:55 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
decoder->instr_count += elem->num_instr_range;
so.. using your branch, this line doesn't build, native, gcc 6.3, with O= set:
util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ decoder->instr_count += elem->num_instr_range; ^~ mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed
Did the line adding num_instr_range get somehow lost on its way into the patch?
That is not the case nor is it a compiler version problem. I completely forgot to highlight this set need to be compiled with the latest version of the openCSD library where the decoder counts the amount of instruction present in a range.
The code ought to be able to handle, and remain backwards compatible with an older version of the library.
Once again I do not agree. It is only normal that developers take advantage of new functionality available in a library and it happens all the time.
Not when it breaks the build for the older version of the library. Where do you see that happening with the upstream perf tool?
I think Rob was looking at a way to warn users about an openCSD that is too old.
That's fine, but perf should still build without crippling users with older versions of libraries installed.
Kim
On Mon, 18 Jun 2018 at 15:25, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 16:22:33 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 15:54, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 15:38:27 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 13:39, Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:55 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
decoder->instr_count += elem->num_instr_range;
so.. using your branch, this line doesn't build, native, gcc 6.3, with O= set:
util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ decoder->instr_count += elem->num_instr_range; ^~ mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed
Did the line adding num_instr_range get somehow lost on its way into the patch?
That is not the case nor is it a compiler version problem. I completely forgot to highlight this set need to be compiled with the latest version of the openCSD library where the decoder counts the amount of instruction present in a range.
The code ought to be able to handle, and remain backwards compatible with an older version of the library.
Once again I do not agree. It is only normal that developers take advantage of new functionality available in a library and it happens all the time.
Not when it breaks the build for the older version of the library. Where do you see that happening with the upstream perf tool?
Any library embedded in the perf tools (as openCSD is) works exactly the same way. If people are doing development work they are expected to work with the latest revision of the library they are using. For distributions the linux tools package picks up the latest revision.
I think Rob was looking at a way to warn users about an openCSD that is too old.
That's fine, but perf should still build without crippling users with older versions of libraries installed.
And how do we do this? We add #ifdefs everywhere? Every tool I've recompiled needs a specific version of a library... This standard practice.
Kim
On Mon, 18 Jun 2018 15:51:34 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Mon, 18 Jun 2018 at 15:25, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 16:22:33 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 15:54, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 15:38:27 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 13:39, Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:55 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote: > + decoder->instr_count += elem->num_instr_range;
so.. using your branch, this line doesn't build, native, gcc 6.3, with O= set:
util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ decoder->instr_count += elem->num_instr_range; ^~ mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed
Did the line adding num_instr_range get somehow lost on its way into the patch?
That is not the case nor is it a compiler version problem. I completely forgot to highlight this set need to be compiled with the latest version of the openCSD library where the decoder counts the amount of instruction present in a range.
The code ought to be able to handle, and remain backwards compatible with an older version of the library.
Once again I do not agree. It is only normal that developers take advantage of new functionality available in a library and it happens all the time.
Not when it breaks the build for the older version of the library. Where do you see that happening with the upstream perf tool?
Any library embedded in the perf tools (as openCSD is) works exactly
The openCSD library is not embedded in the perf tool source tree. libopencsd is a standalone library: it's available by itself, and tools like perf can optionally detect and use it.
the same way. If people are doing development work they are expected to work with the latest revision of the library they are using. For
Not necessarily: Hello world doesn't require the latest C library.
distributions the linux tools package picks up the latest revision.
libopencsd will only be updated upon distribution maintainer and user update action. Meanwhile, perf should continue to build no matter what.
I think Rob was looking at a way to warn users about an openCSD that is too old.
That's fine, but perf should still build without crippling users with older versions of libraries installed.
And how do we do this? We add #ifdefs everywhere? Every tool I've
That would work.
recompiled needs a specific version of a library... This standard practice.
Not in the upstream perf tool.
Kim
On Mon, 18 Jun 2018 at 16:29, Kim Phillips kim.phillips@arm.com wrote:
On Mon, 18 Jun 2018 15:51:34 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Mon, 18 Jun 2018 at 15:25, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 16:22:33 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 15:54, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 15:38:27 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 13:39, Kim Phillips kim.phillips@arm.com wrote: > > On Thu, 14 Jun 2018 14:26:55 -0600 > Mathieu Poirier mathieu.poirier@linaro.org wrote: > > + decoder->instr_count += elem->num_instr_range; > > so.. using your branch, this line doesn't build, native, gcc 6.3, with > O= set: > > util/cs-etm-decoder/cs-etm-decoder.c: In function ‘cs_etm_decoder__buffer_range’: > util/cs-etm-decoder/cs-etm-decoder.c:351:30: error: ‘ocsd_generic_trace_elem {aka const struct _ocsd_generic_trace_elem}’ has no member named ‘num_instr_range’ > decoder->instr_count += elem->num_instr_range; > ^~ > mv: cannot stat '/nfsroot/linux/tools/arm64-perf/util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such file or directory > linux/tools/build/Makefile.build:96: recipe for target 'linux/tools/arm64-perf/util/cs-etm-decoder/cs-etm-decoder.o' failed > > Did the line adding num_instr_range get somehow lost on its way into > the patch?
That is not the case nor is it a compiler version problem. I completely forgot to highlight this set need to be compiled with the latest version of the openCSD library where the decoder counts the amount of instruction present in a range.
The code ought to be able to handle, and remain backwards compatible with an older version of the library.
Once again I do not agree. It is only normal that developers take advantage of new functionality available in a library and it happens all the time.
Not when it breaks the build for the older version of the library. Where do you see that happening with the upstream perf tool?
Any library embedded in the perf tools (as openCSD is) works exactly
The openCSD library is not embedded in the perf tool source tree. libopencsd is a standalone library: it's available by itself, and tools like perf can optionally detect and use it.
the same way. If people are doing development work they are expected to work with the latest revision of the library they are using. For
Not necessarily: Hello world doesn't require the latest C library.
Try using the latest features of the C library on any old system, things will break.
distributions the linux tools package picks up the latest revision.
libopencsd will only be updated upon distribution maintainer and user update action. Meanwhile, perf should continue to build no matter what.
Distribution perf tools and the openCSD library are always in sync, that's not a problem.
When downloading the latest revision of the perf tools, it is normal that the latest features may not be present in the libraries available on the system - this is true for any library. OpenCSD provides flags to compile with an alternate (i.e newer) revision of the library exactly for those situations. Those are well documented and I've used them many times.
I think Rob was looking at a way to warn users about an openCSD that is too old.
That's fine, but perf should still build without crippling users with older versions of libraries installed.
And how do we do this? We add #ifdefs everywhere? Every tool I've
That would work.
That is impossible to maintain, which is why people simply don't do that.
recompiled needs a specific version of a library... This standard practice.
Not in the upstream perf tool.
Then we agree to disagree.
Kim
On Mon, 18 Jun 2018 16:52:01 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Mon, 18 Jun 2018 at 16:29, Kim Phillips kim.phillips@arm.com wrote:
On Mon, 18 Jun 2018 15:51:34 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Mon, 18 Jun 2018 at 15:25, Kim Phillips kim.phillips@arm.com wrote:
On Fri, 15 Jun 2018 16:22:33 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
On Fri, 15 Jun 2018 at 15:54, Kim Phillips kim.phillips@arm.com wrote:
The code ought to be able to handle, and remain backwards compatible with an older version of the library.
Once again I do not agree. It is only normal that developers take advantage of new functionality available in a library and it happens all the time.
Not when it breaks the build for the older version of the library. Where do you see that happening with the upstream perf tool?
Any library embedded in the perf tools (as openCSD is) works exactly
The openCSD library is not embedded in the perf tool source tree. libopencsd is a standalone library: it's available by itself, and tools like perf can optionally detect and use it.
the same way. If people are doing development work they are expected to work with the latest revision of the library they are using. For
Not necessarily: Hello world doesn't require the latest C library.
Try using the latest features of the C library on any old system, things will break.
which is why we have tools like autoconf to handle working around any missing features.
distributions the linux tools package picks up the latest revision.
libopencsd will only be updated upon distribution maintainer and user update action. Meanwhile, perf should continue to build no matter what.
Distribution perf tools and the openCSD library are always in sync, that's not a problem.
Not on my system. I don't know how you can say that that will be guaranteed going forward, either.
When downloading the latest revision of the perf tools, it is normal that the latest features may not be present in the libraries available on the system - this is true for any library. OpenCSD provides flags to compile with an alternate (i.e newer) revision of the library exactly for those situations. Those are well documented and I've used them many times.
Right but any version of perf should be able to build against any version of libopencsd. If a feature isn't present, it should fall back to its old behaviour.
I think Rob was looking at a way to warn users about an openCSD that is too old.
That's fine, but perf should still build without crippling users with older versions of libraries installed.
And how do we do this? We add #ifdefs everywhere? Every tool I've
That would work.
That is impossible to maintain, which is why people simply don't do that.
There are ifdefs in the perf tool source, granted not a lot, but this is what we have to do if libopencsd is going to add features in an incompatible manner.
recompiled needs a specific version of a library... This standard practice.
Not in the upstream perf tool.
Then we agree to disagree.
I've looked, and haven't been able to find an instance of code in the perf tool source that fails to build given a newer version of a library is not available. Where does it do this?
Can the dependent libopencsd change be possibly reverted, and re-done in a backward-compatible manner?
Thanks,
Kim
Making cs_etm_decoder__clear_buffer() public so that it can be used in the main packet processing loop to clear the decoder's packet queue when needed.
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 2 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 016180de40f1..6f9caa437b8d 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -256,7 +256,7 @@ cs_etm_decoder__create_etm_packet_printer(struct cs_etm_trace_params *t_params, trace_config); }
-static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) +void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) { int i;
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 1f560545f135..5b0449b8495b 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -90,6 +90,7 @@ cs_etm_decoder__new(int num_cpu, struct cs_etm_trace_params t_params[]);
void cs_etm_decoder__free(struct cs_etm_decoder *decoder); +void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder);
int cs_etm_decoder__add_mem_access_cb(struct cs_etm_decoder *decoder, u64 start, u64 end,
Add support for CPU-wide trace scenarios by correlating range packets with timestamp packets. That way range packets received on different CPUs can be processed and synthesized in chronological order (min heap).
Signed-off-by: Mathieu Poirier mathieu.poirier@linaro.org --- tools/perf/util/cs-etm.c | 161 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 153 insertions(+), 8 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f75942acfdc3..91d50023a590 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -89,8 +89,12 @@ struct cs_etm_queue { };
static int cs_etm__update_queues(struct cs_etm_auxtrace *etm); +static int cs_etm__process_queues(struct cs_etm_auxtrace *etm); static int cs_etm__process_timeless_queues(struct cs_etm_auxtrace *etm, pid_t tid, u64 time_); +static int cs_etm__get_trace(struct cs_etm_queue *etmq); +static int cs_etm__get_data_block(struct cs_etm_queue *etmq); +static int cs_etm__decode_data_block(struct cs_etm_queue *etmq);
static void cs_etm__packet_dump(const char *pkt_string) { @@ -253,15 +257,16 @@ static int cs_etm__flush_events(struct perf_session *session, if (!tool->ordered_events) return -EINVAL;
- if (!etm->timeless_decoding) - return -EINVAL; - ret = cs_etm__update_queues(etm);
if (ret < 0) return ret;
- return cs_etm__process_timeless_queues(etm, -1, MAX_TIMESTAMP - 1); + if (etm->timeless_decoding) + return cs_etm__process_timeless_queues(etm, -1, + MAX_TIMESTAMP - 1); + + return cs_etm__process_queues(etm); }
static void cs_etm__free_queue(void *priv) @@ -465,6 +470,7 @@ static int cs_etm__setup_queue(struct cs_etm_auxtrace *etm, unsigned int queue_nr) { int ret = 0, cpu = queue->cpu; + u64 timestamp; struct cs_etm_queue *etmq = queue->priv;
if (list_empty(&queue->head) || etmq) @@ -486,6 +492,66 @@ static int cs_etm__setup_queue(struct cs_etm_auxtrace *etm, etmq->offset = 0; etmq->period_instructions = 0;
+ if (etm->timeless_decoding) + goto out; + + /* + * We are under a CPU-wide trace scenario. As such we need to know + * when the code that generated the traces started to execute so that + * it can be correlated with execution on other CPUs. So we get a + * handle on the beginning of traces and decode until we find a + * timestamp. Substracted from that timestamp is the amount of + * instruction that were executed in the range packet(s) that have + * been reported, yielding a fairly good estimate of then the trace + * started. + */ + + while (1) { + /* + * Fetch an aux_buffer from this etmq. Bail if no more + * blocks or an error has been encountered. + */ + ret = cs_etm__get_data_block(etmq); + if (ret <= 0) + goto out; + + /* + * Run decoder on the trace block. The decoder will stop when + * encountering a timestamp of end of trace for that block. + */ + ret = cs_etm__decode_data_block(etmq); + if (ret) + goto out; + + /* The decoder code does all the timestamp calculation for us */ + timestamp = cs_etm_decoder__get_timestamp(etmq->decoder); + + /* We found a timestamp, no need to continue. */ + if (timestamp) + break; + + /* + * We didn't find a timestamp so empty the decoder packet queue + * before fetching another data block. Packets that were + * decoded are useless since no timestamp have been associated + * with them. + */ + cs_etm_decoder__clear_buffer(etmq->decoder); + } + + + /* + * Add to the min heap the time at which execution of traces in the + * first range started. Once the same has been done for each etmq, + * redenring and decoding can start in chronological order. + * + * Note that decoded packets are still in the decoder's packet queue + * and will be processed in cs_etm__process_queues(). + */ + ret = auxtrace_heap__add(&etm->heap, queue_nr, timestamp); + if (ret) + goto out; + out: return ret; } @@ -1246,6 +1312,83 @@ static int cs_etm__process_timeless_queues(struct cs_etm_auxtrace *etm, return 0; }
+static int cs_etm__process_queues(struct cs_etm_auxtrace *etm) +{ + int ret = 0; + unsigned int queue_nr; + u64 timestamp; + struct auxtrace_queue *queue; + struct cs_etm_queue *etmq; + + while (1) { + if (!etm->heap.heap_cnt) + goto out; + + /* Take the entry at the top of the min heap */ + queue_nr = etm->heap.heap_array[0].queue_nr; + queue = &etm->queues.queue_array[queue_nr]; + etmq = queue->priv; + + auxtrace_heap__pop(&etm->heap); + + /* + * Packets associated with this timestamp are already in + * the decoder queue, so process them. + */ + ret = cs_etm__process_decoder_queue(etmq); + if (ret < 0) + goto out; + + /* + * Packets for this timestamp have been process, time to + * move on to the next timestamp, fetching a new aux_buffer if + * need be. + */ + ret = cs_etm__get_data_block(etmq); + if (ret < 0) + goto out; + + /* + * No more blocks to process in this queue, simply move on + * with the other queues. + */ + if (!ret) + continue; + + ret = cs_etm__decode_data_block(etmq); + if (ret) + goto out; + + timestamp = cs_etm_decoder__get_timestamp(etmq->decoder); + + if (!timestamp) { + /* + * Looks like there is no more traces to process... + * Empty the queue. + */ + ret = cs_etm__process_decoder_queue(etmq); + + /* + * Generate an instruction sample with the remaining + * branchstack entries. + */ + cs_etm__flush(etmq); + continue; + } + + /* + * Add to the min heap the timestamp for packets that have + * just been decoded. They will be processed and synthesized + * during the next call to cs_etm__process_decoder_queue() for + * this queue. + */ + ret = auxtrace_heap__add(&etm->heap, queue_nr, timestamp); + } + +out: + return ret; +} + static int cs_etm__process_itrace_start(struct cs_etm_auxtrace *etm, union perf_event *event) { @@ -1328,9 +1471,6 @@ static int cs_etm__process_event(struct perf_session *session, return -EINVAL; }
- if (!etm->timeless_decoding) - return -EINVAL; - if (sample->time && (sample->time != (u64) -1)) timestamp = sample->time; else @@ -1342,11 +1482,16 @@ static int cs_etm__process_event(struct perf_session *session, return err; }
- if (event->header.type == PERF_RECORD_EXIT) + if (etm->timeless_decoding && + event->header.type == PERF_RECORD_EXIT) return cs_etm__process_timeless_queues(etm, event->fork.tid, sample->time);
+ if (!etm->timeless_decoding && + event->header.type == PERF_RECORD_AUX) + return cs_etm__process_queues(etm); + if (event->header.type == PERF_RECORD_ITRACE_START) return cs_etm__process_itrace_start(etm, event); else if (event->header.type == PERF_RECORD_SWITCH_CPU_WIDE)
Open question:
At this time the implementation supports tracing a single CPU since the only HW we have exhibit an N:1 source/sink topology. The HW itself does support collecting traces from more than one source but using the feature in this way could be very confusing and mislead users.
For example the following:
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
would end up tracing everyting that is happening on CPU 2 and 3 for as long as appliation1 is executing. Because the HW doesn't give us an interrupt when buffers are full, traces from one CPU could easily clobber traces from the other, giving the impression that nothing was executed on the latter.
You only really get that impression if you expect it to behave like ordinary perf events traced into a per-CPU ring buffer. It's just a question of remembering that for ETM the trace is usually into a shared buffer and older trace _of an inactive CPU_ might be lost.
Two _active_ CPUs will interleave into the buffer, and being able to get trace of two communicating processes on separate CPUs is very useful. The benefits of being able to trace simultaneously active CPUs outweighs the small risk of someone misinterpreting trace when one CPU happens to be inactive.
Al
So this would work:
# perf record -e cs_etm/20070000.etr/ -C 3 application1
I am open to discussion on the topic should someone think of something.
As with the cleanup set this code has been uploaded here [1].
Thanks, Mathieu
[1].https://git.linaro.org/people/mathieu.poirier/coresight.git perf-opencsd- master-cpu-wide-support
Mathieu Poirier (12): perf tools: Add defines for CONTEXTID configuration perf tools: Configure contextID tracing in CPU-wide mode perf tools: Configure timestsamp generation in CPU-wide mode perf tools: Configure SWITCH_EVENTS in CPU-wide mode perf tools: Add handling of itrace start events perf tools: Add handling of switch-CPU-wide events perf tools: Linking PE contextID with perf thread mechanic perf tools: Allocate decoder tree as needed perf tools: Make cs_etm__dump_event() work with CPU-wide scenarios perf tools: Add notion of time to the decoding code perf tools: Make function cs_etm_decoder__clear_buffer() public perf tools: Add support for CPU-wide trace scenarios
include/linux/coresight-pmu.h | 2 + tools/include/linux/coresight-pmu.h | 2 + tools/perf/arch/arm/util/cs-etm.c | 174 ++++++++++-- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 140 +++++++++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 4 +- tools/perf/util/cs-etm.c | 334 ++++++++++++++++++++++-- tools/perf/util/cs-etm.h | 17 ++ 7 files changed, 623 insertions(+), 50 deletions(-)
-- 2.7.4
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On Thu, 14 Jun 2018 at 15:42, Al Grant Al.Grant@arm.com wrote:
Open question:
At this time the implementation supports tracing a single CPU since the only HW we have exhibit an N:1 source/sink topology. The HW itself does support collecting traces from more than one source but using the feature in this way could be very confusing and mislead users.
For example the following:
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
would end up tracing everyting that is happening on CPU 2 and 3 for as long as appliation1 is executing. Because the HW doesn't give us an interrupt when buffers are full, traces from one CPU could easily clobber traces from the other, giving the impression that nothing was executed on the latter.
You only really get that impression if you expect it to behave like ordinary perf events traced into a per-CPU ring buffer. It's just a question of remembering that for ETM the trace is usually into a shared buffer and older trace _of an inactive CPU_ might be lost.
Two _active_ CPUs will interleave into the buffer, and being able to get trace of two communicating processes on separate CPUs is very useful. The benefits of being able to trace simultaneously active CPUs outweighs the small risk of someone misinterpreting trace when one CPU happens to be inactive.
That's the kind of discussion I had with myself when working on the feature. I decided to make things work with the simplest option, i.e a single CPU, and then add support for multiple CPU should we deem it desirable. I've always been of the opinion that if the HW supports it, we should use it so I'm good with extending the feature. But doing so will require a fair amount of adjustment that need to be carefully considered.
For example calling the following
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
and
# perf record -e cs_etm/20070000.etr/ -C 2 application1 # perf record -e cs_etm/20070000.etr/ -C 3 application1
look exactly the same to the kernel. That is, both example create two new events and there is currently now way to correlate them. In the first example we need to allow both events to use the same sink while preventing it in the latter.
Thanks for the comments, Mathieu
Al
So this would work:
# perf record -e cs_etm/20070000.etr/ -C 3 application1
I am open to discussion on the topic should someone think of something.
As with the cleanup set this code has been uploaded here [1].
Thanks, Mathieu
[1].https://git.linaro.org/people/mathieu.poirier/coresight.git perf-opencsd- master-cpu-wide-support
Mathieu Poirier (12): perf tools: Add defines for CONTEXTID configuration perf tools: Configure contextID tracing in CPU-wide mode perf tools: Configure timestsamp generation in CPU-wide mode perf tools: Configure SWITCH_EVENTS in CPU-wide mode perf tools: Add handling of itrace start events perf tools: Add handling of switch-CPU-wide events perf tools: Linking PE contextID with perf thread mechanic perf tools: Allocate decoder tree as needed perf tools: Make cs_etm__dump_event() work with CPU-wide scenarios perf tools: Add notion of time to the decoding code perf tools: Make function cs_etm_decoder__clear_buffer() public perf tools: Add support for CPU-wide trace scenarios
include/linux/coresight-pmu.h | 2 + tools/include/linux/coresight-pmu.h | 2 + tools/perf/arch/arm/util/cs-etm.c | 174 ++++++++++-- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 140 +++++++++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 4 +- tools/perf/util/cs-etm.c | 334 ++++++++++++++++++++++-- tools/perf/util/cs-etm.h | 17 ++ 7 files changed, 623 insertions(+), 50 deletions(-)
-- 2.7.4
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
For example calling the following
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
and
# perf record -e cs_etm/20070000.etr/ -C 2 application1 # perf record -e cs_etm/20070000.etr/ -C 3 application1
look exactly the same to the kernel. That is, both example create two new events and there is currently now way to correlate them. In the first example we need to allow both events to use the same sink while preventing it in the latter.
I'm sorry, I thought in the first example, you were using the file descriptor parameter on perf_event_open to correlate the events - i.e. when creating the event for CPU 3 you pass it the event for CPU 2 and this indicates it will be using the same trace buffer.
And when the commands are separate, they really are two separate events, and if the second one is opened while the first is still active and using the same ETR, the second should fail, probably with EBUSY.
Al
Thanks for the comments, Mathieu
Al
So this would work:
# perf record -e cs_etm/20070000.etr/ -C 3 application1
I am open to discussion on the topic should someone think of something.
As with the cleanup set this code has been uploaded here [1].
Thanks, Mathieu
[1].https://git.linaro.org/people/mathieu.poirier/coresight.git perf-opencsd- master-cpu-wide-support
Mathieu Poirier (12): perf tools: Add defines for CONTEXTID configuration perf tools: Configure contextID tracing in CPU-wide mode perf tools: Configure timestsamp generation in CPU-wide mode perf tools: Configure SWITCH_EVENTS in CPU-wide mode perf tools: Add handling of itrace start events perf tools: Add handling of switch-CPU-wide events perf tools: Linking PE contextID with perf thread mechanic perf tools: Allocate decoder tree as needed perf tools: Make cs_etm__dump_event() work with CPU-wide scenarios perf tools: Add notion of time to the decoding code perf tools: Make function cs_etm_decoder__clear_buffer() public perf tools: Add support for CPU-wide trace scenarios
include/linux/coresight-pmu.h | 2 + tools/include/linux/coresight-pmu.h | 2 + tools/perf/arch/arm/util/cs-etm.c | 174 ++++++++++-- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 140 +++++++++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 4 +- tools/perf/util/cs-etm.c | 334 ++++++++++++++++++++++-- tools/perf/util/cs-etm.h | 17 ++ 7 files changed, 623 insertions(+), 50 deletions(-)
-- 2.7.4
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On Fri, 15 Jun 2018 at 10:07, Al Grant Al.Grant@arm.com wrote:
For example calling the following
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
and
# perf record -e cs_etm/20070000.etr/ -C 2 application1 # perf record -e cs_etm/20070000.etr/ -C 3 application1
look exactly the same to the kernel. That is, both example create two new events and there is currently now way to correlate them. In the first example we need to allow both events to use the same sink while preventing it in the latter.
I'm sorry, I thought in the first example, you were using the file descriptor parameter on perf_event_open to correlate the events - i.e. when creating the event for CPU 3 you pass it the event for CPU 2 and this indicates it will be using the same trace buffer.
Exactly, I need to do something like that but the perf infrastructure currently doesn't provide it. I will either have to build it of find something in the arguments given to the kernel system call to correlate the sessions.
And when the commands are separate, they really are two separate events, and if the second one is opened while the first is still active and using the same ETR, the second should fail, probably with EBUSY.
Exactly.
Al
Thanks for the comments, Mathieu
Al
So this would work:
# perf record -e cs_etm/20070000.etr/ -C 3 application1
I am open to discussion on the topic should someone think of something.
As with the cleanup set this code has been uploaded here [1].
Thanks, Mathieu
[1].https://git.linaro.org/people/mathieu.poirier/coresight.git perf-opencsd- master-cpu-wide-support
Mathieu Poirier (12): perf tools: Add defines for CONTEXTID configuration perf tools: Configure contextID tracing in CPU-wide mode perf tools: Configure timestsamp generation in CPU-wide mode perf tools: Configure SWITCH_EVENTS in CPU-wide mode perf tools: Add handling of itrace start events perf tools: Add handling of switch-CPU-wide events perf tools: Linking PE contextID with perf thread mechanic perf tools: Allocate decoder tree as needed perf tools: Make cs_etm__dump_event() work with CPU-wide scenarios perf tools: Add notion of time to the decoding code perf tools: Make function cs_etm_decoder__clear_buffer() public perf tools: Add support for CPU-wide trace scenarios
include/linux/coresight-pmu.h | 2 + tools/include/linux/coresight-pmu.h | 2 + tools/perf/arch/arm/util/cs-etm.c | 174 ++++++++++-- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 140 +++++++++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 4 +- tools/perf/util/cs-etm.c | 334 ++++++++++++++++++++++-- tools/perf/util/cs-etm.h | 17 ++ 7 files changed, 623 insertions(+), 50 deletions(-)
-- 2.7.4
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 14/06/18 22:42, Al Grant wrote:
Open question:
At this time the implementation supports tracing a single CPU since the only HW we have exhibit an N:1 source/sink topology. The HW itself does support collecting traces from more than one source but using the feature in this way could be very confusing and mislead users.
For example the following:
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
would end up tracing everyting that is happening on CPU 2 and 3 for as long as appliation1 is executing. Because the HW doesn't give us an interrupt when buffers are full, traces from one CPU could easily clobber traces from the other, giving the impression that nothing was executed on the latter.
You only really get that impression if you expect it to behave like ordinary perf events traced into a per-CPU ring buffer. It's just a question of remembering that for ETM the trace is usually into a shared buffer and older trace _of an inactive CPU_ might be lost.
Two _active_ CPUs will interleave into the buffer, and being able to get trace of two communicating processes on separate CPUs is very useful. The benefits of being able to trace simultaneously active CPUs outweighs the small risk of someone misinterpreting trace when one CPU happens to be inactive.
Al
In the AutoFDO use case where the ETMs are strobed, data loss is much less of a problem as each ETM is outputting much less data so there is less chance of the buffer overflowing. Being able to enable all ETMs would allow system wide sampling which is something partners are asking for.
We might have to use the cross-trigger framework to align the start / stop of the ETMs when doing complete (i.e. not strobed) trace of multiple cores. The simple approach of writing to each ETM's control register could allow the loss of data from the first ETMs to stop as the remaining ETMs continue to output data until they are stopped - at >2Ghz clocks, that's a lot of data!
Regards
Rob
So this would work:
# perf record -e cs_etm/20070000.etr/ -C 3 application1
I am open to discussion on the topic should someone think of something.
As with the cleanup set this code has been uploaded here [1].
Thanks, Mathieu
[1].https://git.linaro.org/people/mathieu.poirier/coresight.git perf-opencsd- master-cpu-wide-support
Mathieu Poirier (12): perf tools: Add defines for CONTEXTID configuration perf tools: Configure contextID tracing in CPU-wide mode perf tools: Configure timestsamp generation in CPU-wide mode perf tools: Configure SWITCH_EVENTS in CPU-wide mode perf tools: Add handling of itrace start events perf tools: Add handling of switch-CPU-wide events perf tools: Linking PE contextID with perf thread mechanic perf tools: Allocate decoder tree as needed perf tools: Make cs_etm__dump_event() work with CPU-wide scenarios perf tools: Add notion of time to the decoding code perf tools: Make function cs_etm_decoder__clear_buffer() public perf tools: Add support for CPU-wide trace scenarios
include/linux/coresight-pmu.h | 2 + tools/include/linux/coresight-pmu.h | 2 + tools/perf/arch/arm/util/cs-etm.c | 174 ++++++++++-- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 140 +++++++++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 4 +- tools/perf/util/cs-etm.c | 334 ++++++++++++++++++++++-- tools/perf/util/cs-etm.h | 17 ++ 7 files changed, 623 insertions(+), 50 deletions(-)
-- 2.7.4
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On Thu, 14 Jun 2018 14:26:45 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
This set adds support for CoreSight CPU-wide trace sessions. It borrows most of its code from the per-thread implementation with exception that range packets are processed and synthesised according to the time the trace they contain has been executed.
...
For example the following:
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
Just to help my understanding:
So prior to this series we'd always have to use --per-thread, right?:
perf record -e cs_etm/20070000.etr/ --per-thread <workload>
Because otherwise we'd get the misleading 'failed to mmap with 12 (Cannot allocate memory)' error.
And based on the above, with this series, we now are able to have more than one CPU specified in the user-specified cpu mask, like so?:
perf record -e cs_etm/20070000.etr/ -C 2,3 application1
Good, because that returns the same unhelpful error message today.
But we want to get coresight to the same point where intel-pt is, i.e., we shouldn't have to specify either --per-thread, nor manual CPU masks, like so:
perf record -e cs_etm/20070000.etr/ <workload>
...which also returns the mmap failed error today (all this on Juno).
So does this patchseries fix that, too, and on heterogeneous machines like Juno? How about record -a?
Thanks,
Kim
On Thu, 14 Jun 2018 19:02:45 -0500 Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:45 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
This set adds support for CoreSight CPU-wide trace sessions. It borrows most of its code from the per-thread implementation with exception that range packets are processed and synthesised according to the time the trace they contain has been executed.
...
For example the following:
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
Just to help my understanding:
So prior to this series we'd always have to use --per-thread, right?:
perf record -e cs_etm/20070000.etr/ --per-thread <workload>
Because otherwise we'd get the misleading 'failed to mmap with 12 (Cannot allocate memory)' error.
And based on the above, with this series, we now are able to have more than one CPU specified in the user-specified cpu mask, like so?:
perf record -e cs_etm/20070000.etr/ -C 2,3 application1
Good, because that returns the same unhelpful error message today.
But we want to get coresight to the same point where intel-pt is, i.e., we shouldn't have to specify either --per-thread, nor manual CPU masks, like so:
perf record -e cs_etm/20070000.etr/ <workload>
...which also returns the mmap failed error today (all this on Juno).
So does this patchseries fix that, too, and on heterogeneous machines like Juno? How about record -a?
FWIW, I tested this series on Juno, and it still fails 'to mmap with 12 (Cannot allocate memory)', with the same style of invocation you provide above, i.e., 'perf record -e cs_etm/20070000.etr/ -C 2,3 application1', in addition to without the -C specification.
Am I testing it wrong?
Kim
On Mon, 18 Jun 2018 at 16:43, Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 19:02:45 -0500 Kim Phillips kim.phillips@arm.com wrote:
On Thu, 14 Jun 2018 14:26:45 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
This set adds support for CoreSight CPU-wide trace sessions. It borrows most of its code from the per-thread implementation with exception that range packets are processed and synthesised according to the time the trace they contain has been executed.
...
For example the following:
# perf record -e cs_etm/20070000.etr/ -C 2,3 application1
Just to help my understanding:
So prior to this series we'd always have to use --per-thread, right?:
perf record -e cs_etm/20070000.etr/ --per-thread <workload>
Because otherwise we'd get the misleading 'failed to mmap with 12 (Cannot allocate memory)' error.
And based on the above, with this series, we now are able to have more than one CPU specified in the user-specified cpu mask, like so?:
perf record -e cs_etm/20070000.etr/ -C 2,3 application1
Good, because that returns the same unhelpful error message today.
But we want to get coresight to the same point where intel-pt is, i.e., we shouldn't have to specify either --per-thread, nor manual CPU masks, like so:
perf record -e cs_etm/20070000.etr/ <workload>
...which also returns the mmap failed error today (all this on Juno).
So does this patchseries fix that, too, and on heterogeneous machines like Juno? How about record -a?
FWIW, I tested this series on Juno, and it still fails 'to mmap with 12 (Cannot allocate memory)', with the same style of invocation you provide above, i.e., 'perf record -e cs_etm/20070000.etr/ -C 2,3 application1', in addition to without the -C specification.
Am I testing it wrong?
The "Open question" section of the original cover letter clearly addresses that topic - the current implementation supports a single CPU. I decided to proceed this way because:
1) I didn't know if users would favour traces over CPUs. 2) CPU wide multi-CPU support is much harder and an extension of the CPU wide single-CPU support. 3) With this patchset it is possible to review and test the algorithm that deal with temporal correlation of the code.
Based on the comments from Al and Robert, we will be proceeding with multi-CPU support, something I'm currently working on. The good news is that tackling that will likely implement the foundation for the complex configuration feature, also discussed on the CS mailing list a couple of months back.
Kim