Hi,
These patches add support for using perf inject to generate branch events and branch stacks from CoreSight ETM traces.
They apply to the recently submitted perf support for CoreSight trace [1] with the subsequent memory cleanup fix [2]
The first patch is Sebastian's original commits from [3] reworked to apply to the refactored version now upstreamed, with some fixes for branch events and my work on branch stacks posted last November [4], updated with review comments.
The second patch is a new patch that handles discontinuities in the trace stream, e.g. when the ETM is configured to only trace certain regions or is only active some of the time.
These probably need to be squashed together before going upstream, but I've left them as separate commits for initial review.
Regards
Rob Walker
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tag perf-core-for-mingo-4.16-20180125 [2]: https://lkml.org/lkml/2018/1/25/432 [3]: https://github.com/Linaro/perf-opencsd/ autoFDO branch [4]: https://lists.linaro.org/pipermail/coresight/2017-November/000955.html
Robert Walker (2): perf tools: inject capabilitity for CoreSight traces perf inject: Emit instruction records on ETM trace discontinuity
Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 68 +++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 + tools/perf/util/cs-etm.c | 471 +++++++++++++++++++++--- 4 files changed, 509 insertions(+), 63 deletions(-)
Added user space perf functionality to translate CoreSight traces into instruction events with branch stack.
To invoke the new functionality, use the perf inject tool with --itrace=il. For example, to translate the ETM trace from perf.data into last branch records in a new file:
$ perf inject --itrace=i100000il128 -i perf.data -o perf.data.new --strip
The 'i' parameter to itrace generates periodic instruction events. The period between instruction events can be specified as a number of instructions suffixed by i (default 100000). The parameter to 'l' specifies the number of entries in the branch stack attached to instruction events. The 'b' parameter to itrace generates events on taken branches.
This patch also fixes the contents of the branch events used in perf report - previously branch events were generated for each contiguous range of instructions executed. These are fixed to generate branch events between the last address of a range ending in an executed branch instruction and the start address of the next range.
Based on patches by Sebastian Pop s.pop@samsung.com with additional fixes and support for specifying the instruction period.
Originally-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Robert Walker robert.walker@arm.com --- Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 13 + tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 432 +++++++++++++++++++++--- 4 files changed, 429 insertions(+), 48 deletions(-)
diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index a33c88c..5dd64e9 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -330,3 +330,34 @@ Details on how to use the generic STM API can be found here [2].
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm [2]. Documentation/trace/stm.txt + + +Generating coverage files for Feedback Directed Optimization: AutoFDO +--------------------------------------------------------------------- + +perf inject accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g. + + perf inject --itrace -i perf.data -o perf.data.new + +Below is an example of using ARM ETM for autoFDO. It requires autofdo +(https://github.com/google/autofdo) and gcc version 5. The bubble +sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial). + + $ gcc-5 -O3 sort.c -o sort + $ taskset -c 2 ./sort + Bubble sorting array of 30000 elements + 5910 ms + + $ perf record -e cs_etm/@20070000.etr/u --per-thread taskset -c 2 ./sort + Bubble sorting array of 30000 elements + 12543 ms + [ perf record: Woken up 35 times to write data ] + [ perf record: Captured and wrote 69.640 MB perf.data ] + + $ perf inject -i perf.data -o inj.data --itrace=il64 --strip + $ create_gcov --binary=./sort --profile=inj.data --gcov=sort.gcov -gcov_version=1 + $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo + $ taskset -c 2 ./sort_autofdo + Bubble sorting array of 30000 elements + 5806 ms diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 1fb0184..ecf1780 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -254,6 +254,7 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) for (i = 0; i < MAX_BUFFER; i++) { decoder->packet_buffer[i].start_addr = 0xdeadbeefdeadbeefUL; decoder->packet_buffer[i].end_addr = 0xdeadbeefdeadbeefUL; + decoder->packet_buffer[i].last_instr_taken_branch = false; decoder->packet_buffer[i].exc = false; decoder->packet_buffer[i].exc_ret = false; decoder->packet_buffer[i].cpu = INT_MIN; @@ -281,6 +282,18 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) decoder->packet_buffer[et].sample_type = sample_type; decoder->packet_buffer[et].start_addr = elem->st_addr; decoder->packet_buffer[et].end_addr = elem->en_addr; + switch (elem->last_i_type) { + case OCSD_INSTR_BR: + case OCSD_INSTR_BR_INDIRECT: + decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec; + break; + case OCSD_INSTR_ISB: + case OCSD_INSTR_DSB_DMB: + case OCSD_INSTR_OTHER: + default: + decoder->packet_buffer[et].last_instr_taken_branch = false; + break; + } decoder->packet_buffer[et].exc = false; decoder->packet_buffer[et].exc_ret = false; decoder->packet_buffer[et].cpu = *((int *)inode->priv); diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 3d2e620..a4fdd28 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -30,6 +30,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; u64 start_addr; u64 end_addr; + u8 last_instr_taken_branch; u8 exc; u8 exc_ret; int cpu; diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f2c9877..43cf1d4 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -32,6 +32,14 @@
#define MAX_TIMESTAMP (~0ULL)
+/* + * A64 instructions are always 4 bytes + * + * Only A64 is supported, so can use this constant for converting between + * addresses and instruction counts, calculting offsets etc + */ +#define A64_INSTR_SIZE 4 + struct cs_etm_auxtrace { struct auxtrace auxtrace; struct auxtrace_queues queues; @@ -45,11 +53,15 @@ struct cs_etm_auxtrace { u8 snapshot_mode; u8 data_queued; u8 sample_branches; + u8 sample_instructions;
int num_cpu; u32 auxtrace_type; u64 branches_sample_type; u64 branches_id; + u64 instructions_sample_type; + u64 instructions_sample_period; + u64 instructions_id; u64 **metadata; u64 kernel_start; unsigned int pmu_type; @@ -68,6 +80,12 @@ struct cs_etm_queue { u64 time; u64 timestamp; u64 offset; + u64 period_instructions; + struct branch_stack *last_branch; + struct branch_stack *last_branch_rb; + size_t last_branch_pos; + struct cs_etm_packet *prev_packet; + struct cs_etm_packet *packet; };
static int cs_etm__update_queues(struct cs_etm_auxtrace *etm); @@ -180,6 +198,10 @@ static void cs_etm__free_queue(void *priv) thread__zput(etmq->thread); cs_etm_decoder__free(etmq->decoder); zfree(&etmq->event_buf); + zfree(&etmq->last_branch); + zfree(&etmq->last_branch_rb); + zfree(&etmq->prev_packet); + zfree(&etmq->packet); free(etmq); }
@@ -276,11 +298,35 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq; + size_t szp = sizeof(struct cs_etm_packet);
etmq = zalloc(sizeof(*etmq)); if (!etmq) return NULL;
+ etmq->packet = zalloc(szp); + if (!etmq->packet) + goto out_free; + + if (etm->synth_opts.last_branch || etm->sample_branches) { + etmq->prev_packet = zalloc(szp); + if (!etmq->prev_packet) + goto out_free; + } + + if (etm->synth_opts.last_branch) { + size_t sz = sizeof(struct branch_stack); + + sz += etm->synth_opts.last_branch_sz * + sizeof(struct branch_entry); + etmq->last_branch = zalloc(sz); + if (!etmq->last_branch) + goto out_free; + etmq->last_branch_rb = zalloc(sz); + if (!etmq->last_branch_rb) + goto out_free; + } + etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); if (!etmq->event_buf) goto out_free; @@ -335,6 +381,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, goto out_free_decoder;
etmq->offset = 0; + etmq->period_instructions = 0;
return etmq;
@@ -342,6 +389,10 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, cs_etm_decoder__free(etmq->decoder); out_free: zfree(&etmq->event_buf); + zfree(&etmq->last_branch); + zfree(&etmq->last_branch_rb); + zfree(&etmq->prev_packet); + zfree(&etmq->packet); free(etmq);
return NULL; @@ -395,6 +446,129 @@ static int cs_etm__update_queues(struct cs_etm_auxtrace *etm) return 0; }
+static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) +{ + struct branch_stack *bs_src = etmq->last_branch_rb; + struct branch_stack *bs_dst = etmq->last_branch; + size_t nr = 0; + + /* + * Set the number of records before early exit: ->nr is used to + * determine how many branches to copy from ->entries. + */ + bs_dst->nr = bs_src->nr; + + /* + * Early exit when there is nothing to copy. + */ + if (!bs_src->nr) + return; + + /* + * As bs_src->entries is a circular buffer, we need to copy from it in + * two steps. First, copy the branches from the most recently inserted + * branch ->last_branch_pos until the end of bs_src->entries buffer. + */ + nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos; + memcpy(&bs_dst->entries[0], + &bs_src->entries[etmq->last_branch_pos], + sizeof(struct branch_entry) * nr); + + /* + * If we wrapped around at least once, the branches from the beginning + * of the bs_src->entries buffer and until the ->last_branch_pos element + * are older valid branches: copy them over. The total number of + * branches copied over will be equal to the number of branches asked by + * the user in last_branch_sz. + */ + if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) { + memcpy(&bs_dst->entries[nr], + &bs_src->entries[0], + sizeof(struct branch_entry) * etmq->last_branch_pos); + } +} + +static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) +{ + etmq->last_branch_pos = 0; + etmq->last_branch_rb->nr = 0; +} + +static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet) +{ + /* + * The packet records the execution range with an exclusive end address + * + * A64 instructions are constant size, so the last executed + * instruction is A64_INSTR_SIZE before the end address + * Will need to do instruction level decode for T32 instructions as + * they can be variable size (not yet supported). + */ + return packet->end_addr - A64_INSTR_SIZE; +} + +static inline u64 cs_etm__instr_count(const struct cs_etm_packet *packet) +{ + /* + * Only A64 instructions are currently supported, so can get + * instruction count by dividing. + * Will need to do instruction level decode for T32 instructions as + * they can be variable size (not yet supported). + */ + return (packet->end_addr - packet->start_addr) / A64_INSTR_SIZE; +} + +static inline u64 cs_etm__instr_addr(const struct cs_etm_packet *packet, + u64 offset) +{ + /* + * Only A64 instructions are currently supported, so can get + * instruction address by muliplying. + * Will need to do instruction level decode for T32 instructions as + * they can be variable size (not yet supported). + */ + return packet->start_addr + offset * A64_INSTR_SIZE; +} + +static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) +{ + struct branch_stack *bs = etmq->last_branch_rb; + struct branch_entry *be; + + /* + * The branches are recorded in a circular buffer in reverse + * chronological order: we start recording from the last element of the + * buffer down. After writing the first element of the stack, move the + * insert position back to the end of the buffer. + */ + if (!etmq->last_branch_pos) + etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz; + + etmq->last_branch_pos -= 1; + + be = &bs->entries[etmq->last_branch_pos]; + be->from = cs_etm__last_executed_instr(etmq->prev_packet); + be->to = etmq->packet->start_addr; + /* No support for mispredict */ + be->flags.mispred = 0; + be->flags.predicted = 1; + + /* + * Increment bs->nr until reaching the number of last branches asked by + * the user on the command line. + */ + if (bs->nr < etmq->etm->synth_opts.last_branch_sz) + bs->nr += 1; +} + +static int cs_etm__inject_event(union perf_event *event, + struct perf_sample *sample, u64 type) +{ + event->header.size = perf_event__sample_event_size(sample, type, 0); + return perf_event__synthesize_sample(event, type, 0, sample); +} + + static int cs_etm__get_trace(struct cs_etm_buffer *buff, struct cs_etm_queue *etmq) { @@ -459,35 +633,105 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, } }
+static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, + u64 addr, u64 period) +{ + int ret = 0; + struct cs_etm_auxtrace *etm = etmq->etm; + union perf_event *event = etmq->event_buf; + struct perf_sample sample = {.ip = 0,}; + + event->sample.header.type = PERF_RECORD_SAMPLE; + event->sample.header.misc = PERF_RECORD_MISC_USER; + event->sample.header.size = sizeof(struct perf_event_header); + + sample.ip = addr; + sample.pid = etmq->pid; + sample.tid = etmq->tid; + sample.id = etmq->etm->instructions_id; + sample.stream_id = etmq->etm->instructions_id; + sample.period = period; + sample.cpu = etmq->packet->cpu; + sample.flags = 0; + sample.insn_len = 1; + sample.cpumode = event->header.misc; + + if (etm->synth_opts.last_branch) { + cs_etm__copy_last_branch_rb(etmq); + sample.branch_stack = etmq->last_branch; + } + + if (etm->synth_opts.inject) { + ret = cs_etm__inject_event(event, &sample, + etm->instructions_sample_type); + if (ret) + return ret; + } + + ret = perf_session__deliver_synth_event(etm->session, event, &sample); + + if (ret) + pr_err( + "CS ETM Trace: failed to deliver instruction event, error %d\n", + ret); + + if (etm->synth_opts.last_branch) + cs_etm__reset_last_branch_rb(etmq); + + return ret; +} + /* * The cs etm packet encodes an instruction range between a branch target * and the next taken branch. Generate sample accordingly. */ -static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, - struct cs_etm_packet *packet) +static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; struct perf_sample sample = {.ip = 0,}; union perf_event *event = etmq->event_buf; - u64 start_addr = packet->start_addr; - u64 end_addr = packet->end_addr; + struct dummy_branch_stack { + u64 nr; + struct branch_entry entries; + } dummy_bs;
event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = start_addr; + sample.ip = cs_etm__last_executed_instr(etmq->prev_packet); sample.pid = etmq->pid; sample.tid = etmq->tid; - sample.addr = end_addr; + sample.addr = etmq->packet->start_addr; sample.id = etmq->etm->branches_id; sample.stream_id = etmq->etm->branches_id; sample.period = 1; - sample.cpu = packet->cpu; + sample.cpu = etmq->packet->cpu; sample.flags = 0; sample.cpumode = PERF_RECORD_MISC_USER;
+ /* + * perf report cannot handle events without a branch stack + */ + if (etm->synth_opts.last_branch) { + dummy_bs = (struct dummy_branch_stack){ + .nr = 1, + .entries = { + .from = sample.ip, + .to = sample.addr, + }, + }; + sample.branch_stack = (struct branch_stack *)&dummy_bs; + } + + if (etm->synth_opts.inject) { + ret = cs_etm__inject_event(event, &sample, + etm->branches_sample_type); + if (ret) + return ret; + } + ret = perf_session__deliver_synth_event(etm->session, event, &sample);
if (ret) @@ -584,6 +828,24 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, etm->sample_branches = true; etm->branches_sample_type = attr.sample_type; etm->branches_id = id; + id += 1; + attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR; + } + + if (etm->synth_opts.last_branch) + attr.sample_type |= PERF_SAMPLE_BRANCH_STACK; + + if (etm->synth_opts.instructions) { + attr.config = PERF_COUNT_HW_INSTRUCTIONS; + attr.sample_period = etm->synth_opts.period; + etm->instructions_sample_period = attr.sample_period; + err = cs_etm__synth_event(session, &attr, id); + if (err) + return err; + etm->sample_instructions = true; + etm->instructions_sample_type = attr.sample_type; + etm->instructions_id = id; + id += 1; }
return 0; @@ -591,20 +853,66 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm,
static int cs_etm__sample(struct cs_etm_queue *etmq) { + struct cs_etm_auxtrace *etm = etmq->etm; + struct cs_etm_packet *tmp; int ret; - struct cs_etm_packet packet; + u64 instrs_executed;
- while (1) { - ret = cs_etm_decoder__get_packet(etmq->decoder, &packet); - if (ret <= 0) + instrs_executed = cs_etm__instr_count(etmq->packet); + etmq->period_instructions += instrs_executed; + + /* + * Record a branch when the last instruction in + * PREV_PACKET is a branch. + */ + if (etm->synth_opts.last_branch && + etmq->prev_packet->last_instr_taken_branch) + cs_etm__update_last_branch_rb(etmq); + + if (etm->sample_instructions && + etmq->period_instructions >= etm->instructions_sample_period) { + /* + * Emit instruction sample periodically + * TODO: allow period to be defined in cycles and clock time + */ + + /* Get number of instructions executed after the sample point */ + u64 instrs_over = etmq->period_instructions - + etm->instructions_sample_period; + + /* + * Calculate the address of the sampled instruction (-1 as + * sample is reported as though instruction has just been + * executed, but PC has not advanced to next instruction) + */ + u64 offset = (instrs_executed - instrs_over - 1); + u64 addr = cs_etm__instr_addr(etmq->packet, offset); + + ret = cs_etm__synth_instruction_sample( + etmq, addr, etm->instructions_sample_period); + if (ret) + return ret; + + /* Carry remaining instructions into next sample period */ + etmq->period_instructions = instrs_over; + } + + if (etm->sample_branches && + etmq->packet->last_instr_taken_branch) { + ret = cs_etm__synth_branch_sample(etmq); + if (ret) return ret; + }
+ if (etm->sample_branches || + etm->synth_opts.last_branch) { /* - * If the packet contains an instruction range, generate an - * instruction sequence event. + * Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for + * the next incoming packet. */ - if (packet.sample_type & CS_ETM_RANGE) - cs_etm__synth_branch_sample(etmq, &packet); + tmp = etmq->packet; + etmq->packet = etmq->prev_packet; + etmq->prev_packet = tmp; }
return 0; @@ -621,45 +929,73 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) etm->kernel_start = machine__kernel_start(etm->machine);
/* Go through each buffer in the queue and decode them one by one */ -more: - buffer_used = 0; - memset(&buffer, 0, sizeof(buffer)); - err = cs_etm__get_trace(&buffer, etmq); - if (err <= 0) - return err; - /* - * We cannot assume consecutive blocks in the data file are contiguous, - * reset the decoder to force re-sync. - */ - err = cs_etm_decoder__reset(etmq->decoder); - if (err != 0) - return err; - - /* Run trace decoder until buffer consumed or end of trace */ - do { - processed = 0; - - err = cs_etm_decoder__process_data_block( - etmq->decoder, - etmq->offset, - &buffer.buf[buffer_used], - buffer.len - buffer_used, - &processed); - - if (err) + while (1) { + buffer_used = 0; + memset(&buffer, 0, sizeof(buffer)); + err = cs_etm__get_trace(&buffer, etmq); + if (err <= 0) + return err; + /* + * We cannot assume consecutive blocks in the data file are + * contiguous, reset the decoder to force re-sync. + */ + err = cs_etm_decoder__reset(etmq->decoder); + if (err != 0) return err;
- etmq->offset += processed; - buffer_used += processed; + /* Run trace decoder until buffer consumed or end of trace */ + do { + processed = 0; + + err = cs_etm_decoder__process_data_block( + etmq->decoder, + etmq->offset, + &buffer.buf[buffer_used], + buffer.len - buffer_used, + &processed); + + if (err) + return err; + + etmq->offset += processed; + buffer_used += processed; + + while (1) { + err = cs_etm_decoder__get_packet(etmq->decoder, + etmq->packet); + if (err <= 0) + /* + * Stop processing this chunk on + * end of data or error + */ + break; + + /* + * If the packet contains an instruction + * range, generate instruction sequence + * events. + */ + if (etmq->packet->sample_type & CS_ETM_RANGE) + err = cs_etm__sample(etmq); + } + } while (buffer.len > buffer_used);
/* - * Nothing to do with an error condition, let's hope the next - * chunk will be better. + * Generate a last branch event for the branches left in + * the circular buffer at the end of the trace. */ - err = cs_etm__sample(etmq); - } while (buffer.len > buffer_used); + if (etm->sample_instructions && + etmq->etm->synth_opts.last_branch) { + struct branch_stack *bs = etmq->last_branch_rb; + struct branch_entry *be = &bs->entries[etmq->last_branch_pos]; + + err = cs_etm__synth_instruction_sample(etmq, be->to, + etmq->period_instructions); + if (err) + return err; + }
-goto more; + }
return err; }
On Tue, Jan 30, 2018 at 02:42:24PM +0000, Robert Walker wrote:
Added user space perf functionality to translate CoreSight traces into instruction events with branch stack.
To invoke the new functionality, use the perf inject tool with --itrace=il. For example, to translate the ETM trace from perf.data into last branch records in a new file:
$ perf inject --itrace=i100000il128 -i perf.data -o perf.data.new --strip
The 'i' parameter to itrace generates periodic instruction events. The period between instruction events can be specified as a number of instructions suffixed by i (default 100000). The parameter to 'l' specifies the number of entries in the branch stack attached to instruction events. The 'b' parameter to itrace generates events on taken branches.
This patch also fixes the contents of the branch events used in perf report
- previously branch events were generated for each contiguous range of
instructions executed. These are fixed to generate branch events between the last address of a range ending in an executed branch instruction and the start address of the next range.
Based on patches by Sebastian Pop s.pop@samsung.com with additional fixes and support for specifying the instruction period.
Originally-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Robert Walker robert.walker@arm.com
Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 13 + tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 432 +++++++++++++++++++++--- 4 files changed, 429 insertions(+), 48 deletions(-)
diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index a33c88c..5dd64e9 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -330,3 +330,34 @@ Details on how to use the generic STM API can be found here [2]. [1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm [2]. Documentation/trace/stm.txt
+Generating coverage files for Feedback Directed Optimization: AutoFDO +---------------------------------------------------------------------
Please document that autoFDO is only available on A64.
+perf inject accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g.
- perf inject --itrace -i perf.data -o perf.data.new
+Below is an example of using ARM ETM for autoFDO. It requires autofdo +(https://github.com/google/autofdo) and gcc version 5. The bubble +sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
- $ gcc-5 -O3 sort.c -o sort
- $ taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 5910 ms
- $ perf record -e cs_etm/@20070000.etr/u --per-thread taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 12543 ms
- [ perf record: Woken up 35 times to write data ]
- [ perf record: Captured and wrote 69.640 MB perf.data ]
- $ perf inject -i perf.data -o inj.data --itrace=il64 --strip
- $ create_gcov --binary=./sort --profile=inj.data --gcov=sort.gcov -gcov_version=1
- $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
- $ taskset -c 2 ./sort_autofdo
- Bubble sorting array of 30000 elements
- 5806 ms
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 1fb0184..ecf1780 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -254,6 +254,7 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) for (i = 0; i < MAX_BUFFER; i++) { decoder->packet_buffer[i].start_addr = 0xdeadbeefdeadbeefUL; decoder->packet_buffer[i].end_addr = 0xdeadbeefdeadbeefUL;
decoder->packet_buffer[i].exc = false; decoder->packet_buffer[i].exc_ret = false; decoder->packet_buffer[i].cpu = INT_MIN;decoder->packet_buffer[i].last_instr_taken_branch = false;
@@ -281,6 +282,18 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) decoder->packet_buffer[et].sample_type = sample_type; decoder->packet_buffer[et].start_addr = elem->st_addr; decoder->packet_buffer[et].end_addr = elem->en_addr;
- switch (elem->last_i_type) {
- case OCSD_INSTR_BR:
- case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec;
break;
- case OCSD_INSTR_ISB:
- case OCSD_INSTR_DSB_DMB:
- case OCSD_INSTR_OTHER:
- default:
decoder->packet_buffer[et].last_instr_taken_branch = false;
break;
- } decoder->packet_buffer[et].exc = false; decoder->packet_buffer[et].exc_ret = false; decoder->packet_buffer[et].cpu = *((int *)inode->priv);
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 3d2e620..a4fdd28 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -30,6 +30,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; u64 start_addr; u64 end_addr;
- u8 last_instr_taken_branch; u8 exc; u8 exc_ret; int cpu;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f2c9877..43cf1d4 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -32,6 +32,14 @@ #define MAX_TIMESTAMP (~0ULL) +/*
- A64 instructions are always 4 bytes
- Only A64 is supported, so can use this constant for converting between
- addresses and instruction counts, calculting offsets etc
- */
+#define A64_INSTR_SIZE 4
struct cs_etm_auxtrace { struct auxtrace auxtrace; struct auxtrace_queues queues; @@ -45,11 +53,15 @@ struct cs_etm_auxtrace { u8 snapshot_mode; u8 data_queued; u8 sample_branches;
- u8 sample_instructions;
int num_cpu; u32 auxtrace_type; u64 branches_sample_type; u64 branches_id;
- u64 instructions_sample_type;
- u64 instructions_sample_period;
- u64 instructions_id; u64 **metadata; u64 kernel_start; unsigned int pmu_type;
@@ -68,6 +80,12 @@ struct cs_etm_queue { u64 time; u64 timestamp; u64 offset;
- u64 period_instructions;
- struct branch_stack *last_branch;
- struct branch_stack *last_branch_rb;
- size_t last_branch_pos;
- struct cs_etm_packet *prev_packet;
- struct cs_etm_packet *packet;
}; static int cs_etm__update_queues(struct cs_etm_auxtrace *etm); @@ -180,6 +198,10 @@ static void cs_etm__free_queue(void *priv) thread__zput(etmq->thread); cs_etm_decoder__free(etmq->decoder); zfree(&etmq->event_buf);
- zfree(&etmq->last_branch);
- zfree(&etmq->last_branch_rb);
- zfree(&etmq->prev_packet);
- zfree(&etmq->packet); free(etmq);
} @@ -276,11 +298,35 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq;
- size_t szp = sizeof(struct cs_etm_packet);
etmq = zalloc(sizeof(*etmq)); if (!etmq) return NULL;
- etmq->packet = zalloc(szp);
- if (!etmq->packet)
goto out_free;
- if (etm->synth_opts.last_branch || etm->sample_branches) {
etmq->prev_packet = zalloc(szp);
if (!etmq->prev_packet)
goto out_free;
- }
- if (etm->synth_opts.last_branch) {
size_t sz = sizeof(struct branch_stack);
sz += etm->synth_opts.last_branch_sz *
sizeof(struct branch_entry);
etmq->last_branch = zalloc(sz);
if (!etmq->last_branch)
goto out_free;
etmq->last_branch_rb = zalloc(sz);
if (!etmq->last_branch_rb)
goto out_free;
- }
- etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); if (!etmq->event_buf) goto out_free;
@@ -335,6 +381,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, goto out_free_decoder; etmq->offset = 0;
- etmq->period_instructions = 0;
return etmq; @@ -342,6 +389,10 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, cs_etm_decoder__free(etmq->decoder); out_free: zfree(&etmq->event_buf);
- zfree(&etmq->last_branch);
- zfree(&etmq->last_branch_rb);
- zfree(&etmq->prev_packet);
- zfree(&etmq->packet); free(etmq);
return NULL; @@ -395,6 +446,129 @@ static int cs_etm__update_queues(struct cs_etm_auxtrace *etm) return 0; } +static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs_src = etmq->last_branch_rb;
- struct branch_stack *bs_dst = etmq->last_branch;
- size_t nr = 0;
- /*
* Set the number of records before early exit: ->nr is used to
* determine how many branches to copy from ->entries.
*/
- bs_dst->nr = bs_src->nr;
- /*
* Early exit when there is nothing to copy.
*/
- if (!bs_src->nr)
return;
- /*
* As bs_src->entries is a circular buffer, we need to copy from it in
* two steps. First, copy the branches from the most recently inserted
* branch ->last_branch_pos until the end of bs_src->entries buffer.
*/
- nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos;
- memcpy(&bs_dst->entries[0],
&bs_src->entries[etmq->last_branch_pos],
sizeof(struct branch_entry) * nr);
- /*
* If we wrapped around at least once, the branches from the beginning
* of the bs_src->entries buffer and until the ->last_branch_pos element
* are older valid branches: copy them over. The total number of
* branches copied over will be equal to the number of branches asked by
* the user in last_branch_sz.
*/
- if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
memcpy(&bs_dst->entries[nr],
&bs_src->entries[0],
sizeof(struct branch_entry) * etmq->last_branch_pos);
- }
+}
+static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) +{
- etmq->last_branch_pos = 0;
- etmq->last_branch_rb->nr = 0;
+}
+static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet) +{
- /*
* The packet records the execution range with an exclusive end address
*
* A64 instructions are constant size, so the last executed
* instruction is A64_INSTR_SIZE before the end address
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->end_addr - A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_count(const struct cs_etm_packet *packet) +{
- /*
* Only A64 instructions are currently supported, so can get
* instruction count by dividing.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return (packet->end_addr - packet->start_addr) / A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_addr(const struct cs_etm_packet *packet,
u64 offset)
+{
- /*
* Only A64 instructions are currently supported, so can get
* instruction address by muliplying.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->start_addr + offset * A64_INSTR_SIZE;
+}
+static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs = etmq->last_branch_rb;
- struct branch_entry *be;
- /*
* The branches are recorded in a circular buffer in reverse
* chronological order: we start recording from the last element of the
* buffer down. After writing the first element of the stack, move the
* insert position back to the end of the buffer.
*/
- if (!etmq->last_branch_pos)
etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz;
- etmq->last_branch_pos -= 1;
- be = &bs->entries[etmq->last_branch_pos];
- be->from = cs_etm__last_executed_instr(etmq->prev_packet);
- be->to = etmq->packet->start_addr;
- /* No support for mispredict */
- be->flags.mispred = 0;
- be->flags.predicted = 1;
- /*
* Increment bs->nr until reaching the number of last branches asked by
* the user on the command line.
*/
- if (bs->nr < etmq->etm->synth_opts.last_branch_sz)
bs->nr += 1;
+}
+static int cs_etm__inject_event(union perf_event *event,
struct perf_sample *sample, u64 type)
+{
- event->header.size = perf_event__sample_event_size(sample, type, 0);
- return perf_event__synthesize_sample(event, type, 0, sample);
+}
static int cs_etm__get_trace(struct cs_etm_buffer *buff, struct cs_etm_queue *etmq) { @@ -459,35 +633,105 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, } } +static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
u64 addr, u64 period)
+{
- int ret = 0;
- struct cs_etm_auxtrace *etm = etmq->etm;
- union perf_event *event = etmq->event_buf;
- struct perf_sample sample = {.ip = 0,};
- event->sample.header.type = PERF_RECORD_SAMPLE;
- event->sample.header.misc = PERF_RECORD_MISC_USER;
- event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = addr;
- sample.pid = etmq->pid;
- sample.tid = etmq->tid;
- sample.id = etmq->etm->instructions_id;
- sample.stream_id = etmq->etm->instructions_id;
- sample.period = period;
- sample.cpu = etmq->packet->cpu;
- sample.flags = 0;
- sample.insn_len = 1;
- sample.cpumode = event->header.misc;
- if (etm->synth_opts.last_branch) {
cs_etm__copy_last_branch_rb(etmq);
sample.branch_stack = etmq->last_branch;
- }
- if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->instructions_sample_type);
if (ret)
return ret;
- }
- ret = perf_session__deliver_synth_event(etm->session, event, &sample);
- if (ret)
pr_err(
"CS ETM Trace: failed to deliver instruction event, error %d\n",
ret);
- if (etm->synth_opts.last_branch)
cs_etm__reset_last_branch_rb(etmq);
- return ret;
+}
/*
- The cs etm packet encodes an instruction range between a branch target
- and the next taken branch. Generate sample accordingly.
*/ -static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
struct cs_etm_packet *packet)
+static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; struct perf_sample sample = {.ip = 0,}; union perf_event *event = etmq->event_buf;
- u64 start_addr = packet->start_addr;
- u64 end_addr = packet->end_addr;
- struct dummy_branch_stack {
u64 nr;
struct branch_entry entries;
- } dummy_bs;
event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = start_addr;
- sample.ip = cs_etm__last_executed_instr(etmq->prev_packet);
Humm...
sample.pid = etmq->pid; sample.tid = etmq->tid;
- sample.addr = end_addr;
- sample.addr = etmq->packet->start_addr;
I think this is the right way to do it - much better yes.
sample.id = etmq->etm->branches_id; sample.stream_id = etmq->etm->branches_id; sample.period = 1;
- sample.cpu = packet->cpu;
- sample.cpu = etmq->packet->cpu; sample.flags = 0; sample.cpumode = PERF_RECORD_MISC_USER;
- /*
* perf report cannot handle events without a branch stack
*/
- if (etm->synth_opts.last_branch) {
dummy_bs = (struct dummy_branch_stack){
.nr = 1,
.entries = {
.from = sample.ip,
.to = sample.addr,
},
};
sample.branch_stack = (struct branch_stack *)&dummy_bs;
- }
- if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->branches_sample_type);
if (ret)
return ret;
- }
- ret = perf_session__deliver_synth_event(etm->session, event, &sample);
if (ret) @@ -584,6 +828,24 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, etm->sample_branches = true; etm->branches_sample_type = attr.sample_type; etm->branches_id = id;
id += 1;
attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR;
- }
- if (etm->synth_opts.last_branch)
attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
- if (etm->synth_opts.instructions) {
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.sample_period = etm->synth_opts.period;
etm->instructions_sample_period = attr.sample_period;
err = cs_etm__synth_event(session, &attr, id);
if (err)
return err;
etm->sample_instructions = true;
etm->instructions_sample_type = attr.sample_type;
etm->instructions_id = id;
}id += 1;
return 0; @@ -591,20 +853,66 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, static int cs_etm__sample(struct cs_etm_queue *etmq) {
- struct cs_etm_auxtrace *etm = etmq->etm;
- struct cs_etm_packet *tmp; int ret;
- struct cs_etm_packet packet;
- u64 instrs_executed;
- while (1) {
ret = cs_etm_decoder__get_packet(etmq->decoder, &packet);
if (ret <= 0)
- instrs_executed = cs_etm__instr_count(etmq->packet);
- etmq->period_instructions += instrs_executed;
- /*
* Record a branch when the last instruction in
* PREV_PACKET is a branch.
*/
- if (etm->synth_opts.last_branch &&
etmq->prev_packet->last_instr_taken_branch)
cs_etm__update_last_branch_rb(etmq);
- if (etm->sample_instructions &&
etmq->period_instructions >= etm->instructions_sample_period) {
/*
* Emit instruction sample periodically
* TODO: allow period to be defined in cycles and clock time
*/
/* Get number of instructions executed after the sample point */
u64 instrs_over = etmq->period_instructions -
etm->instructions_sample_period;
/*
* Calculate the address of the sampled instruction (-1 as
* sample is reported as though instruction has just been
* executed, but PC has not advanced to next instruction)
*/
u64 offset = (instrs_executed - instrs_over - 1);
u64 addr = cs_etm__instr_addr(etmq->packet, offset);
ret = cs_etm__synth_instruction_sample(
etmq, addr, etm->instructions_sample_period);
if (ret)
return ret;
/* Carry remaining instructions into next sample period */
etmq->period_instructions = instrs_over;
- }
- if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
Shouldn't this check be done on the etmq->prev_packet instead?
ret = cs_etm__synth_branch_sample(etmq);
if (ret) return ret;
- }
- if (etm->sample_branches ||
/*etm->synth_opts.last_branch) {
* If the packet contains an instruction range, generate an
* instruction sequence event.
* Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
*/* the next incoming packet.
if (packet.sample_type & CS_ETM_RANGE)
cs_etm__synth_branch_sample(etmq, &packet);
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
}etmq->prev_packet = tmp;
return 0; @@ -621,45 +929,73 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) etm->kernel_start = machine__kernel_start(etm->machine); /* Go through each buffer in the queue and decode them one by one */ -more:
- buffer_used = 0;
- memset(&buffer, 0, sizeof(buffer));
- err = cs_etm__get_trace(&buffer, etmq);
- if (err <= 0)
return err;
- /*
* We cannot assume consecutive blocks in the data file are contiguous,
* reset the decoder to force re-sync.
*/
- err = cs_etm_decoder__reset(etmq->decoder);
- if (err != 0)
return err;
- /* Run trace decoder until buffer consumed or end of trace */
- do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
- while (1) {
Why another while loop? To me it's just instroducing an additional level of imbrication. Is there anything the goto statement wasn't allowing you to achieve?
buffer_used = 0;
memset(&buffer, 0, sizeof(buffer));
err = cs_etm__get_trace(&buffer, etmq);
if (err <= 0)
return err;
/*
* We cannot assume consecutive blocks in the data file are
* contiguous, reset the decoder to force re-sync.
*/
err = cs_etm_decoder__reset(etmq->decoder);
if (err != 0) return err;
etmq->offset += processed;
buffer_used += processed;
/* Run trace decoder until buffer consumed or end of trace */
do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
return err;
etmq->offset += processed;
buffer_used += processed;
while (1) {
err = cs_etm_decoder__get_packet(etmq->decoder,
etmq->packet);
if (err <= 0)
/*
* Stop processing this chunk on
* end of data or error
*/
break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type & CS_ETM_RANGE)
err = cs_etm__sample(etmq);
}
} while (buffer.len > buffer_used);
/*
* Nothing to do with an error condition, let's hope the next
* chunk will be better.
* Generate a last branch event for the branches left in
*/* the circular buffer at the end of the trace.
err = cs_etm__sample(etmq);
- } while (buffer.len > buffer_used);
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq, be->to,
etmq->period_instructions);
if (err)
return err;
}
-goto more;
- }
return err; } -- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 01/02/18 23:57, Mathieu Poirier wrote:
On Tue, Jan 30, 2018 at 02:42:24PM +0000, Robert Walker wrote:
Added user space perf functionality to translate CoreSight traces into instruction events with branch stack.
To invoke the new functionality, use the perf inject tool with --itrace=il. For example, to translate the ETM trace from perf.data into last branch records in a new file:
$ perf inject --itrace=i100000il128 -i perf.data -o perf.data.new --strip
The 'i' parameter to itrace generates periodic instruction events. The period between instruction events can be specified as a number of instructions suffixed by i (default 100000). The parameter to 'l' specifies the number of entries in the branch stack attached to instruction events. The 'b' parameter to itrace generates events on taken branches.
This patch also fixes the contents of the branch events used in perf report
- previously branch events were generated for each contiguous range of
instructions executed. These are fixed to generate branch events between the last address of a range ending in an executed branch instruction and the start address of the next range.
Based on patches by Sebastian Pop s.pop@samsung.com with additional fixes and support for specifying the instruction period.
Originally-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Robert Walker robert.walker@arm.com
Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 13 + tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 432 +++++++++++++++++++++--- 4 files changed, 429 insertions(+), 48 deletions(-)
diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index a33c88c..5dd64e9 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -330,3 +330,34 @@ Details on how to use the generic STM API can be found here [2].
[1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm [2]. Documentation/trace/stm.txt
+Generating coverage files for Feedback Directed Optimization: AutoFDO +---------------------------------------------------------------------
Please document that autoFDO is only available on A64.
Will do. It's not just autoFDO - it's any use of perf to convert trace to instruction samples / branch samples, mainly because of the assumption that all instructions are 4 bytes. I'm not sure whether the AutoFDO tools and compilers do or do not support A32/T32 - I haven't tried it.
+perf inject accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g.
- perf inject --itrace -i perf.data -o perf.data.new
+Below is an example of using ARM ETM for autoFDO. It requires autofdo +(https://github.com/google/autofdo) and gcc version 5. The bubble +sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
- $ gcc-5 -O3 sort.c -o sort
- $ taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 5910 ms
- $ perf record -e cs_etm/@20070000.etr/u --per-thread taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 12543 ms
- [ perf record: Woken up 35 times to write data ]
- [ perf record: Captured and wrote 69.640 MB perf.data ]
- $ perf inject -i perf.data -o inj.data --itrace=il64 --strip
- $ create_gcov --binary=./sort --profile=inj.data --gcov=sort.gcov -gcov_version=1
- $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
- $ taskset -c 2 ./sort_autofdo
- Bubble sorting array of 30000 elements
- 5806 ms
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 1fb0184..ecf1780 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -254,6 +254,7 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) for (i = 0; i < MAX_BUFFER; i++) { decoder->packet_buffer[i].start_addr = 0xdeadbeefdeadbeefUL; decoder->packet_buffer[i].end_addr = 0xdeadbeefdeadbeefUL;
decoder->packet_buffer[i].last_instr_taken_branch = false; decoder->packet_buffer[i].exc = false; decoder->packet_buffer[i].exc_ret = false; decoder->packet_buffer[i].cpu = INT_MIN;
@@ -281,6 +282,18 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) decoder->packet_buffer[et].sample_type = sample_type; decoder->packet_buffer[et].start_addr = elem->st_addr; decoder->packet_buffer[et].end_addr = elem->en_addr;
- switch (elem->last_i_type) {
- case OCSD_INSTR_BR:
- case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec;
break;
- case OCSD_INSTR_ISB:
- case OCSD_INSTR_DSB_DMB:
- case OCSD_INSTR_OTHER:
- default:
decoder->packet_buffer[et].last_instr_taken_branch = false;
break;
- } decoder->packet_buffer[et].exc = false; decoder->packet_buffer[et].exc_ret = false; decoder->packet_buffer[et].cpu = *((int *)inode->priv);
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 3d2e620..a4fdd28 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -30,6 +30,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; u64 start_addr; u64 end_addr;
- u8 last_instr_taken_branch; u8 exc; u8 exc_ret; int cpu;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f2c9877..43cf1d4 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -32,6 +32,14 @@
#define MAX_TIMESTAMP (~0ULL)
+/*
- A64 instructions are always 4 bytes
- Only A64 is supported, so can use this constant for converting between
- addresses and instruction counts, calculting offsets etc
- */
+#define A64_INSTR_SIZE 4
- struct cs_etm_auxtrace { struct auxtrace auxtrace; struct auxtrace_queues queues;
@@ -45,11 +53,15 @@ struct cs_etm_auxtrace { u8 snapshot_mode; u8 data_queued; u8 sample_branches;
u8 sample_instructions;
int num_cpu; u32 auxtrace_type; u64 branches_sample_type; u64 branches_id;
u64 instructions_sample_type;
u64 instructions_sample_period;
u64 instructions_id; u64 **metadata; u64 kernel_start; unsigned int pmu_type;
@@ -68,6 +80,12 @@ struct cs_etm_queue { u64 time; u64 timestamp; u64 offset;
u64 period_instructions;
struct branch_stack *last_branch;
struct branch_stack *last_branch_rb;
size_t last_branch_pos;
struct cs_etm_packet *prev_packet;
struct cs_etm_packet *packet; };
static int cs_etm__update_queues(struct cs_etm_auxtrace *etm);
@@ -180,6 +198,10 @@ static void cs_etm__free_queue(void *priv) thread__zput(etmq->thread); cs_etm_decoder__free(etmq->decoder); zfree(&etmq->event_buf);
- zfree(&etmq->last_branch);
- zfree(&etmq->last_branch_rb);
- zfree(&etmq->prev_packet);
- zfree(&etmq->packet); free(etmq); }
@@ -276,11 +298,35 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq;
size_t szp = sizeof(struct cs_etm_packet);
etmq = zalloc(sizeof(*etmq)); if (!etmq) return NULL;
etmq->packet = zalloc(szp);
if (!etmq->packet)
goto out_free;
if (etm->synth_opts.last_branch || etm->sample_branches) {
etmq->prev_packet = zalloc(szp);
if (!etmq->prev_packet)
goto out_free;
}
if (etm->synth_opts.last_branch) {
size_t sz = sizeof(struct branch_stack);
sz += etm->synth_opts.last_branch_sz *
sizeof(struct branch_entry);
etmq->last_branch = zalloc(sz);
if (!etmq->last_branch)
goto out_free;
etmq->last_branch_rb = zalloc(sz);
if (!etmq->last_branch_rb)
goto out_free;
}
etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); if (!etmq->event_buf) goto out_free;
@@ -335,6 +381,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, goto out_free_decoder;
etmq->offset = 0;
etmq->period_instructions = 0;
return etmq;
@@ -342,6 +389,10 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, cs_etm_decoder__free(etmq->decoder); out_free: zfree(&etmq->event_buf);
zfree(&etmq->last_branch);
zfree(&etmq->last_branch_rb);
zfree(&etmq->prev_packet);
zfree(&etmq->packet); free(etmq);
return NULL;
@@ -395,6 +446,129 @@ static int cs_etm__update_queues(struct cs_etm_auxtrace *etm) return 0; }
+static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs_src = etmq->last_branch_rb;
- struct branch_stack *bs_dst = etmq->last_branch;
- size_t nr = 0;
- /*
* Set the number of records before early exit: ->nr is used to
* determine how many branches to copy from ->entries.
*/
- bs_dst->nr = bs_src->nr;
- /*
* Early exit when there is nothing to copy.
*/
- if (!bs_src->nr)
return;
- /*
* As bs_src->entries is a circular buffer, we need to copy from it in
* two steps. First, copy the branches from the most recently inserted
* branch ->last_branch_pos until the end of bs_src->entries buffer.
*/
- nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos;
- memcpy(&bs_dst->entries[0],
&bs_src->entries[etmq->last_branch_pos],
sizeof(struct branch_entry) * nr);
- /*
* If we wrapped around at least once, the branches from the beginning
* of the bs_src->entries buffer and until the ->last_branch_pos element
* are older valid branches: copy them over. The total number of
* branches copied over will be equal to the number of branches asked by
* the user in last_branch_sz.
*/
- if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
memcpy(&bs_dst->entries[nr],
&bs_src->entries[0],
sizeof(struct branch_entry) * etmq->last_branch_pos);
- }
+}
+static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) +{
- etmq->last_branch_pos = 0;
- etmq->last_branch_rb->nr = 0;
+}
+static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet) +{
- /*
* The packet records the execution range with an exclusive end address
*
* A64 instructions are constant size, so the last executed
* instruction is A64_INSTR_SIZE before the end address
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->end_addr - A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_count(const struct cs_etm_packet *packet) +{
- /*
* Only A64 instructions are currently supported, so can get
* instruction count by dividing.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return (packet->end_addr - packet->start_addr) / A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_addr(const struct cs_etm_packet *packet,
u64 offset)
+{
- /*
* Only A64 instructions are currently supported, so can get
* instruction address by muliplying.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->start_addr + offset * A64_INSTR_SIZE;
+}
+static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs = etmq->last_branch_rb;
- struct branch_entry *be;
- /*
* The branches are recorded in a circular buffer in reverse
* chronological order: we start recording from the last element of the
* buffer down. After writing the first element of the stack, move the
* insert position back to the end of the buffer.
*/
- if (!etmq->last_branch_pos)
etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz;
- etmq->last_branch_pos -= 1;
- be = &bs->entries[etmq->last_branch_pos];
- be->from = cs_etm__last_executed_instr(etmq->prev_packet);
- be->to = etmq->packet->start_addr;
- /* No support for mispredict */
- be->flags.mispred = 0;
- be->flags.predicted = 1;
- /*
* Increment bs->nr until reaching the number of last branches asked by
* the user on the command line.
*/
- if (bs->nr < etmq->etm->synth_opts.last_branch_sz)
bs->nr += 1;
+}
+static int cs_etm__inject_event(union perf_event *event,
struct perf_sample *sample, u64 type)
+{
- event->header.size = perf_event__sample_event_size(sample, type, 0);
- return perf_event__synthesize_sample(event, type, 0, sample);
+}
- static int cs_etm__get_trace(struct cs_etm_buffer *buff, struct cs_etm_queue *etmq) {
@@ -459,35 +633,105 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, } }
+static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
u64 addr, u64 period)
+{
- int ret = 0;
- struct cs_etm_auxtrace *etm = etmq->etm;
- union perf_event *event = etmq->event_buf;
- struct perf_sample sample = {.ip = 0,};
- event->sample.header.type = PERF_RECORD_SAMPLE;
- event->sample.header.misc = PERF_RECORD_MISC_USER;
- event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = addr;
- sample.pid = etmq->pid;
- sample.tid = etmq->tid;
- sample.id = etmq->etm->instructions_id;
- sample.stream_id = etmq->etm->instructions_id;
- sample.period = period;
- sample.cpu = etmq->packet->cpu;
- sample.flags = 0;
- sample.insn_len = 1;
- sample.cpumode = event->header.misc;
- if (etm->synth_opts.last_branch) {
cs_etm__copy_last_branch_rb(etmq);
sample.branch_stack = etmq->last_branch;
- }
- if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->instructions_sample_type);
if (ret)
return ret;
- }
- ret = perf_session__deliver_synth_event(etm->session, event, &sample);
- if (ret)
pr_err(
"CS ETM Trace: failed to deliver instruction event, error %d\n",
ret);
- if (etm->synth_opts.last_branch)
cs_etm__reset_last_branch_rb(etmq);
- return ret;
+}
- /*
*/
- The cs etm packet encodes an instruction range between a branch target
- and the next taken branch. Generate sample accordingly.
-static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
struct cs_etm_packet *packet)
+static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; struct perf_sample sample = {.ip = 0,}; union perf_event *event = etmq->event_buf;
- u64 start_addr = packet->start_addr;
- u64 end_addr = packet->end_addr;
struct dummy_branch_stack {
u64 nr;
struct branch_entry entries;
} dummy_bs;
event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = start_addr;
- sample.ip = cs_etm__last_executed_instr(etmq->prev_packet);
Humm...
I introduced cs_etm__last_executed_instr(), cs_etm__instr_count() and cs_etm__instr_count_addr() to abstract away the detail that the end address in the trace packet is not executed - the last executed instruction is the instruction before that. Putting this into a separate function means we don't have the calculations and explanations scattered throughout the code. As we're only supporting A64 for now, we can simply subtract 4 to get this address, but if we ever support Thumb then we will need instruction decode to work out the actual address - so we will only need to update these functions.
sample.pid = etmq->pid; sample.tid = etmq->tid;
- sample.addr = end_addr;
- sample.addr = etmq->packet->start_addr;
I think this is the right way to do it - much better yes.
sample.id = etmq->etm->branches_id; sample.stream_id = etmq->etm->branches_id; sample.period = 1;
- sample.cpu = packet->cpu;
sample.cpu = etmq->packet->cpu; sample.flags = 0; sample.cpumode = PERF_RECORD_MISC_USER;
/*
* perf report cannot handle events without a branch stack
*/
if (etm->synth_opts.last_branch) {
dummy_bs = (struct dummy_branch_stack){
.nr = 1,
.entries = {
.from = sample.ip,
.to = sample.addr,
},
};
sample.branch_stack = (struct branch_stack *)&dummy_bs;
}
if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->branches_sample_type);
if (ret)
return ret;
}
ret = perf_session__deliver_synth_event(etm->session, event, &sample);
if (ret)
@@ -584,6 +828,24 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, etm->sample_branches = true; etm->branches_sample_type = attr.sample_type; etm->branches_id = id;
id += 1;
attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR;
}
if (etm->synth_opts.last_branch)
attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
if (etm->synth_opts.instructions) {
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.sample_period = etm->synth_opts.period;
etm->instructions_sample_period = attr.sample_period;
err = cs_etm__synth_event(session, &attr, id);
if (err)
return err;
etm->sample_instructions = true;
etm->instructions_sample_type = attr.sample_type;
etm->instructions_id = id;
id += 1;
}
return 0;
@@ -591,20 +853,66 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm,
static int cs_etm__sample(struct cs_etm_queue *etmq) {
- struct cs_etm_auxtrace *etm = etmq->etm;
- struct cs_etm_packet *tmp; int ret;
- struct cs_etm_packet packet;
- u64 instrs_executed;
- while (1) {
ret = cs_etm_decoder__get_packet(etmq->decoder, &packet);
if (ret <= 0)
- instrs_executed = cs_etm__instr_count(etmq->packet);
- etmq->period_instructions += instrs_executed;
- /*
* Record a branch when the last instruction in
* PREV_PACKET is a branch.
*/
- if (etm->synth_opts.last_branch &&
etmq->prev_packet->last_instr_taken_branch)
cs_etm__update_last_branch_rb(etmq);
- if (etm->sample_instructions &&
etmq->period_instructions >= etm->instructions_sample_period) {
/*
* Emit instruction sample periodically
* TODO: allow period to be defined in cycles and clock time
*/
/* Get number of instructions executed after the sample point */
u64 instrs_over = etmq->period_instructions -
etm->instructions_sample_period;
/*
* Calculate the address of the sampled instruction (-1 as
* sample is reported as though instruction has just been
* executed, but PC has not advanced to next instruction)
*/
u64 offset = (instrs_executed - instrs_over - 1);
u64 addr = cs_etm__instr_addr(etmq->packet, offset);
ret = cs_etm__synth_instruction_sample(
etmq, addr, etm->instructions_sample_period);
if (ret)
return ret;
/* Carry remaining instructions into next sample period */
etmq->period_instructions = instrs_over;
- }
- if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
Shouldn't this check be done on the etmq->prev_packet instead?
Yes. And it does get fixed in patch 2. I'll fix this here in the next version.
ret = cs_etm__synth_branch_sample(etmq);
if (ret) return ret;
}
if (etm->sample_branches ||
etm->synth_opts.last_branch) { /*
* If the packet contains an instruction range, generate an
* instruction sequence event.
* Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
* the next incoming packet. */
if (packet.sample_type & CS_ETM_RANGE)
cs_etm__synth_branch_sample(etmq, &packet);
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
etmq->prev_packet = tmp;
}
return 0;
@@ -621,45 +929,73 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) etm->kernel_start = machine__kernel_start(etm->machine);
/* Go through each buffer in the queue and decode them one by one */
-more:
- buffer_used = 0;
- memset(&buffer, 0, sizeof(buffer));
- err = cs_etm__get_trace(&buffer, etmq);
- if (err <= 0)
return err;
- /*
* We cannot assume consecutive blocks in the data file are contiguous,
* reset the decoder to force re-sync.
*/
- err = cs_etm_decoder__reset(etmq->decoder);
- if (err != 0)
return err;
- /* Run trace decoder until buffer consumed or end of trace */
- do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
- while (1) {
Why another while loop? To me it's just instroducing an additional level of imbrication. Is there anything the goto statement wasn't allowing you to achieve?
Other reviewers didn't like the goto (and I agree with them).
buffer_used = 0;
memset(&buffer, 0, sizeof(buffer));
err = cs_etm__get_trace(&buffer, etmq);
if (err <= 0)
return err;
/*
* We cannot assume consecutive blocks in the data file are
* contiguous, reset the decoder to force re-sync.
*/
err = cs_etm_decoder__reset(etmq->decoder);
if (err != 0) return err;
etmq->offset += processed;
buffer_used += processed;
/* Run trace decoder until buffer consumed or end of trace */
do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
return err;
etmq->offset += processed;
buffer_used += processed;
while (1) {
err = cs_etm_decoder__get_packet(etmq->decoder,
etmq->packet);
if (err <= 0)
/*
* Stop processing this chunk on
* end of data or error
*/
break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type & CS_ETM_RANGE)
err = cs_etm__sample(etmq);
}
} while (buffer.len > buffer_used); /*
* Nothing to do with an error condition, let's hope the next
* chunk will be better.
* Generate a last branch event for the branches left in
* the circular buffer at the end of the trace. */
err = cs_etm__sample(etmq);
- } while (buffer.len > buffer_used);
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq, be->to,
etmq->period_instructions);
if (err)
return err;
}
-goto more;
}
return err; }
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On Tue, Jan 30, 2018 at 02:42:24PM +0000, Robert Walker wrote:
Added user space perf functionality to translate CoreSight traces into instruction events with branch stack.
To invoke the new functionality, use the perf inject tool with --itrace=il. For example, to translate the ETM trace from perf.data into last branch records in a new file:
$ perf inject --itrace=i100000il128 -i perf.data -o perf.data.new --strip
The 'i' parameter to itrace generates periodic instruction events. The period between instruction events can be specified as a number of instructions suffixed by i (default 100000). The parameter to 'l' specifies the number of entries in the branch stack attached to instruction events. The 'b' parameter to itrace generates events on taken branches.
This patch also fixes the contents of the branch events used in perf report
- previously branch events were generated for each contiguous range of
instructions executed. These are fixed to generate branch events between the last address of a range ending in an executed branch instruction and the start address of the next range.
Based on patches by Sebastian Pop s.pop@samsung.com with additional fixes and support for specifying the instruction period.
Originally-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Robert Walker robert.walker@arm.com
... and this patch doesn't clear checkpatch.pl
Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 13 + tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 432 +++++++++++++++++++++--- 4 files changed, 429 insertions(+), 48 deletions(-)
diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index a33c88c..5dd64e9 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -330,3 +330,34 @@ Details on how to use the generic STM API can be found here [2]. [1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm [2]. Documentation/trace/stm.txt
+Generating coverage files for Feedback Directed Optimization: AutoFDO +---------------------------------------------------------------------
+perf inject accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g.
- perf inject --itrace -i perf.data -o perf.data.new
+Below is an example of using ARM ETM for autoFDO. It requires autofdo +(https://github.com/google/autofdo) and gcc version 5. The bubble +sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
- $ gcc-5 -O3 sort.c -o sort
- $ taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 5910 ms
- $ perf record -e cs_etm/@20070000.etr/u --per-thread taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 12543 ms
- [ perf record: Woken up 35 times to write data ]
- [ perf record: Captured and wrote 69.640 MB perf.data ]
- $ perf inject -i perf.data -o inj.data --itrace=il64 --strip
- $ create_gcov --binary=./sort --profile=inj.data --gcov=sort.gcov -gcov_version=1
- $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
- $ taskset -c 2 ./sort_autofdo
- Bubble sorting array of 30000 elements
- 5806 ms
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 1fb0184..ecf1780 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -254,6 +254,7 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) for (i = 0; i < MAX_BUFFER; i++) { decoder->packet_buffer[i].start_addr = 0xdeadbeefdeadbeefUL; decoder->packet_buffer[i].end_addr = 0xdeadbeefdeadbeefUL;
decoder->packet_buffer[i].exc = false; decoder->packet_buffer[i].exc_ret = false; decoder->packet_buffer[i].cpu = INT_MIN;decoder->packet_buffer[i].last_instr_taken_branch = false;
@@ -281,6 +282,18 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) decoder->packet_buffer[et].sample_type = sample_type; decoder->packet_buffer[et].start_addr = elem->st_addr; decoder->packet_buffer[et].end_addr = elem->en_addr;
- switch (elem->last_i_type) {
- case OCSD_INSTR_BR:
- case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec;
break;
- case OCSD_INSTR_ISB:
- case OCSD_INSTR_DSB_DMB:
- case OCSD_INSTR_OTHER:
- default:
decoder->packet_buffer[et].last_instr_taken_branch = false;
break;
- } decoder->packet_buffer[et].exc = false; decoder->packet_buffer[et].exc_ret = false; decoder->packet_buffer[et].cpu = *((int *)inode->priv);
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 3d2e620..a4fdd28 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -30,6 +30,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; u64 start_addr; u64 end_addr;
- u8 last_instr_taken_branch; u8 exc; u8 exc_ret; int cpu;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f2c9877..43cf1d4 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -32,6 +32,14 @@ #define MAX_TIMESTAMP (~0ULL) +/*
- A64 instructions are always 4 bytes
- Only A64 is supported, so can use this constant for converting between
- addresses and instruction counts, calculting offsets etc
- */
+#define A64_INSTR_SIZE 4
struct cs_etm_auxtrace { struct auxtrace auxtrace; struct auxtrace_queues queues; @@ -45,11 +53,15 @@ struct cs_etm_auxtrace { u8 snapshot_mode; u8 data_queued; u8 sample_branches;
- u8 sample_instructions;
int num_cpu; u32 auxtrace_type; u64 branches_sample_type; u64 branches_id;
- u64 instructions_sample_type;
- u64 instructions_sample_period;
- u64 instructions_id; u64 **metadata; u64 kernel_start; unsigned int pmu_type;
@@ -68,6 +80,12 @@ struct cs_etm_queue { u64 time; u64 timestamp; u64 offset;
- u64 period_instructions;
- struct branch_stack *last_branch;
- struct branch_stack *last_branch_rb;
- size_t last_branch_pos;
- struct cs_etm_packet *prev_packet;
- struct cs_etm_packet *packet;
}; static int cs_etm__update_queues(struct cs_etm_auxtrace *etm); @@ -180,6 +198,10 @@ static void cs_etm__free_queue(void *priv) thread__zput(etmq->thread); cs_etm_decoder__free(etmq->decoder); zfree(&etmq->event_buf);
- zfree(&etmq->last_branch);
- zfree(&etmq->last_branch_rb);
- zfree(&etmq->prev_packet);
- zfree(&etmq->packet); free(etmq);
} @@ -276,11 +298,35 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq;
- size_t szp = sizeof(struct cs_etm_packet);
etmq = zalloc(sizeof(*etmq)); if (!etmq) return NULL;
- etmq->packet = zalloc(szp);
- if (!etmq->packet)
goto out_free;
- if (etm->synth_opts.last_branch || etm->sample_branches) {
etmq->prev_packet = zalloc(szp);
if (!etmq->prev_packet)
goto out_free;
- }
- if (etm->synth_opts.last_branch) {
size_t sz = sizeof(struct branch_stack);
sz += etm->synth_opts.last_branch_sz *
sizeof(struct branch_entry);
etmq->last_branch = zalloc(sz);
if (!etmq->last_branch)
goto out_free;
etmq->last_branch_rb = zalloc(sz);
if (!etmq->last_branch_rb)
goto out_free;
- }
- etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); if (!etmq->event_buf) goto out_free;
@@ -335,6 +381,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, goto out_free_decoder; etmq->offset = 0;
- etmq->period_instructions = 0;
return etmq; @@ -342,6 +389,10 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, cs_etm_decoder__free(etmq->decoder); out_free: zfree(&etmq->event_buf);
- zfree(&etmq->last_branch);
- zfree(&etmq->last_branch_rb);
- zfree(&etmq->prev_packet);
- zfree(&etmq->packet); free(etmq);
return NULL; @@ -395,6 +446,129 @@ static int cs_etm__update_queues(struct cs_etm_auxtrace *etm) return 0; } +static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs_src = etmq->last_branch_rb;
- struct branch_stack *bs_dst = etmq->last_branch;
- size_t nr = 0;
- /*
* Set the number of records before early exit: ->nr is used to
* determine how many branches to copy from ->entries.
*/
- bs_dst->nr = bs_src->nr;
- /*
* Early exit when there is nothing to copy.
*/
- if (!bs_src->nr)
return;
- /*
* As bs_src->entries is a circular buffer, we need to copy from it in
* two steps. First, copy the branches from the most recently inserted
* branch ->last_branch_pos until the end of bs_src->entries buffer.
*/
- nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos;
- memcpy(&bs_dst->entries[0],
&bs_src->entries[etmq->last_branch_pos],
sizeof(struct branch_entry) * nr);
- /*
* If we wrapped around at least once, the branches from the beginning
* of the bs_src->entries buffer and until the ->last_branch_pos element
* are older valid branches: copy them over. The total number of
* branches copied over will be equal to the number of branches asked by
* the user in last_branch_sz.
*/
- if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
memcpy(&bs_dst->entries[nr],
&bs_src->entries[0],
sizeof(struct branch_entry) * etmq->last_branch_pos);
- }
+}
+static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) +{
- etmq->last_branch_pos = 0;
- etmq->last_branch_rb->nr = 0;
+}
+static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet) +{
- /*
* The packet records the execution range with an exclusive end address
*
* A64 instructions are constant size, so the last executed
* instruction is A64_INSTR_SIZE before the end address
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->end_addr - A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_count(const struct cs_etm_packet *packet) +{
- /*
* Only A64 instructions are currently supported, so can get
* instruction count by dividing.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return (packet->end_addr - packet->start_addr) / A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_addr(const struct cs_etm_packet *packet,
u64 offset)
+{
- /*
* Only A64 instructions are currently supported, so can get
* instruction address by muliplying.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->start_addr + offset * A64_INSTR_SIZE;
+}
+static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs = etmq->last_branch_rb;
- struct branch_entry *be;
- /*
* The branches are recorded in a circular buffer in reverse
* chronological order: we start recording from the last element of the
* buffer down. After writing the first element of the stack, move the
* insert position back to the end of the buffer.
*/
- if (!etmq->last_branch_pos)
etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz;
- etmq->last_branch_pos -= 1;
- be = &bs->entries[etmq->last_branch_pos];
- be->from = cs_etm__last_executed_instr(etmq->prev_packet);
- be->to = etmq->packet->start_addr;
- /* No support for mispredict */
- be->flags.mispred = 0;
- be->flags.predicted = 1;
- /*
* Increment bs->nr until reaching the number of last branches asked by
* the user on the command line.
*/
- if (bs->nr < etmq->etm->synth_opts.last_branch_sz)
bs->nr += 1;
+}
+static int cs_etm__inject_event(union perf_event *event,
struct perf_sample *sample, u64 type)
+{
- event->header.size = perf_event__sample_event_size(sample, type, 0);
- return perf_event__synthesize_sample(event, type, 0, sample);
+}
static int cs_etm__get_trace(struct cs_etm_buffer *buff, struct cs_etm_queue *etmq) { @@ -459,35 +633,105 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, } } +static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
u64 addr, u64 period)
+{
- int ret = 0;
- struct cs_etm_auxtrace *etm = etmq->etm;
- union perf_event *event = etmq->event_buf;
- struct perf_sample sample = {.ip = 0,};
- event->sample.header.type = PERF_RECORD_SAMPLE;
- event->sample.header.misc = PERF_RECORD_MISC_USER;
- event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = addr;
- sample.pid = etmq->pid;
- sample.tid = etmq->tid;
- sample.id = etmq->etm->instructions_id;
- sample.stream_id = etmq->etm->instructions_id;
- sample.period = period;
- sample.cpu = etmq->packet->cpu;
- sample.flags = 0;
- sample.insn_len = 1;
- sample.cpumode = event->header.misc;
- if (etm->synth_opts.last_branch) {
cs_etm__copy_last_branch_rb(etmq);
sample.branch_stack = etmq->last_branch;
- }
- if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->instructions_sample_type);
if (ret)
return ret;
- }
- ret = perf_session__deliver_synth_event(etm->session, event, &sample);
- if (ret)
pr_err(
"CS ETM Trace: failed to deliver instruction event, error %d\n",
ret);
- if (etm->synth_opts.last_branch)
cs_etm__reset_last_branch_rb(etmq);
- return ret;
+}
/*
- The cs etm packet encodes an instruction range between a branch target
- and the next taken branch. Generate sample accordingly.
*/ -static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
struct cs_etm_packet *packet)
+static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; struct perf_sample sample = {.ip = 0,}; union perf_event *event = etmq->event_buf;
- u64 start_addr = packet->start_addr;
- u64 end_addr = packet->end_addr;
- struct dummy_branch_stack {
u64 nr;
struct branch_entry entries;
- } dummy_bs;
event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = start_addr;
- sample.ip = cs_etm__last_executed_instr(etmq->prev_packet); sample.pid = etmq->pid; sample.tid = etmq->tid;
- sample.addr = end_addr;
- sample.addr = etmq->packet->start_addr; sample.id = etmq->etm->branches_id; sample.stream_id = etmq->etm->branches_id; sample.period = 1;
- sample.cpu = packet->cpu;
- sample.cpu = etmq->packet->cpu; sample.flags = 0; sample.cpumode = PERF_RECORD_MISC_USER;
- /*
* perf report cannot handle events without a branch stack
*/
- if (etm->synth_opts.last_branch) {
dummy_bs = (struct dummy_branch_stack){
.nr = 1,
.entries = {
.from = sample.ip,
.to = sample.addr,
},
};
sample.branch_stack = (struct branch_stack *)&dummy_bs;
- }
- if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->branches_sample_type);
if (ret)
return ret;
- }
- ret = perf_session__deliver_synth_event(etm->session, event, &sample);
if (ret) @@ -584,6 +828,24 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, etm->sample_branches = true; etm->branches_sample_type = attr.sample_type; etm->branches_id = id;
id += 1;
attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR;
- }
- if (etm->synth_opts.last_branch)
attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
- if (etm->synth_opts.instructions) {
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.sample_period = etm->synth_opts.period;
etm->instructions_sample_period = attr.sample_period;
err = cs_etm__synth_event(session, &attr, id);
if (err)
return err;
etm->sample_instructions = true;
etm->instructions_sample_type = attr.sample_type;
etm->instructions_id = id;
}id += 1;
return 0; @@ -591,20 +853,66 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, static int cs_etm__sample(struct cs_etm_queue *etmq) {
- struct cs_etm_auxtrace *etm = etmq->etm;
- struct cs_etm_packet *tmp; int ret;
- struct cs_etm_packet packet;
- u64 instrs_executed;
- while (1) {
ret = cs_etm_decoder__get_packet(etmq->decoder, &packet);
if (ret <= 0)
- instrs_executed = cs_etm__instr_count(etmq->packet);
- etmq->period_instructions += instrs_executed;
- /*
* Record a branch when the last instruction in
* PREV_PACKET is a branch.
*/
- if (etm->synth_opts.last_branch &&
etmq->prev_packet->last_instr_taken_branch)
cs_etm__update_last_branch_rb(etmq);
- if (etm->sample_instructions &&
etmq->period_instructions >= etm->instructions_sample_period) {
/*
* Emit instruction sample periodically
* TODO: allow period to be defined in cycles and clock time
*/
/* Get number of instructions executed after the sample point */
u64 instrs_over = etmq->period_instructions -
etm->instructions_sample_period;
/*
* Calculate the address of the sampled instruction (-1 as
* sample is reported as though instruction has just been
* executed, but PC has not advanced to next instruction)
*/
u64 offset = (instrs_executed - instrs_over - 1);
u64 addr = cs_etm__instr_addr(etmq->packet, offset);
ret = cs_etm__synth_instruction_sample(
etmq, addr, etm->instructions_sample_period);
if (ret)
return ret;
/* Carry remaining instructions into next sample period */
etmq->period_instructions = instrs_over;
- }
- if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
ret = cs_etm__synth_branch_sample(etmq);
if (ret) return ret;
- }
- if (etm->sample_branches ||
/*etm->synth_opts.last_branch) {
* If the packet contains an instruction range, generate an
* instruction sequence event.
* Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
*/* the next incoming packet.
if (packet.sample_type & CS_ETM_RANGE)
cs_etm__synth_branch_sample(etmq, &packet);
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
}etmq->prev_packet = tmp;
return 0; @@ -621,45 +929,73 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) etm->kernel_start = machine__kernel_start(etm->machine); /* Go through each buffer in the queue and decode them one by one */ -more:
- buffer_used = 0;
- memset(&buffer, 0, sizeof(buffer));
- err = cs_etm__get_trace(&buffer, etmq);
- if (err <= 0)
return err;
- /*
* We cannot assume consecutive blocks in the data file are contiguous,
* reset the decoder to force re-sync.
*/
- err = cs_etm_decoder__reset(etmq->decoder);
- if (err != 0)
return err;
- /* Run trace decoder until buffer consumed or end of trace */
- do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
- while (1) {
buffer_used = 0;
memset(&buffer, 0, sizeof(buffer));
err = cs_etm__get_trace(&buffer, etmq);
if (err <= 0)
return err;
/*
* We cannot assume consecutive blocks in the data file are
* contiguous, reset the decoder to force re-sync.
*/
err = cs_etm_decoder__reset(etmq->decoder);
if (err != 0) return err;
etmq->offset += processed;
buffer_used += processed;
/* Run trace decoder until buffer consumed or end of trace */
do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
return err;
etmq->offset += processed;
buffer_used += processed;
while (1) {
err = cs_etm_decoder__get_packet(etmq->decoder,
etmq->packet);
if (err <= 0)
/*
* Stop processing this chunk on
* end of data or error
*/
break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type & CS_ETM_RANGE)
err = cs_etm__sample(etmq);
}
} while (buffer.len > buffer_used);
/*
* Nothing to do with an error condition, let's hope the next
* chunk will be better.
* Generate a last branch event for the branches left in
*/* the circular buffer at the end of the trace.
err = cs_etm__sample(etmq);
- } while (buffer.len > buffer_used);
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq, be->to,
etmq->period_instructions);
if (err)
return err;
}
-goto more;
- }
return err; } -- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 02/02/18 00:00, Mathieu Poirier wrote:
On Tue, Jan 30, 2018 at 02:42:24PM +0000, Robert Walker wrote:
Added user space perf functionality to translate CoreSight traces into instruction events with branch stack.
To invoke the new functionality, use the perf inject tool with --itrace=il. For example, to translate the ETM trace from perf.data into last branch records in a new file:
$ perf inject --itrace=i100000il128 -i perf.data -o perf.data.new --strip
The 'i' parameter to itrace generates periodic instruction events. The period between instruction events can be specified as a number of instructions suffixed by i (default 100000). The parameter to 'l' specifies the number of entries in the branch stack attached to instruction events. The 'b' parameter to itrace generates events on taken branches.
This patch also fixes the contents of the branch events used in perf report
- previously branch events were generated for each contiguous range of
instructions executed. These are fixed to generate branch events between the last address of a range ending in an executed branch instruction and the start address of the next range.
Based on patches by Sebastian Pop s.pop@samsung.com with additional fixes and support for specifying the instruction period.
Originally-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Robert Walker robert.walker@arm.com
... and this patch doesn't clear checkpatch.pl
checkpatch.pl gives me warnings for the Originally-by tag, but Kim advised me to use that to preserve Sebastian's credit and there is precedent for it in previous commits.
The other warnings are for lines over 80 characters, but these are fixed in the 2nd patch, so will go away if we squash them together. Will fix up if they're to be kept separate.
Regards
Rob
Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 13 + tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 432 +++++++++++++++++++++--- 4 files changed, 429 insertions(+), 48 deletions(-)
diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index a33c88c..5dd64e9 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -330,3 +330,34 @@ Details on how to use the generic STM API can be found here [2]. [1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm [2]. Documentation/trace/stm.txt
+Generating coverage files for Feedback Directed Optimization: AutoFDO +---------------------------------------------------------------------
+perf inject accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g.
- perf inject --itrace -i perf.data -o perf.data.new
+Below is an example of using ARM ETM for autoFDO. It requires autofdo +(https://github.com/google/autofdo) and gcc version 5. The bubble +sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
- $ gcc-5 -O3 sort.c -o sort
- $ taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 5910 ms
- $ perf record -e cs_etm/@20070000.etr/u --per-thread taskset -c 2 ./sort
- Bubble sorting array of 30000 elements
- 12543 ms
- [ perf record: Woken up 35 times to write data ]
- [ perf record: Captured and wrote 69.640 MB perf.data ]
- $ perf inject -i perf.data -o inj.data --itrace=il64 --strip
- $ create_gcov --binary=./sort --profile=inj.data --gcov=sort.gcov -gcov_version=1
- $ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
- $ taskset -c 2 ./sort_autofdo
- Bubble sorting array of 30000 elements
- 5806 ms
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 1fb0184..ecf1780 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -254,6 +254,7 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) for (i = 0; i < MAX_BUFFER; i++) { decoder->packet_buffer[i].start_addr = 0xdeadbeefdeadbeefUL; decoder->packet_buffer[i].end_addr = 0xdeadbeefdeadbeefUL;
decoder->packet_buffer[i].exc = false; decoder->packet_buffer[i].exc_ret = false; decoder->packet_buffer[i].cpu = INT_MIN;decoder->packet_buffer[i].last_instr_taken_branch = false;
@@ -281,6 +282,18 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) decoder->packet_buffer[et].sample_type = sample_type; decoder->packet_buffer[et].start_addr = elem->st_addr; decoder->packet_buffer[et].end_addr = elem->en_addr;
- switch (elem->last_i_type) {
- case OCSD_INSTR_BR:
- case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec;
break;
- case OCSD_INSTR_ISB:
- case OCSD_INSTR_DSB_DMB:
- case OCSD_INSTR_OTHER:
- default:
decoder->packet_buffer[et].last_instr_taken_branch = false;
break;
- } decoder->packet_buffer[et].exc = false; decoder->packet_buffer[et].exc_ret = false; decoder->packet_buffer[et].cpu = *((int *)inode->priv);
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 3d2e620..a4fdd28 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -30,6 +30,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; u64 start_addr; u64 end_addr;
- u8 last_instr_taken_branch; u8 exc; u8 exc_ret; int cpu;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f2c9877..43cf1d4 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -32,6 +32,14 @@ #define MAX_TIMESTAMP (~0ULL) +/*
- A64 instructions are always 4 bytes
- Only A64 is supported, so can use this constant for converting between
- addresses and instruction counts, calculting offsets etc
- */
+#define A64_INSTR_SIZE 4
- struct cs_etm_auxtrace { struct auxtrace auxtrace; struct auxtrace_queues queues;
@@ -45,11 +53,15 @@ struct cs_etm_auxtrace { u8 snapshot_mode; u8 data_queued; u8 sample_branches;
- u8 sample_instructions;
int num_cpu; u32 auxtrace_type; u64 branches_sample_type; u64 branches_id;
- u64 instructions_sample_type;
- u64 instructions_sample_period;
- u64 instructions_id; u64 **metadata; u64 kernel_start; unsigned int pmu_type;
@@ -68,6 +80,12 @@ struct cs_etm_queue { u64 time; u64 timestamp; u64 offset;
- u64 period_instructions;
- struct branch_stack *last_branch;
- struct branch_stack *last_branch_rb;
- size_t last_branch_pos;
- struct cs_etm_packet *prev_packet;
- struct cs_etm_packet *packet; };
static int cs_etm__update_queues(struct cs_etm_auxtrace *etm); @@ -180,6 +198,10 @@ static void cs_etm__free_queue(void *priv) thread__zput(etmq->thread); cs_etm_decoder__free(etmq->decoder); zfree(&etmq->event_buf);
- zfree(&etmq->last_branch);
- zfree(&etmq->last_branch_rb);
- zfree(&etmq->prev_packet);
- zfree(&etmq->packet); free(etmq); }
@@ -276,11 +298,35 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq;
- size_t szp = sizeof(struct cs_etm_packet);
etmq = zalloc(sizeof(*etmq)); if (!etmq) return NULL;
- etmq->packet = zalloc(szp);
- if (!etmq->packet)
goto out_free;
- if (etm->synth_opts.last_branch || etm->sample_branches) {
etmq->prev_packet = zalloc(szp);
if (!etmq->prev_packet)
goto out_free;
- }
- if (etm->synth_opts.last_branch) {
size_t sz = sizeof(struct branch_stack);
sz += etm->synth_opts.last_branch_sz *
sizeof(struct branch_entry);
etmq->last_branch = zalloc(sz);
if (!etmq->last_branch)
goto out_free;
etmq->last_branch_rb = zalloc(sz);
if (!etmq->last_branch_rb)
goto out_free;
- }
- etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); if (!etmq->event_buf) goto out_free;
@@ -335,6 +381,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, goto out_free_decoder; etmq->offset = 0;
- etmq->period_instructions = 0;
return etmq; @@ -342,6 +389,10 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, cs_etm_decoder__free(etmq->decoder); out_free: zfree(&etmq->event_buf);
- zfree(&etmq->last_branch);
- zfree(&etmq->last_branch_rb);
- zfree(&etmq->prev_packet);
- zfree(&etmq->packet); free(etmq);
return NULL; @@ -395,6 +446,129 @@ static int cs_etm__update_queues(struct cs_etm_auxtrace *etm) return 0; } +static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs_src = etmq->last_branch_rb;
- struct branch_stack *bs_dst = etmq->last_branch;
- size_t nr = 0;
- /*
* Set the number of records before early exit: ->nr is used to
* determine how many branches to copy from ->entries.
*/
- bs_dst->nr = bs_src->nr;
- /*
* Early exit when there is nothing to copy.
*/
- if (!bs_src->nr)
return;
- /*
* As bs_src->entries is a circular buffer, we need to copy from it in
* two steps. First, copy the branches from the most recently inserted
* branch ->last_branch_pos until the end of bs_src->entries buffer.
*/
- nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos;
- memcpy(&bs_dst->entries[0],
&bs_src->entries[etmq->last_branch_pos],
sizeof(struct branch_entry) * nr);
- /*
* If we wrapped around at least once, the branches from the beginning
* of the bs_src->entries buffer and until the ->last_branch_pos element
* are older valid branches: copy them over. The total number of
* branches copied over will be equal to the number of branches asked by
* the user in last_branch_sz.
*/
- if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
memcpy(&bs_dst->entries[nr],
&bs_src->entries[0],
sizeof(struct branch_entry) * etmq->last_branch_pos);
- }
+}
+static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) +{
- etmq->last_branch_pos = 0;
- etmq->last_branch_rb->nr = 0;
+}
+static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet) +{
- /*
* The packet records the execution range with an exclusive end address
*
* A64 instructions are constant size, so the last executed
* instruction is A64_INSTR_SIZE before the end address
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->end_addr - A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_count(const struct cs_etm_packet *packet) +{
- /*
* Only A64 instructions are currently supported, so can get
* instruction count by dividing.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return (packet->end_addr - packet->start_addr) / A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_addr(const struct cs_etm_packet *packet,
u64 offset)
+{
- /*
* Only A64 instructions are currently supported, so can get
* instruction address by muliplying.
* Will need to do instruction level decode for T32 instructions as
* they can be variable size (not yet supported).
*/
- return packet->start_addr + offset * A64_INSTR_SIZE;
+}
+static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) +{
- struct branch_stack *bs = etmq->last_branch_rb;
- struct branch_entry *be;
- /*
* The branches are recorded in a circular buffer in reverse
* chronological order: we start recording from the last element of the
* buffer down. After writing the first element of the stack, move the
* insert position back to the end of the buffer.
*/
- if (!etmq->last_branch_pos)
etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz;
- etmq->last_branch_pos -= 1;
- be = &bs->entries[etmq->last_branch_pos];
- be->from = cs_etm__last_executed_instr(etmq->prev_packet);
- be->to = etmq->packet->start_addr;
- /* No support for mispredict */
- be->flags.mispred = 0;
- be->flags.predicted = 1;
- /*
* Increment bs->nr until reaching the number of last branches asked by
* the user on the command line.
*/
- if (bs->nr < etmq->etm->synth_opts.last_branch_sz)
bs->nr += 1;
+}
+static int cs_etm__inject_event(union perf_event *event,
struct perf_sample *sample, u64 type)
+{
- event->header.size = perf_event__sample_event_size(sample, type, 0);
- return perf_event__synthesize_sample(event, type, 0, sample);
+}
- static int cs_etm__get_trace(struct cs_etm_buffer *buff, struct cs_etm_queue *etmq) {
@@ -459,35 +633,105 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, } } +static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
u64 addr, u64 period)
+{
- int ret = 0;
- struct cs_etm_auxtrace *etm = etmq->etm;
- union perf_event *event = etmq->event_buf;
- struct perf_sample sample = {.ip = 0,};
- event->sample.header.type = PERF_RECORD_SAMPLE;
- event->sample.header.misc = PERF_RECORD_MISC_USER;
- event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = addr;
- sample.pid = etmq->pid;
- sample.tid = etmq->tid;
- sample.id = etmq->etm->instructions_id;
- sample.stream_id = etmq->etm->instructions_id;
- sample.period = period;
- sample.cpu = etmq->packet->cpu;
- sample.flags = 0;
- sample.insn_len = 1;
- sample.cpumode = event->header.misc;
- if (etm->synth_opts.last_branch) {
cs_etm__copy_last_branch_rb(etmq);
sample.branch_stack = etmq->last_branch;
- }
- if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->instructions_sample_type);
if (ret)
return ret;
- }
- ret = perf_session__deliver_synth_event(etm->session, event, &sample);
- if (ret)
pr_err(
"CS ETM Trace: failed to deliver instruction event, error %d\n",
ret);
- if (etm->synth_opts.last_branch)
cs_etm__reset_last_branch_rb(etmq);
- return ret;
+}
- /*
*/
- The cs etm packet encodes an instruction range between a branch target
- and the next taken branch. Generate sample accordingly.
-static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
struct cs_etm_packet *packet)
+static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; struct perf_sample sample = {.ip = 0,}; union perf_event *event = etmq->event_buf;
- u64 start_addr = packet->start_addr;
- u64 end_addr = packet->end_addr;
- struct dummy_branch_stack {
u64 nr;
struct branch_entry entries;
- } dummy_bs;
event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = start_addr;
- sample.ip = cs_etm__last_executed_instr(etmq->prev_packet); sample.pid = etmq->pid; sample.tid = etmq->tid;
- sample.addr = end_addr;
- sample.addr = etmq->packet->start_addr; sample.id = etmq->etm->branches_id; sample.stream_id = etmq->etm->branches_id; sample.period = 1;
- sample.cpu = packet->cpu;
- sample.cpu = etmq->packet->cpu; sample.flags = 0; sample.cpumode = PERF_RECORD_MISC_USER;
- /*
* perf report cannot handle events without a branch stack
*/
- if (etm->synth_opts.last_branch) {
dummy_bs = (struct dummy_branch_stack){
.nr = 1,
.entries = {
.from = sample.ip,
.to = sample.addr,
},
};
sample.branch_stack = (struct branch_stack *)&dummy_bs;
- }
- if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->branches_sample_type);
if (ret)
return ret;
- }
- ret = perf_session__deliver_synth_event(etm->session, event, &sample);
if (ret) @@ -584,6 +828,24 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, etm->sample_branches = true; etm->branches_sample_type = attr.sample_type; etm->branches_id = id;
id += 1;
attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR;
- }
- if (etm->synth_opts.last_branch)
attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
- if (etm->synth_opts.instructions) {
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.sample_period = etm->synth_opts.period;
etm->instructions_sample_period = attr.sample_period;
err = cs_etm__synth_event(session, &attr, id);
if (err)
return err;
etm->sample_instructions = true;
etm->instructions_sample_type = attr.sample_type;
etm->instructions_id = id;
}id += 1;
return 0; @@ -591,20 +853,66 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, static int cs_etm__sample(struct cs_etm_queue *etmq) {
- struct cs_etm_auxtrace *etm = etmq->etm;
- struct cs_etm_packet *tmp; int ret;
- struct cs_etm_packet packet;
- u64 instrs_executed;
- while (1) {
ret = cs_etm_decoder__get_packet(etmq->decoder, &packet);
if (ret <= 0)
- instrs_executed = cs_etm__instr_count(etmq->packet);
- etmq->period_instructions += instrs_executed;
- /*
* Record a branch when the last instruction in
* PREV_PACKET is a branch.
*/
- if (etm->synth_opts.last_branch &&
etmq->prev_packet->last_instr_taken_branch)
cs_etm__update_last_branch_rb(etmq);
- if (etm->sample_instructions &&
etmq->period_instructions >= etm->instructions_sample_period) {
/*
* Emit instruction sample periodically
* TODO: allow period to be defined in cycles and clock time
*/
/* Get number of instructions executed after the sample point */
u64 instrs_over = etmq->period_instructions -
etm->instructions_sample_period;
/*
* Calculate the address of the sampled instruction (-1 as
* sample is reported as though instruction has just been
* executed, but PC has not advanced to next instruction)
*/
u64 offset = (instrs_executed - instrs_over - 1);
u64 addr = cs_etm__instr_addr(etmq->packet, offset);
ret = cs_etm__synth_instruction_sample(
etmq, addr, etm->instructions_sample_period);
if (ret)
return ret;
/* Carry remaining instructions into next sample period */
etmq->period_instructions = instrs_over;
- }
- if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
ret = cs_etm__synth_branch_sample(etmq);
if (ret) return ret;
- }
- if (etm->sample_branches ||
/*etm->synth_opts.last_branch) {
* If the packet contains an instruction range, generate an
* instruction sequence event.
* Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
*/* the next incoming packet.
if (packet.sample_type & CS_ETM_RANGE)
cs_etm__synth_branch_sample(etmq, &packet);
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
}etmq->prev_packet = tmp;
return 0; @@ -621,45 +929,73 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) etm->kernel_start = machine__kernel_start(etm->machine); /* Go through each buffer in the queue and decode them one by one */ -more:
- buffer_used = 0;
- memset(&buffer, 0, sizeof(buffer));
- err = cs_etm__get_trace(&buffer, etmq);
- if (err <= 0)
return err;
- /*
* We cannot assume consecutive blocks in the data file are contiguous,
* reset the decoder to force re-sync.
*/
- err = cs_etm_decoder__reset(etmq->decoder);
- if (err != 0)
return err;
- /* Run trace decoder until buffer consumed or end of trace */
- do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
- while (1) {
buffer_used = 0;
memset(&buffer, 0, sizeof(buffer));
err = cs_etm__get_trace(&buffer, etmq);
if (err <= 0)
return err;
/*
* We cannot assume consecutive blocks in the data file are
* contiguous, reset the decoder to force re-sync.
*/
err = cs_etm_decoder__reset(etmq->decoder);
if (err != 0) return err;
etmq->offset += processed;
buffer_used += processed;
/* Run trace decoder until buffer consumed or end of trace */
do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
return err;
etmq->offset += processed;
buffer_used += processed;
while (1) {
err = cs_etm_decoder__get_packet(etmq->decoder,
etmq->packet);
if (err <= 0)
/*
* Stop processing this chunk on
* end of data or error
*/
break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type & CS_ETM_RANGE)
err = cs_etm__sample(etmq);
}
} while (buffer.len > buffer_used);
/*
* Nothing to do with an error condition, let's hope the next
* chunk will be better.
* Generate a last branch event for the branches left in
*/* the circular buffer at the end of the trace.
err = cs_etm__sample(etmq);
- } while (buffer.len > buffer_used);
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq, be->to,
etmq->period_instructions);
if (err)
return err;
}
-goto more;
- }
return err; } -- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 2 February 2018 at 08:08, Robert Walker robert.walker@arm.com wrote:
On 02/02/18 00:00, Mathieu Poirier wrote:
On Tue, Jan 30, 2018 at 02:42:24PM +0000, Robert Walker wrote:
Added user space perf functionality to translate CoreSight traces into instruction events with branch stack.
To invoke the new functionality, use the perf inject tool with --itrace=il. For example, to translate the ETM trace from perf.data into last branch records in a new file:
$ perf inject --itrace=i100000il128 -i perf.data -o perf.data.new --strip
The 'i' parameter to itrace generates periodic instruction events. The period between instruction events can be specified as a number of instructions suffixed by i (default 100000). The parameter to 'l' specifies the number of entries in the branch stack attached to instruction events. The 'b' parameter to itrace generates events on taken branches.
This patch also fixes the contents of the branch events used in perf report
- previously branch events were generated for each contiguous range of
instructions executed. These are fixed to generate branch events between the last address of a range ending in an executed branch instruction and the start address of the next range.
Based on patches by Sebastian Pop s.pop@samsung.com with additional fixes and support for specifying the instruction period.
Originally-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Robert Walker robert.walker@arm.com
... and this patch doesn't clear checkpatch.pl
checkpatch.pl gives me warnings for the Originally-by tag, but Kim advised me to use that to preserve Sebastian's credit and there is precedent for it in previous commits.
I probably should have been clearer - I'm good with the "Originally-by" part.
The other warnings are for lines over 80 characters, but these are fixed in the 2nd patch, so will go away if we squash them together. Will fix up if they're to be kept separate.
Correct, it's the 80 character part I was referring to. I didn't have the time to review 2/2, something I will do this morning.
Regards
Rob
Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 13 + tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 432 +++++++++++++++++++++--- 4 files changed, 429 insertions(+), 48 deletions(-)
diff --git a/Documentation/trace/coresight.txt b/Documentation/trace/coresight.txt index a33c88c..5dd64e9 100644 --- a/Documentation/trace/coresight.txt +++ b/Documentation/trace/coresight.txt @@ -330,3 +330,34 @@ Details on how to use the generic STM API can be found here [2]. [1]. Documentation/ABI/testing/sysfs-bus-coresight-devices-stm [2]. Documentation/trace/stm.txt
+Generating coverage files for Feedback Directed Optimization: AutoFDO +---------------------------------------------------------------------
+perf inject accepts the --itrace option in which case tracing data is +removed and replaced with the synthesized events. e.g.
perf inject --itrace -i perf.data -o perf.data.new
+Below is an example of using ARM ETM for autoFDO. It requires autofdo +(https://github.com/google/autofdo) and gcc version 5. The bubble +sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial).
$ gcc-5 -O3 sort.c -o sort
$ taskset -c 2 ./sort
Bubble sorting array of 30000 elements
5910 ms
$ perf record -e cs_etm/@20070000.etr/u --per-thread taskset -c 2
./sort
Bubble sorting array of 30000 elements
12543 ms
[ perf record: Woken up 35 times to write data ]
[ perf record: Captured and wrote 69.640 MB perf.data ]
$ perf inject -i perf.data -o inj.data --itrace=il64 --strip
$ create_gcov --binary=./sort --profile=inj.data --gcov=sort.gcov
-gcov_version=1
$ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
$ taskset -c 2 ./sort_autofdo
Bubble sorting array of 30000 elements
5806 ms
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 1fb0184..ecf1780 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -254,6 +254,7 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) for (i = 0; i < MAX_BUFFER; i++) { decoder->packet_buffer[i].start_addr = 0xdeadbeefdeadbeefUL; decoder->packet_buffer[i].end_addr = 0xdeadbeefdeadbeefUL;
decoder->packet_buffer[i].last_instr_taken_branch =
false; decoder->packet_buffer[i].exc = false; decoder->packet_buffer[i].exc_ret = false; decoder->packet_buffer[i].cpu = INT_MIN; @@ -281,6 +282,18 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) decoder->packet_buffer[et].sample_type = sample_type; decoder->packet_buffer[et].start_addr = elem->st_addr; decoder->packet_buffer[et].end_addr = elem->en_addr;
switch (elem->last_i_type) {
case OCSD_INSTR_BR:
case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch =
elem->last_instr_exec;
break;
case OCSD_INSTR_ISB:
case OCSD_INSTR_DSB_DMB:
case OCSD_INSTR_OTHER:
default:
decoder->packet_buffer[et].last_instr_taken_branch =
false;
break;
} decoder->packet_buffer[et].exc = false; decoder->packet_buffer[et].exc_ret = false; decoder->packet_buffer[et].cpu = *((int *)inode->priv);
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 3d2e620..a4fdd28 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -30,6 +30,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; u64 start_addr; u64 end_addr;
u8 last_instr_taken_branch; u8 exc; u8 exc_ret; int cpu;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f2c9877..43cf1d4 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -32,6 +32,14 @@ #define MAX_TIMESTAMP (~0ULL) +/*
- A64 instructions are always 4 bytes
- Only A64 is supported, so can use this constant for converting
between
- addresses and instruction counts, calculting offsets etc
- */
+#define A64_INSTR_SIZE 4
- struct cs_etm_auxtrace { struct auxtrace auxtrace; struct auxtrace_queues queues;
@@ -45,11 +53,15 @@ struct cs_etm_auxtrace { u8 snapshot_mode; u8 data_queued; u8 sample_branches;
u8 sample_instructions; int num_cpu; u32 auxtrace_type; u64 branches_sample_type; u64 branches_id;
u64 instructions_sample_type;
u64 instructions_sample_period;
u64 instructions_id; u64 **metadata; u64 kernel_start; unsigned int pmu_type;
@@ -68,6 +80,12 @@ struct cs_etm_queue { u64 time; u64 timestamp; u64 offset;
u64 period_instructions;
struct branch_stack *last_branch;
struct branch_stack *last_branch_rb;
size_t last_branch_pos;
struct cs_etm_packet *prev_packet;
}; static int cs_etm__update_queues(struct cs_etm_auxtrace *etm);struct cs_etm_packet *packet;
@@ -180,6 +198,10 @@ static void cs_etm__free_queue(void *priv) thread__zput(etmq->thread); cs_etm_decoder__free(etmq->decoder); zfree(&etmq->event_buf);
zfree(&etmq->last_branch);
zfree(&etmq->last_branch_rb);
zfree(&etmq->prev_packet);
} @@ -276,11 +298,35 @@ static struct cs_etm_queuezfree(&etmq->packet); free(etmq);
*cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq;
size_t szp = sizeof(struct cs_etm_packet); etmq = zalloc(sizeof(*etmq)); if (!etmq) return NULL;
etmq->packet = zalloc(szp);
if (!etmq->packet)
goto out_free;
if (etm->synth_opts.last_branch || etm->sample_branches) {
etmq->prev_packet = zalloc(szp);
if (!etmq->prev_packet)
goto out_free;
}
if (etm->synth_opts.last_branch) {
size_t sz = sizeof(struct branch_stack);
sz += etm->synth_opts.last_branch_sz *
sizeof(struct branch_entry);
etmq->last_branch = zalloc(sz);
if (!etmq->last_branch)
goto out_free;
etmq->last_branch_rb = zalloc(sz);
if (!etmq->last_branch_rb)
goto out_free;
}
etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); if (!etmq->event_buf) goto out_free;
@@ -335,6 +381,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, goto out_free_decoder; etmq->offset = 0;
@@ -342,6 +389,10 @@ static struct cs_etm_queueetmq->period_instructions = 0; return etmq;
*cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, cs_etm_decoder__free(etmq->decoder); out_free: zfree(&etmq->event_buf);
zfree(&etmq->last_branch);
zfree(&etmq->last_branch_rb);
zfree(&etmq->prev_packet);
zfree(&etmq->packet); free(etmq); return NULL;
@@ -395,6 +446,129 @@ static int cs_etm__update_queues(struct cs_etm_auxtrace *etm) return 0; } +static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) +{
struct branch_stack *bs_src = etmq->last_branch_rb;
struct branch_stack *bs_dst = etmq->last_branch;
size_t nr = 0;
/*
* Set the number of records before early exit: ->nr is used to
* determine how many branches to copy from ->entries.
*/
bs_dst->nr = bs_src->nr;
/*
* Early exit when there is nothing to copy.
*/
if (!bs_src->nr)
return;
/*
* As bs_src->entries is a circular buffer, we need to copy from
it in
* two steps. First, copy the branches from the most recently
inserted
* branch ->last_branch_pos until the end of bs_src->entries
buffer.
*/
nr = etmq->etm->synth_opts.last_branch_sz -
etmq->last_branch_pos;
memcpy(&bs_dst->entries[0],
&bs_src->entries[etmq->last_branch_pos],
sizeof(struct branch_entry) * nr);
/*
* If we wrapped around at least once, the branches from the
beginning
* of the bs_src->entries buffer and until the ->last_branch_pos
element
* are older valid branches: copy them over. The total number of
* branches copied over will be equal to the number of branches
asked by
* the user in last_branch_sz.
*/
if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
memcpy(&bs_dst->entries[nr],
&bs_src->entries[0],
sizeof(struct branch_entry) *
etmq->last_branch_pos);
}
+}
+static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) +{
etmq->last_branch_pos = 0;
etmq->last_branch_rb->nr = 0;
+}
+static inline u64 cs_etm__last_executed_instr(struct cs_etm_packet *packet) +{
/*
* The packet records the execution range with an exclusive end
address
*
* A64 instructions are constant size, so the last executed
* instruction is A64_INSTR_SIZE before the end address
* Will need to do instruction level decode for T32 instructions
as
* they can be variable size (not yet supported).
*/
return packet->end_addr - A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_count(const struct cs_etm_packet *packet) +{
/*
* Only A64 instructions are currently supported, so can get
* instruction count by dividing.
* Will need to do instruction level decode for T32 instructions
as
* they can be variable size (not yet supported).
*/
return (packet->end_addr - packet->start_addr) / A64_INSTR_SIZE;
+}
+static inline u64 cs_etm__instr_addr(const struct cs_etm_packet *packet,
u64 offset)
+{
/*
* Only A64 instructions are currently supported, so can get
* instruction address by muliplying.
* Will need to do instruction level decode for T32 instructions
as
* they can be variable size (not yet supported).
*/
return packet->start_addr + offset * A64_INSTR_SIZE;
+}
+static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) +{
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be;
/*
* The branches are recorded in a circular buffer in reverse
* chronological order: we start recording from the last element
of the
* buffer down. After writing the first element of the stack,
move the
* insert position back to the end of the buffer.
*/
if (!etmq->last_branch_pos)
etmq->last_branch_pos =
etmq->etm->synth_opts.last_branch_sz;
etmq->last_branch_pos -= 1;
be = &bs->entries[etmq->last_branch_pos];
be->from = cs_etm__last_executed_instr(etmq->prev_packet);
be->to = etmq->packet->start_addr;
/* No support for mispredict */
be->flags.mispred = 0;
be->flags.predicted = 1;
/*
* Increment bs->nr until reaching the number of last branches
asked by
* the user on the command line.
*/
if (bs->nr < etmq->etm->synth_opts.last_branch_sz)
bs->nr += 1;
+}
+static int cs_etm__inject_event(union perf_event *event,
struct perf_sample *sample, u64 type)
+{
event->header.size = perf_event__sample_event_size(sample, type,
0);
return perf_event__synthesize_sample(event, type, 0, sample);
+}
- static int cs_etm__get_trace(struct cs_etm_buffer *buff, struct cs_etm_queue
*etmq) { @@ -459,35 +633,105 @@ static void cs_etm__set_pid_tid_cpu(struct cs_etm_auxtrace *etm, } } +static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
u64 addr, u64 period)
+{
int ret = 0;
struct cs_etm_auxtrace *etm = etmq->etm;
union perf_event *event = etmq->event_buf;
struct perf_sample sample = {.ip = 0,};
event->sample.header.type = PERF_RECORD_SAMPLE;
event->sample.header.misc = PERF_RECORD_MISC_USER;
event->sample.header.size = sizeof(struct perf_event_header);
sample.ip = addr;
sample.pid = etmq->pid;
sample.tid = etmq->tid;
sample.id = etmq->etm->instructions_id;
sample.stream_id = etmq->etm->instructions_id;
sample.period = period;
sample.cpu = etmq->packet->cpu;
sample.flags = 0;
sample.insn_len = 1;
sample.cpumode = event->header.misc;
if (etm->synth_opts.last_branch) {
cs_etm__copy_last_branch_rb(etmq);
sample.branch_stack = etmq->last_branch;
}
if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->instructions_sample_type);
if (ret)
return ret;
}
ret = perf_session__deliver_synth_event(etm->session, event,
&sample);
if (ret)
pr_err(
"CS ETM Trace: failed to deliver instruction
event, error %d\n",
ret);
if (etm->synth_opts.last_branch)
cs_etm__reset_last_branch_rb(etmq);
return ret;
+}
- /*
- The cs etm packet encodes an instruction range between a branch
target
- and the next taken branch. Generate sample accordingly.
*/ -static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq,
struct cs_etm_packet *packet)
+static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; struct perf_sample sample = {.ip = 0,}; union perf_event *event = etmq->event_buf;
u64 start_addr = packet->start_addr;
u64 end_addr = packet->end_addr;
struct dummy_branch_stack {
u64 nr;
struct branch_entry entries;
} dummy_bs; event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
sample.ip = start_addr;
sample.ip = cs_etm__last_executed_instr(etmq->prev_packet); sample.pid = etmq->pid; sample.tid = etmq->tid;
sample.addr = end_addr;
sample.addr = etmq->packet->start_addr; sample.id = etmq->etm->branches_id; sample.stream_id = etmq->etm->branches_id; sample.period = 1;
sample.cpu = packet->cpu;
sample.cpu = etmq->packet->cpu; sample.flags = 0; sample.cpumode = PERF_RECORD_MISC_USER;
/*
* perf report cannot handle events without a branch stack
*/
if (etm->synth_opts.last_branch) {
dummy_bs = (struct dummy_branch_stack){
.nr = 1,
.entries = {
.from = sample.ip,
.to = sample.addr,
},
};
sample.branch_stack = (struct branch_stack *)&dummy_bs;
}
if (etm->synth_opts.inject) {
ret = cs_etm__inject_event(event, &sample,
etm->branches_sample_type);
if (ret)
return ret;
}
ret = perf_session__deliver_synth_event(etm->session, event,
&sample); if (ret) @@ -584,6 +828,24 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, etm->sample_branches = true; etm->branches_sample_type = attr.sample_type; etm->branches_id = id;
id += 1;
attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR;
}
if (etm->synth_opts.last_branch)
attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
if (etm->synth_opts.instructions) {
attr.config = PERF_COUNT_HW_INSTRUCTIONS;
attr.sample_period = etm->synth_opts.period;
etm->instructions_sample_period = attr.sample_period;
err = cs_etm__synth_event(session, &attr, id);
if (err)
return err;
etm->sample_instructions = true;
etm->instructions_sample_type = attr.sample_type;
etm->instructions_id = id;
id += 1; } return 0;
@@ -591,20 +853,66 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, static int cs_etm__sample(struct cs_etm_queue *etmq) {
struct cs_etm_auxtrace *etm = etmq->etm;
struct cs_etm_packet *tmp; int ret;
struct cs_etm_packet packet;
u64 instrs_executed;
while (1) {
ret = cs_etm_decoder__get_packet(etmq->decoder, &packet);
if (ret <= 0)
instrs_executed = cs_etm__instr_count(etmq->packet);
etmq->period_instructions += instrs_executed;
/*
* Record a branch when the last instruction in
* PREV_PACKET is a branch.
*/
if (etm->synth_opts.last_branch &&
etmq->prev_packet->last_instr_taken_branch)
cs_etm__update_last_branch_rb(etmq);
if (etm->sample_instructions &&
etmq->period_instructions >= etm->instructions_sample_period)
{
/*
* Emit instruction sample periodically
* TODO: allow period to be defined in cycles and clock
time
*/
/* Get number of instructions executed after the sample
point */
u64 instrs_over = etmq->period_instructions -
etm->instructions_sample_period;
/*
* Calculate the address of the sampled instruction (-1
as
* sample is reported as though instruction has just been
* executed, but PC has not advanced to next instruction)
*/
u64 offset = (instrs_executed - instrs_over - 1);
u64 addr = cs_etm__instr_addr(etmq->packet, offset);
ret = cs_etm__synth_instruction_sample(
etmq, addr, etm->instructions_sample_period);
if (ret)
return ret;
/* Carry remaining instructions into next sample period
*/
etmq->period_instructions = instrs_over;
}
if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
ret = cs_etm__synth_branch_sample(etmq);
if (ret) return ret;
}
if (etm->sample_branches ||
etm->synth_opts.last_branch) { /*
* If the packet contains an instruction range, generate
an
* instruction sequence event.
* Swap PACKET with PREV_PACKET: PACKET becomes
PREV_PACKET for
* the next incoming packet. */
if (packet.sample_type & CS_ETM_RANGE)
cs_etm__synth_branch_sample(etmq, &packet);
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
etmq->prev_packet = tmp; } return 0;
@@ -621,45 +929,73 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) etm->kernel_start = machine__kernel_start(etm->machine); /* Go through each buffer in the queue and decode them one by one */ -more:
buffer_used = 0;
memset(&buffer, 0, sizeof(buffer));
err = cs_etm__get_trace(&buffer, etmq);
if (err <= 0)
return err;
/*
* We cannot assume consecutive blocks in the data file are
contiguous,
* reset the decoder to force re-sync.
*/
err = cs_etm_decoder__reset(etmq->decoder);
if (err != 0)
return err;
/* Run trace decoder until buffer consumed or end of trace */
do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
while (1) {
buffer_used = 0;
memset(&buffer, 0, sizeof(buffer));
err = cs_etm__get_trace(&buffer, etmq);
if (err <= 0)
return err;
/*
* We cannot assume consecutive blocks in the data file
are
* contiguous, reset the decoder to force re-sync.
*/
err = cs_etm_decoder__reset(etmq->decoder);
if (err != 0) return err;
etmq->offset += processed;
buffer_used += processed;
/* Run trace decoder until buffer consumed or end of
trace */
do {
processed = 0;
err = cs_etm_decoder__process_data_block(
etmq->decoder,
etmq->offset,
&buffer.buf[buffer_used],
buffer.len - buffer_used,
&processed);
if (err)
return err;
etmq->offset += processed;
buffer_used += processed;
while (1) {
err =
cs_etm_decoder__get_packet(etmq->decoder,
etmq->packet);
if (err <= 0)
/*
* Stop processing this chunk on
* end of data or error
*/
break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type &
CS_ETM_RANGE)
err = cs_etm__sample(etmq);
}
} while (buffer.len > buffer_used); /*
* Nothing to do with an error condition, let's hope the
next
* chunk will be better.
* Generate a last branch event for the branches left in
* the circular buffer at the end of the trace. */
err = cs_etm__sample(etmq);
} while (buffer.len > buffer_used);
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be =
&bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq,
be->to,
etmq->period_instructions);
if (err)
return err;
-goto more;}
}} return err;
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
There may be discontinuities in the ETM trace stream due to overflows or ETM configuration for selective trace. This patch emits an instruction sample with the pending branch stack when a TRACE ON packet occurs indicating a discontinuity in the trace data.
A new packet type CS_ETM_TRACE_ON is added, which is emitted by the low level decoder when a TRACE ON occurs. The higher level decoder flushes the branch stack when this packet is emitted.
Signed-off-by: Robert Walker robert.walker@arm.com --- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 71 ++++++++++++++------ tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 89 +++++++++++++++++-------- 3 files changed, 113 insertions(+), 48 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index ecf1780..d199f58 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -78,6 +78,8 @@ int cs_etm_decoder__reset(struct cs_etm_decoder *decoder) { ocsd_datapath_resp_t dp_ret;
+ decoder->prev_return = OCSD_RESP_CONT; + dp_ret = ocsd_dt_process_data(decoder->dcd_tree, OCSD_OP_RESET, 0, 0, NULL, NULL); if (OCSD_DATA_RESP_IS_FATAL(dp_ret)) @@ -263,7 +265,6 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder)
static ocsd_datapath_resp_t cs_etm_decoder__buffer_packet(struct cs_etm_decoder *decoder, - const ocsd_generic_trace_elem *elem, const u8 trace_chan_id, enum cs_etm_sample_type sample_type) { @@ -279,35 +280,62 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) return OCSD_RESP_FATAL_SYS_ERR;
et = decoder->tail; - decoder->packet_buffer[et].sample_type = sample_type; - decoder->packet_buffer[et].start_addr = elem->st_addr; - decoder->packet_buffer[et].end_addr = elem->en_addr; + et = (et + 1) & (MAX_BUFFER - 1); + decoder->tail = et; + decoder->packet_count++; + + decoder->packet_buffer[et].sample_type = sample_type; + decoder->packet_buffer[et].exc = false; + decoder->packet_buffer[et].exc_ret = false; + decoder->packet_buffer[et].cpu = *((int *)inode->priv); + decoder->packet_buffer[et].start_addr = 0xdeadbeefdeadbeefUL; + decoder->packet_buffer[et].end_addr = 0xdeadbeefdeadbeefUL; + + if (decoder->packet_count == MAX_BUFFER - 1) + return OCSD_RESP_WAIT; + + return OCSD_RESP_CONT; +} + +static ocsd_datapath_resp_t +cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder, + const ocsd_generic_trace_elem *elem, + const uint8_t trace_chan_id) +{ + int ret = 0; + struct cs_etm_packet *packet; + + ret = cs_etm_decoder__buffer_packet(decoder, trace_chan_id, + CS_ETM_RANGE); + if (ret != OCSD_RESP_CONT && ret != OCSD_RESP_WAIT) + return ret; + + packet = &decoder->packet_buffer[decoder->tail]; + + packet->start_addr = elem->st_addr; + packet->end_addr = elem->en_addr; switch (elem->last_i_type) { case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT: - decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec; + packet->last_instr_taken_branch = elem->last_instr_exec; break; case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB: case OCSD_INSTR_OTHER: default: - decoder->packet_buffer[et].last_instr_taken_branch = false; + packet->last_instr_taken_branch = false; break; } - decoder->packet_buffer[et].exc = false; - decoder->packet_buffer[et].exc_ret = false; - decoder->packet_buffer[et].cpu = *((int *)inode->priv); - - /* Wrap around if need be */ - et = (et + 1) & (MAX_BUFFER - 1); - - decoder->tail = et; - decoder->packet_count++;
- if (decoder->packet_count == MAX_BUFFER - 1) - return OCSD_RESP_WAIT; + return ret; +}
- return OCSD_RESP_CONT; +static ocsd_datapath_resp_t +cs_etm_decoder__buffer_trace_on(struct cs_etm_decoder *decoder, + const uint8_t trace_chan_id) +{ + return cs_etm_decoder__buffer_packet(decoder, trace_chan_id, + CS_ETM_TRACE_ON); }
static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( @@ -326,12 +354,13 @@ static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( decoder->trace_on = false; break; case OCSD_GEN_TRC_ELEM_TRACE_ON: + resp = cs_etm_decoder__buffer_trace_on(decoder, + trace_chan_id); decoder->trace_on = true; break; case OCSD_GEN_TRC_ELEM_INSTR_RANGE: - resp = cs_etm_decoder__buffer_packet(decoder, elem, - trace_chan_id, - CS_ETM_RANGE); + resp = cs_etm_decoder__buffer_range(decoder, elem, + trace_chan_id); break; case OCSD_GEN_TRC_ELEM_EXCEPTION: decoder->packet_buffer[decoder->tail].exc = true; diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index a4fdd28..743f5f4 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -24,6 +24,7 @@ struct cs_etm_buffer {
enum cs_etm_sample_type { CS_ETM_RANGE = 1 << 0, + CS_ETM_TRACE_ON = 1 << 1, };
struct cs_etm_packet { diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 43cf1d4..a8d07bd 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -866,6 +866,8 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) * PREV_PACKET is a branch. */ if (etm->synth_opts.last_branch && + etmq->prev_packet && + etmq->prev_packet->sample_type == CS_ETM_RANGE && etmq->prev_packet->last_instr_taken_branch) cs_etm__update_last_branch_rb(etmq);
@@ -898,14 +900,14 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) }
if (etm->sample_branches && - etmq->packet->last_instr_taken_branch) { + etmq->prev_packet->sample_type == CS_ETM_RANGE && + etmq->prev_packet->last_instr_taken_branch) { ret = cs_etm__synth_branch_sample(etmq); if (ret) return ret; }
- if (etm->sample_branches || - etm->synth_opts.last_branch) { + if (etm->sample_branches || etm->synth_opts.last_branch) { /* * Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for * the next incoming packet. @@ -918,6 +920,40 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) return 0; }
+static int cs_etm__flush(struct cs_etm_queue *etmq) +{ + int err = 0; + struct cs_etm_packet *tmp; + + if (etmq->etm->synth_opts.last_branch && + etmq->prev_packet && + etmq->prev_packet->sample_type == CS_ETM_RANGE) { + /* + * Generate a last branch event for the branches left in the + * circular buffer at the end of the trace. + * + * Use the address of the end of the last reported execution + * range + */ + u64 addr = cs_etm__last_executed_instr(etmq->prev_packet); + + err = cs_etm__synth_instruction_sample( + etmq, addr, + etmq->period_instructions); + etmq->period_instructions = 0; + + /* + * Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for + * the next incoming packet. + */ + tmp = etmq->packet; + etmq->packet = etmq->prev_packet; + etmq->prev_packet = tmp; + } + + return err; +} + static int cs_etm__run_decoder(struct cs_etm_queue *etmq) { struct cs_etm_auxtrace *etm = etmq->etm; @@ -946,20 +982,19 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) /* Run trace decoder until buffer consumed or end of trace */ do { processed = 0; - err = cs_etm_decoder__process_data_block( etmq->decoder, etmq->offset, &buffer.buf[buffer_used], buffer.len - buffer_used, &processed); - if (err) return err;
etmq->offset += processed; buffer_used += processed;
+ /* Process each packet in this chunk */ while (1) { err = cs_etm_decoder__get_packet(etmq->decoder, etmq->packet); @@ -970,31 +1005,31 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) */ break;
- /* - * If the packet contains an instruction - * range, generate instruction sequence - * events. - */ - if (etmq->packet->sample_type & CS_ETM_RANGE) - err = cs_etm__sample(etmq); + switch (etmq->packet->sample_type) { + case CS_ETM_RANGE: + /* + * If the packet contains an instruction + * range, generate instruction sequence + * events. + */ + cs_etm__sample(etmq); + break; + case CS_ETM_TRACE_ON: + /* + * Discontinuity in trace, flush + * previous branch stack + */ + cs_etm__flush(etmq); + break; + default: + break; + } } } while (buffer.len > buffer_used);
- /* - * Generate a last branch event for the branches left in - * the circular buffer at the end of the trace. - */ - if (etm->sample_instructions && - etmq->etm->synth_opts.last_branch) { - struct branch_stack *bs = etmq->last_branch_rb; - struct branch_entry *be = &bs->entries[etmq->last_branch_pos]; - - err = cs_etm__synth_instruction_sample(etmq, be->to, - etmq->period_instructions); - if (err) - return err; - } - + if (err == 0) + /* Flush any remaining branch stack entries */ + err = cs_etm__flush(etmq); }
return err;
On Tue, Jan 30, 2018 at 02:42:25PM +0000, Robert Walker wrote:
There may be discontinuities in the ETM trace stream due to overflows or ETM configuration for selective trace. This patch emits an instruction sample with the pending branch stack when a TRACE ON packet occurs indicating a discontinuity in the trace data.
A new packet type CS_ETM_TRACE_ON is added, which is emitted by the low level decoder when a TRACE ON occurs. The higher level decoder flushes the branch stack when this packet is emitted.
Signed-off-by: Robert Walker robert.walker@arm.com
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 71 ++++++++++++++------ tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 89 +++++++++++++++++-------- 3 files changed, 113 insertions(+), 48 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index ecf1780..d199f58 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -78,6 +78,8 @@ int cs_etm_decoder__reset(struct cs_etm_decoder *decoder) { ocsd_datapath_resp_t dp_ret;
- decoder->prev_return = OCSD_RESP_CONT;
- dp_ret = ocsd_dt_process_data(decoder->dcd_tree, OCSD_OP_RESET, 0, 0, NULL, NULL); if (OCSD_DATA_RESP_IS_FATAL(dp_ret))
@@ -263,7 +265,6 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) static ocsd_datapath_resp_t cs_etm_decoder__buffer_packet(struct cs_etm_decoder *decoder,
const ocsd_generic_trace_elem *elem, const u8 trace_chan_id, enum cs_etm_sample_type sample_type)
{ @@ -279,35 +280,62 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) return OCSD_RESP_FATAL_SYS_ERR; et = decoder->tail;
- decoder->packet_buffer[et].sample_type = sample_type;
- decoder->packet_buffer[et].start_addr = elem->st_addr;
- decoder->packet_buffer[et].end_addr = elem->en_addr;
- et = (et + 1) & (MAX_BUFFER - 1);
- decoder->tail = et;
- decoder->packet_count++;
- decoder->packet_buffer[et].sample_type = sample_type;
- decoder->packet_buffer[et].exc = false;
- decoder->packet_buffer[et].exc_ret = false;
- decoder->packet_buffer[et].cpu = *((int *)inode->priv);
- decoder->packet_buffer[et].start_addr = 0xdeadbeefdeadbeefUL;
- decoder->packet_buffer[et].end_addr = 0xdeadbeefdeadbeefUL;
Please drop the tabulations.
- if (decoder->packet_count == MAX_BUFFER - 1)
return OCSD_RESP_WAIT;
- return OCSD_RESP_CONT;
+}
+static ocsd_datapath_resp_t +cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder,
const ocsd_generic_trace_elem *elem,
const uint8_t trace_chan_id)
+{
- int ret = 0;
- struct cs_etm_packet *packet;
- ret = cs_etm_decoder__buffer_packet(decoder, trace_chan_id,
CS_ETM_RANGE);
- if (ret != OCSD_RESP_CONT && ret != OCSD_RESP_WAIT)
return ret;
- packet = &decoder->packet_buffer[decoder->tail];
- packet->start_addr = elem->st_addr;
- packet->end_addr = elem->en_addr;
Same here.
switch (elem->last_i_type) { case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec;
break; case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB: case OCSD_INSTR_OTHER: default:packet->last_instr_taken_branch = elem->last_instr_exec;
decoder->packet_buffer[et].last_instr_taken_branch = false;
break; }packet->last_instr_taken_branch = false;
- decoder->packet_buffer[et].exc = false;
- decoder->packet_buffer[et].exc_ret = false;
- decoder->packet_buffer[et].cpu = *((int *)inode->priv);
- /* Wrap around if need be */
- et = (et + 1) & (MAX_BUFFER - 1);
- decoder->tail = et;
- decoder->packet_count++;
- if (decoder->packet_count == MAX_BUFFER - 1)
return OCSD_RESP_WAIT;
- return ret;
+}
- return OCSD_RESP_CONT;
+static ocsd_datapath_resp_t +cs_etm_decoder__buffer_trace_on(struct cs_etm_decoder *decoder,
const uint8_t trace_chan_id)
+{
- return cs_etm_decoder__buffer_packet(decoder, trace_chan_id,
CS_ETM_TRACE_ON);
} static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( @@ -326,12 +354,13 @@ static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( decoder->trace_on = false; break; case OCSD_GEN_TRC_ELEM_TRACE_ON:
resp = cs_etm_decoder__buffer_trace_on(decoder,
decoder->trace_on = true; break; case OCSD_GEN_TRC_ELEM_INSTR_RANGE:trace_chan_id);
resp = cs_etm_decoder__buffer_packet(decoder, elem,
trace_chan_id,
CS_ETM_RANGE);
resp = cs_etm_decoder__buffer_range(decoder, elem,
trace_chan_id);
I think the two patchset should be kept as they are since one is clearly building on top of the other. In your next revision keep functions cs_etm_decoder__buffer_range() so that in the second patch you don't undo the work you've done in the first patch.
break;
case OCSD_GEN_TRC_ELEM_EXCEPTION: decoder->packet_buffer[decoder->tail].exc = true; diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index a4fdd28..743f5f4 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -24,6 +24,7 @@ struct cs_etm_buffer { enum cs_etm_sample_type { CS_ETM_RANGE = 1 << 0,
- CS_ETM_TRACE_ON = 1 << 1,
}; struct cs_etm_packet { diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 43cf1d4..a8d07bd 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -866,6 +866,8 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) * PREV_PACKET is a branch. */ if (etm->synth_opts.last_branch &&
etmq->prev_packet &&
cs_etm__update_last_branch_rb(etmq);etmq->prev_packet->sample_type == CS_ETM_RANGE && etmq->prev_packet->last_instr_taken_branch)
@@ -898,14 +900,14 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) } if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
etmq->prev_packet->sample_type == CS_ETM_RANGE &&
etmq->prev_packet->last_instr_taken_branch) {
Much better.
ret = cs_etm__synth_branch_sample(etmq); if (ret) return ret;
}
- if (etm->sample_branches ||
etm->synth_opts.last_branch) {
- if (etm->sample_branches || etm->synth_opts.last_branch) { /*
- Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
- the next incoming packet.
@@ -918,6 +920,40 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) return 0; } +static int cs_etm__flush(struct cs_etm_queue *etmq) +{
- int err = 0;
- struct cs_etm_packet *tmp;
- if (etmq->etm->synth_opts.last_branch &&
etmq->prev_packet &&
etmq->prev_packet->sample_type == CS_ETM_RANGE) {
/*
* Generate a last branch event for the branches left in the
* circular buffer at the end of the trace.
*
* Use the address of the end of the last reported execution
* range
*/
u64 addr = cs_etm__last_executed_instr(etmq->prev_packet);
err = cs_etm__synth_instruction_sample(
etmq, addr,
etmq->period_instructions);
etmq->period_instructions = 0;
/*
* Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
* the next incoming packet.
*/
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
etmq->prev_packet = tmp;
I've been thinking about etmq->packet yesterday. From what I see there is no need to have it in the cs_etm_queue structure... Perhaps I missed something.
Finally I tested things on my side and things work as advertised.
I'm done reviewing your patches - thanks for the submission. Unless you see a reason not to I think the next revision should be sent to the public mailing list once -rc1 comes out. That will give time to other people on the CS list to review and test your work.
Mathieu
- }
- return err;
+}
static int cs_etm__run_decoder(struct cs_etm_queue *etmq) { struct cs_etm_auxtrace *etm = etmq->etm; @@ -946,20 +982,19 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) /* Run trace decoder until buffer consumed or end of trace */ do { processed = 0;
err = cs_etm_decoder__process_data_block( etmq->decoder, etmq->offset, &buffer.buf[buffer_used], buffer.len - buffer_used, &processed);
if (err) return err;
etmq->offset += processed; buffer_used += processed;
/* Process each packet in this chunk */ while (1) { err = cs_etm_decoder__get_packet(etmq->decoder, etmq->packet);
@@ -970,31 +1005,31 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) */ break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type & CS_ETM_RANGE)
err = cs_etm__sample(etmq);
switch (etmq->packet->sample_type) {
case CS_ETM_RANGE:
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
cs_etm__sample(etmq);
break;
case CS_ETM_TRACE_ON:
/*
* Discontinuity in trace, flush
* previous branch stack
*/
cs_etm__flush(etmq);
break;
default:
break;
} while (buffer.len > buffer_used);} }
/*
* Generate a last branch event for the branches left in
* the circular buffer at the end of the trace.
*/
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq, be->to,
etmq->period_instructions);
if (err)
return err;
}
if (err == 0)
/* Flush any remaining branch stack entries */
}err = cs_etm__flush(etmq);
return err; -- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 02/02/18 22:10, Mathieu Poirier wrote:
On Tue, Jan 30, 2018 at 02:42:25PM +0000, Robert Walker wrote:
There may be discontinuities in the ETM trace stream due to overflows or ETM configuration for selective trace. This patch emits an instruction sample with the pending branch stack when a TRACE ON packet occurs indicating a discontinuity in the trace data.
A new packet type CS_ETM_TRACE_ON is added, which is emitted by the low level decoder when a TRACE ON occurs. The higher level decoder flushes the branch stack when this packet is emitted.
Signed-off-by: Robert Walker robert.walker@arm.com
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 71 ++++++++++++++------ tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 89 +++++++++++++++++-------- 3 files changed, 113 insertions(+), 48 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index ecf1780..d199f58 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -78,6 +78,8 @@ int cs_etm_decoder__reset(struct cs_etm_decoder *decoder) { ocsd_datapath_resp_t dp_ret;
- decoder->prev_return = OCSD_RESP_CONT;
- dp_ret = ocsd_dt_process_data(decoder->dcd_tree, OCSD_OP_RESET, 0, 0, NULL, NULL); if (OCSD_DATA_RESP_IS_FATAL(dp_ret))
@@ -263,7 +265,6 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) static ocsd_datapath_resp_t cs_etm_decoder__buffer_packet(struct cs_etm_decoder *decoder,
{const ocsd_generic_trace_elem *elem, const u8 trace_chan_id, enum cs_etm_sample_type sample_type)
@@ -279,35 +280,62 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) return OCSD_RESP_FATAL_SYS_ERR; et = decoder->tail;
- decoder->packet_buffer[et].sample_type = sample_type;
- decoder->packet_buffer[et].start_addr = elem->st_addr;
- decoder->packet_buffer[et].end_addr = elem->en_addr;
- et = (et + 1) & (MAX_BUFFER - 1);
- decoder->tail = et;
- decoder->packet_count++;
- decoder->packet_buffer[et].sample_type = sample_type;
- decoder->packet_buffer[et].exc = false;
- decoder->packet_buffer[et].exc_ret = false;
- decoder->packet_buffer[et].cpu = *((int *)inode->priv);
- decoder->packet_buffer[et].start_addr = 0xdeadbeefdeadbeefUL;
- decoder->packet_buffer[et].end_addr = 0xdeadbeefdeadbeefUL;
Please drop the tabulations.
- if (decoder->packet_count == MAX_BUFFER - 1)
return OCSD_RESP_WAIT;
- return OCSD_RESP_CONT;
+}
+static ocsd_datapath_resp_t +cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder,
const ocsd_generic_trace_elem *elem,
const uint8_t trace_chan_id)
+{
- int ret = 0;
- struct cs_etm_packet *packet;
- ret = cs_etm_decoder__buffer_packet(decoder, trace_chan_id,
CS_ETM_RANGE);
- if (ret != OCSD_RESP_CONT && ret != OCSD_RESP_WAIT)
return ret;
- packet = &decoder->packet_buffer[decoder->tail];
- packet->start_addr = elem->st_addr;
- packet->end_addr = elem->en_addr;
Same here.
switch (elem->last_i_type) { case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch = elem->last_instr_exec;
break; case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB: case OCSD_INSTR_OTHER: default:packet->last_instr_taken_branch = elem->last_instr_exec;
decoder->packet_buffer[et].last_instr_taken_branch = false;
break; }packet->last_instr_taken_branch = false;
- decoder->packet_buffer[et].exc = false;
- decoder->packet_buffer[et].exc_ret = false;
- decoder->packet_buffer[et].cpu = *((int *)inode->priv);
- /* Wrap around if need be */
- et = (et + 1) & (MAX_BUFFER - 1);
- decoder->tail = et;
- decoder->packet_count++;
- if (decoder->packet_count == MAX_BUFFER - 1)
return OCSD_RESP_WAIT;
- return ret;
+}
- return OCSD_RESP_CONT;
+static ocsd_datapath_resp_t +cs_etm_decoder__buffer_trace_on(struct cs_etm_decoder *decoder,
const uint8_t trace_chan_id)
+{
- return cs_etm_decoder__buffer_packet(decoder, trace_chan_id,
}CS_ETM_TRACE_ON);
static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( @@ -326,12 +354,13 @@ static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( decoder->trace_on = false; break; case OCSD_GEN_TRC_ELEM_TRACE_ON:
resp = cs_etm_decoder__buffer_trace_on(decoder,
decoder->trace_on = true; break; case OCSD_GEN_TRC_ELEM_INSTR_RANGE:trace_chan_id);
resp = cs_etm_decoder__buffer_packet(decoder, elem,
trace_chan_id,
CS_ETM_RANGE);
resp = cs_etm_decoder__buffer_range(decoder, elem,
trace_chan_id);
I think the two patchset should be kept as they are since one is clearly building on top of the other. In your next revision keep functions cs_etm_decoder__buffer_range() so that in the second patch you don't undo the work you've done in the first patch.
I'm not sure what you mean here - the first patch doesn't really change anything in this file other than add the last_instr_exec field. This patch makes cs_etm_decoder__buffer_packet() the common code for creating a packet and adds functions that call this for execution range and trace on packets. Do you mean move this refactoring into the first patch?
break;
case OCSD_GEN_TRC_ELEM_EXCEPTION: decoder->packet_buffer[decoder->tail].exc = true; diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index a4fdd28..743f5f4 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -24,6 +24,7 @@ struct cs_etm_buffer { enum cs_etm_sample_type { CS_ETM_RANGE = 1 << 0,
- CS_ETM_TRACE_ON = 1 << 1, };
struct cs_etm_packet { diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 43cf1d4..a8d07bd 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -866,6 +866,8 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) * PREV_PACKET is a branch. */ if (etm->synth_opts.last_branch &&
etmq->prev_packet &&
cs_etm__update_last_branch_rb(etmq);etmq->prev_packet->sample_type == CS_ETM_RANGE && etmq->prev_packet->last_instr_taken_branch)
@@ -898,14 +900,14 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) } if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
etmq->prev_packet->sample_type == CS_ETM_RANGE &&
etmq->prev_packet->last_instr_taken_branch) {
Much better.
ret = cs_etm__synth_branch_sample(etmq); if (ret) return ret;
}
- if (etm->sample_branches ||
etm->synth_opts.last_branch) {
- if (etm->sample_branches || etm->synth_opts.last_branch) { /*
- Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
- the next incoming packet.
@@ -918,6 +920,40 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) return 0; } +static int cs_etm__flush(struct cs_etm_queue *etmq) +{
- int err = 0;
- struct cs_etm_packet *tmp;
- if (etmq->etm->synth_opts.last_branch &&
etmq->prev_packet &&
etmq->prev_packet->sample_type == CS_ETM_RANGE) {
/*
* Generate a last branch event for the branches left in the
* circular buffer at the end of the trace.
*
* Use the address of the end of the last reported execution
* range
*/
u64 addr = cs_etm__last_executed_instr(etmq->prev_packet);
err = cs_etm__synth_instruction_sample(
etmq, addr,
etmq->period_instructions);
etmq->period_instructions = 0;
/*
* Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
* the next incoming packet.
*/
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
etmq->prev_packet = tmp;
I've been thinking about etmq->packet yesterday. From what I see there is no need to have it in the cs_etm_queue structure... Perhaps I missed something.
You could make it a local in cs_etm__run_decoder(), but would then have to pass it as a parameter to the other functions. And you would need a memcpy to save it in prev_packet instead of a pointer swap.
Finally I tested things on my side and things work as advertised.
I'm done reviewing your patches - thanks for the submission. Unless you see a reason not to I think the next revision should be sent to the public mailing list once -rc1 comes out. That will give time to other people on the CS list to review and test your work.
Mathieu
Thanks for the review. Do you mean linux-arm-kernel as the public mailing list?
Rob
- }
- return err;
+}
- static int cs_etm__run_decoder(struct cs_etm_queue *etmq) { struct cs_etm_auxtrace *etm = etmq->etm;
@@ -946,20 +982,19 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) /* Run trace decoder until buffer consumed or end of trace */ do { processed = 0;
err = cs_etm_decoder__process_data_block( etmq->decoder, etmq->offset, &buffer.buf[buffer_used], buffer.len - buffer_used, &processed);
if (err) return err;
etmq->offset += processed; buffer_used += processed;
/* Process each packet in this chunk */ while (1) { err = cs_etm_decoder__get_packet(etmq->decoder, etmq->packet);
@@ -970,31 +1005,31 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) */ break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type & CS_ETM_RANGE)
err = cs_etm__sample(etmq);
switch (etmq->packet->sample_type) {
case CS_ETM_RANGE:
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
cs_etm__sample(etmq);
break;
case CS_ETM_TRACE_ON:
/*
* Discontinuity in trace, flush
* previous branch stack
*/
cs_etm__flush(etmq);
break;
default:
break;
} while (buffer.len > buffer_used);} }
/*
* Generate a last branch event for the branches left in
* the circular buffer at the end of the trace.
*/
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq, be->to,
etmq->period_instructions);
if (err)
return err;
}
if (err == 0)
/* Flush any remaining branch stack entries */
}err = cs_etm__flush(etmq);
return err; -- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 5 February 2018 at 04:15, Robert Walker robert.walker@arm.com wrote:
On 02/02/18 22:10, Mathieu Poirier wrote:
On Tue, Jan 30, 2018 at 02:42:25PM +0000, Robert Walker wrote:
There may be discontinuities in the ETM trace stream due to overflows or ETM configuration for selective trace. This patch emits an instruction sample with the pending branch stack when a TRACE ON packet occurs indicating a discontinuity in the trace data.
A new packet type CS_ETM_TRACE_ON is added, which is emitted by the low level decoder when a TRACE ON occurs. The higher level decoder flushes the branch stack when this packet is emitted.
Signed-off-by: Robert Walker robert.walker@arm.com
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 71 ++++++++++++++------ tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 89 +++++++++++++++++-------- 3 files changed, 113 insertions(+), 48 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index ecf1780..d199f58 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -78,6 +78,8 @@ int cs_etm_decoder__reset(struct cs_etm_decoder *decoder) { ocsd_datapath_resp_t dp_ret;
decoder->prev_return = OCSD_RESP_CONT;
dp_ret = ocsd_dt_process_data(decoder->dcd_tree, OCSD_OP_RESET, 0, 0, NULL, NULL); if (OCSD_DATA_RESP_IS_FATAL(dp_ret))
@@ -263,7 +265,6 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) static ocsd_datapath_resp_t cs_etm_decoder__buffer_packet(struct cs_etm_decoder *decoder,
{const ocsd_generic_trace_elem *elem, const u8 trace_chan_id, enum cs_etm_sample_type sample_type)
@@ -279,35 +280,62 @@ static void cs_etm_decoder__clear_buffer(struct cs_etm_decoder *decoder) return OCSD_RESP_FATAL_SYS_ERR; et = decoder->tail;
decoder->packet_buffer[et].sample_type = sample_type;
decoder->packet_buffer[et].start_addr = elem->st_addr;
decoder->packet_buffer[et].end_addr = elem->en_addr;
et = (et + 1) & (MAX_BUFFER - 1);
decoder->tail = et;
decoder->packet_count++;
decoder->packet_buffer[et].sample_type = sample_type;
decoder->packet_buffer[et].exc = false;
decoder->packet_buffer[et].exc_ret = false;
decoder->packet_buffer[et].cpu = *((int *)inode->priv);
decoder->packet_buffer[et].start_addr = 0xdeadbeefdeadbeefUL;
decoder->packet_buffer[et].end_addr = 0xdeadbeefdeadbeefUL;
Please drop the tabulations.
if (decoder->packet_count == MAX_BUFFER - 1)
return OCSD_RESP_WAIT;
return OCSD_RESP_CONT;
+}
+static ocsd_datapath_resp_t +cs_etm_decoder__buffer_range(struct cs_etm_decoder *decoder,
const ocsd_generic_trace_elem *elem,
const uint8_t trace_chan_id)
+{
int ret = 0;
struct cs_etm_packet *packet;
ret = cs_etm_decoder__buffer_packet(decoder, trace_chan_id,
CS_ETM_RANGE);
if (ret != OCSD_RESP_CONT && ret != OCSD_RESP_WAIT)
return ret;
packet = &decoder->packet_buffer[decoder->tail];
packet->start_addr = elem->st_addr;
packet->end_addr = elem->en_addr;
Same here.
switch (elem->last_i_type) { case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instr_taken_branch =
elem->last_instr_exec;
packet->last_instr_taken_branch = elem->last_instr_exec; break; case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB: case OCSD_INSTR_OTHER: default:
decoder->packet_buffer[et].last_instr_taken_branch =
false;
packet->last_instr_taken_branch = false; break; }
decoder->packet_buffer[et].exc = false;
decoder->packet_buffer[et].exc_ret = false;
decoder->packet_buffer[et].cpu = *((int *)inode->priv);
/* Wrap around if need be */
et = (et + 1) & (MAX_BUFFER - 1);
decoder->tail = et;
decoder->packet_count++;
if (decoder->packet_count == MAX_BUFFER - 1)
return OCSD_RESP_WAIT;
return ret;
+}
return OCSD_RESP_CONT;
+static ocsd_datapath_resp_t +cs_etm_decoder__buffer_trace_on(struct cs_etm_decoder *decoder,
const uint8_t trace_chan_id)
+{
return cs_etm_decoder__buffer_packet(decoder, trace_chan_id,
} static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer(CS_ETM_TRACE_ON);
@@ -326,12 +354,13 @@ static ocsd_datapath_resp_t cs_etm_decoder__gen_trace_elem_printer( decoder->trace_on = false; break; case OCSD_GEN_TRC_ELEM_TRACE_ON:
resp = cs_etm_decoder__buffer_trace_on(decoder,
trace_chan_id); decoder->trace_on = true; break; case OCSD_GEN_TRC_ELEM_INSTR_RANGE:
resp = cs_etm_decoder__buffer_packet(decoder, elem,
trace_chan_id,
CS_ETM_RANGE);
resp = cs_etm_decoder__buffer_range(decoder, elem,
trace_chan_id);
I think the two patchset should be kept as they are since one is clearly building on top of the other. In your next revision keep functions cs_etm_decoder__buffer_range() so that in the second patch you don't undo the work you've done in the first patch.
I'm not sure what you mean here - the first patch doesn't really change anything in this file other than add the last_instr_exec field. This patch makes cs_etm_decoder__buffer_packet() the common code for creating a packet and adds functions that call this for execution range and trace on packets. Do you mean move this refactoring into the first patch?
Yes
break; case OCSD_GEN_TRC_ELEM_EXCEPTION: decoder->packet_buffer[decoder->tail].exc = true;
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index a4fdd28..743f5f4 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -24,6 +24,7 @@ struct cs_etm_buffer { enum cs_etm_sample_type { CS_ETM_RANGE = 1 << 0,
}; struct cs_etm_packet {CS_ETM_TRACE_ON = 1 << 1,
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 43cf1d4..a8d07bd 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -866,6 +866,8 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) * PREV_PACKET is a branch. */ if (etm->synth_opts.last_branch &&
etmq->prev_packet &&
@@ -898,14 +900,14 @@ static int cs_etm__sample(struct cs_etm_queueetmq->prev_packet->sample_type == CS_ETM_RANGE && etmq->prev_packet->last_instr_taken_branch) cs_etm__update_last_branch_rb(etmq);
*etmq) } if (etm->sample_branches &&
etmq->packet->last_instr_taken_branch) {
etmq->prev_packet->sample_type == CS_ETM_RANGE &&
etmq->prev_packet->last_instr_taken_branch) {
Much better.
ret = cs_etm__synth_branch_sample(etmq); if (ret) return ret; }
if (etm->sample_branches ||
etm->synth_opts.last_branch) {
if (etm->sample_branches || etm->synth_opts.last_branch) { /* * Swap PACKET with PREV_PACKET: PACKET becomes
PREV_PACKET for * the next incoming packet. @@ -918,6 +920,40 @@ static int cs_etm__sample(struct cs_etm_queue *etmq) return 0; } +static int cs_etm__flush(struct cs_etm_queue *etmq) +{
int err = 0;
struct cs_etm_packet *tmp;
if (etmq->etm->synth_opts.last_branch &&
etmq->prev_packet &&
etmq->prev_packet->sample_type == CS_ETM_RANGE) {
/*
* Generate a last branch event for the branches left in
the
* circular buffer at the end of the trace.
*
* Use the address of the end of the last reported
execution
* range
*/
u64 addr =
cs_etm__last_executed_instr(etmq->prev_packet);
err = cs_etm__synth_instruction_sample(
etmq, addr,
etmq->period_instructions);
etmq->period_instructions = 0;
/*
* Swap PACKET with PREV_PACKET: PACKET becomes
PREV_PACKET for
* the next incoming packet.
*/
tmp = etmq->packet;
etmq->packet = etmq->prev_packet;
etmq->prev_packet = tmp;
I've been thinking about etmq->packet yesterday. From what I see there is no need to have it in the cs_etm_queue structure... Perhaps I missed something.
You could make it a local in cs_etm__run_decoder(), but would then have to pass it as a parameter to the other functions. And you would need a memcpy to save it in prev_packet instead of a pointer swap.
I don't care much about passing the local reference as a parameter to other functions (that's how it was before) but I agree the memcpy() is expensive. Ok, leave it as it is.
Finally I tested things on my side and things work as advertised.
I'm done reviewing your patches - thanks for the submission. Unless you see a reason not to I think the next revision should be sent to the public mailing list once -rc1 comes out. That will give time to other people on the CS list to review and test your work.
Mathieu
Thanks for the review. Do you mean linux-arm-kernel as the public mailing list?
Both linux-arm-kernel@lists.infradead.org and linux-kernel@vger.kernel.org.
Rob
}
return err;
+}
- static int cs_etm__run_decoder(struct cs_etm_queue *etmq) { struct cs_etm_auxtrace *etm = etmq->etm;
@@ -946,20 +982,19 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) /* Run trace decoder until buffer consumed or end of trace */ do { processed = 0;
err = cs_etm_decoder__process_data_block( etmq->decoder, etmq->offset, &buffer.buf[buffer_used], buffer.len - buffer_used, &processed);
if (err) return err; etmq->offset += processed; buffer_used += processed;
/* Process each packet in this chunk */ while (1) { err =
cs_etm_decoder__get_packet(etmq->decoder,
etmq->packet); @@ -970,31 +1005,31 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq) */ break;
/*
* If the packet contains an instruction
* range, generate instruction sequence
* events.
*/
if (etmq->packet->sample_type &
CS_ETM_RANGE)
err = cs_etm__sample(etmq);
switch (etmq->packet->sample_type) {
case CS_ETM_RANGE:
/*
* If the packet contains an
instruction
* range, generate instruction
sequence
* events.
*/
cs_etm__sample(etmq);
break;
case CS_ETM_TRACE_ON:
/*
* Discontinuity in trace, flush
* previous branch stack
*/
cs_etm__flush(etmq);
break;
default:
break;
} } } while (buffer.len > buffer_used);
/*
* Generate a last branch event for the branches left in
* the circular buffer at the end of the trace.
*/
if (etm->sample_instructions &&
etmq->etm->synth_opts.last_branch) {
struct branch_stack *bs = etmq->last_branch_rb;
struct branch_entry *be =
&bs->entries[etmq->last_branch_pos];
err = cs_etm__synth_instruction_sample(etmq,
be->to,
etmq->period_instructions);
if (err)
return err;
}
if (err == 0)
/* Flush any remaining branch stack entries */
err = cs_etm__flush(etmq); } return err;
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 30 January 2018 at 07:42, Robert Walker robert.walker@arm.com wrote:
Hi,
These patches add support for using perf inject to generate branch events and branch stacks from CoreSight ETM traces.
They apply to the recently submitted perf support for CoreSight trace [1] with the subsequent memory cleanup fix [2]
The first patch is Sebastian's original commits from [3] reworked to apply to the refactored version now upstreamed, with some fixes for branch events and my work on branch stacks posted last November [4], updated with review comments.
Very good. Coincidentally I spent a couple of hours yesterday doing exactly that i.e, rebasing Sebastian's work on top of the new perf tools patchset with a very low level of confidence in the result.
The second patch is a new patch that handles discontinuities in the trace stream, e.g. when the ETM is configured to only trace certain regions or is only active some of the time.
These probably need to be squashed together before going upstream, but I've left them as separate commits for initial review.
Ok, let me look at those - I'll get back to you later this week.
Mathieu
Regards
Rob Walker
tag perf-core-for-mingo-4.16-20180125
[3]: https://github.com/Linaro/perf-opencsd/ autoFDO branch [4]: https://lists.linaro.org/pipermail/coresight/2017-November/000955.html
Robert Walker (2): perf tools: inject capabilitity for CoreSight traces perf inject: Emit instruction records on ETM trace discontinuity
Documentation/trace/coresight.txt | 31 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 68 +++- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 + tools/perf/util/cs-etm.c | 471 +++++++++++++++++++++--- 4 files changed, 509 insertions(+), 63 deletions(-)
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight