These patches fix some issues with the branch stacks generated from CoreSight ETM trace.
The main issues addressed are: - The branch stack should only contain taken branches. - The instruction samples are generated using the period specified by the --itrace option to perf inject. Currently, the period can only be specified as an instruction count - further work is required to specify the period as a cycle count or time interval. - The ordering of the branch stack should have newest branch first. - Some minor fixes to the address calculations.
With these fixes, the branch stacks are more similar to the last branch records produced by 'perf record -b' and Intel-PT on x86. There are similar improvements in the autofdo profiles generated from these traces.
The patches apply to the autoFDO branch of https://github.com/Linaro/perf-opencsd.git (d3fa0f7)
Regards
Robert Walker
Robert Walker (2): Revert "perf inject: record branches in chronological order" perf: Fix branch stack records from CoreSight ETM decode
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 134 +++++++++++++----------- 3 files changed, 73 insertions(+), 67 deletions(-)
This reverts commit d3fa0f70b7e80dbd9317d25b2dc028a093e0f5b7.
A previous commit changed the order branches were recorded in so that they are recorded in chronological order. This results in the branch stack generated by perf inject containing the oldest branch first and the newest branch last.
However, this does not match the behaviour of LBR (perf record -b) or the results of perf inject with Intel-PT, which show the newest branch at the top of the stack and the oldest at the bottom.
For example, with the following source:
#include <stdio.h>
int f3(int a) { return a ^ 0x55; }
int f2(int a) { int b;
b = f3(a);
return b; }
int f1(int a) { int b;
b = f2(a);
return b; }
int main(int argc, char *argv[]) { int a = 0xDEADBEEF; int i;
for (i = 0; i < 1000000; ++i) a = f1(a);
printf("%x\n", a);
return 0; }
LBR produces the following branch stacks (annotated with symbol addresses):
12714186730559108 [main+12714186730158b22] 0x10a0 [0x1b0]: PERF_RECORD_SAMPLE(IP, 0x4002): 25772/25772: 0x4005c4 [f2+18] period: 63741 addr: 0 ... branch stack: nr:16 ..... 0: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles P 0 ..... 1: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles P 0 ..... 2: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles P 0 ..... 3: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles P 0 ..... 4: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles P 0 ..... 5: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles P 0 ..... 6: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles P 0 ..... 7: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles P 0 ..... 8: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles P 0 ..... 9: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles P 0 ..... 10: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles P 0 ..... 11: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles P 0 ..... 12: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles P 0 ..... 13: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles P 0 ..... 14: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles P 0 ..... 15: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles P 0
Where the stack shows the call from main() -> f1() -> f2() -> f3(), then return to f2().
By restoring the previous behaviour, the branch stacks generated from ETM trace have the same ordering.
Signed-off-by: Robert Walker robert.walker@arm.com --- tools/perf/util/cs-etm.c | 47 ++++++++++++++++++++++++++--------------------- 1 file changed, 26 insertions(+), 21 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index b1095a5..24783b7 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -580,28 +580,32 @@ static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) return;
/* - * If we wrapped around at least once, the branches from last_branch_pos - * element to last_branch_sz are older valid branches: copy them over. + * As bs_src->entries is a circular buffer, we need to copy from it in + * two steps. First, copy the branches from the most recently inserted + * branch ->last_branch_pos until the end of bs_src->entries buffer. */ - if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) { - nr = etmq->etm->synth_opts.last_branch_sz - - etmq->last_branch_pos - 1; - memcpy(&bs_dst->entries[0], - &bs_src->entries[etmq->last_branch_pos + 1], - sizeof(struct branch_entry) * nr); - } + nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos; + memcpy(&bs_dst->entries[0], + &bs_src->entries[etmq->last_branch_pos], + sizeof(struct branch_entry) * nr);
/* - * Copy the branches from the most recently inserted branches from 0 - * to last_branch_pos included. + * If we wrapped around at least once, the branches from the beginning + * of the bs_src->entries buffer and until the ->last_branch_pos element + * are older valid branches: copy them over. The total number of + * branches copied over will be equal to the number of branches asked by + * the user in last_branch_sz. */ - memcpy(&bs_dst->entries[nr], &bs_src->entries[0], - sizeof(struct branch_entry) * (etmq->last_branch_pos + 1)); + if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) { + memcpy(&bs_dst->entries[nr], + &bs_src->entries[0], + sizeof(struct branch_entry) * etmq->last_branch_pos); + } }
static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) { - etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz - 1; + etmq->last_branch_pos = 0; etmq->last_branch_rb->nr = 0; }
@@ -611,14 +615,15 @@ static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) struct branch_entry *be;
/* - * Record branches in a circular buffer in chronological order. After - * writing the last element of the stack, move the insert position back - * to the beginning of the buffer. + * The branches are recorded in a circular buffer in reverse + * chronological order: we start recording from the last element of the + * buffer down. After writing the first element of the stack, move the + * insert position back to the end of the buffer. */ - if (etmq->last_branch_pos == etmq->etm->synth_opts.last_branch_sz - 1) - etmq->last_branch_pos = 0; - else - etmq->last_branch_pos += 1; + if (!etmq->last_branch_pos) + etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz; + + etmq->last_branch_pos -= 1;
be = &bs->entries[etmq->last_branch_pos];
On 13 November 2017 at 08:11, Robert Walker robert.walker@arm.com wrote:
This reverts commit d3fa0f70b7e80dbd9317d25b2dc028a093e0f5b7.
A previous commit changed the order branches were recorded in so that they are recorded in chronological order. This results in the branch stack generated by perf inject containing the oldest branch first and the newest branch last.
However, this does not match the behaviour of LBR (perf record -b) or the results of perf inject with Intel-PT, which show the newest branch at the top of the stack and the oldest at the bottom.
For example, with the following source:
#include <stdio.h> int f3(int a) { return a ^ 0x55; } int f2(int a) { int b; b = f3(a); return b; } int f1(int a) { int b; b = f2(a); return b; } int main(int argc, char *argv[]) { int a = 0xDEADBEEF; int i; for (i = 0; i < 1000000; ++i) a = f1(a); printf("%x\n", a); return 0; }
LBR produces the following branch stacks (annotated with symbol addresses):
12714186730559108 [main+12714186730158b22] 0x10a0 [0x1b0]: PERF_RECORD_SAMPLE(IP, 0x4002): 25772/25772: 0x4005c4 [f2+18] period: 63741 addr: 0 ... branch stack: nr:16 ..... 0: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles P 0 ..... 1: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles P 0 ..... 2: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles P 0 ..... 3: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles P 0 ..... 4: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles P 0 ..... 5: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles P 0 ..... 6: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles P 0 ..... 7: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles P 0 ..... 8: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles P 0 ..... 9: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles P 0 ..... 10: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles P 0 ..... 11: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles P 0 ..... 12: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles P 0 ..... 13: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles P 0 ..... 14: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles P 0 ..... 15: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles P 0
Where the stack shows the call from main() -> f1() -> f2() -> f3(), then return to f2().
By restoring the previous behaviour, the branch stacks generated from ETM trace have the same ordering.
I'm in agreement with Robert's assessment. The first pastebin [1] is a snapshot from x86, the second [2] is a snapshot on ARM64 _without_ Robert's revert and the third [3] is another ARM64 snapshot with the revert. In this scenario it is clear that results with the revert do match what we get on Intel. Sebastian, can you share details on the scenario you were working with when implementing your patch?
Thanks, Mathieu
[1]. https://pastebin.com/kkPhg2sx [2]. https://pastebin.com/EZrYpFjJ [3]. https://pastebin.com/gea5JXBS
Signed-off-by: Robert Walker robert.walker@arm.com
tools/perf/util/cs-etm.c | 47 ++++++++++++++++++++++++++--------------------- 1 file changed, 26 insertions(+), 21 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index b1095a5..24783b7 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -580,28 +580,32 @@ static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) return;
/*
* If we wrapped around at least once, the branches from last_branch_pos
* element to last_branch_sz are older valid branches: copy them over.
* As bs_src->entries is a circular buffer, we need to copy from it in
* two steps. First, copy the branches from the most recently inserted
* branch ->last_branch_pos until the end of bs_src->entries buffer. */
if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
nr = etmq->etm->synth_opts.last_branch_sz
- etmq->last_branch_pos - 1;
memcpy(&bs_dst->entries[0],
&bs_src->entries[etmq->last_branch_pos + 1],
sizeof(struct branch_entry) * nr);
}
nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos;
memcpy(&bs_dst->entries[0],
&bs_src->entries[etmq->last_branch_pos],
sizeof(struct branch_entry) * nr); /*
* Copy the branches from the most recently inserted branches from 0
* to last_branch_pos included.
* If we wrapped around at least once, the branches from the beginning
* of the bs_src->entries buffer and until the ->last_branch_pos element
* are older valid branches: copy them over. The total number of
* branches copied over will be equal to the number of branches asked by
* the user in last_branch_sz. */
memcpy(&bs_dst->entries[nr], &bs_src->entries[0],
sizeof(struct branch_entry) * (etmq->last_branch_pos + 1));
if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) {
memcpy(&bs_dst->entries[nr],
&bs_src->entries[0],
sizeof(struct branch_entry) * etmq->last_branch_pos);
}
}
static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) {
etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz - 1;
etmq->last_branch_pos = 0; etmq->last_branch_rb->nr = 0;
}
@@ -611,14 +615,15 @@ static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) struct branch_entry *be;
/*
* Record branches in a circular buffer in chronological order. After
* writing the last element of the stack, move the insert position back
* to the beginning of the buffer.
* The branches are recorded in a circular buffer in reverse
* chronological order: we start recording from the last element of the
* buffer down. After writing the first element of the stack, move the
* insert position back to the end of the buffer. */
if (etmq->last_branch_pos == etmq->etm->synth_opts.last_branch_sz - 1)
etmq->last_branch_pos = 0;
else
etmq->last_branch_pos += 1;
if (!etmq->last_branch_pos)
etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz;
etmq->last_branch_pos -= 1; be = &bs->entries[etmq->last_branch_pos];
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
This patch fixes the branch instruction samples produced from CoreSight ETM trace: - The branch stack should only include taken branches and exclude non-taken branches - Use the number of instructions in each range reported by the decoder when determining the period between samples - Correct the address reported for each sample
Signed-off-by: Robert Walker robert.walker@arm.com --- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 87 +++++++++++++------------ 3 files changed, 47 insertions(+), 46 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 8abf10d..bcbda7d 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -166,13 +166,13 @@ static int cs_etm_decoder__buffer_packet(struct cs_etm_decoder *decoder, switch (elem->last_i_type) { case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT: - decoder->packet_buffer[et].last_instruction_is_branch = true; + decoder->packet_buffer[et].last_instruction_is_taken_branch = elem->last_instr_exec; break; case OCSD_INSTR_OTHER: case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB: default: - decoder->packet_buffer[et].last_instruction_is_branch = false; + decoder->packet_buffer[et].last_instruction_is_taken_branch = false; break; } decoder->packet_buffer[et].exc = false; diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 4fd5a9b..c889078 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -49,7 +49,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; uint64_t start_addr; uint64_t end_addr; - bool last_instruction_is_branch; + bool last_instruction_is_taken_branch; bool exc; bool exc_ret; int cpu; diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 24783b7..e9e00a8 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -89,6 +89,7 @@ struct cs_etm_queue { u64 offset; bool eot; bool kernel_mapped; + u64 period_insns; /* * Stack of branches in reverse chronological order that will be copied * to a last branch event sample. @@ -379,6 +380,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq; + size_t szp = sizeof(struct cs_etm_packet); size_t i;
etmq = zalloc(sizeof(struct cs_etm_queue)); @@ -396,9 +398,11 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, etmq->chain = NULL; }
+ etmq->packet = zalloc(szp); + if (!etmq->packet) + goto out_free; if (etm->synth_opts.last_branch) { size_t sz = sizeof(struct branch_stack); - size_t szp = sizeof(struct cs_etm_packet);
sz += etm->synth_opts.last_branch_sz * sizeof(struct branch_entry); @@ -411,9 +415,6 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, etmq->prev_packet = zalloc(szp); if (!etmq->prev_packet) goto out_free; - etmq->packet = zalloc(szp); - if (!etmq->packet) - goto out_free; }
etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); @@ -457,6 +458,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm,
etmq->offset = 0; etmq->eot = false; + etmq->period_insns = 0;
return etmq;
@@ -741,29 +743,25 @@ static struct cs_etm_cache_entry *cs_etm__cache_lookup(struct dso *dso, } #endif
-static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq) +static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, u64 addr, u64 period, u32 cpu) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; union perf_event *event = etmq->event_buf; struct perf_sample sample = {.ip = 0,}; - struct cs_etm_packet *packet = etmq->packet; - uint64_t start_addr = packet->start_addr; - uint64_t end_addr = packet->end_addr;
event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
- sample.ip = start_addr; + sample.ip = addr; sample.pid = etmq->pid; sample.tid = etmq->tid; - sample.addr = end_addr; sample.id = etmq->etm->instructions_id; sample.stream_id = etmq->etm->instructions_id; - sample.period = (end_addr - start_addr) >> 2; - sample.cpu = packet->cpu; + sample.period = period; + sample.cpu = cpu; sample.flags = 0; // etmq->flags; sample.insn_len = 1; // etmq->insn_len; sample.cpumode = event->header.misc; @@ -964,6 +962,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) struct cs_etm_auxtrace *etm = etmq->etm; struct cs_etm_packet *tmp; int err = cs_etm_decoder__get_packet(etmq->decoder, etmq->packet); + u64 insns;
/* * If there is no sample, it returns err = -1, no real error. @@ -971,29 +970,38 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) if (err) return err;
- if (etm->synth_opts.last_branch) { - *cpu = etmq->packet->cpu; + /* A64 instructions are always 4 bytes */ + insns = (etmq->packet->end_addr - etmq->packet->start_addr) >> 2; + etmq->period_insns += insns;
- /* - * FIXME: as the trace sampling does not work for now, (for - * example perf inject --itrace=i100us will not generate events - * every 100 micro-seconds), generate a last branch event after - * having decoded last_branch_sz branch samples. This condition - * should be rewritten as "if reached sampling period". - */ - if (etmq->last_branch_rb->nr == - etm->synth_opts.last_branch_sz) { - err = cs_etm__synth_instruction_sample(etmq); - if (err) - return err; - } - /* - * Record a branch when the last instruction in PREV_PACKET is a - * branch. - */ - if (etmq->prev_packet->last_instruction_is_branch) - cs_etm__update_last_branch_rb(etmq); + *cpu = etmq->packet->cpu; + + /* + * Record a branch when the last instruction in PREV_PACKET is a + * branch. + */ + if (etm->synth_opts.last_branch && + etmq->prev_packet->last_instruction_is_taken_branch) + cs_etm__update_last_branch_rb(etmq); + + if (etmq->period_insns >= etm->instructions_sample_period) { + /* Emit instruction sample periodically */ + /* TODO: allow period to be defined in cycles and clock time */ + + u64 insns_over = etmq->period_insns - etm->instructions_sample_period; + u64 addr = etmq->packet->end_addr - 4 - (insns_over<<2);
+ err = cs_etm__synth_instruction_sample(etmq, + addr, + etm->instructions_sample_period, + etmq->packet->cpu); + if (err) + return err; + + etmq->period_insns = insns_over; + } + + if (etm->synth_opts.last_branch) { /* * Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for * the next incoming packet. @@ -1001,12 +1009,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) tmp = etmq->packet; etmq->packet = etmq->prev_packet; etmq->prev_packet = tmp; - } else if (etmq->packet->sample_type & CS_ETM_RANGE) { - err = cs_etm__synth_instruction_sample(etmq); - if (err) - return err; } - return 0; + + return 0; }
static int cs_etm__run_decoder(struct cs_etm_queue *etmq, u64 *timestamp) @@ -1053,11 +1058,7 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq, u64 *timestamp) struct branch_stack *bs = etmq->last_branch_rb; struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
- etmq->packet->cpu = cpu; - etmq->packet->start_addr = be->to; - etmq->packet->end_addr = be->to + 4; - - err = cs_etm__synth_instruction_sample(etmq); + err = cs_etm__synth_instruction_sample(etmq, be->to, 1, cpu); if (err) return err; }
On Mon, Nov 13, 2017 at 03:11:42PM +0000, Robert Walker wrote:
This patch fixes the branch instruction samples produced from CoreSight ETM trace:
- The branch stack should only include taken branches and exclude non-taken branches
- Use the number of instructions in each range reported by the decoder when determining the period between samples
- Correct the address reported for each sample
Signed-off-by: Robert Walker robert.walker@arm.com
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 87 +++++++++++++------------ 3 files changed, 47 insertions(+), 46 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 8abf10d..bcbda7d 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -166,13 +166,13 @@ static int cs_etm_decoder__buffer_packet(struct cs_etm_decoder *decoder, switch (elem->last_i_type) { case OCSD_INSTR_BR: case OCSD_INSTR_BR_INDIRECT:
decoder->packet_buffer[et].last_instruction_is_branch = true;
break; case OCSD_INSTR_OTHER: case OCSD_INSTR_ISB: case OCSD_INSTR_DSB_DMB: default:decoder->packet_buffer[et].last_instruction_is_taken_branch = elem->last_instr_exec;
decoder->packet_buffer[et].last_instruction_is_branch = false;
break; } decoder->packet_buffer[et].exc = false;decoder->packet_buffer[et].last_instruction_is_taken_branch = false;
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index 4fd5a9b..c889078 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -49,7 +49,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; uint64_t start_addr; uint64_t end_addr;
- bool last_instruction_is_branch;
- bool last_instruction_is_taken_branch;
Wouldn't "last_instruction_is_branch_taken" be better?
bool exc; bool exc_ret; int cpu;
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 24783b7..e9e00a8 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -89,6 +89,7 @@ struct cs_etm_queue { u64 offset; bool eot; bool kernel_mapped;
u64 period_insns;
Please add a comment to describe what "insns" means.
/* * Stack of branches in reverse chronological order that will be copied * to a last branch event sample. @@ -379,6 +380,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, struct cs_etm_decoder_params d_params; struct cs_etm_trace_params *t_params; struct cs_etm_queue *etmq;
size_t szp = sizeof(struct cs_etm_packet); size_t i;
etmq = zalloc(sizeof(struct cs_etm_queue)); @@ -396,9 +398,11 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, etmq->chain = NULL; }
- etmq->packet = zalloc(szp);
- if (!etmq->packet)
if (etm->synth_opts.last_branch) { size_t sz = sizeof(struct branch_stack);goto out_free;
size_t szp = sizeof(struct cs_etm_packet);
sz += etm->synth_opts.last_branch_sz * sizeof(struct branch_entry); @@ -411,9 +415,6 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, etmq->prev_packet = zalloc(szp); if (!etmq->prev_packet) goto out_free;
etmq->packet = zalloc(szp);
if (!etmq->packet)
}goto out_free;
etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); @@ -457,6 +458,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, etmq->offset = 0; etmq->eot = false;
etmq->period_insns = 0;
return etmq; @@ -741,29 +743,25 @@ static struct cs_etm_cache_entry *cs_etm__cache_lookup(struct dso *dso, } #endif -static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq) +static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, u64 addr, u64 period, u32 cpu) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; union perf_event *event = etmq->event_buf; struct perf_sample sample = {.ip = 0,};
- struct cs_etm_packet *packet = etmq->packet;
uint64_t start_addr = packet->start_addr;
uint64_t end_addr = packet->end_addr;
event->sample.header.type = PERF_RECORD_SAMPLE; event->sample.header.misc = PERF_RECORD_MISC_USER; event->sample.header.size = sizeof(struct perf_event_header);
sample.ip = start_addr;
sample.ip = addr; sample.pid = etmq->pid; sample.tid = etmq->tid;
sample.addr = end_addr; sample.id = etmq->etm->instructions_id; sample.stream_id = etmq->etm->instructions_id;
sample.period = (end_addr - start_addr) >> 2;
sample.cpu = packet->cpu;
sample.period = period;
sample.cpumode = event->header.misc;sample.cpu = cpu; sample.flags = 0; // etmq->flags; sample.insn_len = 1; // etmq->insn_len;
@@ -964,6 +962,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) struct cs_etm_auxtrace *etm = etmq->etm; struct cs_etm_packet *tmp; int err = cs_etm_decoder__get_packet(etmq->decoder, etmq->packet);
- u64 insns;
/* * If there is no sample, it returns err = -1, no real error. @@ -971,29 +970,38 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) if (err) return err;
- if (etm->synth_opts.last_branch) {
*cpu = etmq->packet->cpu;
- /* A64 instructions are always 4 bytes *
What about other processors?
- insns = (etmq->packet->end_addr - etmq->packet->start_addr) >> 2;
Please add comments when using hard coded nubers.
- etmq->period_insns += insns;
/*
* FIXME: as the trace sampling does not work for now, (for
* example perf inject --itrace=i100us will not generate events
* every 100 micro-seconds), generate a last branch event after
* having decoded last_branch_sz branch samples. This condition
* should be rewritten as "if reached sampling period".
*/
if (etmq->last_branch_rb->nr ==
etm->synth_opts.last_branch_sz) {
err = cs_etm__synth_instruction_sample(etmq);
if (err)
return err;
}
/*
* Record a branch when the last instruction in PREV_PACKET is a
* branch.
*/
if (etmq->prev_packet->last_instruction_is_branch)
cs_etm__update_last_branch_rb(etmq);
- *cpu = etmq->packet->cpu;
- /*
* Record a branch when the last instruction in PREV_PACKET is a
* branch.
*/
- if (etm->synth_opts.last_branch &&
etmq->prev_packet->last_instruction_is_taken_branch)
cs_etm__update_last_branch_rb(etmq);
- if (etmq->period_insns >= etm->instructions_sample_period) {
/* Emit instruction sample periodically */
/* TODO: allow period to be defined in cycles and clock time */
u64 insns_over = etmq->period_insns - etm->instructions_sample_period;
u64 addr = etmq->packet->end_addr - 4 - (insns_over<<2);
1) Please add comments when using hard coded nubers.
2) s/insns_over<<2/insns_over << 2
err = cs_etm__synth_instruction_sample(etmq,
addr,
etm->instructions_sample_period,
etmq->packet->cpu);
if (err)
return err;
etmq->period_insns = insns_over;
- }
- if (etm->synth_opts.last_branch) { /*
- Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for
- the next incoming packet.
@@ -1001,12 +1009,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) tmp = etmq->packet; etmq->packet = etmq->prev_packet; etmq->prev_packet = tmp;
- } else if (etmq->packet->sample_type & CS_ETM_RANGE) {
err = cs_etm__synth_instruction_sample(etmq);
if (err)
}return err;
return 0;
- return 0;
}
Since cs_etm__sample() is used for trace decoding should we also apply this patch to the master branch?
I wanted to test your implemenation but (8bbf7ee5d6cb perf inject: correct recording of branch address and destination) breaks normal CS trace decoding. Would you mind having a look?
Be mindful that I rebased the autoFDO tree to v4.14, so SHA's will have changed.
Thanks, Mathieu
static int cs_etm__run_decoder(struct cs_etm_queue *etmq, u64 *timestamp) @@ -1053,11 +1058,7 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq, u64 *timestamp) struct branch_stack *bs = etmq->last_branch_rb; struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
etmq->packet->cpu = cpu;
etmq->packet->start_addr = be->to;
etmq->packet->end_addr = be->to + 4;
err = cs_etm__synth_instruction_sample(etmq);
if (err) return err; }err = cs_etm__synth_instruction_sample(etmq, be->to, 1, cpu);
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
Hi Mathieu,
Thanks for the feedback, I'll update the patch and send a new version out.
On 17/11/17 19:01, Mathieu Poirier wrote:
Since cs_etm__sample() is used for trace decoding should we also apply this patch to the master branch?
I think a lot of the changes in this patch are closely coupled to Sebastian's original patches, so it may be hard to separate them out and apply them to master.
I wanted to test your implemenation but (8bbf7ee5d6cb perf inject: correct recording of branch address and destination) breaks normal CS trace decoding. Would you mind having a look?
Do you mean the trace decode from "perf script"? Do you have any more info on how it's broken and what you would expect? (I'm not familiar with perf script, so don't know what output it should produce).
Be mindful that I rebased the autoFDO tree to v4.14, so SHA's will have changed.
I'll use that as the baseline for any future patches.
Regards
Rob
On 20 November 2017 at 07:48, Robert Walker robert.walker@arm.com wrote:
Hi Mathieu,
Thanks for the feedback, I'll update the patch and send a new version out.
On 17/11/17 19:01, Mathieu Poirier wrote:
Since cs_etm__sample() is used for trace decoding should we also apply this patch to the master branch?
I think a lot of the changes in this patch are closely coupled to Sebastian's original patches, so it may be hard to separate them out and apply them to master.
I wanted to test your implemenation but (8bbf7ee5d6cb perf inject: correct recording of branch address and destination) breaks normal CS trace decoding. Would you mind having a look?
Do you mean the trace decode from "perf script"? Do you have any more info on how it's broken and what you would expect? (I'm not familiar with perf script, so don't know what output it should produce).
I noticed that just doing a "perf report --stdio" on the autoFDO branch hangs with the commit I pointed out.
Be mindful that I rebased the autoFDO tree to v4.14, so SHA's will have changed.
I'll use that as the baseline for any future patches.
Yes please. We don't have a choice, until all patches are upstream we have to rebase when to the newest kernel revision.
Regards
Rob
On 20 November 2017 at 15:21, Mathieu Poirier mathieu.poirier@linaro.org wrote:
On 20 November 2017 at 07:48, Robert Walker robert.walker@arm.com wrote:
Hi Mathieu,
Thanks for the feedback, I'll update the patch and send a new version out.
On 17/11/17 19:01, Mathieu Poirier wrote:
Since cs_etm__sample() is used for trace decoding should we also apply this patch to the master branch?
I think a lot of the changes in this patch are closely coupled to Sebastian's original patches, so it may be hard to separate them out and apply them to master.
I wanted to test your implemenation but (8bbf7ee5d6cb perf inject: correct recording of branch address and destination) breaks normal CS trace decoding. Would you mind having a look?
Do you mean the trace decode from "perf script"? Do you have any more info on how it's broken and what you would expect? (I'm not familiar with perf script, so don't know what output it should produce).
I noticed that just doing a "perf report --stdio" on the autoFDO branch hangs with the commit I pointed out.
This hangs without Rob's patches too. --stdio --dump works, --stdio only hangs. Not sure we haven't seen this before?
Mike
Be mindful that I rebased the autoFDO tree to v4.14, so SHA's will have changed.
I'll use that as the baseline for any future patches.
Yes please. We don't have a choice, until all patches are upstream we have to rebase when to the newest kernel revision.
Regards
Rob
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org wrote:
On 20 November 2017 at 15:21, Mathieu Poirier mathieu.poirier@linaro.org wrote:
On 20 November 2017 at 07:48, Robert Walker robert.walker@arm.com wrote:
Hi Mathieu,
Thanks for the feedback, I'll update the patch and send a new version out.
On 17/11/17 19:01, Mathieu Poirier wrote:
Since cs_etm__sample() is used for trace decoding should we also apply this patch to the master branch?
I think a lot of the changes in this patch are closely coupled to Sebastian's original patches, so it may be hard to separate them out and apply them to master.
I wanted to test your implemenation but (8bbf7ee5d6cb perf inject: correct recording of branch address and destination) breaks normal CS trace decoding. Would you mind having a look?
Do you mean the trace decode from "perf script"? Do you have any more info on how it's broken and what you would expect? (I'm not familiar with perf script, so don't know what output it should produce).
I noticed that just doing a "perf report --stdio" on the autoFDO branch hangs with the commit I pointed out.
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he could have a look.
--stdio --dump works, --stdio only hangs.
I'm not surprised as --dump avoids a lot of code.
Not sure we haven't seen this before?
Mike
Be mindful that I rebased the autoFDO tree to v4.14, so SHA's will have changed.
I'll use that as the baseline for any future patches.
Yes please. We don't have a choice, until all patches are upstream we have to rebase when to the newest kernel revision.
Regards
Rob
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-- Mike Leach Principal Engineer, ARM Ltd. Blackburn Design Centre. UK
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 21 November 2017 17:46 To: Mike Leach mike.leach@linaro.org Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org wrote:
On 20 November 2017 at 15:21, Mathieu Poirier mathieu.poirier@linaro.org wrote:
I noticed that just doing a "perf report --stdio" on the autoFDO branch hangs with the commit I pointed out.
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he could have a look.
--stdio --dump works, --stdio only hangs.
I've tried it on a few trace captures from the HiKey 960 - it does complete eventually, but takes 10-20 minutes for a 50Mb input file. Does perf report ever complete for you if you leave it for a longer time?
If I inspect it with gdb, it seems to be spending a lot of time in cs_etm__run_decoder() making calls to cs_etm_decoder__process_data_block() - these usually only add a single packet to the output queue for cs_etm__sample(), but it is make *slow* progress through the trace data. Digging down a bit further, cs_etm_decoder__process_data_block() is most often calling the decoder with OCSD_OP_FLUSH because the previous call returned a WAIT response. I wonder if there's an efficiency problem here? With dense trace (i.e. all ATOM packets), it ends up calling into the trace decoder and cs_etm__sample() for almost every bit in the trace data. Can we make it build up a larger queue of packets from the decoder to pass to cs_etm__sample()?
Regards
Rob
I'm not surprised as --dump avoids a lot of code.
Not sure we haven't seen this before?
Mike
On 21 November 2017 at 13:05, Robert Walker robert.walker@arm.com wrote:
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 21 November 2017 17:46 To: Mike Leach mike.leach@linaro.org Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org wrote:
On 20 November 2017 at 15:21, Mathieu Poirier mathieu.poirier@linaro.org wrote:
I noticed that just doing a "perf report --stdio" on the autoFDO branch hangs with the commit I pointed out.
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he could have a look.
--stdio --dump works, --stdio only hangs.
I've tried it on a few trace captures from the HiKey 960 - it does complete eventually, but takes 10-20 minutes for a 50Mb input file. Does perf report ever complete for you if you leave it for a longer time?
Thanks for looking into this. It probably does complete but it's just a matter of giving it time. My development environment is a little far from that right now so I'll test again when you send a second revision.
If I inspect it with gdb, it seems to be spending a lot of time in cs_etm__run_decoder() making calls to cs_etm_decoder__process_data_block() - these usually only add a single packet to the output queue for cs_etm__sample(), but it is make *slow* progress through the trace data. Digging down a bit further, cs_etm_decoder__process_data_block() is most often calling the decoder with OCSD_OP_FLUSH because the previous call returned a WAIT response. I wonder if there's an efficiency problem here? With dense trace (i.e. all ATOM packets), it ends up calling into the trace decoder and cs_etm__sample() for almost every bit in the trace data. Can we make it build up a larger queue of packets from the decoder to pass to cs_etm__sample()?
There is probably room for improvement as this code has stayed largely untouched since inception where our goal was to "just make it work". One way or another we'll have to look at it again when I get to implement support for cpu-wide scenarios. I'm currently half-way into how Intel has done it but had to set that aside to concentrate on upstreaming support for per-thread scenarios. I'm expecting a big sit-in with Mike at the Hong Kong Connect to iron out what and how we'll make cpu-wide tracing work.
Regards
Rob
I'm not surprised as --dump avoids a lot of code.
Not sure we haven't seen this before?
Mike
Hi Rob, Mathieu,
On 21 November 2017 at 20:32, Mathieu Poirier mathieu.poirier@linaro.org wrote:
On 21 November 2017 at 13:05, Robert Walker robert.walker@arm.com wrote:
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 21 November 2017 17:46 To: Mike Leach mike.leach@linaro.org Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org wrote:
On 20 November 2017 at 15:21, Mathieu Poirier mathieu.poirier@linaro.org wrote:
I noticed that just doing a "perf report --stdio" on the autoFDO branch hangs with the commit I pointed out.
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he could have a look.
--stdio --dump works, --stdio only hangs.
I've tried it on a few trace captures from the HiKey 960 - it does complete eventually, but takes 10-20 minutes for a 50Mb input file. Does perf report ever complete for you if you leave it for a longer time?
Thanks for looking into this. It probably does complete but it's just a matter of giving it time. My development environment is a little far from that right now so I'll test again when you send a second revision.
If I inspect it with gdb, it seems to be spending a lot of time in cs_etm__run_decoder() making calls to cs_etm_decoder__process_data_block() - these usually only add a single packet to the output queue for cs_etm__sample(), but it is make *slow* progress through the trace data. Digging down a bit further, cs_etm_decoder__process_data_block() is most often calling the decoder with OCSD_OP_FLUSH because the previous call returned a WAIT response. I wonder if there's an efficiency problem here? With dense trace (i.e. all ATOM packets), it ends up calling into the trace decoder and cs_etm__sample() for almost every bit in the trace data. Can we make it build up a larger queue of packets from the decoder to pass to cs_etm__sample()?
There is probably room for improvement as this code has stayed largely untouched since inception where our goal was to "just make it work". One way or another we'll have to look at it again when I get to implement support for cpu-wide scenarios. I'm currently half-way into how Intel has done it but had to set that aside to concentrate on upstreaming support for per-thread scenarios. I'm expecting a big sit-in with Mike at the Hong Kong Connect to iron out what and how we'll make cpu-wide tracing work.
I'd agree with the above - we haven't really looked at optimising the code to date.
The decode library client registers a callback with the library to accept decoded trace packets, and can return CONT / WAIT accordingly.
I would expect that perf should be processing each packet immediately (which is likely what does happen with --dump) and return CONT which avoids the necessity to run a FLUSH operation. It seem to recall Tor mentioning that perf may be buffering a couple of packets - there was a reason that perf needed a couple of consecutive packets to generate necessary perf structures - the then returning WAIT, which must then be followed by a FLUSH call before more trace data is fed into the library. (we need flush as a multi-atom byte can generate many output packets, so we must not feed in the next byte of trace data until the current one is done.). It may be better to buffer up multiple packets in a future version.
I am working on some extension code for the library that will implement an in-library packet buffer, allowing clients to have a polling/request type API (this is to make it easier to interface with scripting languages such as python for library standalone testing - registering callbacks from C into python seems tricky at best). It may be we can adjust the perf code to use the same API.
Regards
Mike
Regards
Rob
I'm not surprised as --dump avoids a lot of code.
Not sure we haven't seen this before?
Mike
-----Original Message----- From: Mike Leach [mailto:mike.leach@linaro.org] Sent: 22 November 2017 10:59 To: Mathieu Poirier mathieu.poirier@linaro.org Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode
Hi Rob, Mathieu,
On 21 November 2017 at 20:32, Mathieu Poirier mathieu.poirier@linaro.org wrote:
On 21 November 2017 at 13:05, Robert Walker robert.walker@arm.com
wrote:
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 21 November 2017 17:46 To: Mike Leach mike.leach@linaro.org Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org
wrote:
On 20 November 2017 at 15:21, Mathieu Poirier mathieu.poirier@linaro.org wrote:
I noticed that just doing a "perf report --stdio" on the autoFDO branch hangs with the commit I pointed out.
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he could have a look.
--stdio --dump works, --stdio only hangs.
I've tried it on a few trace captures from the HiKey 960 - it does complete
eventually, but takes 10-20 minutes for a 50Mb input file. Does perf report ever complete for you if you leave it for a longer time?
Thanks for looking into this. It probably does complete but it's just a matter of giving it time. My development environment is a little far from that right now so I'll test again when you send a second revision.
If I inspect it with gdb, it seems to be spending a lot of time in
cs_etm__run_decoder() making calls to cs_etm_decoder__process_data_block() - these usually only add a single packet to the output queue for cs_etm__sample(), but it is make *slow* progress through the trace data. Digging down a bit further, cs_etm_decoder__process_data_block() is most often calling the decoder with OCSD_OP_FLUSH because the previous call returned a WAIT response. I wonder if there's an efficiency problem here? With dense trace (i.e. all ATOM packets), it ends up calling into the trace decoder and cs_etm__sample() for almost every bit in the trace data. Can we make it build up a larger queue of packets from the decoder to pass to cs_etm__sample()?
There is probably room for improvement as this code has stayed largely untouched since inception where our goal was to "just make it work". One way or another we'll have to look at it again when I get to implement support for cpu-wide scenarios. I'm currently half-way into how Intel has done it but had to set that aside to concentrate on upstreaming support for per-thread scenarios. I'm expecting a big sit-in with Mike at the Hong Kong Connect to iron out what and how we'll make cpu-wide tracing work.
I'd agree with the above - we haven't really looked at optimising the code to date.
The decode library client registers a callback with the library to accept decoded trace packets, and can return CONT / WAIT accordingly.
I would expect that perf should be processing each packet immediately (which is likely what does happen with --dump) and return CONT which avoids the necessity to run a FLUSH operation. It seem to recall Tor mentioning that perf may be buffering a couple of packets - there was a reason that perf needed a couple of consecutive packets to generate necessary perf structures - the then returning WAIT, which must then be followed by a FLUSH call before more trace data is fed into the library. (we need flush as a multi-atom byte can generate many output packets, so we must not feed in the next byte of trace data until the current one is done.). It may be better to buffer up multiple packets in a future version.
I am working on some extension code for the library that will implement an in-library packet buffer, allowing clients to have a polling/request type API (this is to make it easier to interface with scripting languages such as python for library standalone testing - registering callbacks from C into python seems tricky at best). It may be we can adjust the perf code to use the same API.
Regards
Mike
I've been running some tests on the various branches of https://github.com/Linaro/perf-opencsd.git. On a 50Mb capture file, perf report takes: - master: 19m15s - autoFDO: > 1hr (I killed it after 1hr) - autoFDO+my patches: 15m33s (which is about the same time as perf inject takes) I'm running perf on an x86 box, so these times would be slower on something like a Juno. I think the autoFDO branch is so much slower because it's generating instruction samples every 64 branches (every time the branch stack fills up) which then go through whatever processing perf report does to produce the statistics. With my patches, the samples are only generated and processed every 10000 instructions (or whatever interval is specified).
Regards
Rob
OK - I assume these tests were for 'perf report --stdio' Given that, I don't see why the addition of the code for AutoFDO - which I believe is to make 'perf inject' work should alter in the slightest an apparently unrelated command.
On 22 November 2017 at 11:23, Robert Walker robert.walker@arm.com wrote:
-----Original Message----- From: Mike Leach [mailto:mike.leach@linaro.org] Sent: 22 November 2017 10:59 To: Mathieu Poirier mathieu.poirier@linaro.org Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode
Hi Rob, Mathieu,
On 21 November 2017 at 20:32, Mathieu Poirier mathieu.poirier@linaro.org wrote:
On 21 November 2017 at 13:05, Robert Walker robert.walker@arm.com
wrote:
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 21 November 2017 17:46 To: Mike Leach mike.leach@linaro.org Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org
wrote:
On 20 November 2017 at 15:21, Mathieu Poirier mathieu.poirier@linaro.org wrote: > > I noticed that just doing a "perf report --stdio" on the autoFDO > branch hangs with the commit I pointed out. >
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he could have a look.
--stdio --dump works, --stdio only hangs.
I've tried it on a few trace captures from the HiKey 960 - it does complete
eventually, but takes 10-20 minutes for a 50Mb input file. Does perf report ever complete for you if you leave it for a longer time?
Thanks for looking into this. It probably does complete but it's just a matter of giving it time. My development environment is a little far from that right now so I'll test again when you send a second revision.
If I inspect it with gdb, it seems to be spending a lot of time in
cs_etm__run_decoder() making calls to cs_etm_decoder__process_data_block() - these usually only add a single packet to the output queue for cs_etm__sample(), but it is make *slow* progress through the trace data. Digging down a bit further, cs_etm_decoder__process_data_block() is most often calling the decoder with OCSD_OP_FLUSH because the previous call returned a WAIT response. I wonder if there's an efficiency problem here? With dense trace (i.e. all ATOM packets), it ends up calling into the trace decoder and cs_etm__sample() for almost every bit in the trace data. Can we make it build up a larger queue of packets from the decoder to pass to cs_etm__sample()?
There is probably room for improvement as this code has stayed largely untouched since inception where our goal was to "just make it work". One way or another we'll have to look at it again when I get to implement support for cpu-wide scenarios. I'm currently half-way into how Intel has done it but had to set that aside to concentrate on upstreaming support for per-thread scenarios. I'm expecting a big sit-in with Mike at the Hong Kong Connect to iron out what and how we'll make cpu-wide tracing work.
I'd agree with the above - we haven't really looked at optimising the code to date.
The decode library client registers a callback with the library to accept decoded trace packets, and can return CONT / WAIT accordingly.
I would expect that perf should be processing each packet immediately (which is likely what does happen with --dump) and return CONT which avoids the necessity to run a FLUSH operation. It seem to recall Tor mentioning that perf may be buffering a couple of packets - there was a reason that perf needed a couple of consecutive packets to generate necessary perf structures - the then returning WAIT, which must then be followed by a FLUSH call before more trace data is fed into the library. (we need flush as a multi-atom byte can generate many output packets, so we must not feed in the next byte of trace data until the current one is done.). It may be better to buffer up multiple packets in a future version.
I am working on some extension code for the library that will implement an in-library packet buffer, allowing clients to have a polling/request type API (this is to make it easier to interface with scripting languages such as python for library standalone testing - registering callbacks from C into python seems tricky at best). It may be we can adjust the perf code to use the same API.
Regards
Mike
I've been running some tests on the various branches of https://github.com/Linaro/perf-opencsd.git. On a 50Mb capture file, perf report takes:
- master: 19m15s
- autoFDO: > 1hr (I killed it after 1hr)
- autoFDO+my patches: 15m33s (which is about the same time as perf inject takes)
I'm running perf on an x86 box, so these times would be slower on something like a Juno. I think the autoFDO branch is so much slower because it's generating instruction samples every 64 branches (every time the branch stack fills up) which then go through whatever processing perf report does to produce the statistics. With my patches, the samples are only generated and processed every 10000 instructions (or whatever interval is specified).
OK - I assume these tests were for 'perf report --stdio'
Given that, I don't see why the addition of the code for AutoFDO - which I believe is to make 'perf inject' work should alter in the slightest an apparently unrelated command.
Mike
Regards
Rob
Hi Robert and thanks for the code.
On 13 November 2017 at 08:11, Robert Walker robert.walker@arm.com wrote:
These patches fix some issues with the branch stacks generated from CoreSight ETM trace.
The main issues addressed are:
- The branch stack should only contain taken branches.
- The instruction samples are generated using the period specified by the --itrace option to perf inject. Currently, the period can only be specified as an instruction count - further work is required to specify the period as a cycle count or time interval.
- The ordering of the branch stack should have newest branch first.
- Some minor fixes to the address calculations.
With these fixes, the branch stacks are more similar to the last branch records produced by 'perf record -b' and Intel-PT on x86. There are similar improvements in the autofdo profiles generated from these traces.
I'm a little confused. Here you mention that reverting d3fa0f70b7e8 make records look more similar to intelPT, but the changelog in d3fa0f70b7e8 claims the same thing. We obviously have two diverging point of views and I'd like to have a better understanding of the situation. Is there any way I can test this on my side?
Thanks, Mathieu
The patches apply to the autoFDO branch of https://github.com/Linaro/perf-opencsd.git (d3fa0f7)
Regards
Robert Walker
Robert Walker (2): Revert "perf inject: record branches in chronological order" perf: Fix branch stack records from CoreSight ETM decode
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 134 +++++++++++++----------- 3 files changed, 73 insertions(+), 67 deletions(-)
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 14 November 2017 17:56 To: Robert Walker Robert.Walker@arm.com Cc: CoreSight@lists.linaro.org Subject: Re: [PATCH 0/2] perf inject branch stack fixes
Hi Robert and thanks for the code.
On 13 November 2017 at 08:11, Robert Walker robert.walker@arm.com wrote:
These patches fix some issues with the branch stacks generated from CoreSight ETM trace.
The main issues addressed are:
- The branch stack should only contain taken branches.
- The instruction samples are generated using the period specified by the --itrace option to perf inject. Currently, the period can only be specified as an instruction count - further work is required to specify the period as a cycle count or time interval.
- The ordering of the branch stack should have newest branch first.
- Some minor fixes to the address calculations.
With these fixes, the branch stacks are more similar to the last branch records produced by 'perf record -b' and Intel-PT on x86. There are similar improvements in the autofdo profiles generated from
these traces.
I'm a little confused. Here you mention that reverting d3fa0f70b7e8 make records look more similar to intelPT, but the changelog in d3fa0f70b7e8 claims the same thing. We obviously have two diverging point of views and I'd like to have a better understanding of the situation. Is there any way I can test this on my side?
Thanks, Mathieu
I used the attached test program to test this - main() calls f1() which then calls f2(), which calls f3().
On an x86 PC, I recorded last branch records with: perf record -b ./call_chain
And Intel-PT with: perf record -e intel_pt//u ./call_chain perf inject -i perf.data -o inj.data --itrace=i10000il --strip
In each case, perf report -D is used to view the instruction samples with branch stacks.
The attached script, addr_mapper.py makes it easier to see what's going on in the branch stack by annotating addresses with their offsets from symbols.
objdump -d ./call_chain > call_chain.dump perf report -D -i perf.data | ./addr_mapper.py call_chain.dump | less
This results in a branch stack like this:
13 12877531596650324 [main+1287753159624fd3e] 0x4528 [0x640]: PERF_RECORD_SAMPLE(IP, 0x2): 17548/17548: 0x400608 [main+22] period: 10000 a ddr: 0 ... branch stack: nr:64 ..... 0: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 1: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 2: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 3: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 4: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 5: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 6: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0 ..... 7: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 8: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 9: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 10: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 11: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 12: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 13: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0
Entry 13 is the call from main() to f1(), entry 12 is the call from f1() to f2(), entry 11 is the call from f2() to f2(). Then entries 10, 9 & 8 are the returns from f3(), f2(), f1() to main().
Without the reversion of d3fa0f7, the Arm trace produced the reverse stack, so that the call from main() to f1() appeared at the top, f1() to f2() as the 2nd entry, f2() to f3() as the 3rd and so on. With d3fa0f7 reverted, the Arm stacks match the order of the intel stacks.
Hope this helps.
Regards
Rob
The patches apply to the autoFDO branch of https://github.com/Linaro/perf-opencsd.git (d3fa0f7)
Regards
Robert Walker
Robert Walker (2): Revert "perf inject: record branches in chronological order" perf: Fix branch stack records from CoreSight ETM decode
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 134 +++++++++++++----------- 3 files changed, 73 insertions(+), 67 deletions(-)
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
Hello Robert,
On 15 November 2017 at 04:55, Robert Walker robert.walker@arm.com wrote:
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 14 November 2017 17:56 To: Robert Walker Robert.Walker@arm.com Cc: CoreSight@lists.linaro.org Subject: Re: [PATCH 0/2] perf inject branch stack fixes
Hi Robert and thanks for the code.
On 13 November 2017 at 08:11, Robert Walker robert.walker@arm.com wrote:
These patches fix some issues with the branch stacks generated from CoreSight ETM trace.
The main issues addressed are:
- The branch stack should only contain taken branches.
- The instruction samples are generated using the period specified by the --itrace option to perf inject. Currently, the period can only be specified as an instruction count - further work is required to specify the period as a cycle count or time interval.
- The ordering of the branch stack should have newest branch first.
- Some minor fixes to the address calculations.
With these fixes, the branch stacks are more similar to the last branch records produced by 'perf record -b' and Intel-PT on x86. There are similar improvements in the autofdo profiles generated from
these traces.
I'm a little confused. Here you mention that reverting d3fa0f70b7e8 make records look more similar to intelPT, but the changelog in d3fa0f70b7e8 claims the same thing. We obviously have two diverging point of views and I'd like to have a better understanding of the situation. Is there any way I can test this on my side?
Thanks, Mathieu
I used the attached test program to test this - main() calls f1() which then calls f2(), which calls f3().
On an x86 PC, I recorded last branch records with: perf record -b ./call_chain
That worked on x86 but blew up on Juno:
root@linaro-nano:~/rwalker# perf record -b ./call_chain Error: PMU Hardware doesn't support sampling/overflow-interrupts.
Have you also seen this?
And Intel-PT with: perf record -e intel_pt//u ./call_chain perf inject -i perf.data -o inj.data --itrace=i10000il --strip
A couple of questions here:
1) The command "perf record -e intel_pt//u ./call_chain" will trigger a cpu-wide trace session, something that is not (yet) supported on CS. Did you add the --per-thread switch when doing this on Juno?
2) The command "perf inject -i perf.data -o inj.data --itrace=i10000il --strip" doesn't affect the content of "perf.data". Wince "inj.data" isn't referenced below, what is the purpose of that command?
In each case, perf report -D is used to view the instruction samples with branch stacks.
The attached script, addr_mapper.py makes it easier to see what's going on in the branch stack by annotating addresses with their offsets from symbols.
objdump -d ./call_chain > call_chain.dump perf report -D -i perf.data | ./addr_mapper.py call_chain.dump | less
Doing this on x86 and feeding it the "perf.data" file resulting from "perf record -b ./call_chain", I was able to get something [1] that somewhat resembles the output below. On my snapshot function names don't come out as neatly and addresses are in the higher part of the memory.
[1]. https://pastebin.com/pEQT4aP0
This results in a branch stack like this:
13 12877531596650324 [main+1287753159624fd3e] 0x4528 [0x640]: PERF_RECORD_SAMPLE(IP, 0x2): 17548/17548: 0x400608 [main+22] period: 10000 a ddr: 0 ... branch stack: nr:64 ..... 0: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 1: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 2: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 3: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 4: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 5: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 6: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0 ..... 7: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 8: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 9: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 10: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 11: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 12: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 13: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0
Entry 13 is the call from main() to f1(), entry 12 is the call from f1() to f2(), entry 11 is the call from f2() to f2(). Then entries 10, 9 & 8 are the returns from f3(), f2(), f1() to main().
Without the reversion of d3fa0f7, the Arm trace produced the reverse stack, so that the call from main() to f1() appeared at the top, f1() to f2() as the 2nd entry, f2() to f3() as the 3rd and so on. With d3fa0f7 reverted, the Arm stacks match the order of the intel stacks.
Hope this helps.
Regards
Rob
The patches apply to the autoFDO branch of https://github.com/Linaro/perf-opencsd.git (d3fa0f7)
Regards
Robert Walker
Robert Walker (2): Revert "perf inject: record branches in chronological order" perf: Fix branch stack records from CoreSight ETM decode
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 134 +++++++++++++----------- 3 files changed, 73 insertions(+), 67 deletions(-)
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
Hi Mathieu,
On 16/11/17 18:37, Mathieu Poirier wrote:
Hello Robert,
On 15 November 2017 at 04:55, Robert Walker robert.walker@arm.com wrote:
-----Original Message----- From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org] Sent: 14 November 2017 17:56 To: Robert Walker Robert.Walker@arm.com Cc: CoreSight@lists.linaro.org Subject: Re: [PATCH 0/2] perf inject branch stack fixes
Hi Robert and thanks for the code.
On 13 November 2017 at 08:11, Robert Walker robert.walker@arm.com wrote:
These patches fix some issues with the branch stacks generated from CoreSight ETM trace.
The main issues addressed are:
- The branch stack should only contain taken branches.
- The instruction samples are generated using the period specified by the --itrace option to perf inject. Currently, the period can only be specified as an instruction count - further work is required to specify the period as a cycle count or time interval.
- The ordering of the branch stack should have newest branch first.
- Some minor fixes to the address calculations.
With these fixes, the branch stacks are more similar to the last branch records produced by 'perf record -b' and Intel-PT on x86. There are similar improvements in the autofdo profiles generated from
these traces.
I'm a little confused. Here you mention that reverting d3fa0f70b7e8 make records look more similar to intelPT, but the changelog in d3fa0f70b7e8 claims the same thing. We obviously have two diverging point of views and I'd like to have a better understanding of the situation. Is there any way I can test this on my side?
Thanks, Mathieu
I used the attached test program to test this - main() calls f1() which then calls f2(), which calls f3().
On an x86 PC, I recorded last branch records with: perf record -b ./call_chain
That worked on x86 but blew up on Juno:
root@linaro-nano:~/rwalker# perf record -b ./call_chain Error: PMU Hardware doesn't support sampling/overflow-interrupts.
Have you also seen this?
Arm cores don't have support for last branch records - this is why we need to record the execution trace and reconstruct the branch records with perf inject.
And Intel-PT with: perf record -e intel_pt//u ./call_chain perf inject -i perf.data -o inj.data --itrace=i10000il --strip
A couple of questions here:
- The command "perf record -e intel_pt//u ./call_chain" will trigger
a cpu-wide trace session, something that is not (yet) supported on CS. Did you add the --per-thread switch when doing this on Juno?
Yes, I did add the --per-thread switch when recording on Arm (I'm using HiKey960)
- The command "perf inject -i perf.data -o inj.data --itrace=i10000il
--strip" doesn't affect the content of "perf.data". Wince "inj.data" isn't referenced below, what is the purpose of that command?
The perf inject command decodes the execution trace and constructs instruction samples containing branch records at the requested interval (10000 instructions in the example above) - these are written to inj.data. When viewing branch records created by perf inject, inj.data should be passed to perf report instead of perf.data.
In each case, perf report -D is used to view the instruction samples with branch stacks.
The attached script, addr_mapper.py makes it easier to see what's going on in the branch stack by annotating addresses with their offsets from symbols.
objdump -d ./call_chain > call_chain.dump perf report -D -i perf.data | ./addr_mapper.py call_chain.dump | less
Doing this on x86 and feeding it the "perf.data" file resulting from "perf record -b ./call_chain", I was able to get something [1] that somewhat resembles the output below. On my snapshot function names don't come out as neatly and addresses are in the higher part of the memory.
The high addresses will be the shared libraries linked to the program - the first few samples are when the C runtime is preparing to start the program (i.e. before main() is reached). I found I had to skip the first few samples (around 8-10) before the main program is reached and I then get the branch stacks as found in the previous mail.
Regards
Rob
This results in a branch stack like this:
13 12877531596650324 [main+1287753159624fd3e] 0x4528 [0x640]: PERF_RECORD_SAMPLE(IP, 0x2): 17548/17548: 0x400608 [main+22] period: 10000 a ddr: 0 ... branch stack: nr:64 ..... 0: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 1: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 2: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 3: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 4: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 5: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 6: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0 ..... 7: 000000000040061d [main+37] -> 0000000000400605 [main+1f] 0 cycles 0 ..... 8: 00000000004005e5 [f1+1c] -> 000000000040060f [main+29] 0 cycles 0 ..... 9: 00000000004005c8 [f2+1c] -> 00000000004005de [f1+15] 0 cycles 0 ..... 10: 00000000004005ab [f3+e] -> 00000000004005c1 [f2+15] 0 cycles 0 ..... 11: 00000000004005bc [f2+10] -> 000000000040059d [f3] 0 cycles 0 ..... 12: 00000000004005d9 [f1+10] -> 00000000004005ac [f2] 0 cycles 0 ..... 13: 000000000040060a [main+24] -> 00000000004005c9 [f1] 0 cycles 0
Entry 13 is the call from main() to f1(), entry 12 is the call from f1() to f2(), entry 11 is the call from f2() to f2(). Then entries 10, 9 & 8 are the returns from f3(), f2(), f1() to main().
Without the reversion of d3fa0f7, the Arm trace produced the reverse stack, so that the call from main() to f1() appeared at the top, f1() to f2() as the 2nd entry, f2() to f3() as the 3rd and so on. With d3fa0f7 reverted, the Arm stacks match the order of the intel stacks.
Hope this helps.
Regards
Rob
The patches apply to the autoFDO branch of https://github.com/Linaro/perf-opencsd.git (d3fa0f7)
Regards
Robert Walker
Robert Walker (2): Revert "perf inject: record branches in chronological order" perf: Fix branch stack records from CoreSight ETM decode
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 4 +- tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 2 +- tools/perf/util/cs-etm.c | 134 +++++++++++++----------- 3 files changed, 73 insertions(+), 67 deletions(-)
-- 1.9.1
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight