These two patches generated a fair amount of discussion on the mailing list when they were sent and as such I didn't apply them.
But from what I understand people use them when doing autoFDO testing. They may not be perfect but at least provide a base for discussion and further improvement.
My intention is to add them to the perf-opencsd-master branch for the 4.14 cycle. If anyone is categorically against it please get back to me with convincing arguments.
Best regards, Mathieu
Sebastian Pop (2): perf inject: correct recording of branch address and destination perf inject: record branches in chronological order
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 12 +++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 136 ++++++++++++++---------- 3 files changed, 95 insertions(+), 54 deletions(-)
From: Sebastian Pop s.pop@samsung.com
ARM ETM traces contain packets of executed instructions. The addresses recorded in the trace correspond to the beginning and one instruction past the end of the block of executed instructions. In order to record the address and destination of executed branches, keep track of two packets: the jump occurs from the last executed instruction of the previous packet to the address of the first instruction in the current packet.
Signed-off-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Brian Rzycki b.rzycki@samsung.com --- tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 12 ++++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.h | 1 + tools/perf/util/cs-etm.c | 91 +++++++++++++++++-------- 3 files changed, 75 insertions(+), 29 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index 678e77a7c1b0..8abf10db0f15 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -163,6 +163,18 @@ static int cs_etm_decoder__buffer_packet(struct cs_etm_decoder *decoder, decoder->packet_buffer[et].sample_type = sample_type; decoder->packet_buffer[et].start_addr = elem->st_addr; decoder->packet_buffer[et].end_addr = elem->en_addr; + switch (elem->last_i_type) { + case OCSD_INSTR_BR: + case OCSD_INSTR_BR_INDIRECT: + decoder->packet_buffer[et].last_instruction_is_branch = true; + break; + case OCSD_INSTR_OTHER: + case OCSD_INSTR_ISB: + case OCSD_INSTR_DSB_DMB: + default: + decoder->packet_buffer[et].last_instruction_is_branch = false; + break; + } decoder->packet_buffer[et].exc = false; decoder->packet_buffer[et].exc_ret = false; decoder->packet_buffer[et].cpu = *((int*)inode->priv); diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h index d15a288a3ced..4fd5a9b3aab3 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.h @@ -49,6 +49,7 @@ struct cs_etm_packet { enum cs_etm_sample_type sample_type; uint64_t start_addr; uint64_t end_addr; + bool last_instruction_is_branch; bool exc; bool exc_ret; int cpu; diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 84db08e745c4..565f1ea45703 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -104,6 +104,10 @@ struct cs_etm_queue { * inserted. */ size_t last_branch_pos; + /* + * A pointer to the last and current decoded packets. + */ + struct cs_etm_packet *prev_packet, *packet; };
static int cs_etm__get_trace(struct cs_etm_buffer *buff, struct cs_etm_queue *etmq); @@ -226,6 +230,8 @@ static void cs_etm__free_queue(void *priv) zfree(&etmq->event_buf); zfree(&etmq->last_branch); zfree(&etmq->last_branch_rb); + zfree(&etmq->prev_packet); + zfree(&etmq->packet); zfree(&etmq->chain); free(etmq); } @@ -392,6 +398,7 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm,
if (etm->synth_opts.last_branch) { size_t sz = sizeof(struct branch_stack); + size_t szp = sizeof(struct cs_etm_packet);
sz += etm->synth_opts.last_branch_sz * sizeof(struct branch_entry); @@ -401,6 +408,12 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, etmq->last_branch_rb = zalloc(sz); if (!etmq->last_branch_rb) goto out_free; + etmq->prev_packet = zalloc(szp); + if (!etmq->prev_packet) + goto out_free; + etmq->packet = zalloc(szp); + if (!etmq->packet) + goto out_free; }
etmq->event_buf = malloc(PERF_SAMPLE_MAX_SIZE); @@ -451,6 +464,8 @@ static struct cs_etm_queue *cs_etm__alloc_queue(struct cs_etm_auxtrace *etm, zfree(&etmq->event_buf); zfree(&etmq->last_branch); zfree(&etmq->last_branch_rb); + zfree(&etmq->prev_packet); + zfree(&etmq->packet); zfree(&etmq->chain); free(etmq); return NULL; @@ -594,8 +609,7 @@ static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) etmq->last_branch_rb->nr = 0; }
-static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq, - struct cs_etm_packet *packet) +static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) { struct branch_stack *bs = etmq->last_branch_rb; struct branch_entry *be; @@ -611,10 +625,16 @@ static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq,
etmq->last_branch_pos -= 1;
- be = &bs->entries[etmq->last_branch_pos]; - be->from = packet->start_addr; - be->to = packet->end_addr; - /* No support for mispredict */ + be = &bs->entries[etmq->last_branch_pos]; + + /* + * The FROM address needs to be the instruction before the end of the + * packet at END_ADDR: substract from END_ADDR the size of the last + * instruction: 4 bytes. + */ + be->from = etmq->prev_packet->end_addr - 4; + be->to = etmq->packet->start_addr; + /* No support for mispredict. */ be->flags.mispred = 0; be->flags.predicted = 1;
@@ -721,13 +741,13 @@ static struct cs_etm_cache_entry *cs_etm__cache_lookup(struct dso *dso, } #endif
-static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, - struct cs_etm_packet *packet) +static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq) { int ret = 0; struct cs_etm_auxtrace *etm = etmq->etm; union perf_event *event = etmq->event_buf; struct perf_sample sample = {.ip = 0,}; + struct cs_etm_packet *packet = etmq->packet; uint64_t start_addr = packet->start_addr; uint64_t end_addr = packet->end_addr;
@@ -941,17 +961,18 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm,
static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) { - //const struct cs_etm_state *state = etmq->state; - struct cs_etm_packet packet; - //struct cs_etm_auxtrace *etm = etmq->etm; - int err; + struct cs_etm_auxtrace *etm = etmq->etm; + struct cs_etm_packet *tmp; + int err = cs_etm_decoder__get_packet(etmq->decoder, etmq->packet);
- err = cs_etm_decoder__get_packet(etmq->decoder,&packet); - // if there is no sample, it returns err = -1, no real error + /* + * If there is no sample, it returns err = -1, no real error. + */ if (err) return err; - if (etmq->etm->synth_opts.last_branch) { - *cpu = packet.cpu; + + if (etm->synth_opts.last_branch) { + *cpu = etmq->packet->cpu;
/* * FIXME: as the trace sampling does not work for now, (for @@ -961,17 +982,30 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, int *cpu) * should be rewritten as "if reached sampling period". */ if (etmq->last_branch_rb->nr == - etmq->etm->synth_opts.last_branch_sz) { - err = cs_etm__synth_instruction_sample(etmq, &packet); + etm->synth_opts.last_branch_sz) { + err = cs_etm__synth_instruction_sample(etmq); if (err) return err; } - cs_etm__update_last_branch_rb(etmq, &packet); - } else if (packet.sample_type & CS_ETM_RANGE) { - err = cs_etm__synth_instruction_sample(etmq,&packet); - if (err) - return err; - } + /* + * Record a branch when the last instruction in PREV_PACKET is a + * branch. + */ + if (etmq->prev_packet->last_instruction_is_branch) + cs_etm__update_last_branch_rb(etmq); + + /* + * Swap PACKET with PREV_PACKET: PACKET becomes PREV_PACKET for + * the next incoming packet. + */ + tmp = etmq->packet; + etmq->packet = etmq->prev_packet; + etmq->prev_packet = tmp; + } else if (etmq->packet->sample_type & CS_ETM_RANGE) { + err = cs_etm__synth_instruction_sample(etmq); + if (err) + return err; + } return 0; }
@@ -1016,15 +1050,14 @@ static int cs_etm__run_decoder(struct cs_etm_queue *etmq, u64 *timestamp) * buffer at the end of the trace. */ if (etmq->etm->synth_opts.last_branch) { - struct cs_etm_packet packet; struct branch_stack *bs = etmq->last_branch_rb; struct branch_entry *be = &bs->entries[etmq->last_branch_pos];
- packet.cpu = cpu; - packet.start_addr = be->from; - packet.end_addr = be->to; + etmq->packet->cpu = cpu; + etmq->packet->start_addr = be->to; + etmq->packet->end_addr = be->to + 4;
- err = cs_etm__synth_instruction_sample(etmq, &packet); + err = cs_etm__synth_instruction_sample(etmq); if (err) return err; }
From: Sebastian Pop s.pop@samsung.com
The original patch adding support for translating ARM ETM traces into last branch records was copied from Intel-PT. We kept the code that inserts the branches decoded from the trace into a circular buffer in reverse chronological order, and that produces branch stacks in which the branch addresses are in reverse order for ARM ETM.
This patch records the branches in the same chronological order as they appear in the ETM trace. The order of the branches in a branch stack produced by "perf inject" now matches the branch order in of a stack recorded by "perf record -b" on x86.
Signed-off-by: Sebastian Pop s.pop@samsung.com Signed-off-by: Brian Rzycki b.rzycki@samsung.com --- tools/perf/util/cs-etm.c | 47 +++++++++++++++++++++-------------------------- 1 file changed, 21 insertions(+), 26 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 565f1ea45703..c2fbcab83be3 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -580,32 +580,28 @@ static inline void cs_etm__copy_last_branch_rb(struct cs_etm_queue *etmq) return;
/* - * As bs_src->entries is a circular buffer, we need to copy from it in - * two steps. First, copy the branches from the most recently inserted - * branch ->last_branch_pos until the end of bs_src->entries buffer. + * If we wrapped around at least once, the branches from last_branch_pos + * element to last_branch_sz are older valid branches: copy them over. */ - nr = etmq->etm->synth_opts.last_branch_sz - etmq->last_branch_pos; - memcpy(&bs_dst->entries[0], - &bs_src->entries[etmq->last_branch_pos], - sizeof(struct branch_entry) * nr); + if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) { + nr = etmq->etm->synth_opts.last_branch_sz + - etmq->last_branch_pos - 1; + memcpy(&bs_dst->entries[0], + &bs_src->entries[etmq->last_branch_pos + 1], + sizeof(struct branch_entry) * nr); + }
/* - * If we wrapped around at least once, the branches from the beginning - * of the bs_src->entries buffer and until the ->last_branch_pos element - * are older valid branches: copy them over. The total number of - * branches copied over will be equal to the number of branches asked by - * the user in last_branch_sz. + * Copy the branches from the most recently inserted branches from 0 + * to last_branch_pos included. */ - if (bs_src->nr >= etmq->etm->synth_opts.last_branch_sz) { - memcpy(&bs_dst->entries[nr], - &bs_src->entries[0], - sizeof(struct branch_entry) * etmq->last_branch_pos); - } + memcpy(&bs_dst->entries[nr], &bs_src->entries[0], + sizeof(struct branch_entry) * (etmq->last_branch_pos + 1)); }
static inline void cs_etm__reset_last_branch_rb(struct cs_etm_queue *etmq) { - etmq->last_branch_pos = 0; + etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz - 1; etmq->last_branch_rb->nr = 0; }
@@ -615,15 +611,14 @@ static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq) struct branch_entry *be;
/* - * The branches are recorded in a circular buffer in reverse - * chronological order: we start recording from the last element of the - * buffer down. After writing the first element of the stack, move the - * insert position back to the end of the buffer. + * Record branches in a circular buffer in chronological order. After + * writing the last element of the stack, move the insert position back + * to the beginning of the buffer. */ - if (!etmq->last_branch_pos) - etmq->last_branch_pos = etmq->etm->synth_opts.last_branch_sz; - - etmq->last_branch_pos -= 1; + if (etmq->last_branch_pos == etmq->etm->synth_opts.last_branch_sz - 1) + etmq->last_branch_pos = 0; + else + etmq->last_branch_pos += 1;
be = &bs->entries[etmq->last_branch_pos];
On Mon, 18 Sep 2017 11:41:29 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
These two patches generated a fair amount of discussion on the mailing list when they were sent and as such I didn't apply them.
But from what I understand people use them when doing autoFDO testing. They may not be perfect but at least provide a base for discussion and further improvement.
My intention is to add them to the perf-opencsd-master branch for the 4.14 cycle. If anyone is categorically against it please get back to me with convincing arguments.
In my tests, the original inject support commit 13c23058bfef "perf tools: new inject capabilitity for CoreSight traces" did not work as advertised: no samples were generated during inject, and continuing to follow the AutoFDO procedure produced a better performing executable only because the -fauto-profile compiler flag enables some optimizations whether the gcov input file is populated or not.
Adding the two patches here on top of that made the autoFDO process produce *poorer* performing executables, because the output gcov data did not correctly represent the true code coverage.
I thus don't think it's a good idea to apply them to the perf-opencsd-master branch, as that gives one the illusion that inject/AutoFDO support works and that they should see an improvement in AutoFDO executable performance as per the intructions in HOWTO.OpenCSD.md appear to demonstrate (when in fact my tests suggest they won't).
Thanks,
Kim
Hi,
On 19 September 2017 at 05:50, Kim Phillips kim.phillips@arm.com wrote:
On Mon, 18 Sep 2017 11:41:29 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
These two patches generated a fair amount of discussion on the mailing list when they were sent and as such I didn't apply them.
But from what I understand people use them when doing autoFDO testing. They may not be perfect but at least provide a base for discussion and further improvement.
My intention is to add them to the perf-opencsd-master branch for the 4.14 cycle. If anyone is categorically against it please get back to me with convincing arguments.
In my tests, the original inject support commit 13c23058bfef "perf tools: new inject capabilitity for CoreSight traces" did not work as advertised: no samples were generated during inject, and continuing to follow the AutoFDO procedure produced a better performing executable only because the -fauto-profile compiler flag enables some optimizations whether the gcov input file is populated or not.
Adding the two patches here on top of that made the autoFDO process produce *poorer* performing executables, because the output gcov data did not correctly represent the true code coverage.
I thus don't think it's a good idea to apply them to the perf-opencsd-master branch, as that gives one the illusion that inject/AutoFDO support works and that they should see an improvement in AutoFDO executable performance as per the intructions in HOWTO.OpenCSD.md appear to demonstrate (when in fact my tests suggest they won't).
Unfortunately having the original patch in the tree already gives the impression that this stuff works.
These patches correct some initial misconceptions regarding the nature of the trace packets so as a basis for further investigation as to why we are not getting good results from the autoFDO flow, I think the three patches are a reasonable starting point.
I do think however, that it would be worthwhile modifying HOWTO.md, and / or add a README-autoFDO.md that makes it clear that these patches are available on an "experimental" basis and will not be upstreamed until all are convinced that they have been improved to the point that we get consistent results in improving executables using the autoFDO flow.
Regards
Mike
Thanks,
Kim
On 19 September 2017 at 03:57, Mike Leach mike.leach@linaro.org wrote:
Hi,
On 19 September 2017 at 05:50, Kim Phillips kim.phillips@arm.com wrote:
On Mon, 18 Sep 2017 11:41:29 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
These two patches generated a fair amount of discussion on the mailing list when they were sent and as such I didn't apply them.
But from what I understand people use them when doing autoFDO testing. They may not be perfect but at least provide a base for discussion and further improvement.
My intention is to add them to the perf-opencsd-master branch for the 4.14 cycle. If anyone is categorically against it please get back to me with convincing arguments.
In my tests, the original inject support commit 13c23058bfef "perf tools: new inject capabilitity for CoreSight traces" did not work as advertised: no samples were generated during inject, and continuing to follow the AutoFDO procedure produced a better performing executable only because the -fauto-profile compiler flag enables some optimizations whether the gcov input file is populated or not.
Adding the two patches here on top of that made the autoFDO process produce *poorer* performing executables, because the output gcov data did not correctly represent the true code coverage.
I thus don't think it's a good idea to apply them to the perf-opencsd-master branch, as that gives one the illusion that inject/AutoFDO support works and that they should see an improvement in AutoFDO executable performance as per the intructions in HOWTO.OpenCSD.md appear to demonstrate (when in fact my tests suggest they won't).
Unfortunately having the original patch in the tree already gives the impression that this stuff works.
These patches correct some initial misconceptions regarding the nature of the trace packets so as a basis for further investigation as to why we are not getting good results from the autoFDO flow, I think the three patches are a reasonable starting point.
I do think however, that it would be worthwhile modifying HOWTO.md, and / or add a README-autoFDO.md that makes it clear that these patches are available on an "experimental" basis and will not be upstreamed until all are convinced that they have been improved to the point that we get consistent results in improving executables using the autoFDO flow.
I like that idea - to me making the experimental status of autoFDO in the HOWTO.md is sufficient. I will apply the patches when I received another revision of this patchset that includes an amendment to the HOWTO.md. Sebastian?
Regards
Mike
Thanks,
Kim
-- Mike Leach Principal Engineer, ARM Ltd. Blackburn Design Centre. UK
On Tue, 19 Sep 2017 10:57:05 +0100 Mike Leach mike.leach@linaro.org wrote:
Hi,
On 19 September 2017 at 05:50, Kim Phillips kim.phillips@arm.com wrote:
On Mon, 18 Sep 2017 11:41:29 -0600 Mathieu Poirier mathieu.poirier@linaro.org wrote:
These two patches generated a fair amount of discussion on the mailing list when they were sent and as such I didn't apply them.
But from what I understand people use them when doing autoFDO testing. They may not be perfect but at least provide a base for discussion and further improvement.
My intention is to add them to the perf-opencsd-master branch for the 4.14 cycle. If anyone is categorically against it please get back to me with convincing arguments.
In my tests, the original inject support commit 13c23058bfef "perf tools: new inject capabilitity for CoreSight traces" did not work as advertised: no samples were generated during inject, and continuing to follow the AutoFDO procedure produced a better performing executable only because the -fauto-profile compiler flag enables some optimizations whether the gcov input file is populated or not.
Adding the two patches here on top of that made the autoFDO process produce *poorer* performing executables, because the output gcov data did not correctly represent the true code coverage.
I thus don't think it's a good idea to apply them to the perf-opencsd-master branch, as that gives one the illusion that inject/AutoFDO support works and that they should see an improvement in AutoFDO executable performance as per the intructions in HOWTO.OpenCSD.md appear to demonstrate (when in fact my tests suggest they won't).
Unfortunately having the original patch in the tree already gives the impression that this stuff works.
Indeed. Technically, that can be undone, however.
These patches correct some initial misconceptions regarding the nature of the trace packets so as a basis for further investigation as to why we are not getting good results from the autoFDO flow, I think the three patches are a reasonable starting point.
A starting point for what?
Incomplete feature development doesn't belong on master branches.
I do think however, that it would be worthwhile modifying HOWTO.md, and / or add a README-autoFDO.md that makes it clear that these patches are available on an "experimental" basis and will not be upstreamed until all are convinced that they have been improved to the point that we get consistent results in improving executables using the autoFDO flow.
Better even to migrate the original patch and other two to their own experimental branch, or tree, even.
We've discussed off-list the creation of a separate tree to maintain the decoder library project, and a separate tree to maintain the perf to-be-upstreamed material, this is a chance to have these changes be done in conjunction with that effort.
Kim