This patch series adds support for thread stack and callchain.
Patch 01 is to fix the unsigned variable comparison to zero; patch 02 is to refactor the instruction size calculation; these two patches are preparation for patch 03.
Patch 03 is to add thread stack support, after applying this patch the option '-F,+callindent' can be used by perf script tool; patch 04 is to add branch filter thus the Perf tool can display branch samples only for function calls and returns after enable the call indentation or call chain related options.
Patch 05 is to synthesize call chain for the instruction samples.
Patch 06 allows the instruction sample can be handled synchronously with the thread stack, thus it fixes an error for the callchain generation.
This patch set has been tested on 96boards Hikey620 after applied on perf/core branch with latest commit f7bf75a78095 ("perf annotate: Don't return -1 for error when doing BPF disassembly").
Test for option '-F,+callindent':
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1@plt aaaaba8d37d8 main+0x14 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: do_lookup_x ffff8636a49c _dl_lookup_symbol_x+0x104 (/lib/aarch64-linux-gnu/ld-2.28. main 2808 1 branches: check_match ffff86369bf0 do_lookup_x+0x238 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: strcmp ffff86369888 check_match+0x70 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: do_lookup_x ffff8636a49c _dl_lookup_symbol_x+0x104 (/lib/aarch64-linux-gnu/ld-2.28. main 2808 1 branches: _dl_name_match_p ffff86369af0 do_lookup_x+0x138 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: strcmp ffff8636f7f0 _dl_name_match_p+0x18 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Test for option '--itrace=g':
Before:
# perf script --itrace=g16l64i100 main 1579 100 instructions: ffff0000102137f0 group_sched_in+0xb0 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) [...]
After:
# perf script --itrace=g16l64i100
main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms]) [...]
Changes from v2: * Added patch 01 to fix the unsigned variable comparison to zero (Suzuki). * Refined commit logs.
Changes from v1: * Added comments for task thread handling (Mathieu). * Split patch 02 into two patches, one is for support thread stack and another is for callchain support (Mathieu). * Added a new patch to support branch filter.
Leo Yan (6): perf cs-etm: Fix unsigned variable comparison to zero perf cs-etm: Refactor instruction size handling perf cs-etm: Support thread stack perf cs-etm: Support branch filter perf cs-etm: Support callchain for instruction sample perf cs-etm: Synchronize instruction sample with the thread stack
tools/perf/util/cs-etm.c | 145 ++++++++++++++++++++++++++++++++------- 1 file changed, 120 insertions(+), 25 deletions(-)
If the u64 variable 'offset' is a negative integer, comparison it with bigger than zero is always going to be true because it is unsigned. Fix this by using s64 type for variable 'offset'.
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4ba0f871f086..4bc2d9709d4f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -940,7 +940,7 @@ u64 cs_etm__last_executed_instr(const struct cs_etm_packet *packet) static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq, u64 trace_chan_id, const struct cs_etm_packet *packet, - u64 offset) + s64 offset) { if (packet->isa == CS_ETM_ISA_T32) { u64 addr = packet->start_addr; @@ -1372,7 +1372,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, * sample is reported as though instruction has just been * executed, but PC has not advanced to next instruction) */ - u64 offset = (instrs_executed - instrs_over - 1); + s64 offset = (instrs_executed - instrs_over - 1); u64 addr = cs_etm__instr_addr(etmq, trace_chan_id, tidq->packet, offset);
On Sat, Oct 05, 2019 at 05:16:09PM +0800, Leo Yan wrote:
If the u64 variable 'offset' is a negative integer, comparison it with bigger than zero is always going to be true because it is unsigned. Fix this by using s64 type for variable 'offset'.
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4ba0f871f086..4bc2d9709d4f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -940,7 +940,7 @@ u64 cs_etm__last_executed_instr(const struct cs_etm_packet *packet) static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq, u64 trace_chan_id, const struct cs_etm_packet *packet,
u64 offset)
s64 offset)
In Suzuki's reply there was two choices, 1) move the while(offset > 0) to while (offset) or change the type of @offset to an s64. Here we know offset can't be negative because of the tidq->period_instructions >= etm->instructions_sample_period
in function cs_etm__sample(). As such I think option #1 is the right way to deal with this rather than changing the type of the variable.
Mathieu
{ if (packet->isa == CS_ETM_ISA_T32) { u64 addr = packet->start_addr; @@ -1372,7 +1372,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, * sample is reported as though instruction has just been * executed, but PC has not advanced to next instruction) */
u64 offset = (instrs_executed - instrs_over - 1);
u64 addr = cs_etm__instr_addr(etmq, trace_chan_id, tidq->packet, offset);s64 offset = (instrs_executed - instrs_over - 1);
2.17.1
Hi Mathieu,
On Fri, Oct 11, 2019 at 02:16:06PM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:09PM +0800, Leo Yan wrote:
If the u64 variable 'offset' is a negative integer, comparison it with bigger than zero is always going to be true because it is unsigned. Fix this by using s64 type for variable 'offset'.
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4ba0f871f086..4bc2d9709d4f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -940,7 +940,7 @@ u64 cs_etm__last_executed_instr(const struct cs_etm_packet *packet) static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq, u64 trace_chan_id, const struct cs_etm_packet *packet,
u64 offset)
s64 offset)
In Suzuki's reply there was two choices, 1) move the while(offset > 0) to while (offset) or change the type of @offset to an s64. Here we know offset can't be negative because of the tidq->period_instructions >= etm->instructions_sample_period
in function cs_etm__sample(). As such I think option #1 is the right way to deal with this rather than changing the type of the variable.
I took sometime to use formulas to prove that 'offset' is possible to be a negative value :)
Just paste the updated commit log at here for review:
Pi: period_instructions Ie: instrs_executed Io: instrs_over Ip: instructions_sample_period
Pi' = Pi + Ie -> New period_instructions equals to the old period_instructions + instrs_executed Io = Pi' - Ip -> period_instructions - instructions_sample_period
offset = Ie - Io - 1 = Ie - (Pi' - Ip) -1 = Ie - (Pi + Ie - Ip) -1 = Ip - Pi - 1
In theory, if Ip (instructions_sample_period) is small enough and Pi (period_instructions) is bigger than Ip, then it will lead to the negative value for 'offset'.
So let's see below command:
perf inject --itrace=i1il128 -i perf.data -o perf.data.new
With this command, 'offset' is very easily to be a negative value when handling packets; this is because if use the inject option 'i1', then instructions_sample_period equals to 1; so:
offset = 1 - Pi - 1 = -Pi
Any Pi bigger than zero leads 'offset' to a negative value.
Thanks, Leo Yan
{ if (packet->isa == CS_ETM_ISA_T32) { u64 addr = packet->start_addr; @@ -1372,7 +1372,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, * sample is reported as though instruction has just been * executed, but PC has not advanced to next instruction) */
u64 offset = (instrs_executed - instrs_over - 1);
u64 addr = cs_etm__instr_addr(etmq, trace_chan_id, tidq->packet, offset);s64 offset = (instrs_executed - instrs_over - 1);
2.17.1
Hi Leo,
Two points here - both related.
On Tue, 22 Oct 2019 at 06:10, Leo Yan leo.yan@linaro.org wrote:
Hi Mathieu,
On Fri, Oct 11, 2019 at 02:16:06PM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:09PM +0800, Leo Yan wrote:
If the u64 variable 'offset' is a negative integer, comparison it with bigger than zero is always going to be true because it is unsigned. Fix this by using s64 type for variable 'offset'.
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4ba0f871f086..4bc2d9709d4f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -940,7 +940,7 @@ u64 cs_etm__last_executed_instr(const struct cs_etm_packet *packet) static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq, u64 trace_chan_id, const struct cs_etm_packet *packet,
u64 offset)
s64 offset)
Issue 1:
OK - it appears that cs_etm__instr_addr() is supposed to be returning the address within the current trace sample of the instruction related to offset. For T32 - then if offset < 0, packet->start_addr is returned - not good but at least within the current trace range For A32/A64 - if offset < 0 then an address _before_ packet->start_addr is returned - clearly wrong and possibly a completely invalid address that was never actually traced.
In Suzuki's reply there was two choices, 1) move the while(offset > 0) to while (offset) or change the type of @offset to an s64. Here we know offset can't be negative because of the tidq->period_instructions >= etm->instructions_sample_period
in function cs_etm__sample(). As such I think option #1 is the right way to deal with this rather than changing the type of the variable.
I took sometime to use formulas to prove that 'offset' is possible to be a negative value :)
Just paste the updated commit log at here for review:
Pi: period_instructions Ie: instrs_executed Io: instrs_over Ip: instructions_sample_period
Pi' = Pi + Ie -> New period_instructions equals to the old period_instructions + instrs_executed Io = Pi' - Ip -> period_instructions - instructions_sample_period
offset = Ie - Io - 1 = Ie - (Pi' - Ip) -1 = Ie - (Pi + Ie - Ip) -1 = Ip - Pi - 1
In theory, if Ip (instructions_sample_period) is small enough and Pi (period_instructions) is bigger than Ip, then it will lead to the negative value for 'offset'.
So let's see below command:
perf inject --itrace=i1il128 -i perf.data -o perf.data.new
With this command, 'offset' is very easily to be a negative value when handling packets; this is because if use the inject option 'i1', then instructions_sample_period equals to 1; so:
offset = 1 - Pi - 1 = -Pi
Any Pi bigger than zero leads 'offset' to a negative value.
Thanks, Leo Yan
Issue 2:
Assuming I have understood the logic of this code correctly - there is an issue were sample_period < period_instructions as you say - but I believe the problem is in the logic of the sampling function itself.
Suppose we have a sample_period of 4.
Now on an initial pass through the function, period_instructions must be 0. (i.e. none left over from the previous pass.) Suppose also that the number of instructions executed in this sample is 10 - thus updating period_instructions. Therefore: instr_over = 10 - 4 -> 6 offset = 10 - 6 - 1 -> 3. We therefore call cs_etm_instr_addr to find the address an offset of 3 instructions from the start of the trace sample and synthesize the sample. After this we set period_instructions to the instr_over value of 6.
Next pass, assume 10 instructions in the trace sample again. period_instructions = 6 + 10 -> 16 instr_over = 16 - 4 -> 12 offset = 10 - 12 - 1 -> -3 - the negative value your formulae predict.
This implies that the sample we want is actually in the previous trace packet - which I believe is in fact the case - as explained below.
My reading of the code is that cs_etm__sample() is called once per trace range packet extracted from the decoder - and a trace range packet represents N instructions_executed. Further I am assuming that instructions_sample_period represents the desired periodicity of the instruction samples - i.e. 1 sample every instructions_sample_period number of instructions.
Thus my conclusion here is that where M = instructions_executed + period_instructions, the function should generate quotient ( M / instructions_sample_period ) samples and set period_instructions to M mod instructions_sample_period on exit, ensuring period_instructions is never larger than the sample_period.
i.e. loop to generate multiple samples until instr_over and therefore the output value of period_instructions is less than the value of instructions_sample_period - for the example above, with 10 instructions and a periodicity of 4, we generate 2 samples with a remainder of 2 instructions carried forwards.
In short leave offset as unsigned and fix the logic of the cs_etm__sample() function.
Regards
Mike
{ if (packet->isa == CS_ETM_ISA_T32) { u64 addr = packet->start_addr; @@ -1372,7 +1372,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, * sample is reported as though instruction has just been * executed, but PC has not advanced to next instruction) */
u64 offset = (instrs_executed - instrs_over - 1);
s64 offset = (instrs_executed - instrs_over - 1); u64 addr = cs_etm__instr_addr(etmq, trace_chan_id, tidq->packet, offset);
-- 2.17.1
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
Hi Mike,
On Wed, Oct 23, 2019 at 12:36:39AM +0100, Mike Leach wrote:
Hi Leo,
Two points here - both related.
On Tue, 22 Oct 2019 at 06:10, Leo Yan leo.yan@linaro.org wrote:
Hi Mathieu,
On Fri, Oct 11, 2019 at 02:16:06PM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:09PM +0800, Leo Yan wrote:
If the u64 variable 'offset' is a negative integer, comparison it with bigger than zero is always going to be true because it is unsigned. Fix this by using s64 type for variable 'offset'.
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4ba0f871f086..4bc2d9709d4f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -940,7 +940,7 @@ u64 cs_etm__last_executed_instr(const struct cs_etm_packet *packet) static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq, u64 trace_chan_id, const struct cs_etm_packet *packet,
u64 offset)
s64 offset)
Issue 1:
OK - it appears that cs_etm__instr_addr() is supposed to be returning the address within the current trace sample of the instruction related to offset. For T32 - then if offset < 0, packet->start_addr is returned - not good but at least within the current trace range For A32/A64 - if offset < 0 then an address _before_ packet->start_addr is returned - clearly wrong and possibly a completely invalid address that was never actually traced.
Exactly, if offset < 0 it might output the incorrect trace.
In Suzuki's reply there was two choices, 1) move the while(offset > 0) to while (offset) or change the type of @offset to an s64. Here we know offset can't be negative because of the tidq->period_instructions >= etm->instructions_sample_period
in function cs_etm__sample(). As such I think option #1 is the right way to deal with this rather than changing the type of the variable.
I took sometime to use formulas to prove that 'offset' is possible to be a negative value :)
Just paste the updated commit log at here for review:
Pi: period_instructions Ie: instrs_executed Io: instrs_over Ip: instructions_sample_period
Pi' = Pi + Ie -> New period_instructions equals to the old period_instructions + instrs_executed Io = Pi' - Ip -> period_instructions - instructions_sample_period
offset = Ie - Io - 1 = Ie - (Pi' - Ip) -1 = Ie - (Pi + Ie - Ip) -1 = Ip - Pi - 1
In theory, if Ip (instructions_sample_period) is small enough and Pi (period_instructions) is bigger than Ip, then it will lead to the negative value for 'offset'.
So let's see below command:
perf inject --itrace=i1il128 -i perf.data -o perf.data.new
With this command, 'offset' is very easily to be a negative value when handling packets; this is because if use the inject option 'i1', then instructions_sample_period equals to 1; so:
offset = 1 - Pi - 1 = -Pi
Any Pi bigger than zero leads 'offset' to a negative value.
Thanks, Leo Yan
Issue 2:
Assuming I have understood the logic of this code correctly - there is an issue were sample_period < period_instructions as you say - but I believe the problem is in the logic of the sampling function itself.
Suppose we have a sample_period of 4.
Now on an initial pass through the function, period_instructions must be 0. (i.e. none left over from the previous pass.) Suppose also that the number of instructions executed in this sample is 10 - thus updating period_instructions. Therefore: instr_over = 10 - 4 -> 6 offset = 10 - 6 - 1 -> 3. We therefore call cs_etm_instr_addr to find the address an offset of 3 instructions from the start of the trace sample and synthesize the sample. After this we set period_instructions to the instr_over value of 6.
Next pass, assume 10 instructions in the trace sample again. period_instructions = 6 + 10 -> 16 instr_over = 16 - 4 -> 12 offset = 10 - 12 - 1 -> -3 - the negative value your formulae predict.
This implies that the sample we want is actually in the previous trace packet - which I believe is in fact the case - as explained below.
My reading of the code is that cs_etm__sample() is called once per trace range packet extracted from the decoder - and a trace range packet represents N instructions_executed. Further I am assuming that instructions_sample_period represents the desired periodicity of the instruction samples - i.e. 1 sample every instructions_sample_period number of instructions.
Good point. Yeah, this is the root cause.
Thus my conclusion here is that where M = instructions_executed + period_instructions, the function should generate quotient ( M / instructions_sample_period ) samples and set period_instructions to M mod instructions_sample_period on exit, ensuring period_instructions is never larger than the sample_period.
Totally agree with this; we should generate synthetic samples without dropping trace data.
i.e. loop to generate multiple samples until instr_over and therefore the output value of period_instructions is less than the value of instructions_sample_period - for the example above, with 10 instructions and a periodicity of 4, we generate 2 samples with a remainder of 2 instructions carried forwards.
In short leave offset as unsigned and fix the logic of the cs_etm__sample() function.
Will follow up this suggestion.
Very appreciate your time to review and gave out much reasonable solution!
Thanks, Leo Yan
{ if (packet->isa == CS_ETM_ISA_T32) { u64 addr = packet->start_addr; @@ -1372,7 +1372,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, * sample is reported as though instruction has just been * executed, but PC has not advanced to next instruction) */
u64 offset = (instrs_executed - instrs_over - 1);
s64 offset = (instrs_executed - instrs_over - 1); u64 addr = cs_etm__instr_addr(etmq, trace_chan_id, tidq->packet, offset);
-- 2.17.1
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
In cs-etm.c there have several functions need to know instruction size based on address, e.g. cs_etm__instr_addr() and cs_etm__copy_insn() these two functions both calculate the instruction size separately. Furthermore, if we consider to add new features later which also might require to calculate instruction size.
For this reason, this patch refactors the code to introduce a new function cs_etm__instr_size(), it will be a central place to calculate the instruction size based on ISA type and instruction address.
For a neat implementation, cs_etm__instr_addr() will always execute the loop without checking ISA type, this allows cs_etm__instr_size() and cs_etm__instr_addr() have no any duplicate code with each other and both functions can be changed independently later without breaking anything. As a side effect, cs_etm__instr_addr() will do a few more iterations for A32/A64 instructions, this would be fine if consider perf tool runs in the user space.
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 48 +++++++++++++++++++++++----------------- 1 file changed, 28 insertions(+), 20 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4bc2d9709d4f..58ceba7b91d5 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -918,6 +918,26 @@ static inline int cs_etm__t32_instr_size(struct cs_etm_queue *etmq, return ((instrBytes[1] & 0xF8) >= 0xE8) ? 4 : 2; }
+static inline int cs_etm__instr_size(struct cs_etm_queue *etmq, + u8 trace_chan_id, + enum cs_etm_isa isa, + u64 addr) +{ + int insn_len; + + /* + * T32 instruction size might be 32-bit or 16-bit, decide by calling + * cs_etm__t32_instr_size(). + */ + if (isa == CS_ETM_ISA_T32) + insn_len = cs_etm__t32_instr_size(etmq, trace_chan_id, addr); + /* Otherwise, A64 and A32 instruction size are always 32-bit. */ + else + insn_len = 4; + + return insn_len; +} + static inline u64 cs_etm__first_executed_instr(struct cs_etm_packet *packet) { /* Returns 0 for the CS_ETM_DISCONTINUITY packet */ @@ -942,19 +962,15 @@ static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq, const struct cs_etm_packet *packet, s64 offset) { - if (packet->isa == CS_ETM_ISA_T32) { - u64 addr = packet->start_addr; + u64 addr = packet->start_addr;
- while (offset > 0) { - addr += cs_etm__t32_instr_size(etmq, - trace_chan_id, addr); - offset--; - } - return addr; + while (offset > 0) { + addr += cs_etm__instr_size(etmq, trace_chan_id, + packet->isa, addr); + offset--; }
- /* Assume a 4 byte instruction size (A32/A64) */ - return packet->start_addr + offset * 4; + return addr; }
static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq, @@ -1094,16 +1110,8 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, return; }
- /* - * T32 instruction size might be 32-bit or 16-bit, decide by calling - * cs_etm__t32_instr_size(). - */ - if (packet->isa == CS_ETM_ISA_T32) - sample->insn_len = cs_etm__t32_instr_size(etmq, trace_chan_id, - sample->ip); - /* Otherwise, A64 and A32 instruction size are always 32-bit. */ - else - sample->insn_len = 4; + sample->insn_len = cs_etm__instr_size(etmq, trace_chan_id, + packet->isa, sample->ip);
cs_etm__mem_access(etmq, trace_chan_id, sample->ip, sample->insn_len, (void *)sample->insn);
Since Arm CoreSight doesn't support thread stack, the decoding cannot display symbols with indented spaces to reflect the stack depth.
This patch adds support thread stack for Arm CoreSight, this allows 'perf script' to display properly for option '-F,+callindent'.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 58ceba7b91d5..780abbfd1833 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1117,6 +1117,45 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, sample->insn_len, (void *)sample->insn); }
+static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, + struct cs_etm_traceid_queue *tidq) +{ + struct cs_etm_auxtrace *etm = etmq->etm; + u8 trace_chan_id = tidq->trace_chan_id; + int insn_len; + u64 from_ip, to_ip; + + if (etm->synth_opts.thread_stack) { + from_ip = cs_etm__last_executed_instr(tidq->prev_packet); + to_ip = cs_etm__first_executed_instr(tidq->packet); + + insn_len = cs_etm__instr_size(etmq, trace_chan_id, + tidq->prev_packet->isa, from_ip); + + /* + * Create thread stacks by keeping track of calls and returns; + * any call pushes thread stack, return pops the stack, and + * flush stack when the trace is discontinuous. + */ + thread_stack__event(tidq->thread, tidq->prev_packet->cpu, + tidq->prev_packet->flags, + from_ip, to_ip, insn_len, + etmq->buffer->buffer_nr); + } else { + /* + * The thread stack can be output via thread_stack__process(); + * thus the detailed information about paired calls and returns + * will be facilitated by Python script for the db-export. + * + * Need to set trace buffer number and flush thread stack if the + * trace buffer number has been alternate. + */ + thread_stack__set_trace_nr(tidq->thread, + tidq->prev_packet->cpu, + etmq->buffer->buffer_nr); + } +} + static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, struct cs_etm_traceid_queue *tidq, u64 addr, u64 period) @@ -1393,6 +1432,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->period_instructions = instrs_over; }
+ if (tidq->prev_packet->last_instr_taken_branch) + cs_etm__add_stack_event(etmq, tidq); + if (etm->sample_branches) { bool generate_sample = false;
@@ -2593,6 +2635,8 @@ int cs_etm__process_auxtrace_info(union perf_event *event, itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); etm->synth_opts.callchain = false; + etm->synth_opts.thread_stack = + session->itrace_synth_opts->thread_stack; }
err = cs_etm__synth_events(etm, session);
On Sat, Oct 05, 2019 at 05:16:11PM +0800, Leo Yan wrote:
Since Arm CoreSight doesn't support thread stack, the decoding cannot display symbols with indented spaces to reflect the stack depth.
This patch adds support thread stack for Arm CoreSight, this allows 'perf script' to display properly for option '-F,+callindent'.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 58ceba7b91d5..780abbfd1833 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1117,6 +1117,45 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, sample->insn_len, (void *)sample->insn); } +static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
struct cs_etm_traceid_queue *tidq)
+{
- struct cs_etm_auxtrace *etm = etmq->etm;
- u8 trace_chan_id = tidq->trace_chan_id;
- int insn_len;
- u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) {
from_ip = cs_etm__last_executed_instr(tidq->prev_packet);
to_ip = cs_etm__first_executed_instr(tidq->packet);
insn_len = cs_etm__instr_size(etmq, trace_chan_id,
tidq->prev_packet->isa, from_ip);
/*
* Create thread stacks by keeping track of calls and returns;
* any call pushes thread stack, return pops the stack, and
* flush stack when the trace is discontinuous.
*/
thread_stack__event(tidq->thread, tidq->prev_packet->cpu,
tidq->prev_packet->flags,
from_ip, to_ip, insn_len,
etmq->buffer->buffer_nr);
Details are a little fuzzy in my head but I'm pretty sure we want trace_chan_id here.
- } else {
/*
* The thread stack can be output via thread_stack__process();
* thus the detailed information about paired calls and returns
* will be facilitated by Python script for the db-export.
*
* Need to set trace buffer number and flush thread stack if the
* trace buffer number has been alternate.
*/
thread_stack__set_trace_nr(tidq->thread,
tidq->prev_packet->cpu,
etmq->buffer->buffer_nr);
Same here.
- }
+}
static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, struct cs_etm_traceid_queue *tidq, u64 addr, u64 period) @@ -1393,6 +1432,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->period_instructions = instrs_over; }
- if (tidq->prev_packet->last_instr_taken_branch)
cs_etm__add_stack_event(etmq, tidq);
- if (etm->sample_branches) { bool generate_sample = false;
@@ -2593,6 +2635,8 @@ int cs_etm__process_auxtrace_info(union perf_event *event, itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); etm->synth_opts.callchain = false;
etm->synth_opts.thread_stack =
}session->itrace_synth_opts->thread_stack;
err = cs_etm__synth_events(etm, session); -- 2.17.1
Hi Mathieu,
On Fri, Oct 11, 2019 at 11:53:53AM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:11PM +0800, Leo Yan wrote:
Since Arm CoreSight doesn't support thread stack, the decoding cannot display symbols with indented spaces to reflect the stack depth.
This patch adds support thread stack for Arm CoreSight, this allows 'perf script' to display properly for option '-F,+callindent'.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 58ceba7b91d5..780abbfd1833 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1117,6 +1117,45 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, sample->insn_len, (void *)sample->insn); } +static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
struct cs_etm_traceid_queue *tidq)
+{
- struct cs_etm_auxtrace *etm = etmq->etm;
- u8 trace_chan_id = tidq->trace_chan_id;
- int insn_len;
- u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) {
from_ip = cs_etm__last_executed_instr(tidq->prev_packet);
to_ip = cs_etm__first_executed_instr(tidq->packet);
insn_len = cs_etm__instr_size(etmq, trace_chan_id,
tidq->prev_packet->isa, from_ip);
/*
* Create thread stacks by keeping track of calls and returns;
* any call pushes thread stack, return pops the stack, and
* flush stack when the trace is discontinuous.
*/
thread_stack__event(tidq->thread, tidq->prev_packet->cpu,
tidq->prev_packet->flags,
from_ip, to_ip, insn_len,
etmq->buffer->buffer_nr);
Details are a little fuzzy in my head but I'm pretty sure we want trace_chan_id here.
Thanks a lot for reviewing!
This is good point. After a quick look for the code, seems TO_CS_QUEUE_NR() is the right thing to use at here for buffer number.
Will look more closely to the code and apply changes in next version.
Thanks, Leo Yan
- } else {
/*
* The thread stack can be output via thread_stack__process();
* thus the detailed information about paired calls and returns
* will be facilitated by Python script for the db-export.
*
* Need to set trace buffer number and flush thread stack if the
* trace buffer number has been alternate.
*/
thread_stack__set_trace_nr(tidq->thread,
tidq->prev_packet->cpu,
etmq->buffer->buffer_nr);
Same here.
- }
+}
static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, struct cs_etm_traceid_queue *tidq, u64 addr, u64 period) @@ -1393,6 +1432,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->period_instructions = instrs_over; }
- if (tidq->prev_packet->last_instr_taken_branch)
cs_etm__add_stack_event(etmq, tidq);
- if (etm->sample_branches) { bool generate_sample = false;
@@ -2593,6 +2635,8 @@ int cs_etm__process_auxtrace_info(union perf_event *event, itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); etm->synth_opts.callchain = false;
etm->synth_opts.thread_stack =
}session->itrace_synth_opts->thread_stack;
err = cs_etm__synth_events(etm, session); -- 2.17.1
Hi Mathieu,
On Fri, Oct 11, 2019 at 11:53:53AM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:11PM +0800, Leo Yan wrote:
Since Arm CoreSight doesn't support thread stack, the decoding cannot display symbols with indented spaces to reflect the stack depth.
This patch adds support thread stack for Arm CoreSight, this allows 'perf script' to display properly for option '-F,+callindent'.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 58ceba7b91d5..780abbfd1833 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1117,6 +1117,45 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, sample->insn_len, (void *)sample->insn); } +static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
struct cs_etm_traceid_queue *tidq)
+{
- struct cs_etm_auxtrace *etm = etmq->etm;
- u8 trace_chan_id = tidq->trace_chan_id;
- int insn_len;
- u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) {
from_ip = cs_etm__last_executed_instr(tidq->prev_packet);
to_ip = cs_etm__first_executed_instr(tidq->packet);
insn_len = cs_etm__instr_size(etmq, trace_chan_id,
tidq->prev_packet->isa, from_ip);
/*
* Create thread stacks by keeping track of calls and returns;
* any call pushes thread stack, return pops the stack, and
* flush stack when the trace is discontinuous.
*/
thread_stack__event(tidq->thread, tidq->prev_packet->cpu,
tidq->prev_packet->flags,
from_ip, to_ip, insn_len,
etmq->buffer->buffer_nr);
Details are a little fuzzy in my head but I'm pretty sure we want trace_chan_id here.
I spent some time to look into this question, and I think we don't need to add extra info for trace_chan_id.
The main reason is for CPU wide tracing, if one task is migrated from CPU_a to CPU_b, if we append 'trace_chan_id' for the buffer number, then it will tell the thread_stack that the buffer has been changed (or it will be considered the trace is discontinuous), then thread stack will be flushed. Actually, this is not what we want; if a task is migrated from one CPU to another, we still need to keep its thread stack if the trace data comes from the same buffer_nr.
To be honest, I struggled to understand what's the purpose for 'buffer->buffer_nr', from the code, I think 'buffer->buffer_nr' is mainly used to trace the splitted buffers (e.g. the buffers are splitted into different queues so the trace data coming from different trace chunk?). Now I observe 'buffer->buffer_nr' is always zero since the buffer is not used with splitted mode. If later we support 1:1 map between tracers and sinks, then we need to set 'buffer->buffer_nr' so can reflect the correct buffer mapping, but we don't need to use trace_chan_id as extra info at here.
Please let me know what you think about this? If you agree with this, I will send out patch v4 soon with addressing other comments.
Thanks, Leo Yan
- } else {
/*
* The thread stack can be output via thread_stack__process();
* thus the detailed information about paired calls and returns
* will be facilitated by Python script for the db-export.
*
* Need to set trace buffer number and flush thread stack if the
* trace buffer number has been alternate.
*/
thread_stack__set_trace_nr(tidq->thread,
tidq->prev_packet->cpu,
etmq->buffer->buffer_nr);
Same here.
- }
+}
static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, struct cs_etm_traceid_queue *tidq, u64 addr, u64 period) @@ -1393,6 +1432,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->period_instructions = instrs_over; }
- if (tidq->prev_packet->last_instr_taken_branch)
cs_etm__add_stack_event(etmq, tidq);
- if (etm->sample_branches) { bool generate_sample = false;
@@ -2593,6 +2635,8 @@ int cs_etm__process_auxtrace_info(union perf_event *event, itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); etm->synth_opts.callchain = false;
etm->synth_opts.thread_stack =
}session->itrace_synth_opts->thread_stack;
err = cs_etm__synth_events(etm, session); -- 2.17.1
On Mon, 21 Oct 2019 at 23:03, Leo Yan leo.yan@linaro.org wrote:
Hi Mathieu,
On Fri, Oct 11, 2019 at 11:53:53AM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:11PM +0800, Leo Yan wrote:
Since Arm CoreSight doesn't support thread stack, the decoding cannot display symbols with indented spaces to reflect the stack depth.
This patch adds support thread stack for Arm CoreSight, this allows 'perf script' to display properly for option '-F,+callindent'.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 58ceba7b91d5..780abbfd1833 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1117,6 +1117,45 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, sample->insn_len, (void *)sample->insn); }
+static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
struct cs_etm_traceid_queue *tidq)
+{
- struct cs_etm_auxtrace *etm = etmq->etm;
- u8 trace_chan_id = tidq->trace_chan_id;
- int insn_len;
- u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) {
from_ip = cs_etm__last_executed_instr(tidq->prev_packet);
to_ip = cs_etm__first_executed_instr(tidq->packet);
insn_len = cs_etm__instr_size(etmq, trace_chan_id,
tidq->prev_packet->isa, from_ip);
/*
* Create thread stacks by keeping track of calls and returns;
* any call pushes thread stack, return pops the stack, and
* flush stack when the trace is discontinuous.
*/
thread_stack__event(tidq->thread, tidq->prev_packet->cpu,
tidq->prev_packet->flags,
from_ip, to_ip, insn_len,
etmq->buffer->buffer_nr);
Details are a little fuzzy in my head but I'm pretty sure we want trace_chan_id here.
I spent some time to look into this question, and I think we don't need to add extra info for trace_chan_id.
The main reason is for CPU wide tracing, if one task is migrated from CPU_a to CPU_b, if we append 'trace_chan_id' for the buffer number, then it will tell the thread_stack that the buffer has been changed (or it will be considered the trace is discontinuous), then thread stack will be flushed. Actually, this is not what we want; if a task is migrated from one CPU to another, we still need to keep its thread stack if the trace data comes from the same buffer_nr.
After reviewing the code I conclude that using etmq->buffer->buffer_nr is the correct way to proceed.
That being said you have sent this new set [1], which is a rework of some of the code you have in the current set. As such the only way forward is for you to wait until [1] I has been applied and rebase the remaining work in this set on top of it.
Let me know if you have questions.
Thanks, Mathieu
[1]. https://patchwork.kernel.org/cover/11130213/
To be honest, I struggled to understand what's the purpose for 'buffer->buffer_nr', from the code, I think 'buffer->buffer_nr' is mainly used to trace the splitted buffers (e.g. the buffers are splitted into different queues so the trace data coming from different trace chunk?). Now I observe 'buffer->buffer_nr' is always zero since the buffer is not used with splitted mode. If later we support 1:1 map between tracers and sinks, then we need to set 'buffer->buffer_nr' so can reflect the correct buffer mapping, but we don't need to use trace_chan_id as extra info at here.
Please let me know what you think about this? If you agree with this, I will send out patch v4 soon with addressing other comments.
Thanks, Leo Yan
- } else {
/*
* The thread stack can be output via thread_stack__process();
* thus the detailed information about paired calls and returns
* will be facilitated by Python script for the db-export.
*
* Need to set trace buffer number and flush thread stack if the
* trace buffer number has been alternate.
*/
thread_stack__set_trace_nr(tidq->thread,
tidq->prev_packet->cpu,
etmq->buffer->buffer_nr);
Same here.
- }
+}
static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, struct cs_etm_traceid_queue *tidq, u64 addr, u64 period) @@ -1393,6 +1432,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->period_instructions = instrs_over; }
- if (tidq->prev_packet->last_instr_taken_branch)
cs_etm__add_stack_event(etmq, tidq);
- if (etm->sample_branches) { bool generate_sample = false;
@@ -2593,6 +2635,8 @@ int cs_etm__process_auxtrace_info(union perf_event *event, itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); etm->synth_opts.callchain = false;
etm->synth_opts.thread_stack =
session->itrace_synth_opts->thread_stack;
}
err = cs_etm__synth_events(etm, session);
-- 2.17.1
Hi Mathieu,
On Mon, Oct 28, 2019 at 04:43:57PM -0600, Mathieu Poirier wrote:
On Mon, 21 Oct 2019 at 23:03, Leo Yan leo.yan@linaro.org wrote:
Hi Mathieu,
On Fri, Oct 11, 2019 at 11:53:53AM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:11PM +0800, Leo Yan wrote:
Since Arm CoreSight doesn't support thread stack, the decoding cannot display symbols with indented spaces to reflect the stack depth.
This patch adds support thread stack for Arm CoreSight, this allows 'perf script' to display properly for option '-F,+callindent'.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 58ceba7b91d5..780abbfd1833 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1117,6 +1117,45 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, sample->insn_len, (void *)sample->insn); }
+static void cs_etm__add_stack_event(struct cs_etm_queue *etmq,
struct cs_etm_traceid_queue *tidq)
+{
- struct cs_etm_auxtrace *etm = etmq->etm;
- u8 trace_chan_id = tidq->trace_chan_id;
- int insn_len;
- u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) {
from_ip = cs_etm__last_executed_instr(tidq->prev_packet);
to_ip = cs_etm__first_executed_instr(tidq->packet);
insn_len = cs_etm__instr_size(etmq, trace_chan_id,
tidq->prev_packet->isa, from_ip);
/*
* Create thread stacks by keeping track of calls and returns;
* any call pushes thread stack, return pops the stack, and
* flush stack when the trace is discontinuous.
*/
thread_stack__event(tidq->thread, tidq->prev_packet->cpu,
tidq->prev_packet->flags,
from_ip, to_ip, insn_len,
etmq->buffer->buffer_nr);
Details are a little fuzzy in my head but I'm pretty sure we want trace_chan_id here.
I spent some time to look into this question, and I think we don't need to add extra info for trace_chan_id.
The main reason is for CPU wide tracing, if one task is migrated from CPU_a to CPU_b, if we append 'trace_chan_id' for the buffer number, then it will tell the thread_stack that the buffer has been changed (or it will be considered the trace is discontinuous), then thread stack will be flushed. Actually, this is not what we want; if a task is migrated from one CPU to another, we still need to keep its thread stack if the trace data comes from the same buffer_nr.
After reviewing the code I conclude that using etmq->buffer->buffer_nr is the correct way to proceed.
Thanks for reviewing and confirmation.
That being said you have sent this new set [1], which is a rework of some of the code you have in the current set. As such the only way forward is for you to wait until [1] I has been applied and rebase the remaining work in this set on top of it.
Right.
Seems the shared link is incorrect :) Let's firstly focus on the patch set: 'perf cs-etm: Fix synthesizing instruction samples' [2] and after it is merged I will send new patch set for cs-etm callchain support as soon as possible.
Thanks, Leo Yan
[2] https://patchwork.kernel.org/cover/11209991/
Let me know if you have questions.
Thanks, Mathieu
[1]. https://patchwork.kernel.org/cover/11130213/
To be honest, I struggled to understand what's the purpose for 'buffer->buffer_nr', from the code, I think 'buffer->buffer_nr' is mainly used to trace the splitted buffers (e.g. the buffers are splitted into different queues so the trace data coming from different trace chunk?). Now I observe 'buffer->buffer_nr' is always zero since the buffer is not used with splitted mode. If later we support 1:1 map between tracers and sinks, then we need to set 'buffer->buffer_nr' so can reflect the correct buffer mapping, but we don't need to use trace_chan_id as extra info at here.
Please let me know what you think about this? If you agree with this, I will send out patch v4 soon with addressing other comments.
Thanks, Leo Yan
- } else {
/*
* The thread stack can be output via thread_stack__process();
* thus the detailed information about paired calls and returns
* will be facilitated by Python script for the db-export.
*
* Need to set trace buffer number and flush thread stack if the
* trace buffer number has been alternate.
*/
thread_stack__set_trace_nr(tidq->thread,
tidq->prev_packet->cpu,
etmq->buffer->buffer_nr);
Same here.
- }
+}
static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, struct cs_etm_traceid_queue *tidq, u64 addr, u64 period) @@ -1393,6 +1432,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->period_instructions = instrs_over; }
- if (tidq->prev_packet->last_instr_taken_branch)
cs_etm__add_stack_event(etmq, tidq);
- if (etm->sample_branches) { bool generate_sample = false;
@@ -2593,6 +2635,8 @@ int cs_etm__process_auxtrace_info(union perf_event *event, itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); etm->synth_opts.callchain = false;
etm->synth_opts.thread_stack =
session->itrace_synth_opts->thread_stack;
}
err = cs_etm__synth_events(etm, session);
-- 2.17.1
If user specifies option '-F,+callindent' or call chain related options, it means users only care about function calls and returns; for these cases, it's pointless to generate samples for the branches within function. But unlike other hardware trace handling (e.g. Intel's pt or bts), Arm CoreSight doesn't filter branch types for these options and generate samples for all branches, this causes Perf to output many spurious blanks if the branch is not a function call or return.
To only output pairs of calls and returns, this patch introduces branch filter and the filter is set according to synthetic options. Finally, Perf can output only for calls and returns and avoid to output other unnecessary blanks.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1@plt aaaaba8d37d8 main+0x14 (/root/coresight_test/main) main 2808 1 branches: coresight_test1@plt aaaaba8d367c coresight_test1@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1@plt aaaaba8d37d8 main+0x14 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: do_lookup_x ffff8636a49c _dl_lookup_symbol_x+0x104 (/lib/aarch64-linux-gnu/ld-2.28. main 2808 1 branches: check_match ffff86369bf0 do_lookup_x+0x238 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: strcmp ffff86369888 check_match+0x70 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: do_lookup_x ffff8636a49c _dl_lookup_symbol_x+0x104 (/lib/aarch64-linux-gnu/ld-2.28. main 2808 1 branches: _dl_name_match_p ffff86369af0 do_lookup_x+0x138 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: strcmp ffff8636f7f0 _dl_name_match_p+0x18 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 780abbfd1833..4b42f9c9bd34 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -56,6 +56,7 @@ struct cs_etm_auxtrace {
int num_cpu; u32 auxtrace_type; + u32 branches_filter; u64 branches_sample_type; u64 branches_id; u64 instructions_sample_type; @@ -1223,6 +1224,10 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, } dummy_bs; u64 ip;
+ if (etm->branches_filter && + !(etm->branches_filter & tidq->prev_packet->flags)) + return 0; + ip = cs_etm__last_executed_instr(tidq->prev_packet);
event->sample.header.type = PERF_RECORD_SAMPLE; @@ -2639,6 +2644,13 @@ int cs_etm__process_auxtrace_info(union perf_event *event, session->itrace_synth_opts->thread_stack; }
+ if (etm->synth_opts.calls) + etm->branches_filter |= PERF_IP_FLAG_CALL | PERF_IP_FLAG_ASYNC | + PERF_IP_FLAG_TRACE_END; + if (etm->synth_opts.returns) + etm->branches_filter |= PERF_IP_FLAG_RETURN | + PERF_IP_FLAG_TRACE_BEGIN; + err = cs_etm__synth_events(etm, session); if (err) goto err_delete_thread;
Now CoreSight has supported the thread stack; based on the thread stack we can synthesize call chain for the instruction sample; the call chain can be injected by option '--itrace=g'.
Before:
# perf script --itrace=g16l64i100 main 1579 100 instructions: ffff0000102137f0 group_sched_in+0xb0 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) [...]
After:
# perf script --itrace=g16l64i100
main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms]) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4b42f9c9bd34..56e501cd2f5f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -17,6 +17,7 @@ #include <stdlib.h>
#include "auxtrace.h" +#include "callchain.h" #include "color.h" #include "cs-etm.h" #include "cs-etm-decoder/cs-etm-decoder.h" @@ -74,6 +75,7 @@ struct cs_etm_traceid_queue { size_t last_branch_pos; union perf_event *event_buf; struct thread *thread; + struct ip_callchain *chain; struct branch_stack *last_branch; struct branch_stack *last_branch_rb; struct cs_etm_packet *prev_packet; @@ -251,6 +253,16 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq, if (!tidq->prev_packet) goto out_free;
+ if (etm->synth_opts.callchain) { + size_t sz = sizeof(struct ip_callchain); + + /* Add 1 to callchain_sz for callchain context */ + sz += (etm->synth_opts.callchain_sz + 1) * sizeof(u64); + tidq->chain = zalloc(sz); + if (!tidq->chain) + goto out_free; + } + if (etm->synth_opts.last_branch) { size_t sz = sizeof(struct branch_stack);
@@ -275,6 +287,7 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq, zfree(&tidq->last_branch); zfree(&tidq->prev_packet); zfree(&tidq->packet); + zfree(&tidq->chain); out: return rc; } @@ -546,6 +559,7 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq) zfree(&tidq->last_branch_rb); zfree(&tidq->prev_packet); zfree(&tidq->packet); + zfree(&tidq->chain); zfree(&tidq);
/* @@ -1126,7 +1140,7 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, int insn_len; u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) { + if (etm->synth_opts.callchain || etm->synth_opts.thread_stack) { from_ip = cs_etm__last_executed_instr(tidq->prev_packet); to_ip = cs_etm__first_executed_instr(tidq->packet);
@@ -1182,6 +1196,14 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
cs_etm__copy_insn(etmq, tidq->trace_chan_id, tidq->packet, &sample);
+ if (etm->synth_opts.callchain) { + thread_stack__sample(tidq->thread, tidq->packet->cpu, + tidq->chain, + etm->synth_opts.callchain_sz + 1, + sample.ip, etm->kernel_start); + sample.callchain = tidq->chain; + } + if (etm->synth_opts.last_branch) { cs_etm__copy_last_branch_rb(etmq, tidq); sample.branch_stack = tidq->last_branch; @@ -1369,6 +1391,8 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR; }
+ if (etm->synth_opts.callchain) + attr.sample_type |= PERF_SAMPLE_CALLCHAIN; if (etm->synth_opts.last_branch) attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
@@ -2639,7 +2663,6 @@ int cs_etm__process_auxtrace_info(union perf_event *event, } else { itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); - etm->synth_opts.callchain = false; etm->synth_opts.thread_stack = session->itrace_synth_opts->thread_stack; } @@ -2651,6 +2674,14 @@ int cs_etm__process_auxtrace_info(union perf_event *event, etm->branches_filter |= PERF_IP_FLAG_RETURN | PERF_IP_FLAG_TRACE_BEGIN;
+ if (etm->synth_opts.callchain && !symbol_conf.use_callchain) { + symbol_conf.use_callchain = true; + if (callchain_register_param(&callchain_param) < 0) { + symbol_conf.use_callchain = false; + etm->synth_opts.callchain = false; + } + } + err = cs_etm__synth_events(etm, session); if (err) goto err_delete_thread;
On Sat, Oct 05, 2019 at 05:16:13PM +0800, Leo Yan wrote:
Now CoreSight has supported the thread stack; based on the thread stack we can synthesize call chain for the instruction sample; the call chain can be injected by option '--itrace=g'.
Before:
# perf script --itrace=g16l64i100 main 1579 100 instructions: ffff0000102137f0 group_sched_in+0xb0 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) [...]
After:
# perf script --itrace=g16l64i100
main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms]) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4b42f9c9bd34..56e501cd2f5f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -17,6 +17,7 @@ #include <stdlib.h> #include "auxtrace.h" +#include "callchain.h" #include "color.h" #include "cs-etm.h" #include "cs-etm-decoder/cs-etm-decoder.h" @@ -74,6 +75,7 @@ struct cs_etm_traceid_queue { size_t last_branch_pos; union perf_event *event_buf; struct thread *thread;
- struct ip_callchain *chain; struct branch_stack *last_branch; struct branch_stack *last_branch_rb; struct cs_etm_packet *prev_packet;
@@ -251,6 +253,16 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq, if (!tidq->prev_packet) goto out_free;
- if (etm->synth_opts.callchain) {
size_t sz = sizeof(struct ip_callchain);
/* Add 1 to callchain_sz for callchain context */
sz += (etm->synth_opts.callchain_sz + 1) * sizeof(u64);
tidq->chain = zalloc(sz);
if (!tidq->chain)
goto out_free;
- }
- if (etm->synth_opts.last_branch) { size_t sz = sizeof(struct branch_stack);
@@ -275,6 +287,7 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq, zfree(&tidq->last_branch); zfree(&tidq->prev_packet); zfree(&tidq->packet);
- zfree(&tidq->chain);
Theoretically this should go two lines up, i.e just below zfree(&tidq->prev_packet). If you agree with the comment I did in 3/6 then it is worth doing the above change, otherwise it can stay that way.
out: return rc; } @@ -546,6 +559,7 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq) zfree(&tidq->last_branch_rb); zfree(&tidq->prev_packet); zfree(&tidq->packet);
zfree(&tidq->chain);
Same comment as above. The rest looks good to me.
Mathieu
zfree(&tidq);
/* @@ -1126,7 +1140,7 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, int insn_len; u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) {
- if (etm->synth_opts.callchain || etm->synth_opts.thread_stack) { from_ip = cs_etm__last_executed_instr(tidq->prev_packet); to_ip = cs_etm__first_executed_instr(tidq->packet);
@@ -1182,6 +1196,14 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, cs_etm__copy_insn(etmq, tidq->trace_chan_id, tidq->packet, &sample);
- if (etm->synth_opts.callchain) {
thread_stack__sample(tidq->thread, tidq->packet->cpu,
tidq->chain,
etm->synth_opts.callchain_sz + 1,
sample.ip, etm->kernel_start);
sample.callchain = tidq->chain;
- }
- if (etm->synth_opts.last_branch) { cs_etm__copy_last_branch_rb(etmq, tidq); sample.branch_stack = tidq->last_branch;
@@ -1369,6 +1391,8 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR; }
- if (etm->synth_opts.callchain)
if (etm->synth_opts.last_branch) attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;attr.sample_type |= PERF_SAMPLE_CALLCHAIN;
@@ -2639,7 +2663,6 @@ int cs_etm__process_auxtrace_info(union perf_event *event, } else { itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample);
etm->synth_opts.thread_stack = session->itrace_synth_opts->thread_stack; }etm->synth_opts.callchain = false;
@@ -2651,6 +2674,14 @@ int cs_etm__process_auxtrace_info(union perf_event *event, etm->branches_filter |= PERF_IP_FLAG_RETURN | PERF_IP_FLAG_TRACE_BEGIN;
- if (etm->synth_opts.callchain && !symbol_conf.use_callchain) {
symbol_conf.use_callchain = true;
if (callchain_register_param(&callchain_param) < 0) {
symbol_conf.use_callchain = false;
etm->synth_opts.callchain = false;
}
- }
- err = cs_etm__synth_events(etm, session); if (err) goto err_delete_thread;
-- 2.17.1
The synthesized flow use 'tidq->packet' for instruction samples; on the other hand, 'tidp->prev_packet' is used to generate the thread stack and the branch samples, this results in the instruction samples using one packet ahead than thread stack and branch samples ('tidp->prev_packet' vs 'tidq->packet').
This leads to an instruction's callchain error as shows in below example:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010214850 perf_event_update_userpage+0x48 ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
In the callchain log, for the two continuous lines the up line contains one child function info and the followed line contains the caller function info, and so forth. So the first two lines are:
perf_event_update_userpage+0x4c => the sampled instruction perf_event_update_userpage+0x48 => the parent function's calling
The child function and parent function both are the same function perf_event_update_userpage(), but this isn't a recursive function, thus the sequence for perf_event_update_userpage() calling itself shouldn't never happen. This callchain error is caused by the instruction sample using an ahead packet than the thread stack, the thread stack is deferred to process the new packet and misses to pop stack if it is just a return packet.
To fix this issue, we can simply change to use 'tidq->prev_packet' to generate the instruction samples, this allows the thread stack to push and pop synchronously with instruction sample. Finally, the callchain can be displayed correctly as below:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 56e501cd2f5f..fa969dcb45d2 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1419,7 +1419,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, struct cs_etm_packet *tmp; int ret; u8 trace_chan_id = tidq->trace_chan_id; - u64 instrs_executed = tidq->packet->instr_count; + u64 instrs_executed = tidq->prev_packet->instr_count;
tidq->period_instructions += instrs_executed;
@@ -1450,7 +1450,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, */ s64 offset = (instrs_executed - instrs_over - 1); u64 addr = cs_etm__instr_addr(etmq, trace_chan_id, - tidq->packet, offset); + tidq->prev_packet, offset);
ret = cs_etm__synth_instruction_sample( etmq, tidq, addr, etm->instructions_sample_period);
On Sat, Oct 05, 2019 at 05:16:14PM +0800, Leo Yan wrote:
The synthesized flow use 'tidq->packet' for instruction samples; on the other hand, 'tidp->prev_packet' is used to generate the thread stack and the branch samples, this results in the instruction samples using one packet ahead than thread stack and branch samples ('tidp->prev_packet' vs 'tidq->packet').
This leads to an instruction's callchain error as shows in below example:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010214850 perf_event_update_userpage+0x48 ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
In the callchain log, for the two continuous lines the up line contains one child function info and the followed line contains the caller function info, and so forth. So the first two lines are:
perf_event_update_userpage+0x4c => the sampled instruction perf_event_update_userpage+0x48 => the parent function's calling
The child function and parent function both are the same function perf_event_update_userpage(), but this isn't a recursive function, thus the sequence for perf_event_update_userpage() calling itself shouldn't never happen. This callchain error is caused by the instruction sample using an ahead packet than the thread stack, the thread stack is deferred to process the new packet and misses to pop stack if it is just a return packet.
To fix this issue, we can simply change to use 'tidq->prev_packet' to generate the instruction samples, this allows the thread stack to push and pop synchronously with instruction sample. Finally, the callchain can be displayed correctly as below:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 56e501cd2f5f..fa969dcb45d2 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1419,7 +1419,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, struct cs_etm_packet *tmp; int ret; u8 trace_chan_id = tidq->trace_chan_id;
- u64 instrs_executed = tidq->packet->instr_count;
- u64 instrs_executed = tidq->prev_packet->instr_count;
tidq->period_instructions += instrs_executed; @@ -1450,7 +1450,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, */ s64 offset = (instrs_executed - instrs_over - 1); u64 addr = cs_etm__instr_addr(etmq, trace_chan_id,
tidq->packet, offset);
tidq->prev_packet, offset);
I have tested this set in --per-thread mode and things are working as advertised. Did you see how things look like in CPU-wide scenarios?
Thanks, Mathieu
ret = cs_etm__synth_instruction_sample( etmq, tidq, addr, etm->instructions_sample_period); -- 2.17.1
On Fri, Oct 11, 2019 at 02:17:50PM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:14PM +0800, Leo Yan wrote:
The synthesized flow use 'tidq->packet' for instruction samples; on the other hand, 'tidp->prev_packet' is used to generate the thread stack and the branch samples, this results in the instruction samples using one packet ahead than thread stack and branch samples ('tidp->prev_packet' vs 'tidq->packet').
This leads to an instruction's callchain error as shows in below example:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010214850 perf_event_update_userpage+0x48 ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
In the callchain log, for the two continuous lines the up line contains one child function info and the followed line contains the caller function info, and so forth. So the first two lines are:
perf_event_update_userpage+0x4c => the sampled instruction perf_event_update_userpage+0x48 => the parent function's calling
The child function and parent function both are the same function perf_event_update_userpage(), but this isn't a recursive function, thus the sequence for perf_event_update_userpage() calling itself shouldn't never happen. This callchain error is caused by the instruction sample using an ahead packet than the thread stack, the thread stack is deferred to process the new packet and misses to pop stack if it is just a return packet.
To fix this issue, we can simply change to use 'tidq->prev_packet' to generate the instruction samples, this allows the thread stack to push and pop synchronously with instruction sample. Finally, the callchain can be displayed correctly as below:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 56e501cd2f5f..fa969dcb45d2 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1419,7 +1419,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, struct cs_etm_packet *tmp; int ret; u8 trace_chan_id = tidq->trace_chan_id;
- u64 instrs_executed = tidq->packet->instr_count;
- u64 instrs_executed = tidq->prev_packet->instr_count;
tidq->period_instructions += instrs_executed; @@ -1450,7 +1450,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, */ s64 offset = (instrs_executed - instrs_over - 1); u64 addr = cs_etm__instr_addr(etmq, trace_chan_id,
tidq->packet, offset);
tidq->prev_packet, offset);
I have tested this set in --per-thread mode and things are working as advertised. Did you see how things look like in CPU-wide scenarios?
I am not sure I checked the callchain for CPU-wide scenarios, but I didn't pay attention for the case when the trace data is switching between CPUs, especially if connect with your comment in patch 03, with wrong buffer ID it might cause the wrong callchain for CPU-wide scenarios.
Will do more testing for CPU-wide scenarios in next spin.
Thanks for the suggestions. Leo Yan
Thanks, Mathieu
ret = cs_etm__synth_instruction_sample( etmq, tidq, addr, etm->instructions_sample_period); -- 2.17.1
Hi Mathieu,
On Fri, Oct 11, 2019 at 02:17:50PM -0600, Mathieu Poirier wrote:
On Sat, Oct 05, 2019 at 05:16:14PM +0800, Leo Yan wrote:
The synthesized flow use 'tidq->packet' for instruction samples; on the other hand, 'tidp->prev_packet' is used to generate the thread stack and the branch samples, this results in the instruction samples using one packet ahead than thread stack and branch samples ('tidp->prev_packet' vs 'tidq->packet').
This leads to an instruction's callchain error as shows in below example:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010214850 perf_event_update_userpage+0x48 ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
In the callchain log, for the two continuous lines the up line contains one child function info and the followed line contains the caller function info, and so forth. So the first two lines are:
perf_event_update_userpage+0x4c => the sampled instruction perf_event_update_userpage+0x48 => the parent function's calling
The child function and parent function both are the same function perf_event_update_userpage(), but this isn't a recursive function, thus the sequence for perf_event_update_userpage() calling itself shouldn't never happen. This callchain error is caused by the instruction sample using an ahead packet than the thread stack, the thread stack is deferred to process the new packet and misses to pop stack if it is just a return packet.
To fix this issue, we can simply change to use 'tidq->prev_packet' to generate the instruction samples, this allows the thread stack to push and pop synchronously with instruction sample. Finally, the callchain can be displayed correctly as below:
main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) ffff000010219360 perf_swevent_add+0x88 ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
Signed-off-by: Leo Yan leo.yan@linaro.org
tools/perf/util/cs-etm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 56e501cd2f5f..fa969dcb45d2 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1419,7 +1419,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, struct cs_etm_packet *tmp; int ret; u8 trace_chan_id = tidq->trace_chan_id;
- u64 instrs_executed = tidq->packet->instr_count;
- u64 instrs_executed = tidq->prev_packet->instr_count;
tidq->period_instructions += instrs_executed; @@ -1450,7 +1450,7 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, */ s64 offset = (instrs_executed - instrs_over - 1); u64 addr = cs_etm__instr_addr(etmq, trace_chan_id,
tidq->packet, offset);
tidq->prev_packet, offset);
I have tested this set in --per-thread mode and things are working as advertised. Did you see how things look like in CPU-wide scenarios?
After some testing, I can confirm this patch set can works well for CPU-wide trace; the reason is in the arch/arm/util/cs-etm.c, function cs_etm_recording_options() has enabled option 'ETM_OPT_CTXTID' for CPU-wide trace:
/* * In the case of per-cpu mmaps, we need the CPU on the * AUX event. We also need the contextID in order to be notified * when a context switch happened. */ if (!perf_cpu_map__empty(cpus)) { perf_evsel__set_sample_bit(cs_etm_evsel, CPU);
err = cs_etm_set_option(itr, cs_etm_evsel, ETM_OPT_CTXTID | ETM_OPT_TS); if (err) goto out; }
As result, we don't need to specify extra option to enable CTXID configuration. So below two commands have the same behaviour:
# perf record -e cs_etm/@tmc_etr0/ -a -- sh test.sh # perf record -e cs_etm/@tmc_etr0,config=0x4000/ -a -- sh test.sh `-> bit 14: for ETM_OPT_CTXTID
Since the decoding will set tid when receive the packet 'OCSD_GEN_TRC_ELEM_PE_CONTEXT', thus it can give the correct tid/pid info for threads. This allows to generate per thread stack base on thread->tid and avoid mixing info cross different threads.
Thanks, Leo Yan
ret = cs_etm__synth_instruction_sample( etmq, tidq, addr, etm->instructions_sample_period); -- 2.17.1