This patch series adds support for thread stack and callchain; this patch set depends on the instruction sample fix patch set [1].
This patch set get more complex, so before divide into small groups, I'd like to use this patch set version to include all relevant patches, hope this can give whole context for related code change.
Briefly, this patch can be divided into three parts, which also can be reviewed separately for every part:
Patches 01, 02 are used to fix samples for one corner case is for accessing the branch's target address and trigger an exception. Essentially, an extra branch sample is added to reflect this mediate branch between the previous branch and exception entry.
Patches 03, 04, 05, 06 are coming from patch v4, which are used to support thread stack and callchain.
Patches 07, 08, 09 are used to fixup for exception entry and exit. This is mainly used to fix two cases, one part is to fixup the thread stack and callchain for the case when access branch target address and trigger exception; another part is to fixup the thread stack for instruction emulation (and other single step cases).
This patch set has been tested on Juno-r2 after applied on perf/core branch with latest commit 85fc95d75970 ("perf maps: Add missing unlock to maps__insert() error case"), and this patch set is also applied on top of instruction sample fix patch set [1].
Test for option '-F,+callindent':
# perf script -F,+callindent main 3258 1 branches: main ffffad684d20 __libc_start_main+0xe0 (/usr/lib/aarch64-linux-gnu/libc-2.28.so) main 3258 1 branches: lib_loop_test@plt aaaae2c4d78c main+0x18 (/root/coresight_test/main) main 3258 1 branches: _dl_fixup ffffad811b4c _dl_runtime_resolve+0x40 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) main 3258 1 branches: _dl_lookup_symbol_x ffffad80c078 _dl_fixup+0xb8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) main 3258 1 branches: do_lookup_x ffffad80849c _dl_lookup_symbol_x+0x104 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) main 3258 1 branches: check_match ffffad807bf0 do_lookup_x+0x238 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) main 3258 1 branches: strcmp ffffad807888 check_match+0x70 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) main 3258 1 branches: lib_loop_test@plt aaaae2c4d78c main+0x18 (/root/coresight_test/main) main 3258 1 branches: lib_loop_test@plt aaaae2c4d78c main+0x18 (/root/coresight_test/main) main 3258 1 branches: lib_loop_test@plt aaaae2c4d78c main+0x18 (/root/coresight_test/main) main 3258 1 branches: lib_loop_test@plt aaaae2c4d78c main+0x18 (/root/coresight_test/main)
[...]
Test for option '--itrace=g':
# perf script --itrace=g16l64i100
main 3258 100 instructions: ffffad816a80 memcpy+0x70 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad809468 _dl_new_object+0xa8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad801840 dl_main+0x778 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad81384c _dl_sysdep_start+0x36c (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
main 3258 100 instructions: ffffad80952c _dl_new_object+0x16c (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad801840 dl_main+0x778 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad81384c _dl_sysdep_start+0x36c (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
main 3258 100 instructions: ffffad8018dc dl_main+0x814 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad81384c _dl_sysdep_start+0x36c (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
main 3258 100 instructions: ffff8000100878d0 el0_sync_handler+0x168 ([kernel.kallsyms]) ffff800010082d00 el0_sync+0x140 ([kernel.kallsyms]) ffffad801910 dl_main+0x848 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad81384c _dl_sysdep_start+0x36c (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
[...]
Changes from v4: * Addressed Mike's suggestion for performance improvement for function cs_etm__instr_addr() for quick calculation for non T32; * Removed the patch 'perf cs-etm: Synchronize instruction sample with the thread stack' (Mike); * Fixed the issue for exception is taken for branch target address accessing, for the branch sample and stack thread handling, the related patches are 01, 02, 07; * Fixed the stack thread handling for instruction emulation and single step with patches 08, 09.
Changes from v3: * Split out separate patch set for instruction samples fixing. * Rebased on latest perf/core branch.
Changes from v2: * Added patch 01 to fix the unsigned variable comparison to zero (Suzuki). * Refined commit logs.
Changes from v1: * Added comments for task thread handling (Mathieu). * Split patch 02 into two patches, one is for support thread stack and another is for callchain support (Mathieu). * Added a new patch to support branch filter.
[1] https://lkml.org/lkml/2020/2/18/1406
Leo Yan (9): perf cs-etm: Defer to assign exception sample flag perf cs-etm: Reflect branch prior to exception perf cs-etm: Refactor instruction size handling perf cs-etm: Support thread stack perf cs-etm: Support branch filter perf cs-etm: Support callchain for instruction sample perf cs-etm: Fixup exception entry for thread stack perf thread: Add helper to get top return address perf cs-etm: Fixup exception exit for thread stack
.../perf/util/cs-etm-decoder/cs-etm-decoder.c | 1 + tools/perf/util/cs-etm.c | 290 ++++++++++++++++-- tools/perf/util/thread-stack.c | 10 + tools/perf/util/thread-stack.h | 1 + 4 files changed, 268 insertions(+), 34 deletions(-)
Currently, neither the exception entry packet nor the exception return packet isn't used to generate samples; so the exception packet is only used as an affiliate packet, and the exception sample flag is assigned to its previous range packet, this is finished in the function cs_etm__set_sample_flags().
This patch moves the exception sample flag assignment from cs_etm__set_sample_flags() to cs_etm__exception(), essentially it defers to assign exception sample flag to the previous range packet, thus this gives us a chance to keep the previous range packet's original sample flag.
So this patch is only a preparation for later patches and doesn't include any change for the functionality; based on it, we can add extra processing between the exception packet and its previous range packet.
To reduce the indenting, this patch bails out directly at the entry of cs_etm__exception() if detects the previous packet is not a range packet.
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 28 +++++++++++++++++----------- 1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index bba969d48076..48932a7a933f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1479,6 +1479,13 @@ static int cs_etm__sample(struct cs_etm_queue *etmq,
static int cs_etm__exception(struct cs_etm_traceid_queue *tidq) { + /* + * Usually the exception packet follows a range packet, if it's not the + * case, directly bail out. + */ + if (tidq->prev_packet->sample_type != CS_ETM_RANGE) + return 0; + /* * When the exception packet is inserted, whether the last instruction * in previous range packet is taken branch or not, we need to force @@ -1490,8 +1497,16 @@ static int cs_etm__exception(struct cs_etm_traceid_queue *tidq) * swap PACKET with PREV_PACKET. This keeps PREV_PACKET to be useful * for generating instruction and branch samples. */ - if (tidq->prev_packet->sample_type == CS_ETM_RANGE) - tidq->prev_packet->last_instr_taken_branch = true; + tidq->prev_packet->last_instr_taken_branch = true; + + /* + * Since the exception packet is not used standalone for generating + * samples and it's affiliation to the previous instruction range + * packet; so set previous range packet flags to tell perf it is an + * exception taken branch. + */ + if (tidq->packet->sample_type == CS_ETM_EXCEPTION) + tidq->prev_packet->flags = tidq->packet->flags;
return 0; } @@ -1916,15 +1931,6 @@ static int cs_etm__set_sample_flags(struct cs_etm_queue *etmq, PERF_IP_FLAG_CALL | PERF_IP_FLAG_INTERRUPT;
- /* - * When the exception packet is inserted, since exception - * packet is not used standalone for generating samples - * and it's affiliation to the previous instruction range - * packet; so set previous range packet flags to tell perf - * it is an exception taken branch. - */ - if (prev_packet->sample_type == CS_ETM_RANGE) - prev_packet->flags = packet->flags; break; case CS_ETM_EXCEPTION_RET: /*
When a branch instruction is to be executed, if the branch target address is not mapped into the virtual address space, this branch instruction will trigger an exception with data abort. For this case, CoreSight decoding flow cannot reflect the complete branch flow prior to exception, and leads the user space addresses inconsistency before and after the exception handling.
Let's see the detailed explanation for the issue with an example:
Packet 0: range packet start_addr=0xffffad8018a4 end_addr=0xffffad8018ec Packet 1: exception packet start_addr=0xffffad8018a4 end_addr=0xffffad801910 Packet 2: range packet start_addr=0xffff800010081c00 end_addr=0xffff800010081c18
There have three packets are coming; from packet 0 to packet 1, CPU tries to branch from 0xffffad8018ec-4 to 0xffffad801910, accessing the address 0xffffad801910 causes the data abort, so this branch is not taken and an exception is triggered and jump to 0xffff800010081c00 in packet 2.
When handle this sequence, it misses a range packet for the branch between 0xffffad8018ec-4 and 0xffffad801910, so Perf tool cannot generate a branch sample for it and this might introduce confusion for the addresses before and after exception handling, since we can see the exception return address is 0xffffad801910, which is not a sequential value for the address 0xffffad8018ec-4 before exception was taken.
0xffffad8018ec-4 -> 0xffff800010081c00: exception is taken ... ... exception return back -> 0xffffad801910
To fix this issue, firstly we need to decide which conditions can be used to distinguish that a branch triggers an exception. So below conditions are used to make decision:
- Check if the exception is a trap by comparing the specific sample flag for the exception packet; - The exception packet's end address is not same with its previous range packet's end address, which implies a branch triggering the exception and the branch target address is contained in the exception packet's end address.
This patch changes the exception packet to a 'fake' range packet; this allows to generate an extra branch sample for the branch instruction prior to the exception (between 0xffffad8018ec-4 and 0xffffad801910). So finally can get below samples:
0xffffad8018ec-4 -> 0xffffad801910: branch 0xffffad801910 -> 0xffff800010081c00: exception is taken ... ... exception return back -> 0xffffad801910
Note, this 'fake' range packet will add an extra recording for last branch array and change the thread stack pushing and popping (if later supported). But since 'fake' range packet's instruction length is set to zero, it doesn't introduce any change for instruction samples.
Before:
# perf script -F,+flags
main 3258 1 branches: int ffffad8018e8 dl_main+0x820 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) => ffff800010081c00 vectors+0x400 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010081c20 vectors+0x420 ([kernel.kallsyms]) => ffff800010082bc0 el0_sync+0x0 ([kernel.kallsyms]) main 3258 1 branches: jcc ffff800010082c8c el0_sync+0xcc ([kernel.kallsyms]) => ffff800010082ca0 el0_sync+0xe0 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010082ca0 el0_sync+0xe0 ([kernel.kallsyms]) => ffff800010082ccc el0_sync+0x10c ([kernel.kallsyms]) [...] main 3258 1 branches: jcc ffff800010083574 finish_ret_to_user+0x34 ([kernel.kallsyms]) => ffff800010083580 finish_ret_to_user+0x40 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010083580 finish_ret_to_user+0x40 ([kernel.kallsyms]) => ffff800010083598 finish_ret_to_user+0x58 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010083598 finish_ret_to_user+0x58 ([kernel.kallsyms]) => ffff8000100835c4 finish_ret_to_user+0x84 ([kernel.kallsyms]) main 3258 1 branches: iret ffff800010083610 finish_ret_to_user+0xd0 ([kernel.kallsyms]) => ffffad801910 dl_main+0x848 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
After:
# perf script -F,+flags
main 3258 1 branches: jmp ffffad8018e8 dl_main+0x820 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) => ffffad801910 dl_main+0x848 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) main 3258 1 branches: int ffffad801910 dl_main+0x848 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) => ffff800010081c00 vectors+0x400 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010081c20 vectors+0x420 ([kernel.kallsyms]) => ffff800010082bc0 el0_sync+0x0 ([kernel.kallsyms]) main 3258 1 branches: jcc ffff800010082c8c el0_sync+0xcc ([kernel.kallsyms]) => ffff800010082ca0 el0_sync+0xe0 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010082ca0 el0_sync+0xe0 ([kernel.kallsyms]) => ffff800010082ccc el0_sync+0x10c ([kernel.kallsyms]) [...] main 3258 1 branches: jcc ffff800010083574 finish_ret_to_user+0x34 ([kernel.kallsyms]) => ffff800010083580 finish_ret_to_user+0x40 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010083580 finish_ret_to_user+0x40 ([kernel.kallsyms]) => ffff800010083598 finish_ret_to_user+0x58 ([kernel.kallsyms]) main 3258 1 branches: jmp ffff800010083598 finish_ret_to_user+0x58 ([kernel.kallsyms]) => ffff8000100835c4 finish_ret_to_user+0x84 ([kernel.kallsyms]) main 3258 1 branches: iret ffff800010083610 finish_ret_to_user+0xd0 ([kernel.kallsyms]) => ffffad801910 dl_main+0x848 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
Suggested-by: Mike Leach mike.leach@linaro.org Signed-off-by: Leo Yan leo.yan@linaro.org --- .../perf/util/cs-etm-decoder/cs-etm-decoder.c | 1 + tools/perf/util/cs-etm.c | 66 ++++++++++++++++++- 2 files changed, 65 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c index cd92a99eb89d..f1f66d883391 100644 --- a/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c +++ b/tools/perf/util/cs-etm-decoder/cs-etm-decoder.c @@ -482,6 +482,7 @@ cs_etm_decoder__buffer_exception(struct cs_etm_packet_queue *queue,
packet = &queue->packet_buffer[queue->tail]; packet->exception_number = elem->exception_number; + packet->end_addr = elem->en_addr;
return ret; } diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 48932a7a933f..7cf30b5e0e20 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1477,8 +1477,11 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, return 0; }
-static int cs_etm__exception(struct cs_etm_traceid_queue *tidq) +static int cs_etm__exception(struct cs_etm_queue *etmq, + struct cs_etm_traceid_queue *tidq) { + u32 flags; + /* * Usually the exception packet follows a range packet, if it's not the * case, directly bail out. @@ -1486,6 +1489,65 @@ static int cs_etm__exception(struct cs_etm_traceid_queue *tidq) if (tidq->prev_packet->sample_type != CS_ETM_RANGE) return 0;
+ /* + * If the exception is a trap and its end_addr is not same with its + * previous range packet's end_addr, this implies the exception is + * triggered by a branch and the exception packet's end_addr is the + * branch target address from the previous range packet. + * + * Below is an example with three packets: + * Packet 0: range packet + * start_addr=0xffffad8018a4 end_addr=0xffffad8018ec + * Packet 1: exception packet + * start_addr=0xffffad8018a4 end_addr=0xffffad801910 + * Packet 2: range packet + * start_addr=0xffff800010081c00 end_addr=0xffff800010081c18 + * + * CPU tries to branch from 0xffffad8018ec-4 (packet 0) to + * 0xffffad801910 (packet 1), accessing the address 0xffffad801910 + * causes data abort, so the branch is not taken and an exception is + * triggered and jump to 0xffff800010081c00 (packet 2). + * + * For this case, it misses a range packet for the branch between + * 0xffffad8018ec-4 and 0xffffad801910, so perf tool cannot generate + * branch sample and introduces confusion for exception return parsing: + * + * 0xffffad8018ec-4 -> 0xffff800010081c00: exception is taken + * ... exception return back ... -> 0xffffad801910 + * + * To fix this issue, the exception packet is changed to a 'fake' + * range packet. This can allow to generate a branch sample between + * 0xffffad8018ec-4 and 0xffffad801910. Finally get below samples: + * + * 0xffffad8018ec-4 -> 0xffffad801910: branch + * 0xffffad801910 -> 0xffff800010081c00: exception is taken + * ... exception return back ... -> 0xffffad801910 + */ + + /* Use flags to check if the exception is trap */ + flags = PERF_IP_FLAG_BRANCH | PERF_IP_FLAG_CALL | + PERF_IP_FLAG_INTERRUPT; + + if (tidq->packet->sample_type == CS_ETM_EXCEPTION && + tidq->packet->flags == flags && + tidq->packet->end_addr != tidq->prev_packet->end_addr) { + /* + * Change the exception packet to a range packet, so can reflect + * branch from prev_packet::end_addr-4 to packet::start_addr; + * + * This branch is not taken yet, so set its instruction count + * to zero. Set 'last_instr_taken_branch' to true, so allow + * it to generate samples with its seqential range packet. + */ + tidq->packet->sample_type = CS_ETM_RANGE; + tidq->packet->start_addr = tidq->packet->end_addr; + tidq->packet->instr_count = 0; + tidq->packet->last_instr_taken_branch = true; + + /* Generate sample with the previous range packet */ + return cs_etm__sample(etmq, tidq); + } + /* * When the exception packet is inserted, whether the last instruction * in previous range packet is taken branch or not, we need to force @@ -2045,7 +2107,7 @@ static int cs_etm__process_traceid_queue(struct cs_etm_queue *etmq, * make sure the previous instruction * range packet to be handled properly. */ - cs_etm__exception(tidq); + cs_etm__exception(etmq, tidq); break; case CS_ETM_DISCONTINUITY: /*
cs-etm.c has several functions which need to know instruction size based on address, e.g. cs_etm__instr_addr() and cs_etm__copy_insn() two functions both calculate the instruction size separately with its duplicated code. Furthermore, adding new features later which might require to calculate instruction size as well.
For this reason, this patch refactors the code to introduce a new function cs_etm__instr_size(), this function is central place to calculate the instruction size based on ISA type and instruction address.
Given the trace data can be MB and most likely that will be A64/A32 on a lot of the current and future platforms, cs_etm__instr_addr() keeps a single ISA type check for non T32, for this case it executes an optimized calculation (addr + offset * 4).
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 52 ++++++++++++++++++++++++---------------- 1 file changed, 32 insertions(+), 20 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 7cf30b5e0e20..f3ba2cfb634f 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -935,6 +935,26 @@ static inline int cs_etm__t32_instr_size(struct cs_etm_queue *etmq, return ((instrBytes[1] & 0xF8) >= 0xE8) ? 4 : 2; }
+static inline int cs_etm__instr_size(struct cs_etm_queue *etmq, + u8 trace_chan_id, + enum cs_etm_isa isa, + u64 addr) +{ + int insn_len; + + /* + * T32 instruction size might be 32-bit or 16-bit, decide by calling + * cs_etm__t32_instr_size(). + */ + if (isa == CS_ETM_ISA_T32) + insn_len = cs_etm__t32_instr_size(etmq, trace_chan_id, addr); + /* Otherwise, A64 and A32 instruction size are always 32-bit. */ + else + insn_len = 4; + + return insn_len; +} + static inline u64 cs_etm__first_executed_instr(struct cs_etm_packet *packet) { /* Returns 0 for the CS_ETM_DISCONTINUITY packet */ @@ -959,19 +979,19 @@ static inline u64 cs_etm__instr_addr(struct cs_etm_queue *etmq, const struct cs_etm_packet *packet, u64 offset) { - if (packet->isa == CS_ETM_ISA_T32) { - u64 addr = packet->start_addr; + u64 addr = packet->start_addr;
- while (offset) { - addr += cs_etm__t32_instr_size(etmq, - trace_chan_id, addr); - offset--; - } - return addr; + /* Optimize calculation for non T32 */ + if (packet->isa != CS_ETM_ISA_T32) + return addr + offset * 4; + + while (offset) { + addr += cs_etm__instr_size(etmq, trace_chan_id, + packet->isa, addr); + offset--; }
- /* Assume a 4 byte instruction size (A32/A64) */ - return packet->start_addr + offset * 4; + return addr; }
static void cs_etm__update_last_branch_rb(struct cs_etm_queue *etmq, @@ -1111,16 +1131,8 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, return; }
- /* - * T32 instruction size might be 32-bit or 16-bit, decide by calling - * cs_etm__t32_instr_size(). - */ - if (packet->isa == CS_ETM_ISA_T32) - sample->insn_len = cs_etm__t32_instr_size(etmq, trace_chan_id, - sample->ip); - /* Otherwise, A64 and A32 instruction size are always 32-bit. */ - else - sample->insn_len = 4; + sample->insn_len = cs_etm__instr_size(etmq, trace_chan_id, + packet->isa, sample->ip);
cs_etm__mem_access(etmq, trace_chan_id, sample->ip, sample->insn_len, (void *)sample->insn);
Since Arm CoreSight doesn't support thread stack, the decoding cannot display symbols with indented spaces to reflect the stack depth.
This patch adds support thread stack for Arm CoreSight, this allows 'perf script' to display properly for option '-F,+callindent'.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1 ffff8634f5c8 coresight_test1+0x3c (/root/coresight_test/libcstest.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: printf@plt aaaaba8d36bc printf@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 44 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index f3ba2cfb634f..08ca919aa2b1 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1138,6 +1138,45 @@ static void cs_etm__copy_insn(struct cs_etm_queue *etmq, sample->insn_len, (void *)sample->insn); }
+static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, + struct cs_etm_traceid_queue *tidq) +{ + struct cs_etm_auxtrace *etm = etmq->etm; + u8 trace_chan_id = tidq->trace_chan_id; + int insn_len; + u64 from_ip, to_ip; + + if (etm->synth_opts.thread_stack) { + from_ip = cs_etm__last_executed_instr(tidq->prev_packet); + to_ip = cs_etm__first_executed_instr(tidq->packet); + + insn_len = cs_etm__instr_size(etmq, trace_chan_id, + tidq->prev_packet->isa, from_ip); + + /* + * Create thread stacks by keeping track of calls and returns; + * any call pushes thread stack, return pops the stack, and + * flush stack when the trace is discontinuous. + */ + thread_stack__event(tidq->thread, tidq->prev_packet->cpu, + tidq->prev_packet->flags, + from_ip, to_ip, insn_len, + etmq->buffer->buffer_nr + 1); + } else { + /* + * The thread stack can be output via thread_stack__process(); + * thus the detailed information about paired calls and returns + * will be facilitated by Python script for the db-export. + * + * Need to set trace buffer number and flush thread stack if the + * trace buffer number has been alternate. + */ + thread_stack__set_trace_nr(tidq->thread, + tidq->prev_packet->cpu, + etmq->buffer->buffer_nr + 1); + } +} + static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq, struct cs_etm_traceid_queue *tidq, u64 addr, u64 period) @@ -1382,6 +1421,9 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->prev_packet->last_instr_taken_branch) cs_etm__update_last_branch_rb(etmq, tidq);
+ if (tidq->prev_packet->last_instr_taken_branch) + cs_etm__add_stack_event(etmq, tidq); + if (etm->sample_instructions && tidq->period_instructions >= etm->instructions_sample_period) { /* @@ -2730,6 +2772,8 @@ int cs_etm__process_auxtrace_info(union perf_event *event, itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); etm->synth_opts.callchain = false; + etm->synth_opts.thread_stack = + session->itrace_synth_opts->thread_stack; }
err = cs_etm__synth_events(etm, session);
If user specifies option '-F,+callindent' or call chain related options, it means users only care about function calls and returns; for these cases, it's pointless to generate samples for the branches within function. But unlike other hardware trace handling (e.g. Intel's pt or bts), Arm CoreSight doesn't filter branch types for these options and generate samples for all branches, this causes Perf to output many spurious blanks if the branch is not a function call or return.
To only output pairs of calls and returns, this patch introduces branch filter and the filter is set according to synthetic options. Finally, Perf can output only for calls and returns and avoid to output other unnecessary blanks.
Before:
# perf script -F,+callindent main 2808 1 branches: coresight_test1@plt aaaaba8d37d8 main+0x14 (/root/coresight_test/main) main 2808 1 branches: coresight_test1@plt aaaaba8d367c coresight_test1@plt+0xc (/root/coresight_test/main) main 2808 1 branches: _init aaaaba8d3650 _init+0x30 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: ffff8636a3f4 _dl_lookup_symbol_x+0x5c (/lib/aarch64-linux-gnu/ld-2.28.s [...]
After:
# perf script -F,+callindent main 2808 1 branches: coresight_test1@plt aaaaba8d37d8 main+0x14 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: do_lookup_x ffff8636a49c _dl_lookup_symbol_x+0x104 (/lib/aarch64-linux-gnu/ld-2.28. main 2808 1 branches: check_match ffff86369bf0 do_lookup_x+0x238 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: strcmp ffff86369888 check_match+0x70 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: printf@plt aaaaba8d37ec main+0x28 (/root/coresight_test/main) main 2808 1 branches: _dl_fixup ffff86373b4c _dl_runtime_resolve+0x40 (/lib/aarch64-linux-gnu/ld-2.28.s main 2808 1 branches: _dl_lookup_symbol_x ffff8636e078 _dl_fixup+0xb8 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: do_lookup_x ffff8636a49c _dl_lookup_symbol_x+0x104 (/lib/aarch64-linux-gnu/ld-2.28. main 2808 1 branches: _dl_name_match_p ffff86369af0 do_lookup_x+0x138 (/lib/aarch64-linux-gnu/ld-2.28.so) main 2808 1 branches: strcmp ffff8636f7f0 _dl_name_match_p+0x18 (/lib/aarch64-linux-gnu/ld-2.28.so) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 12 ++++++++++++ 1 file changed, 12 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 08ca919aa2b1..1b08b650b090 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -56,6 +56,7 @@ struct cs_etm_auxtrace {
int num_cpu; u32 auxtrace_type; + u32 branches_filter; u64 branches_sample_type; u64 branches_id; u64 instructions_sample_type; @@ -1239,6 +1240,10 @@ static int cs_etm__synth_branch_sample(struct cs_etm_queue *etmq, } dummy_bs; u64 ip;
+ if (etm->branches_filter && + !(etm->branches_filter & tidq->prev_packet->flags)) + return 0; + ip = cs_etm__last_executed_instr(tidq->prev_packet);
event->sample.header.type = PERF_RECORD_SAMPLE; @@ -2776,6 +2781,13 @@ int cs_etm__process_auxtrace_info(union perf_event *event, session->itrace_synth_opts->thread_stack; }
+ if (etm->synth_opts.calls) + etm->branches_filter |= PERF_IP_FLAG_CALL | PERF_IP_FLAG_ASYNC | + PERF_IP_FLAG_TRACE_END; + if (etm->synth_opts.returns) + etm->branches_filter |= PERF_IP_FLAG_RETURN | + PERF_IP_FLAG_TRACE_BEGIN; + err = cs_etm__synth_events(etm, session); if (err) goto err_delete_thread;
Now CoreSight has supported the thread stack; based on the thread stack we can synthesize call chain for the instruction sample; the call chain can be injected by option '--itrace=g'.
Note the stack event must be processed prior to synthesizing instruction sample; this can ensure the thread stack to push and pop synchronously with instruction sample and the thread stack can be generated correctly for instruction samples. Add a comment for related info.
Before:
# perf script --itrace=g16l64i100 main 1579 100 instructions: ffff0000102137f0 group_sched_in+0xb0 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) main 1579 100 instructions: ffff000010214854 perf_event_update_userpage+0x4c ([kernel.kallsyms]) [...]
After:
# perf script --itrace=g16l64i100
main 1579 100 instructions: ffff000010213b78 flexible_sched_in+0xf0 ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff0000102135ac event_sched_in.isra.57+0x74 ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms])
main 1579 100 instructions: ffff000010219344 perf_swevent_add+0x6c ([kernel.kallsyms]) ffff0000102135f4 event_sched_in.isra.57+0xbc ([kernel.kallsyms]) ffff0000102137a0 group_sched_in+0x60 ([kernel.kallsyms]) ffff000010213b84 flexible_sched_in+0xfc ([kernel.kallsyms]) ffff00001020c0b4 visit_groups_merge+0x12c ([kernel.kallsyms]) [...]
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 40 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 38 insertions(+), 2 deletions(-)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 1b08b650b090..d9c22c145307 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -17,6 +17,7 @@ #include <stdlib.h>
#include "auxtrace.h" +#include "callchain.h" #include "color.h" #include "cs-etm.h" #include "cs-etm-decoder/cs-etm-decoder.h" @@ -74,6 +75,7 @@ struct cs_etm_traceid_queue { size_t last_branch_pos; union perf_event *event_buf; struct thread *thread; + struct ip_callchain *chain; struct branch_stack *last_branch; struct branch_stack *last_branch_rb; struct cs_etm_packet *prev_packet; @@ -251,6 +253,16 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq, if (!tidq->prev_packet) goto out_free;
+ if (etm->synth_opts.callchain) { + size_t sz = sizeof(struct ip_callchain); + + /* Add 1 to callchain_sz for callchain context */ + sz += (etm->synth_opts.callchain_sz + 1) * sizeof(u64); + tidq->chain = zalloc(sz); + if (!tidq->chain) + goto out_free; + } + if (etm->synth_opts.last_branch) { size_t sz = sizeof(struct branch_stack);
@@ -273,6 +285,7 @@ static int cs_etm__init_traceid_queue(struct cs_etm_queue *etmq, out_free: zfree(&tidq->last_branch_rb); zfree(&tidq->last_branch); + zfree(&tidq->chain); zfree(&tidq->prev_packet); zfree(&tidq->packet); out: @@ -561,6 +574,7 @@ static void cs_etm__free_traceid_queues(struct cs_etm_queue *etmq) zfree(&tidq->event_buf); zfree(&tidq->last_branch); zfree(&tidq->last_branch_rb); + zfree(&tidq->chain); zfree(&tidq->prev_packet); zfree(&tidq->packet); zfree(&tidq); @@ -1147,7 +1161,7 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, int insn_len; u64 from_ip, to_ip;
- if (etm->synth_opts.thread_stack) { + if (etm->synth_opts.callchain || etm->synth_opts.thread_stack) { from_ip = cs_etm__last_executed_instr(tidq->prev_packet); to_ip = cs_etm__first_executed_instr(tidq->packet);
@@ -1203,6 +1217,14 @@ static int cs_etm__synth_instruction_sample(struct cs_etm_queue *etmq,
cs_etm__copy_insn(etmq, tidq->trace_chan_id, tidq->packet, &sample);
+ if (etm->synth_opts.callchain) { + thread_stack__sample(tidq->thread, tidq->packet->cpu, + tidq->chain, + etm->synth_opts.callchain_sz + 1, + sample.ip, etm->kernel_start); + sample.callchain = tidq->chain; + } + if (etm->synth_opts.last_branch) sample.branch_stack = tidq->last_branch;
@@ -1385,6 +1407,8 @@ static int cs_etm__synth_events(struct cs_etm_auxtrace *etm, attr.sample_type &= ~(u64)PERF_SAMPLE_ADDR; }
+ if (etm->synth_opts.callchain) + attr.sample_type |= PERF_SAMPLE_CALLCHAIN; if (etm->synth_opts.last_branch) attr.sample_type |= PERF_SAMPLE_BRANCH_STACK;
@@ -1426,6 +1450,11 @@ static int cs_etm__sample(struct cs_etm_queue *etmq, tidq->prev_packet->last_instr_taken_branch) cs_etm__update_last_branch_rb(etmq, tidq);
+ /* + * The stack event must be processed prior to synthesizing + * instruction sample; this can ensure the instruction samples + * to generate correct thread stack. + */ if (tidq->prev_packet->last_instr_taken_branch) cs_etm__add_stack_event(etmq, tidq);
@@ -2776,7 +2805,6 @@ int cs_etm__process_auxtrace_info(union perf_event *event, } else { itrace_synth_opts__set_default(&etm->synth_opts, session->itrace_synth_opts->default_no_sample); - etm->synth_opts.callchain = false; etm->synth_opts.thread_stack = session->itrace_synth_opts->thread_stack; } @@ -2788,6 +2816,14 @@ int cs_etm__process_auxtrace_info(union perf_event *event, etm->branches_filter |= PERF_IP_FLAG_RETURN | PERF_IP_FLAG_TRACE_BEGIN;
+ if (etm->synth_opts.callchain && !symbol_conf.use_callchain) { + symbol_conf.use_callchain = true; + if (callchain_register_param(&callchain_param) < 0) { + symbol_conf.use_callchain = false; + etm->synth_opts.callchain = false; + } + } + err = cs_etm__synth_events(etm, session); if (err) goto err_delete_thread;
In theory when an exception is taken, the thread stack is pushed with an expected return address (ret_addr): from_ip + insn_len; and later when the exception returns back, it compares the return address (from the new packet's to_ip) with the ret_addr in the of thread stack, if have the same values then the thread stack will be popped.
When a branch instruction's target address triggers an exception, the thread stack's ret_addr is the branch target address plus instruction length for exception entry; but this branch instruction is not taken, the exception return address is the branch target address, thus the thread stack's ret_addr cannot match with the exception return address, so the thread stack cannot pop properly.
This patch fixes up the ret_addr at the exception entry, when it detects the exception is triggered by a branch target address, it sets 'insn_len' to zero. This allows the thread stack can pop properly when return from exception.
Before:
# perf script --itrace=g16l64i100
main 3258 100 instructions: ffff800010082c1c el0_sync+0x5c ([kernel.kallsyms]) ffffad816a14 memcpy+0x4 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800820 _dl_start_final+0x48 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800044 _start+0x4 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
The issues in the output: memcpy+0x4 => The function call memcpy() causes exception; it's return address should be memcpy+0x0. _start+0x4 => The thread stack is not popped correctly, this is a stale data which is left in the previous exception flow.
After:
# perf script --itrace=g16l64i100
main 3258 100 instructions: ffff800010082c1c el0_sync+0x5c ([kernel.kallsyms]) ffffad816a10 memcpy+0x0 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800820 _dl_start_final+0x48 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index d9c22c145307..4800daf0dc3d 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1160,6 +1160,7 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, u8 trace_chan_id = tidq->trace_chan_id; int insn_len; u64 from_ip, to_ip; + u32 flags;
if (etm->synth_opts.callchain || etm->synth_opts.thread_stack) { from_ip = cs_etm__last_executed_instr(tidq->prev_packet); @@ -1168,6 +1169,27 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, insn_len = cs_etm__instr_size(etmq, trace_chan_id, tidq->prev_packet->isa, from_ip);
+ /* + * Fixup the exception entry. + * + * If the packet's start_addr is same with its end_addr, this + * packet was altered from a exception packet to a range packet; + * the detailed info is described in cs_etm__exception(), which + * is used to handle the case for a branch instruction is not + * taken but the branch triggers an exception. + * + * In this case, fixup 'insn_len' to zero so that allow the + * thread stack's return address can match with the exception + * return address, finally can pop up thread stack properly when + * return from exception. + */ + flags = PERF_IP_FLAG_BRANCH | PERF_IP_FLAG_CALL | + PERF_IP_FLAG_INTERRUPT; + if (tidq->prev_packet->flags == flags && + tidq->prev_packet->start_addr == + tidq->prev_packet->end_addr) + insn_len = 0; + /* * Create thread stacks by keeping track of calls and returns; * any call pushes thread stack, return pops the stack, and
For the instruction emulation or single step in kernel, when return to the user space, the return address is not possible to be the same with the ret_addr in thread stack.
This patch adds a helper to read out the top return address from thread stack, this can be used for specific calibration in up case.
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/thread-stack.c | 10 ++++++++++ tools/perf/util/thread-stack.h | 1 + 2 files changed, 11 insertions(+)
diff --git a/tools/perf/util/thread-stack.c b/tools/perf/util/thread-stack.c index 0885967d5bc3..60cd6fdca8de 100644 --- a/tools/perf/util/thread-stack.c +++ b/tools/perf/util/thread-stack.c @@ -497,6 +497,16 @@ void thread_stack__sample(struct thread *thread, int cpu, chain->nr = i; }
+u64 thread_stack__get_top_ret_addr(struct thread *thread, int cpu) +{ + struct thread_stack *ts = thread__stack(thread, cpu); + + if (!ts || !ts->cnt) + return UINT64_MAX; + + return ts->stack[ts->cnt--].ret_addr; +} + struct call_return_processor * call_return_processor__new(int (*process)(struct call_return *cr, u64 *parent_db_id, void *data), void *data) diff --git a/tools/perf/util/thread-stack.h b/tools/perf/util/thread-stack.h index e1ec5a58f1b2..b9d07a3be6c2 100644 --- a/tools/perf/util/thread-stack.h +++ b/tools/perf/util/thread-stack.h @@ -88,6 +88,7 @@ void thread_stack__sample(struct thread *thread, int cpu, struct ip_callchain *c int thread_stack__flush(struct thread *thread); void thread_stack__free(struct thread *thread); size_t thread_stack__depth(struct thread *thread, int cpu); +u64 thread_stack__get_top_ret_addr(struct thread *thread, int cpu);
struct call_return_processor * call_return_processor__new(int (*process)(struct call_return *cr, u64 *parent_db_id, void *data),
For instruction emulation or other cases (like ptrace), the program will be trapped into kernel and the kernel executes the instruction with single step, so the exception return address will be one instruction ahead than the recorded address 'ret_addr' in the thread stack. Finally, it's impossible to pop up the thread stack due to the mismatch between the return address and 'ret_addr'.
To fix this issue, this patch reads out the 'ret_addr' from the top of thread stack, and if detects the exception return address is one instruction ahead than 'ret_addr', it implies the kernel has executed single step. For this case, calibrate 'to_ip' to 'ret_addr' so can allow the thread stack to pop up.
Before:
main 3258 100 instructions: ffff800010095f48 do_emulate_mrs+0x48 ([kernel.kallsyms]) ffff800010096060 emulate_mrs+0x48 ([kernel.kallsyms]) ffff8000100904ec do_undefinstr+0x1f4 ([kernel.kallsyms]) ffff80001008788c el0_sync_handler+0x124 ([kernel.kallsyms]) ffff800010082d00 el0_sync+0x140 ([kernel.kallsyms]) ffffad8137d8 _dl_sysdep_start+0x2f8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
main 3258 100 instructions: ffff8000100835fc finish_ret_to_user+0xbc ([kernel.kallsyms]) ffffad8137d8 _dl_sysdep_start+0x2f8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
main 3258 100 instructions: ffffad801138 dl_main+0x70 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad81384c _dl_sysdep_start+0x36c (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad8137d8 _dl_sysdep_start+0x2f8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
Note: after return back from instruction emulation with emulate_mrs(), _dl_sysdep_start+0x2f8 cannot be popped up.
After:
main 3258 100 instructions: ffff800010095f48 do_emulate_mrs+0x48 ([kernel.kallsyms]) ffff800010096060 emulate_mrs+0x48 ([kernel.kallsyms]) ffff8000100904ec do_undefinstr+0x1f4 ([kernel.kallsyms]) ffff80001008788c el0_sync_handler+0x124 ([kernel.kallsyms]) ffff800010082d00 el0_sync+0x140 ([kernel.kallsyms]) ffffad8137d8 _dl_sysdep_start+0x2f8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
main 3258 100 instructions: ffff8000100835fc finish_ret_to_user+0xbc ([kernel.kallsyms]) ffffad8137d8 _dl_sysdep_start+0x2f8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
main 3258 100 instructions: ffffad801138 dl_main+0x70 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad81384c _dl_sysdep_start+0x36c (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800884 _dl_start_final+0xac (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800b00 _dl_start+0x200 (/usr/lib/aarch64-linux-gnu/ld-2.28.so) ffffad800048 _start+0x8 (/usr/lib/aarch64-linux-gnu/ld-2.28.so)
Signed-off-by: Leo Yan leo.yan@linaro.org --- tools/perf/util/cs-etm.c | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+)
diff --git a/tools/perf/util/cs-etm.c b/tools/perf/util/cs-etm.c index 4800daf0dc3d..7ff55704de5c 100644 --- a/tools/perf/util/cs-etm.c +++ b/tools/perf/util/cs-etm.c @@ -1190,6 +1190,34 @@ static void cs_etm__add_stack_event(struct cs_etm_queue *etmq, tidq->prev_packet->end_addr) insn_len = 0;
+ /* + * Fixup the exception exit. + * + * For instruction emulation or single step cases, when return + * from exception, since an extra instruction has been executed + * in kernel, the exception return address 'top_ip' is an + * instruction ahead than the expected address 'ret_addr' in + * thread stack. + * + * When detects this case, calibrate 'to_ip' to 'ret_addr' so + * can pop up thread stack. + */ + flags = PERF_IP_FLAG_BRANCH | PERF_IP_FLAG_RETURN | + PERF_IP_FLAG_INTERRUPT; + if (tidq->prev_packet->flags == flags) { + u64 ret_addr; + int ret_insn_len; + + ret_addr = thread_stack__get_top_ret_addr(tidq->thread, + tidq->prev_packet->cpu); + ret_insn_len = cs_etm__instr_size(etmq, + trace_chan_id, + tidq->packet->isa, + ret_addr); + if (to_ip == ret_addr + ret_insn_len) + to_ip = ret_addr; + } + /* * Create thread stacks by keeping track of calls and returns; * any call pushes thread stack, return pops the stack, and
Quoting Leo Yan (2020-02-19 21:26:52)
This patch series adds support for thread stack and callchain; this patch set depends on the instruction sample fix patch set [1].
This patch set get more complex, so before divide into small groups, I'd like to use this patch set version to include all relevant patches, hope this can give whole context for related code change.
Was this split up into small groups and sent again? I didn't see anything when searching lkml.
Hi Stephen,
On Thu, Nov 05, 2020 at 02:50:56PM -0800, Stephen Boyd wrote:
Quoting Leo Yan (2020-02-19 21:26:52)
This patch series adds support for thread stack and callchain; this patch set depends on the instruction sample fix patch set [1].
This patch set get more complex, so before divide into small groups, I'd like to use this patch set version to include all relevant patches, hope this can give whole context for related code change.
Was this split up into small groups and sent again? I didn't see anything when searching lkml.
No, this patch series is the last one for upstreaming; since I worked on other stuffs, so didn't continue to upstream to the mainline kernel.
IIRC, there have a concern for a pontential breakage for perf cs-etm testing, so falls to backlog. Let me check with Mathieu/Mike offline for how to proceed for this patch set. Thanks for bringing up.
Leo