On 16/06/2026 3:51 pm, Leo Yan wrote:
This series adds thread-stack and synthesized callchain support for Arm CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer. That effectively makes the state per CPU, while the call/return history belongs to a thread. This series moves branch tracking to the common thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses thread_stack__br_sample() for last branch entries, flushes thread stacks after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace discontinuity, so all thread stacks are flushed, so avoids carrying stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that case the exception return address may not match the return address stored in the thread stack, because after exception return can be one instruction ahead. The stack can still recover when a later return matches an upper caller. Given emulated instructions are not the common target for performance callchain analysis. Supporting this would require extending the common thread-stack path to accept both the real target address and an adjusted address for stack matching, so this series leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 136 -vvv 136: CoreSight synthesized callchain: --- start --- test child forked, pid 3539 ---- end(0) ---- 136: CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions: aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test) ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6) ffff90bd233c call_init+0x9c (inlined) ffff90bd233c __libc_start_main_impl+0x9c (inlined) aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions: aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6) ffff90bd233c call_init+0x9c (inlined) ffff90bd233c __libc_start_main_impl+0x9c (inlined) aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions: ffff800080010c20 vectors+0x420 ([kernel.kallsyms]) aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test) ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6) ffff90bd233c call_init+0x9c (inlined) ffff90bd233c __libc_start_main_impl+0x9c (inlined) aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity packets (mainly caused by NO_SYNC elem). This is likely caused by the FIFO overflow on the path.
It passes 20/20 times on my r0 Juno board. But I wonder if reducing the timestamp interval fixes it by reducing the rate of trace generation?
I'm just about to send patches that will change it to "timestamp=14" for --per-thread mode (only context packet timestamps). And "timestamp=7" for per-cpu mode (64 cycle interval).
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linar...
Signed-off-by: Leo Yan leo.yan@arm.com
Changes in v9:
- Added patch 01 to fixed thread leak during trace queue init (sashiko).
- Added check in instruction and branch samples in cs_etm__add_stack_event() (sashiko).
- Released frontend_thread properly in cs_etm__context() (sashiko).
- Refined cs_etm__flush_all_stack() to use switch (sashiko).
- Gathered James' review tags.
- Rebased on the latest perf-tools-next.
- Link to v8: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v8-0-73794...
Changes in v8:
- Updated test_arm_coresight_disasm.sh to pass "--itrace=b" and updated examples in arm-cs-trace-disasm.py (James).
- Removed static annotation in callchain workload and renamed functions with prefix "callchain_" to reduce naming conflict (James).
- For callchain test pre-condition check, removed the aarch64 check and added the root permission check (James).
- Resolved the shellcheck errors (James).
- Link to v7: https://lore.kernel.org/r/20260611-b4-arm_cs_callchain_support_v1-v7-0-1ba77...
Changes in v7:
- Rebased on the latest perf-tools-next.
- Used struct_size() for allocation callchain struct (James).
- Added a helper cs_etm__packet_has_taken_branch() (James).
- Minor improvements for the callchain test (used record-ctl FIFO and reworked the validation callstack push / pop).
- Link to v6: https://lore.kernel.org/r/20260526-b4-arm_cs_callchain_support_v1-v6-0-f9f49...
Changes in v6:
- Heavily rewrote the patches since restarted the work after 6 years.
- Changed to use the common thread-stack for branch stack and callchain management.
- Added a callchain test.
- Link to v5: https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@linar...
Changes in v5:
- Addressed Mike's suggestion for performance improvement for function cs_etm__instr_addr() for quick calculation for non T32;
- Removed the patch 'perf cs-etm: Synchronize instruction sample with the thread stack' (Mike);
- Fixed the issue for exception is taken for branch target address accessing, for the branch sample and stack thread handling, the related patches are 01, 02, 07;
- Fixed the stack thread handling for instruction emulation and single step with patches 08, 09.
- Link to v4: https://lore.kernel.org/linux-arm-kernel/20200203020716.31832-1-leo.yan@lina...
Leo Yan (9): perf cs-etm: Fix thread leaks on trace queue init failure perf cs-etm: Filter synthesized branch samples perf cs-etm: Decode ETE exception packets perf cs-etm: Refactor instruction size handling perf cs-etm: Use thread-stack for last branch entries perf cs-etm: Flush thread stacks after decoder reset perf cs-etm: Support call indentation perf cs-etm: Synthesize callchains for instruction samples perf test: Add Arm CoreSight callchain test
tools/perf/Documentation/perf-test.txt | 6 +- tools/perf/scripts/python/arm-cs-trace-disasm.py | 9 +- tools/perf/tests/builtin-test.c | 1 + tools/perf/tests/shell/coresight/callchain.sh | 172 ++++++++++ .../shell/coresight/test_arm_coresight_disasm.sh | 4 +- tools/perf/tests/tests.h | 1 + tools/perf/tests/workloads/Build | 2 + tools/perf/tests/workloads/callchain.c | 33 ++ tools/perf/util/cs-etm.c | 367 +++++++++++++-------- 9 files changed, 446 insertions(+), 149 deletions(-)
base-commit: 44543bb53ebfecb5ce4e890053a666affab9e482 change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,