On Thu, May 21, 2026 at 08:16:29PM +0800, Yingchao Deng wrote:
[...]
> Qualcomm implements an extended variant of the ARM CoreSight CTI with a
> different register layout and vendor-specific behavior. While the
> programming model remains largely compatible, the register offsets differ
> from the standard ARM CTI and require explicit handling.
I cannot apply this patch successfuly. Please rebase on the latest
coresight-next branch.
> @@ -726,6 +734,22 @@ static int cti_probe(struct amba_device *adev, const struct amba_id *id)
>
> raw_spin_lock_init(&drvdata->spinlock);
>
> + devarch = readl_relaxed(drvdata->base + CORESIGHT_DEVARCH);
> + if (CTI_DEVARCH_ARCHITECT(devarch) == ARCHITECT_QCOM) {
> + drvdata->is_qcom_cti = true;
> + /*
> + * QCOM CTI does not implement Claimtag functionality as
> + * per CoreSight specification, but its CLAIMSET register
> + * is incorrectly initialized to 0xF. This can mislead
> + * tools or drivers into thinking the component is claimed.
> + *
> + * Reset CLAIMSET to 0 to reflect that no claims are active.
> + */
> + CS_UNLOCK(drvdata->base);
> + writel_relaxed(0, drvdata->base + CORESIGHT_CLAIMSET);
> + CS_LOCK(drvdata->base);
Sorry I missed this piece before.
Can you move this quirk into firmware? I don't think the CTI driver
should clear the external claim bit as this totally break the protocol
defined in PSCI. A clean way would clear the bits in firmware and then
CTI driver can use the CLAIM registers.
Or, another option is to create several helpers to bypass claim
operations for Qcom CTI:
static void cti_clear_self_claim_tag(cti_drvdata *drvdata,
struct csdev_access *csa)
{
if (drvdata->is_qcom_cti)
return;
coresight_clear_self_claim_tag(csa);
}
static int cti_claim_device(cti_drvdata *drvdata)
{
if (drvdata->is_qcom_cti)
return 0;
return coresight_claim_device(drvdata->csdev);
}
static void cti_unclaim_device_unlocked(cti_drvdata *drvdata)
{
if (drvdata->is_qcom_cti)
return;
return coresight_disclaim_device_unlocked(drvdata->csdev);
}
Otherwise, this patch is fine for me.
Thanks,
Leo
On Tue, May 19, 2026 at 04:48:08PM +0100, Yeoreum Yun wrote:
> In the perf enable path, there are missing cases where
> cscfg_csdev_disable_active_config() is not called:
>
> - Branch broadcast is selected but not supported by the hardware
> - etm4_enable_hw() fails
>
> This can lead to a leak of config_desc->active_cnt.
> Fix this by properly calling cscfg_csdev_disable_active_config()
> in these error paths.
>
> Fixes: 810ac401db1f ("coresight: etm4x: Add complex configuration handlers to etmv4")
> Signed-off-by: Yeoreum Yun <yeoreum.yun(a)arm.com>
> ---
> .../hwtracing/coresight/coresight-etm4x-core.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
> index f0a9f2ef4b27..0889937811cb 100644
> --- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
> +++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
> @@ -899,6 +899,8 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
> * Missing BB support could cause silent decode errors
> * so fail to open if it's not supported.
> */
> + if (cfg_hash)
> + cscfg_csdev_disable_active_config(csdev);
> ret = -EINVAL;
> goto out;
> } else {
> @@ -915,6 +917,7 @@ static int etm4_enable_perf(struct coresight_device *csdev,
> struct coresight_path *path)
> {
> struct etmv4_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent);
> + struct perf_event_attr *attr = &event->attr;
> int ret;
>
> if (WARN_ON_ONCE(drvdata->cpu != smp_processor_id()))
> @@ -926,7 +929,7 @@ static int etm4_enable_perf(struct coresight_device *csdev,
> /* Configure the tracer based on the session's specifics */
> ret = etm4_parse_event_config(csdev, event);
> if (ret)
> - goto out;
> + goto err;
>
> drvdata->trcid = path->trace_id;
>
> @@ -935,16 +938,19 @@ static int etm4_enable_perf(struct coresight_device *csdev,
>
> /* And enable it */
> ret = etm4_enable_hw(drvdata);
> -
> -out:
> - /* Failed to start tracer; roll back to DISABLED mode */
> if (ret) {
> - coresight_set_mode(csdev, CS_MODE_DISABLED);
> - return ret;
> + if (ATTR_CFG_GET_FLD(attr, configid))
> + cscfg_csdev_disable_active_config(csdev);
> + goto err;
> }
>
> csdev->path = path;
> return 0;
> +
> +err:
> + /* Failed to start tracer; roll back to DISABLED mode */
> + coresight_set_mode(csdev, CS_MODE_DISABLED);
> + return ret;
> }
Seems to me, it is not readable if spread error handling into both child
and parent functions. I would like we can stick to use the same function
for resource allocation and rollback for failures.
How about below change (I did not verify it yet):
diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c
index 0889937811cb..8347efb2069b 100644
--- a/drivers/hwtracing/coresight/coresight-etm4x-core.c
+++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c
@@ -794,8 +794,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
.ATTR_CFG_FLD_timestamp_CFG = U64_MAX,
};
struct perf_event_attr *attr = &event->attr;
- unsigned long cfg_hash;
- int preset, cc_threshold;
+ int cc_threshold;
u8 ts_level;
/* Clear configuration from previous run */
@@ -882,16 +881,6 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
/* bit[12], Return stack enable bit */
config->cfg |= TRCCONFIGR_RS;
- /*
- * Set any selected configuration and preset. A zero configid means no
- * configuration active, preset = 0 means no preset selected.
- */
- cfg_hash = ATTR_CFG_GET_FLD(attr, configid);
- if (cfg_hash) {
- preset = ATTR_CFG_GET_FLD(attr, preset);
- ret = cscfg_csdev_enable_active_config(csdev, cfg_hash, preset);
- }
-
/* branch broadcast - enable if selected and supported */
if (ATTR_CFG_GET_FLD(attr, branch_broadcast)) {
if (!caps->trcbb) {
@@ -899,8 +888,6 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
* Missing BB support could cause silent decode errors
* so fail to open if it's not supported.
*/
- if (cfg_hash)
- cscfg_csdev_disable_active_config(csdev);
ret = -EINVAL;
goto out;
} else {
@@ -918,7 +905,8 @@ static int etm4_enable_perf(struct coresight_device *csdev,
{
struct etmv4_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent);
struct perf_event_attr *attr = &event->attr;
- int ret;
+ unsigned long cfg_hash;
+ int preset, ret;
if (WARN_ON_ONCE(drvdata->cpu != smp_processor_id()))
return -EINVAL;
@@ -931,6 +919,18 @@ static int etm4_enable_perf(struct coresight_device *csdev,
if (ret)
goto err;
+ /*
+ * Set any selected configuration and preset. A zero configid means no
+ * configuration active, preset = 0 means no preset selected.
+ */
+ cfg_hash = ATTR_CFG_GET_FLD(attr, configid);
+ if (cfg_hash) {
+ preset = ATTR_CFG_GET_FLD(attr, preset);
+ ret = cscfg_csdev_enable_active_config(csdev, cfg_hash, preset);
+ if (ret)
+ goto err;
+ }
+
drvdata->trcid = path->trace_id;
/* Populate pause state */
@@ -938,15 +938,15 @@ static int etm4_enable_perf(struct coresight_device *csdev,
/* And enable it */
ret = etm4_enable_hw(drvdata);
- if (ret) {
- if (ATTR_CFG_GET_FLD(attr, configid))
- cscfg_csdev_disable_active_config(csdev);
- goto err;
- }
+ if (ret)
+ goto err_enable_hw;
csdev->path = path;
return 0;
+err_enable_hw:
+ if (cfg_hash)
+ cscfg_csdev_disable_active_config(csdev);
err:
/* Failed to start tracer; roll back to DISABLED mode */
coresight_set_mode(csdev, CS_MODE_DISABLED);
On Thu, May 28, 2026 at 03:26:57PM +0100, Yeoreum Yun wrote:
[...]
> > > @@ -47,7 +47,7 @@ static int etm4_cfg_map_reg_offset(struct etmv4_drvdata *drvdata,
> > > struct cscfg_regval_csdev *reg_csdev, u32 offset)
> > > {
> > > int err = -EINVAL, idx;
> > > - struct etmv4_config *drvcfg = &drvdata->config;
> > > + struct etmv4_config *drvcfg = &drvdata->active_config;
> >
> > Wouldn't it make more sense to keep using drvdata->config for cfgfs?
> > This would avoid cfgfs updating the active configuration while a session
> > is enabled.
> >
> > My understanding is after refactoring cfgfs, we will never expose
> > active_config or config anymore so this should not an issue. Before
> > that, maybe we just keep to use config so can avoid mess.
>
> I don't think so. since the cfgfs's config is updated right before
> enable and in this patchset, It's applied after "take the mode".
>
> If we do it into drvdta->config, then contention would be increased
> with the access of sysfs and that nullifies almost this patch purpose
> and because of we use "one config" this makes a corruption with perf
> and sysfs.
Here have two different contention: the contention between sysfs and
cfgfs, and the contention between cfgfs and a runtime config (active
config). My suggestion is to avoid the later contention, which is the
main idea in this series that avoid the runtime config is modified in
the middle of a session.
That said, I have no strong opinion.
> > > static void etm4_enable_sysfs_smp_call(void *info)
> > > {
> > > struct etm4_enable_arg *arg = info;
> > > + struct etmv4_drvdata *drvdata;
> > > struct coresight_device *csdev;
> > > + unsigned long cfg_hash;
> > > + int preset;
> > >
> > > if (WARN_ON(!arg))
> > > return;
> > >
> > > - csdev = arg->drvdata->csdev;
> > > + drvdata = arg->drvdata;
> > > + csdev = drvdata->csdev;
> > > if (!coresight_take_mode(csdev, CS_MODE_SYSFS)) {
> > > /* Someone is already using the tracer */
> > > arg->rc = -EBUSY;
> > > return;
> > > }
> > >
> > > - arg->rc = etm4_enable_hw(arg->drvdata);
> > > + /* enable any config activated by configfs */
> > > + cscfg_config_sysfs_get_active_cfg(&cfg_hash, &preset);
> > >
> > > - /* The tracer didn't start */
> > > + drvdata->active_config = arg->config;
> >
> > Can move this down to just before drvdata->trcid assignment so cscfg
> > operations can be put together?
>
> No. Copy the arg->config must be before cscfg_csdev_enable_active_config().
> If that's after, it overrides the "preset" and this is different from
> the former behavior.
Please copy config first, then do cscfg stuffs.
> > > + arg->rc = etm4_enable_hw(drvdata);
> > > if (arg->rc) {
> > > - coresight_set_mode(csdev, CS_MODE_DISABLED);
> > > - return;
> > > + cscfg_csdev_disable_active_config(csdev);
> > > + goto err;
> >
> > Add a new goto tag (like err_enable_hw) for the disable_active_config
> > rollback?
>
> This is only place where cscfg_csdev_disable_active_config().
> Why do we need to goto for cleanup where there is no duplication?
It is about to use a central place for handling errors.
> > > @@ -1449,7 +1458,7 @@ static void etm4_init_arch_data(void *info)
> > >
> > > /* EXLEVEL_S, bits[19:16] Secure state instruction tracing */
> > > caps->s_ex_level = FIELD_GET(TRCIDR3_EXLEVEL_S_MASK, etmidr3);
> > > - drvdata->config.s_ex_level = caps->s_ex_level;
> > > + config->s_ex_level = caps->s_ex_level;
> >
> > config->s_ex_level is redundant and not really a configuration item.
> > We should use caps->s_ex_level instead of config->s_ex_level wherever
> > it is used.
>
> Agree, but this includes it should change the omse function
> declations to pass "caps" argument:
> - etm4_set_comparator_filter()
> - etm4_set_start_stop_filter()
>
> If that's okay, I'll respin with it.
This is fine for me.
Thanks,
Leo
On Tue, May 19, 2026 at 04:48:04PM +0100, Yeoreum Yun wrote:
[...]
> @@ -571,11 +571,11 @@ static int etm4_enable_hw(struct etmv4_drvdata *drvdata)
> etm4x_relaxed_write32(csa, config->res_ctrl[i], TRCRSCTLRn(i));
>
> for (i = 0; i < caps->nr_ss_cmp; i++) {
> - /* always clear status bit on restart if using single-shot */
> + /* always clear status and pending bits on restart if using single-shot */
> if (config->ss_ctrl[i] || config->ss_pe_cmp[i])
> - config->ss_status[i] &= ~TRCSSCSRn_STATUS;
> + drvdata->ss_status[i] &= ~(TRCSSCSRn_STATUS | TRCSSCSRn_PENDING);
> etm4x_relaxed_write32(csa, config->ss_ctrl[i], TRCSSCCRn(i));
> - etm4x_relaxed_write32(csa, config->ss_status[i], TRCSSCSRn(i));
> + etm4x_relaxed_write32(csa, drvdata->ss_status[i], TRCSSCSRn(i));
> if (etm4x_sspcicrn_present(drvdata, i))
> etm4x_relaxed_write32(csa, config->ss_pe_cmp[i], TRCSSPCICRn(i));
> }
> @@ -772,6 +772,7 @@ static int etm4_parse_event_config(struct coresight_device *csdev,
> /* Clear configuration from previous run */
> memset(config, 0, sizeof(struct etmv4_config));
>
> +
Unexpected new line?
> @@ -1497,8 +1498,9 @@ static void etm4_init_arch_data(void *info)
> */
> caps->nr_ss_cmp = FIELD_GET(TRCIDR4_NUMSSCC_MASK, etmidr4);
> for (i = 0; i < caps->nr_ss_cmp; i++) {
> - drvdata->config.ss_status[i] =
> - etm4x_relaxed_read32(csa, TRCSSCSRn(i));
> + drvdata->ss_status[i] = etm4x_relaxed_read32(csa, TRCSSCSRn(i));
> + drvdata->ss_status[i] &= (TRCSSCSRn_PC | TRCSSCSRn_DV |
> + TRCSSCSRn_DA | TRCSSCSRn_INST);
Since etm4_enable_hw() clears the TRCSSCSRn_STATUS and
TRCSSCSRn_PENDING bits every time, here is no need to clear the
status bits during probe.
In the future, we may want to preserve the status within a session and
clear it only when starting a new session. Clearing the status bits here
still cannot handle stale status across multiple sessions, so we can
defer this improvement for later.
Thanks,
Leo
On Tue, May 19, 2026 at 04:48:02PM +0100, Yeoreum Yun wrote:
> According to IHI006H Embedded Trace Macrocell Architecture
> Specification [0], n could be 0-2 for TCRSEQEVR<n> when
> TCRIDR5.NUMSEQSTATE is 0b100.
>
> Therefore, introduce ETM_MAX_SEQ_TRANSITIONS macro and apply this
> in TCRSEQEVR<n> relevant fields.
>
> Link: https://developer.arm.com/documentation/ihi0064/latest/ [0]
> Suggestedby: Leo Yan <leo.yan(a)arm.com>
s/Suggestedby/Suggested-by
> Signed-off-by: Yeoreum Yun <yeoreum.yun(a)arm.com>
The change looks good to me:
Reviewed-by: Leo Yan <leo.yan(a)arm.com>
This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 150 -vvv
150: Check Arm CoreSight synthesized callchain:
--- start ---
test child forked, pid 13528
Test callchain push: PASS
Test callchain pop: PASS
---- end(0) ----
150: Check Arm CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions:
ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Signed-off-by: Leo Yan <leo.yan(a)arm.com>
---
Leo Yan (8):
perf cs-etm: Decode ETE exception packets
perf cs-etm: Refactor instruction size handling
perf cs-etm: Use thread-stack for last branch entries
perf cs-etm: Flush thread stacks after decoder reset
perf cs-etm: Support call indentation
perf cs-etm: Filter synthesized branch samples
perf cs-etm: Synthesize callchains for instruction samples
perf test: Add Arm CoreSight callchain test
.../tests/shell/test_arm_coresight_callchain.sh | 235 ++++++++++++++++
tools/perf/util/cs-etm.c | 309 ++++++++++++---------
2 files changed, 408 insertions(+), 136 deletions(-)
---
base-commit: bd2a5be1fe731bc7548205dd148db75f1d588da2
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,
--
Leo Yan <leo.yan(a)arm.com>
On 15/05/2026 3:11 am, Amir Ayupov wrote:
> In a system-wide `perf record -e cs_etm/.../u` capture on aarch64,
> synthesized samples emitted by `perf script --itrace=il64` are
> sometimes attributed to the WRONG sample.pid/tid (and to the wrong
> EL/cpumode) for the chunk of branches that straddle a context-switch
> boundary on a CPU. A branch actually retired by process A is emitted
> with sample.pid set to the thread that next ran on the same CPU.
>
> Mechanism:
> 1. ETM emits CONTEXTIDR/EL packets in-stream when the kernel updates
> CONTEXTIDR_EL1 on context switch / EL change. OpenCSD turns these
> into OCSD_GEN_TRC_ELEM_PE_CONTEXT elements interleaved with
> OCSD_GEN_TRC_ELEM_INSTR_RANGE elements for retired branch ranges.
> 2. cs_etm_decoder__buffer_range() queues each INSTR_RANGE into
> packet_queue->packet_buffer[]; packets carry start/end addrs,
> instr_count, last-instruction info, etc., but NO owner identity.
> 3. PE_CONTEXT goes through cs_etm_decoder__set_tid() ->
> cs_etm__set_thread(), which immediately mutates tidq->thread and
> tidq->el. Queued packets are not drained first; reset_timestamp()
> is called so the next TIMESTAMP triggers OCSD_RESP_WAIT and a
> drain.
> 4. By drain time in cs_etm__process_traceid_queue() ->
> cs_etm__sample(), sample.pid/tid is read from the now-mutated
> tidq->thread and sample.cpumode from the now-mutated tidq->el.
> Pre-context INSTR_RANGEs get the post-context owner.
>
> The same race affects branch samples via tidq->prev_packet_thread /
> tidq->prev_packet_el, captured at packet-swap time from
> tidq->thread / tidq->el (which may already have flipped).
>
> This is independent of PERF_RECORD_SWITCH_CPU_WIDE, which is
> deliberately not used to assign sample identity in this path. The
> bug applies to any cs_etm capture with in-stream CONTEXTIDR
> (PIDFMT_CTXTID or PIDFMT_CTXTID2).
>
> Effect on downstream tools: branches that should belong to the
> previous thread on the CPU get attributed to the next thread. When
> the two threads share a binary, leaked branches' VAs land in the
> wrong thread's mappings; samples whose IPs land in r-x mappings
> silently pollute that binary's profile, while samples landing in
> R-only/RW mappings show up as out-of-range / non-text samples.
> Either way, AutoFDO/BOLT profiles built from `perf script --itrace`
> output of system-wide cs_etm captures contain misattributed samples.
>
> Concrete example from `perf script --itrace=il64` of the same
> captured branch (same timestamp, same IP, same from/to addrs) before
> and after this fix:
>
> before: launcher_multia 2638146/2638146 705897.219172: \
> fffcda6b124c 0xfffcda641958/0xfffcda6b123c
> after: ws-tcf-sr-io13 2736581/2741587 705897.219172: \
> fffcda6b124c 0xfffcda641958/0xfffcda6b123c
>
> The branch was retired by ws-tcf-sr-io13 (tid 2741587) but, before
> the fix, was attributed to launcher_multia (the next thread to run on
> that CPU after the context switch). After the fix, it is correctly
> attributed to ws-tcf-sr-io13.
>
> Why not "drain on PE_CONTEXT then switch" (deferred-set_thread):
> tidq->thread has two consumers \u2014 sample emission needs the OUTGOING
> identity for queued packets, but cs_etm__mem_access() needs the
> CURRENT thread's maps to fetch instruction bytes for OpenCSD. The
> two needs are temporally inverted; a single tidq->thread cannot
> serve both. Keeping tidq->thread current and stamping owner identity
> per packet is the only design that decouples them cleanly.
>
> Fix: capture the owning pid/tid/EL on each buffered packet at
> cs_etm_decoder__buffer_packet() time (before any subsequent
> PE_CONTEXT can mutate tidq->thread / tidq->el), and read them at
> sample emission time.
>
> - struct cs_etm_packet gains pid_t pid, pid_t tid, int el (storing
> an ocsd_ex_level value; typed as int so the struct does not
> depend on OpenCSD headers, which are only included inside
> HAVE_CSTRACE_SUPPORT).
> - cs_etm__etmq_get_pid_tid_el() (formerly cs_etm__etmq_get_pid_tid)
> returns all three.
> - cs_etm__synth_instruction_sample() reads sample.pid / sample.tid
> from tidq->packet->{pid,tid} and derives sample.cpumode from
> tidq->packet->el.
> - cs_etm__synth_branch_sample() reads sample.pid / sample.tid /
> cpumode from tidq->prev_packet->{pid,tid,el}.
> - The separate prev_packet_thread / prev_packet_el bookkeeping in
> cs_etm__packet_swap() / cs_etm__init_traceid_queue() /
> cs_etm__free_traceid_queues() is removed; the per-packet stamp
> on prev_packet now carries that information.
>
> Cost: 12 bytes added to struct cs_etm_packet (~12-16 KB per
> packet_queue with CS_ETM_PACKET_MAX_BUFFER=1024), 16 bytes saved per
> cs_etm_traceid_queue (one struct thread * + one ocsd_ex_level).
>
> A residual gap: cs_etm__copy_insn() reads sample.insn bytes via
> cs_etm__mem_access(), which still uses tidq->thread (the current
> thread), so the inline insn bytes for an outgoing-thread sample may
> be looked up against the wrong address space. Fixing this requires
> threading the packet's owner pid through cs_etm__mem_access and is
> left for a follow-up. sample.ip / sample.pid attribution \u2014 what
> AutoFDO/BOLT consume \u2014 is correct.
>
Hi Amir,
Can you test the patch here to see if it fixes your issue [1]?
We thought it didn't make sense to store the thread on every packet when
there is only one active thread for the decoder and one for sample
generation. We also fixed the other issue mentioned above about
cs_etm__copy_insn() not working.
Thanks
James
[1]:
https://lore.kernel.org/linux-perf-users/20260526-james-cs-context-tracking…
Fix thread tracking when decoding Coresight trace and add a new test for
it.
Unfortunately the test has to be added as a separate binary like the
other existing Coresight workloads which needs a bit of makefile
boilerplate. Ideally it would be a builtin Perf test workload, but Perf
has a lot of dependencies and generates a lot of trace when starting up
which makes tracing tests very slow because all that has to be decoded.
Hopefully I can find a generic way to enable Perf events at the
beginning of the workload and then I can migrate everything from
tools/perf/tests/shell/coresight to Perf test workloads, but that will
have to be done as a separate change.
Signed-off-by: James Clark <james.clark(a)linaro.org>
---
James Clark (1):
perf test cs-etm: Test thread attribution
Leo Yan (1):
perf cs-etm: Queue context packets for frontend
tools/perf/tests/shell/coresight/Makefile | 1 +
.../shell/coresight/context_switch_loop/.gitignore | 1 +
.../shell/coresight/context_switch_loop/Makefile | 29 ++++
.../context_switch_loop/context_switch_loop.c | 83 ++++++++++
.../tests/shell/coresight/context_switch_thread.sh | 48 ++++++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 18 +-
tools/perf/util/cs-etm.c | 181 +++++++++++++--------
tools/perf/util/cs-etm.h | 5 +-
8 files changed, 295 insertions(+), 71 deletions(-)
---
base-commit: 09d355618f7ccc27ffc7fc668b2e232872962079
change-id: 20260515-james-cs-context-tracking-fix-754998bae7ed
Best regards,
--
James Clark <james.clark(a)linaro.org>
On 20/05/2026 11:13 am, Jie Gan wrote:
>
>
> On 5/20/2026 5:27 PM, Mike Leach wrote:
>>
>>
>>> -----Original Message-----
>>> From: James Clark <james.clark(a)linaro.org>
>>> Sent: Wednesday, May 20, 2026 9:38 AM
>>> To: Jie Gan <jie.gan(a)oss.qualcomm.com>
>>> Cc: coresight(a)lists.linaro.org; linux-arm-kernel(a)lists.infradead.org;
>>> linux-
>>> kernel(a)vger.kernel.org; Suzuki Poulose <Suzuki.Poulose(a)arm.com>; Mike
>>> Leach <Mike.Leach(a)arm.com>; Leo Yan <Leo.Yan(a)arm.com>; Alexander
>>> Shishkin <alexander.shishkin(a)linux.intel.com>; Mathieu Poirier
>>> <mathieu.poirier(a)linaro.org>; Tingwei Zhang
>>> <tingwei.zhang(a)oss.qualcomm.com>; Greg Kroah-Hartman
>>> <gregkh(a)linuxfoundation.org>
>>> Subject: Re: [PATCH] coresight: fix resource leaks on path build failure
>>>
>>>
>>>
>>> On 20/05/2026 2:55 am, Jie Gan wrote:
>>>>
>>>>
>>>> On 5/19/2026 9:57 PM, James Clark wrote:
>>>>>
>>>>>
>>>>> On 13/05/2026 2:32 am, Jie Gan wrote:
>>>>>> Two related leaks when _coresight_build_path() encounters an error
>>>>>> after
>>>>>> coresight_grab_device() has already incremented the pm_runtime,
>>> module,
>>>>>> and device references for a node:
>>>>>>
>>>>>> 1. In _coresight_build_path(), if kzalloc_obj() for the path node
>>>>>> fails
>>>>>> after coresight_grab_device() succeeds,
>>>>>> coresight_drop_device() was
>>>>>> never called, permanently leaking all three references.
>>>>>>
>>>>>> 2. In coresight_build_path(), on failure the partial path was
>>>>>> freed with
>>>>>> kfree(path) instead of coresight_release_path(path). kfree()
>>>>>> only
>>>>>> frees the coresight_path struct itself; it does not iterate
>>>>>> path_list
>>>>>> to call coresight_drop_device() and kfree() for each
>>>>>> coresight_node
>>>>>> already added by deeper recursive calls, leaking both the
>>>>>> pm_runtime,
>>>>>> module, and device references and the node memory for every
>>>>>> element
>>>>>> on the partial path.
>>>>>>
>>>>>> Fix both by adding coresight_drop_device() in the OOM unwind of
>>>>>> _coresight_build_path(), and replacing kfree(path) with
>>>>>> coresight_release_path(path) in coresight_build_path().
>>>>>>
>>>>>> Fixes: 32b0707a4182 ("coresight: Add try_get_module() in
>>>>>> coresight_grab_device()")
>>>>>> Fixes: b3e94405941e ("coresight: associating path with session rather
>>>>>> than tracer")
>>>>>> Signed-off-by: Jie Gan <jie.gan(a)oss.qualcomm.com>
>>>>>> ---
>>>>>> drivers/hwtracing/coresight/coresight-core.c | 6 ++++--
>>>>>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/hwtracing/coresight/coresight-core.c b/drivers/
>>>>>> hwtracing/coresight/coresight-core.c
>>>>>> index 46f247f73cf6..c1354ea8e11d 100644
>>>>>> --- a/drivers/hwtracing/coresight/coresight-core.c
>>>>>> +++ b/drivers/hwtracing/coresight/coresight-core.c
>>>>>> @@ -825,8 +825,10 @@ static int _coresight_build_path(struct
>>>>>> coresight_device *csdev,
>>>>>> return ret;
>>>>>> node = kzalloc_obj(struct coresight_node);
>>>>>> - if (!node)
>>>>>> + if (!node) {
>>>>>> + coresight_drop_device(csdev);
>>>>>> return -ENOMEM;
>>>>>> + }
>>>>>> node->csdev = csdev;
>>>>>> list_add(&node->link, &path->path_list);
>>>>>> @@ -851,7 +853,7 @@ struct coresight_path
>>>>>> *coresight_build_path(struct coresight_device *source,
>>>>>> rc = _coresight_build_path(source, source, sink, path);
>>>>>> if (rc) {
>>>>>> - kfree(path);
>>>>>> + coresight_release_path(path);
>>>>>> return ERR_PTR(rc);
>>>>>> }
>>>>>>
>>>>>> ---
>>>>>> base-commit: e98d21c170b01ddef366f023bbfcf6b31509fa83
>>>>>> change-id: 20260513-fix-memory-leak-issue-034b4a45265e
>>>>>>
>>>>>> Best regards,
>>>>>
>>>>> Looks good to me, but sashiko is complaining: https://sashiko.dev/#/
>>>>> patchset/20260513-fix-memory-leak-issue-
>>>>> v1-1-49822d7bc7d4%40oss.qualcomm.com
>>>>>
>>>>> I'm trying to understand why it's saying that, but I think the
>>>>> scenario is that if there are multiple correct paths to a sink, when
>>>>> one path partially fails and a second path succeeds you could get a
>>>>> path_list with some garbage entries in it.
>>>>
>>>> I think the coresight_release_path is added to address this situation.
>>>> We suffered the path partially failure, and we need release all nodes
>>>> already added to the path.
>>>>
>>>
>>> It wouldn't call coresight_release_path() in this case though. If one
>>> path ends up building to success but a parallel path partially failed
>>> then _coresight_build_path() still returns success. During the search it
>>> would have still added the nodes from the partially failed path to the
>>> path_list. This is only an issue if there are multiple correct paths.
>>>
>
> The point here is there are multiple routes from the same source device
> to the same sink device, am right?
>
> I have no experience on this scenario. So with the scenario, the
> build_path may succeeded in one route and failed in another route, but
> finally, the _coresight_build_path still returns success, is that correct?
>
>>>>>
>>>>> That's kind of a different and existing issue to the one you've fixed,
>>>>> and assumes that multiple paths to one sink are possible, which I'm
>>>>> not sure is supported?
>>>>
>>>> Each path is unique. We only deal with the issue path for balancing the
>>>> reference count.
>>>>
>>>> Thanks,
>>>> Jie
>>>>
>>>
>>> I'm not exactly sure what you mean by unique, but the same source and
>>> sink could potentially be connected through two different sets of links.
>>>
>>
>> Multiple paths between a source and sink are not permitted under the
>> CoreSight spec.
>>
>
> As Mike mentioned, my understanding is that a source device is only
> allowed to be added to one valid path—this is what I mean by “unique.”
>
> Thanks,
> Jie
>
That's ok then we can ignore this for this patch. But it would be good
to enforce that in _coresight_build_path() with some kind of assert. Or
at least add a comment to appease the AI reviewers.
>> If such a system was to be built - then a fix would need to be in the
>> declaration of connections - e.g. miss one path out in the device tree
>> for example. Not up to the Coresight drivers to handle out of
>> specification hardware.
>>
>> Mike
>>
>>
>>>>>
>>>>> It might be as easy as breaking the loop early for any return value
>>>>> other than -ENODEV, but I'll leave it to you to decide whether to do
>>>>> that here or not.
>>>>>
>>>>> Reviewed-by: James Clark <james.clark(a)linaro.org>
>>>>>
>>>>
>>
>