On Thu, May 21, 2026 at 08:16:30PM +0800, Yingchao Deng wrote:
[...]
> @@ -515,18 +543,36 @@ static struct attribute *coresight_cti_regs_attrs[] = {
> &dev_attr_appclear.attr,
> &dev_attr_apppulse.attr,
> coresight_cti_reg(triginstatus, CTITRIGINSTATUS),
> + coresight_cti_reg_index(triginstatus1, CTITRIGINSTATUS, 1),
> + coresight_cti_reg_index(triginstatus2, CTITRIGINSTATUS, 2),
> + coresight_cti_reg_index(triginstatus3, CTITRIGINSTATUS, 3),
For this patch:
Reviewed-by: Leo Yan <leo.yan(a)arm.com>
AI tool reminds to update
Documentation/ABI/testing/sysfs-bus-coresight-devices-cti, you might
need to add description with a new patch:
What: /sys/bus/coresight/devices/<cti-name>/regs/trigoutstatus[1-3]
Date: May 2026
KernelVersion: 7.2
Contact: coresight(a)lists.linaro.org
Description: (Read) read current status of QCOM extended output trigger signals.
And please add document for other new sysfs knobs.
Thanks,
Leo
Fix thread tracking when decoding Coresight trace and add a new test for
it.
Unfortunately the test has to be added as a separate binary like the
other existing Coresight workloads which needs a bit of makefile
boilerplate. Ideally it would be a builtin Perf test workload, but Perf
has a lot of dependencies and generates a lot of trace when starting up
which makes tracing tests very slow because all that has to be decoded.
Hopefully I can find a generic way to enable Perf events at the
beginning of the workload and then I can migrate everything from
tools/perf/tests/shell/coresight to Perf test workloads, but that will
have to be done as a separate change.
Signed-off-by: James Clark <james.clark(a)linaro.org>
---
James Clark (1):
perf test cs-etm: Test thread attribution
Leo Yan (1):
perf cs-etm: Queue context packets for frontend
tools/perf/tests/shell/coresight/Makefile | 1 +
.../shell/coresight/context_switch_loop/.gitignore | 1 +
.../shell/coresight/context_switch_loop/Makefile | 29 ++++
.../context_switch_loop/context_switch_loop.c | 83 ++++++++++
.../tests/shell/coresight/context_switch_thread.sh | 48 ++++++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 18 +-
tools/perf/util/cs-etm.c | 181 +++++++++++++--------
tools/perf/util/cs-etm.h | 5 +-
8 files changed, 295 insertions(+), 71 deletions(-)
---
base-commit: 09d355618f7ccc27ffc7fc668b2e232872962079
change-id: 20260515-james-cs-context-tracking-fix-754998bae7ed
Best regards,
--
James Clark <james.clark(a)linaro.org>
This series adds thread-stack and synthesized callchain support for Arm
CoreSight, which comes from older series [1] but heavily rewritten.
CS ETM previously kept last-branch state in a per-trace-queue buffer.
That effectively makes the state per CPU, while the call/return history
belongs to a thread. This series moves branch tracking to the common
thread-stack code.
The series records CoreSight branches with thread_stack__event(), uses
thread_stack__br_sample() for last branch entries, flushes thread stacks
after decoder resets.
A decoder reset between AUX trace buffers is treated as a global trace
discontinuity, so all thread stacks are flushed, so avoids carrying
stale call/return history across a trace discontinuity.
One limitation remains for instructions emulated by the kernel. In that
case the exception return address may not match the return address
stored in the thread stack, because after exception return can be one
instruction ahead. The stack can still recover when a later return
matches an upper caller. Given emulated instructions are not the common
target for performance callchain analysis. Supporting this would require
extending the common thread-stack path to accept both the real target
address and an adjusted address for stack matching, so this series
leaves that extra complexity out.
The series has been tested on Orion6 board:
perf test 150 -vvv
150: Check Arm CoreSight synthesized callchain:
--- start ---
test child forked, pid 13528
Test callchain push: PASS
Test callchain pop: PASS
---- end(0) ----
150: Check Arm CoreSight synthesized callchain : Ok
perf script --itrace=g16i10il64
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac32507c4 main+0x8 (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229943: 10 instructions:
aaaac3250774 do_svc+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
callchain_test 17468 [005] 1031003.229944: 10 instructions:
ffff800080010c20 vectors+0x420 ([kernel.kallsyms])
aaaac3250784 do_svc+0x1c (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac3250798 print+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507b0 foo+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
aaaac32507c8 main+0xc (/home/kernel/leoy/test_cs_callchain/callchain_test)
ffff90bd225c __libc_start_call_main+0x7c (/usr/lib/aarch64-linux-gnu/libc.so.6)
ffff90bd233c call_init+0x9c (inlined)
ffff90bd233c __libc_start_main_impl+0x9c (inlined)
aaaac3250670 _start+0x30 (/home/kernel/leoy/test_cs_callchain/callchain_test)
Note, the test fails on Juno board which is caused by many discontinuity
packets (mainly caused by NO_SYNC elem). This is likely caused by the
FIFO overflow on the path.
[1] https://lore.kernel.org/linux-arm-kernel/20200220052701.7754-1-leo.yan@lina…
Signed-off-by: Leo Yan <leo.yan(a)arm.com>
---
Leo Yan (8):
perf cs-etm: Decode ETE exception packets
perf cs-etm: Refactor instruction size handling
perf cs-etm: Use thread-stack for last branch entries
perf cs-etm: Flush thread stacks after decoder reset
perf cs-etm: Support call indentation
perf cs-etm: Filter synthesized branch samples
perf cs-etm: Synthesize callchains for instruction samples
perf test: Add Arm CoreSight callchain test
.../tests/shell/test_arm_coresight_callchain.sh | 235 ++++++++++++++++
tools/perf/util/cs-etm.c | 309 ++++++++++++---------
2 files changed, 408 insertions(+), 136 deletions(-)
---
base-commit: bd2a5be1fe731bc7548205dd148db75f1d588da2
change-id: 20260521-b4-arm_cs_callchain_support_v1-2c2a70719bcc
Best regards,
--
Leo Yan <leo.yan(a)arm.com>
On 29/05/2026 11:26, Jie Gan wrote:
>
>
> On 2/27/2026 6:10 PM, Suzuki K Poulose wrote:
>> Hello,
>>
>>
>> On 04/02/2026 02:22, Jie Gan wrote:
>>> The DT‑binding patch adds platform‑specific compatibles for the
>>> CTCU device, and the following Qualcomm platforms are included:
>>> Kaanapali
>>> Pakala(sm8750)
>>> Hamoa(x1e80100)
>>> Glymur
>>
>> Given this is predominantly DTS changes, and there is very low chances
>> of a conflict with the binding yaml change, I would recommend this to go
>> via soc or the qcom platform tree.
>>
>> For the series:
>>
>> Acked-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
>
> Hi Suzuki,
>
> May I ask is there a chance this patch series could go through the
> CoreSight tree?
Like I said, it is mostly Qcom platform changes. So, I would leave it to
the appropriate channel
Suzuki
>
> Thanks a lot.
> Jie
>
>>
>>
>>>
>>> Since the base Coresight DT patches for the Kaanapali and Glymur
>>> platforms have not yet been applied, I created DT patches only
>>> for the Pakala and Hamoa platforms. I will submit the Kaanapali
>>> and Glymur patches once their corresponding base Coresight DT patches
>>> are merged.
>>>
>>> The Hamoa‑related patches were posted in a separate email, and I
>>> have included them in the current patch series.
>>>
>>> Link to the previous Hamoa patch series:
>>> https://lore.kernel.org/all/20251106-enable-etr-and-ctcu-for-hamoa-
>>> v2-0-cdb3a18753aa(a)oss.qualcomm.com/
>>>
>>> Signed-off-by: Jie Gan <jie.gan(a)oss.qualcomm.com>
>>> ---
>>> Changes in v3:
>>> - change back to the numeric compatible from hamoa to x1e80100.
>>> - Link to v2: https://lore.kernel.org/r/20260203-enable-ctcu-and-etr-
>>> v2-0-aacc7bd7eccb(a)oss.qualcomm.com
>>>
>>> Changes in v2:
>>> - change back to the numeric compatible from pakala to sm8750.
>>> - Link to v1: https://lore.kernel.org/r/20260203-enable-ctcu-and-etr-
>>> v1-0-a5371a2ec2b8(a)oss.qualcomm.com
>>>
>>> ---
>>> Jie Gan (3):
>>> dt-binding: document QCOM platforms for CTCU device
>>> arm64: dts: qcom: hamoa: enable ETR and CTCU devices
>>> arm64: dts: qcom: sm8750: enable ETR and CTCU devices
>>>
>>> .../bindings/arm/qcom,coresight-ctcu.yaml | 4 +
>>> arch/arm64/boot/dts/qcom/hamoa.dtsi | 160 ++++++++++
>>> + +++++++-
>>> arch/arm64/boot/dts/qcom/sm8750.dtsi | 177 ++++++++++
>>> + ++++++++++
>>> 3 files changed, 340 insertions(+), 1 deletion(-)
>>> ---
>>> base-commit: 193579fe01389bc21aff0051d13f24e8ea95b47d
>>> change-id: 20260203-enable-ctcu-and-etr-31f9e9d1088d
>>>
>>> Best regards,
>>
>
On Mon, 11 May 2026 12:19:18 +0800, Jie Gan wrote:
> coresight_add_out_conn() increments nr_outconns before calling
> devm_krealloc_array() and again before devm_kmalloc(). If either
> allocation fails, the counter is already bumped while the corresponding
> array entry is NULL or uninitialized garbage.
>
> coresight_add_in_conn() has the same problem with nr_inconns and
> devm_krealloc_array().
>
> [...]
Applied, thanks!
[1/1] coresight: platform: defer connection counter increment until alloc succeeds
https://git.kernel.org/coresight/c/1563ae33dc4f
Best regards,
--
Suzuki K Poulose <suzuki.poulose(a)arm.com>
On Fri, 29 May 2026 00:52:01 +0800, Runyu Xiao wrote:
> The etb10 miscdevice uses drvdata->reading as a shared exclusivity gate
> for userspace buffer access. etb_open() claims that gate with
> local_cmpxchg(), and etb_release() clears it with local_set().
>
> That gate is shared per-device state rather than CPU-local state. A
> running system can reach it whenever /dev/<etb> is opened, closed, and
> reopened by different tasks while the device remains registered, so the
> same drvdata->reading variable may be claimed on one CPU and later
> cleared on another.
>
> [...]
Applied, thanks!
[1/1] coresight: etb10: restore atomic_t for shared reading state
https://git.kernel.org/coresight/c/fa09f08ede3d
Best regards,
--
Suzuki K Poulose <suzuki.poulose(a)arm.com>
On 28/05/2026 5:52 pm, Runyu Xiao wrote:
> The etb10 miscdevice uses drvdata->reading as a shared exclusivity gate
> for userspace buffer access. etb_open() claims that gate with
> local_cmpxchg(), and etb_release() clears it with local_set().
>
> That gate is shared per-device state rather than CPU-local state. A
> running system can reach it whenever /dev/<etb> is opened, closed, and
> reopened by different tasks while the device remains registered, so the
> same drvdata->reading variable may be claimed on one CPU and later
> cleared on another.
>
> This code used to use atomic_t for the same gate, but commit
> 27b10da8fff2 ("coresight: etb10: moving to local atomic operations")
> changed it to local_t even though the access pattern remained cross-task
> and cross-CPU. Restore atomic_t together with atomic_cmpxchg() and
> atomic_set() so the exclusivity gate again uses a primitive intended
> for shared state.
>
> The issue was found on Linux v6.18.21 by our static analysis tool while
> scanning surviving local_t-on-shared-state sites, and then manually
> reviewed against the live etb10 file-op path.
>
> It was runtime-validated with a reproducible QEMU no-device KCSAN PoC
> that kept the same report-local contract:
>
> 1. use one shared struct etb_drvdata carrier and its
> drvdata->reading gate;
> 2. call etb_open() and etb_release() sequentially on that gate to
> confirm the original claim/clear path;
> 3. bind the open side to CPU0 and the release side to CPU1 for the
> same gate to show cross-CPU ownership;
> 4. run bound workers that repeatedly race etb_open() and
> etb_release() on the same gate until KCSAN reports a target hit.
>
> The harness recorded:
>
> L1 passed open=1 release=1
> reading_after_open=1 reading_after_release=0
> L2 passed open_cpu=0 release_cpu=1
> cross_cpu_release=1 reading_after=0 open_ret=0
>
> Representative KCSAN excerpt from the no-device validation run:
>
> BUG: KCSAN: data-race in etb_open.constprop.0.isra.0 [vuln_msv]
>
> write to 0xffffffffc0003810 of 4 bytes by task 216 on cpu 1:
> etb_open.constprop.0.isra.0+0x38/0x80 [vuln_msv]
> l3_worker_thread_fn+0x4f/0xf0 [vuln_msv]
> kthread+0x17e/0x1c0
> ret_from_fork+0x22/0x30
>
> read to 0xffffffffc0003810 of 4 bytes by task 215 on cpu 0:
> etb_open.constprop.0.isra.0+0x18/0x80 [vuln_msv]
> l3_worker_thread_fn+0x4f/0xf0 [vuln_msv]
> kthread+0x17e/0x1c0
> ret_from_fork+0x22/0x30
>
> value changed: 0x00000000 -> 0x00000001
>
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 0 PID: 215 Comm: etb10_l3_a Tainted: G O 6.1.66 #2
>
> This no-device harness is not a real ETB10 hardware end-to-end run, but
> it preserves the same shared drvdata->reading gate and the same
> etb_open()/etb_release() claim/clear contract. No real ETB10 hardware
> was available for runtime testing.
>
> Build-tested with:
> make olddefconfig
> make -j"$(nproc)" drivers/hwtracing/coresight/coresight-etb10.o
>
> Fixes: 27b10da8fff2 ("coresight: etb10: moving to local atomic operations")
> Cc: stable(a)vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao(a)seu.edu.cn>
> ---
> drivers/hwtracing/coresight/coresight-etb10.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-etb10.c b/drivers/hwtracing/coresight/coresight-etb10.c
> index 35db1b6093d1..98269ea6f7ae 100644
> --- a/drivers/hwtracing/coresight/coresight-etb10.c
> +++ b/drivers/hwtracing/coresight/coresight-etb10.c
> @@ -85,7 +85,7 @@ struct etb_drvdata {
> struct coresight_device *csdev;
> struct miscdevice miscdev;
> raw_spinlock_t spinlock;
> - local_t reading;
> + atomic_t reading;
> pid_t pid;
> u8 *buf;
> u32 buffer_depth;
> @@ -603,7 +603,7 @@ static int etb_open(struct inode *inode, struct file *file)
> struct etb_drvdata *drvdata = container_of(file->private_data,
> struct etb_drvdata, miscdev);
>
> - if (local_cmpxchg(&drvdata->reading, 0, 1))
> + if (atomic_cmpxchg(&drvdata->reading, 0, 1))
> return -EBUSY;
>
> dev_dbg(&drvdata->csdev->dev, "%s: successfully opened\n", __func__);
> @@ -641,7 +641,7 @@ static int etb_release(struct inode *inode, struct file *file)
> {
> struct etb_drvdata *drvdata = container_of(file->private_data,
> struct etb_drvdata, miscdev);
> - local_set(&drvdata->reading, 0);
> + atomic_set(&drvdata->reading, 0);
>
> dev_dbg(&drvdata->csdev->dev, "%s: released\n", __func__);
> return 0;
Reviewed-by: James Clark <james.clark(a)linaro.org>
Semi-related to this change, etb_read() doesn't have any lock when
reading drvdata->buffer_dept or drvdata->buf. It locks in etb_dump(),
but then unlocks before actually calling copy_to_user().
Seems like concurrent calls to etb_read() might end up with corrupt
data, although I'm not sure if that would ever happen in practice
because it only allows one open file handle.
On Thu, May 21, 2026 at 08:16:27PM +0800, Yingchao Deng wrote:
[...]
> @@ -231,6 +254,8 @@ struct cti_trig_con *cti_allocate_trig_con(struct device *dev, int in_sigs,
> {
> struct cti_trig_con *tc = NULL;
> struct cti_trig_grp *in = NULL, *out = NULL;
> + struct cti_drvdata *drvdata = dev_get_drvdata(dev);
> + int n_trigs = drvdata->config.nr_trig_max;
I don't mind it allocates bitmask with nr_trig_max, but AI review
suggests that when in_sigs / out_sigs bigger than nr_trig_max, it might
access memory out-of-boundary (see cti_plat_read_trig_group()).
It is good to add a check:
if (in_sigs > n_trigs || out_sigs > n_trigs) {
dev_err(dev, "trigger signal is out of range: in=%d out=%d nr_max=%d\n",
in_sigs, out_sigs, n_trigs\n");
return NULL;
}
With this:
Reviewed-by: Leo Yan <leo.yan(a)arm.com>
BTW, I have given my review tag on v8, please remember to update
patches with review / ack tags.
On Thu, May 21, 2026 at 08:16:29PM +0800, Yingchao Deng wrote:
[...]
> Qualcomm implements an extended variant of the ARM CoreSight CTI with a
> different register layout and vendor-specific behavior. While the
> programming model remains largely compatible, the register offsets differ
> from the standard ARM CTI and require explicit handling.
I cannot apply this patch successfuly. Please rebase on the latest
coresight-next branch.
> @@ -726,6 +734,22 @@ static int cti_probe(struct amba_device *adev, const struct amba_id *id)
>
> raw_spin_lock_init(&drvdata->spinlock);
>
> + devarch = readl_relaxed(drvdata->base + CORESIGHT_DEVARCH);
> + if (CTI_DEVARCH_ARCHITECT(devarch) == ARCHITECT_QCOM) {
> + drvdata->is_qcom_cti = true;
> + /*
> + * QCOM CTI does not implement Claimtag functionality as
> + * per CoreSight specification, but its CLAIMSET register
> + * is incorrectly initialized to 0xF. This can mislead
> + * tools or drivers into thinking the component is claimed.
> + *
> + * Reset CLAIMSET to 0 to reflect that no claims are active.
> + */
> + CS_UNLOCK(drvdata->base);
> + writel_relaxed(0, drvdata->base + CORESIGHT_CLAIMSET);
> + CS_LOCK(drvdata->base);
Sorry I missed this piece before.
Can you move this quirk into firmware? I don't think the CTI driver
should clear the external claim bit as this totally break the protocol
defined in PSCI. A clean way would clear the bits in firmware and then
CTI driver can use the CLAIM registers.
Or, another option is to create several helpers to bypass claim
operations for Qcom CTI:
static void cti_clear_self_claim_tag(cti_drvdata *drvdata,
struct csdev_access *csa)
{
if (drvdata->is_qcom_cti)
return;
coresight_clear_self_claim_tag(csa);
}
static int cti_claim_device(cti_drvdata *drvdata)
{
if (drvdata->is_qcom_cti)
return 0;
return coresight_claim_device(drvdata->csdev);
}
static void cti_unclaim_device_unlocked(cti_drvdata *drvdata)
{
if (drvdata->is_qcom_cti)
return;
return coresight_disclaim_device_unlocked(drvdata->csdev);
}
Otherwise, this patch is fine for me.
Thanks,
Leo