Hi Yabin,
Sorry for late reply as I was on vacation for the past two weeks.
On Tue, Apr 15, 2025 at 11:46:48AM -0700, Yabin Cui wrote:
> When tracing ETM data on multiple CPUs concurrently via the
> perf interface, the CATU device is shared across different CPU
> paths. This can lead to race conditions when multiple CPUs attempt
> to enable or disable the CATU device simultaneously.
>
> To address these race conditions, this patch introduces the
> following changes:
>
> 1. The enable and disable operations for the CATU device are not
> reentrant. Therefore, a spinlock is added to ensure that only
> one CPU can enable or disable a given CATU device at any point
> in time.
>
> 2. A reference counter is used to manage the enable/disable state
> of the CATU device. The device is enabled when the first CPU
> requires it and is only disabled when the last CPU finishes
> using it. This ensures the device remains active as long as at
> least one CPU needs it.
>
> Signed-off-by: Yabin Cui <yabinc(a)google.com>
LGTM:
Reviewed-by: Leo Yan <leo.yan(a)arm.com>
As Jie reminded, please add James' review tag in next spin.
Thanks,
Leo
> ---
> drivers/hwtracing/coresight/coresight-catu.c | 25 +++++++++++++-------
> drivers/hwtracing/coresight/coresight-catu.h | 1 +
> 2 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-catu.c b/drivers/hwtracing/coresight/coresight-catu.c
> index fa170c966bc3..3909b562b077 100644
> --- a/drivers/hwtracing/coresight/coresight-catu.c
> +++ b/drivers/hwtracing/coresight/coresight-catu.c
> @@ -458,12 +458,17 @@ static int catu_enable_hw(struct catu_drvdata *drvdata, enum cs_mode cs_mode,
> static int catu_enable(struct coresight_device *csdev, enum cs_mode mode,
> void *data)
> {
> - int rc;
> + int rc = 0;
> struct catu_drvdata *catu_drvdata = csdev_to_catu_drvdata(csdev);
>
> - CS_UNLOCK(catu_drvdata->base);
> - rc = catu_enable_hw(catu_drvdata, mode, data);
> - CS_LOCK(catu_drvdata->base);
> + guard(raw_spinlock_irqsave)(&catu_drvdata->spinlock);
> + if (csdev->refcnt == 0) {
> + CS_UNLOCK(catu_drvdata->base);
> + rc = catu_enable_hw(catu_drvdata, mode, data);
> + CS_LOCK(catu_drvdata->base);
> + }
> + if (!rc)
> + csdev->refcnt++;
> return rc;
> }
>
> @@ -486,12 +491,15 @@ static int catu_disable_hw(struct catu_drvdata *drvdata)
>
> static int catu_disable(struct coresight_device *csdev, void *__unused)
> {
> - int rc;
> + int rc = 0;
> struct catu_drvdata *catu_drvdata = csdev_to_catu_drvdata(csdev);
>
> - CS_UNLOCK(catu_drvdata->base);
> - rc = catu_disable_hw(catu_drvdata);
> - CS_LOCK(catu_drvdata->base);
> + guard(raw_spinlock_irqsave)(&catu_drvdata->spinlock);
> + if (--csdev->refcnt == 0) {
> + CS_UNLOCK(catu_drvdata->base);
> + rc = catu_disable_hw(catu_drvdata);
> + CS_LOCK(catu_drvdata->base);
> + }
> return rc;
> }
>
> @@ -550,6 +558,7 @@ static int __catu_probe(struct device *dev, struct resource *res)
> dev->platform_data = pdata;
>
> drvdata->base = base;
> + raw_spin_lock_init(&drvdata->spinlock);
> catu_desc.access = CSDEV_ACCESS_IOMEM(base);
> catu_desc.pdata = pdata;
> catu_desc.dev = dev;
> diff --git a/drivers/hwtracing/coresight/coresight-catu.h b/drivers/hwtracing/coresight/coresight-catu.h
> index 141feac1c14b..755776cd19c5 100644
> --- a/drivers/hwtracing/coresight/coresight-catu.h
> +++ b/drivers/hwtracing/coresight/coresight-catu.h
> @@ -65,6 +65,7 @@ struct catu_drvdata {
> void __iomem *base;
> struct coresight_device *csdev;
> int irq;
> + raw_spinlock_t spinlock;
> };
>
> #define CATU_REG32(name, offset) \
> --
> 2.49.0.604.gff1f9ca942-goog
>
On Mon, Apr 21, 2025 at 02:58:18PM -0700, Yabin Cui wrote:
> The cs_etm PMU, regardless of the underlying trace sink (ETF, ETR or
> TRBE), doesn't require contiguous pages for its AUX buffer.
Though contiguous pages are not mandatory for TRBE, I would set the
PERF_PMU_CAP_AUX_NO_SG flag for it. This can potentially benefit
performance.
For non per CPU sinks, it is fine to allocate non-contiguous pages.
Thanks,
Leo
> This patch adds the PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES capability
> to the cs_etm PMU. This allows the kernel to allocate non-contiguous
> pages for the AUX buffer, reducing memory fragmentation when using
> cs_etm.
>
> Signed-off-by: Yabin Cui <yabinc(a)google.com>
> ---
> drivers/hwtracing/coresight/coresight-etm-perf.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-etm-perf.c b/drivers/hwtracing/coresight/coresight-etm-perf.c
> index f4cccd68e625..c98646eca7f8 100644
> --- a/drivers/hwtracing/coresight/coresight-etm-perf.c
> +++ b/drivers/hwtracing/coresight/coresight-etm-perf.c
> @@ -899,7 +899,8 @@ int __init etm_perf_init(void)
> int ret;
>
> etm_pmu.capabilities = (PERF_PMU_CAP_EXCLUSIVE |
> - PERF_PMU_CAP_ITRACE);
> + PERF_PMU_CAP_ITRACE |
> + PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES);
>
> etm_pmu.attr_groups = etm_pmu_attr_groups;
> etm_pmu.task_ctx_nr = perf_sw_context;
> --
> 2.49.0.805.g082f7c87e0-goog
>
>
On 21/04/2025 10:58 pm, Yabin Cui wrote:
> For PMUs like ARM ETM/ETE, contiguous AUX buffers are unnecessary
> and increase memory fragmentation.
>
> This patch introduces PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES, allowing
> PMUs to request non-contiguous pages for their AUX buffers.
>
> Signed-off-by: Yabin Cui <yabinc(a)google.com>
> ---
> include/linux/perf_event.h | 1 +
> kernel/events/ring_buffer.c | 6 ++++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0069ba6866a4..26ca35d6a9f2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -301,6 +301,7 @@ struct perf_event_pmu_context;
> #define PERF_PMU_CAP_AUX_OUTPUT 0x0080
> #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
> #define PERF_PMU_CAP_AUX_PAUSE 0x0200
> +#define PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES 0x0400
>
> /**
> * pmu::scope
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 5130b119d0ae..87f42f4e8edc 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -710,6 +710,12 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> max_order = ilog2(nr_pages);
> watermark = 0;
> }
> + /*
> + * When the PMU doesn't prefer contiguous AUX buffer pages, favor
> + * low-order allocations to reduce memory fragmentation.
> + */
> + if (event->pmu->capabilities & PERF_PMU_CAP_AUX_NON_CONTIGUOUS_PAGES)
> + max_order = 0;
>
> /*
> * kcalloc_node() is unable to allocate buffer if the size is larger
Hi Yabin,
I was wondering if this is just the opposite of PERF_PMU_CAP_AUX_NO_SG,
and that order 0 should be used by default for all devices to solve the
issue you describe. Because we already have PERF_PMU_CAP_AUX_NO_SG for
devices that need contiguous pages. Then I found commit 5768402fd9c6
("perf/ring_buffer: Use high order allocations for AUX buffers
optimistically") that explains that the current allocation strategy is
an optimization.
Your change seems to decide that for certain devices we want to optimize
for fragmentation rather than performance. If these are rarely used
features specifically when looking at performance should we not continue
to optimize for performance? Or at least make it user configurable?
Thanks
James
This series fixes and improves clock usage in the Arm CoreSight drivers.
Based on the DT binding documents, the trace clock (atclk) is defined in
some CoreSight modules, but the corresponding drivers to support it are
absent. In most cases, the issue is hidden because atclk is shared by
multiple CoreSight modules and is enabled anyway by other drivers.
The first three patches address this issue.
The programming clock (pclk) management in CoreSight drivers does not
use the devm_XXX() variant APIs, so the drivers needs to manually
disable and release clocks for errors and for normal module exit.
However, the drivers miss to disable clocks during module exit.
Another issue is pclk might be enabled twice in init phase - once by
AMBA bus driver, and again by CoreSight drivers.
Patches 04 and 05 fix pclk issues.
The atclk may also not be disabled in CoreSight drivers during module
exit. This is fixed in patch 06.
Patches 07 to 09 refactor the clock related code. Patch 07 makes the
clock enabling sequence consistent. Patch 08 removes redundant
condition checks and adds error handling in runtime PM. Patch 09
consolidats the clock initialization into a central place.
This series is verified on Arm64 Hikey960 platform.
Leo Yan (9):
coresight: tmc: Support atclk
coresight: catu: Support atclk
coresight: etm4x: Support atclk
coresight: Disable programming clock properly
coresight: Avoid enable programming clock duplicately
coresight: Disable trace bus clock properly
coresight: Make clock sequence consistent
coresight: Refactor runtime PM
coresight: Consolidate clock enabling
drivers/hwtracing/coresight/coresight-catu.c | 53 ++++++++++++++++-----------------
drivers/hwtracing/coresight/coresight-catu.h | 1 +
drivers/hwtracing/coresight/coresight-cpu-debug.c | 41 +++++++++-----------------
drivers/hwtracing/coresight/coresight-ctcu-core.c | 24 +++++----------
drivers/hwtracing/coresight/coresight-etb10.c | 18 ++++--------
drivers/hwtracing/coresight/coresight-etm3x-core.c | 17 ++++-------
drivers/hwtracing/coresight/coresight-etm4x-core.c | 32 ++++++++++----------
drivers/hwtracing/coresight/coresight-etm4x.h | 4 ++-
drivers/hwtracing/coresight/coresight-funnel.c | 66 +++++++++++++++---------------------------
drivers/hwtracing/coresight/coresight-replicator.c | 63 ++++++++++++++--------------------------
drivers/hwtracing/coresight/coresight-stm.c | 34 +++++++++-------------
drivers/hwtracing/coresight/coresight-tmc-core.c | 48 +++++++++++++++---------------
drivers/hwtracing/coresight/coresight-tmc.h | 2 ++
drivers/hwtracing/coresight/coresight-tpiu.c | 36 ++++++++++-------------
include/linux/coresight.h | 47 ++++++++++++++++++------------
15 files changed, 206 insertions(+), 280 deletions(-)
--
2.34.1
On 08/04/2025 8:59 pm, Yabin Cui wrote:
> When tracing ETM data on multiple CPUs concurrently via the
> perf interface, the CATU device is shared across different CPU
> paths. This can lead to race conditions when multiple CPUs attempt
> to enable or disable the CATU device simultaneously.
>
> To address these race conditions, this patch introduces the
> following changes:
>
> 1. The enable and disable operations for the CATU device are not
> reentrant. Therefore, a spinlock is added to ensure that only
> one CPU can enable or disable a given CATU device at any point
> in time.
>
> 2. A reference counter is used to manage the enable/disable state
> of the CATU device. The device is enabled when the first CPU
> requires it and is only disabled when the last CPU finishes
> using it. This ensures the device remains active as long as at
> least one CPU needs it.
>
> Signed-off-by: Yabin Cui <yabinc(a)google.com>
> ---
> drivers/hwtracing/coresight/coresight-catu.c | 25 +++++++++++++-------
> drivers/hwtracing/coresight/coresight-catu.h | 1 +
> 2 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-catu.c b/drivers/hwtracing/coresight/coresight-catu.c
> index fa170c966bc3..30b78b2f8adb 100644
> --- a/drivers/hwtracing/coresight/coresight-catu.c
> +++ b/drivers/hwtracing/coresight/coresight-catu.c
> @@ -458,12 +458,17 @@ static int catu_enable_hw(struct catu_drvdata *drvdata, enum cs_mode cs_mode,
> static int catu_enable(struct coresight_device *csdev, enum cs_mode mode,
> void *data)
> {
> - int rc;
> + int rc = 0;
> struct catu_drvdata *catu_drvdata = csdev_to_catu_drvdata(csdev);
> + guard(raw_spinlock_irqsave)(&catu_drvdata->spinlock);
>
Very minor nit only because you need to resend anyway, but there should
be a newline between the variable definitions and the code. Not sure why
checkpatch doesn't warn here.
> - CS_UNLOCK(catu_drvdata->base);
> - rc = catu_enable_hw(catu_drvdata, mode, data);
> - CS_LOCK(catu_drvdata->base);
> + if (csdev->refcnt == 0) {
> + CS_UNLOCK(catu_drvdata->base);
> + rc = catu_enable_hw(catu_drvdata, mode, data);
> + CS_LOCK(catu_drvdata->base);
> + }
> + if (!rc)
> + csdev->refcnt++;
> return rc;
> }
>
> @@ -486,12 +491,15 @@ static int catu_disable_hw(struct catu_drvdata *drvdata)
>
> static int catu_disable(struct coresight_device *csdev, void *__unused)
> {
> - int rc;
> + int rc = 0;
> struct catu_drvdata *catu_drvdata = csdev_to_catu_drvdata(csdev);
> + guard(raw_spinlock_irqsave)(&catu_drvdata->spinlock);
>
> - CS_UNLOCK(catu_drvdata->base);
> - rc = catu_disable_hw(catu_drvdata);
> - CS_LOCK(catu_drvdata->base);
> + if (--csdev->refcnt == 0) {
Hopefully this never underflows if disable is called again after a
failed enable. We could add a WARN_ON() but I think this is a general
case and not specific to these patches so is probably better to do later
as separate change.
Reviewed-by: James Clark <james.clark(a)linaro.org>
On 14/04/2025 11:06, Andy Shevchenko wrote:
> On Mon, Mar 31, 2025 at 02:40:51PM +0100, James Clark wrote:
>> On 31/03/2025 8:14 am, Andy Shevchenko wrote:
>
>> Reviewed-by: James Clark <james.clark(a)linaro.org>
>
> Thank you! Can it be applied now?
>
I will queue this via coresight tree, for v6.16
Cheers
Suzuki
Hi,
I can see that this patchset has fixed some of the issues raised from
v1 - using the existing file handle for read, and stopping the ETR to
read the RWP.
However the fundamental problem that it is still attempting to read
memory without stopping the ETR has not been addressed.
As mentioned in mine and Suzuki's comments for v1, this means:-
1) you cannot guarantee that the buffer has not wrapped when reading
data back, which will corrupt the trace decode process.
2) The DMA buffers are not being synchronized, so the PE could be
reading stale data rather than the new data written by the ETR.
Regards
Mike
On Thu, 10 Apr 2025 at 02:33, Jie Gan <jie.gan(a)oss.qualcomm.com> wrote:
>
> The byte-cntr function provided by the CTCU device is used to transfer data
> from the ETR buffer to the userspace. An interrupt is tiggered if the data
> size exceeds the threshold set in the BYTECNTRVAL register. The interrupt
> handler counts the number of triggered interruptions and the read function
> will read the data from the ETR buffer if the IRQ count is greater than 0.
> The read work will be conducted ASAP after the byte-cntr is started.
> Each successful read process will decrement the IRQ count by 1.
>
> The byte cntr function will start when the device node is opened for reading,
> and the IRQ count will reset when the byte cntr function has stopped. When
> the file node is opened, the w_offset of the ETR buffer will be read and
> stored in byte_cntr_data, serving as the original r_offset (indicating
> where reading starts) for the byte counter function.
>
> The work queue for the read operation will wake up once when ETR is stopped,
> ensuring that the remaining data in the ETR buffer has been flushed based on
> the w_offset read at the time of stopping.
>
> The byte-cntr read work has integrated with the file node tmc_etr, e.g.
> /dev/tmc_etr0
> /dev/tmc_etr1
>
> There are two scenarios for the ETR file nodes with byte-cntr function:
> 1. BYTECNTRVAL register has configured -> byte-cntr read
> 2. BYTECNTRVAL register is disabled -> original behavior, flush the etr_buf
>
> We still can flush the etr buffer once after the byte-cntr function has
> triggered.
> 1. Enable byte-cntr
> 2. Byte-cntr read
> 3. Disable byte-cntr
> 4. Flush etr buffer
>
> Since the ETR operates in circular buffer mode, we cannot fully guarantee
> that no overwrites occur when the byte-cntr read function reads the data.
> The read function will read the data ASAP when the interrupt is
> triggered and we should not configure a threshold greater than the
> buffer size of the ETR buffer.
>
> The following shell commands write threshold to BYTECNTRVAL registers.
>
> Only enable byte-cntr for ETR0:
> echo 0x10000 > /sys/devices/platform/soc(a)0/4001000.ctcu/ctcu0/byte_cntr_val
>
> Enable byte-cntr for both ETR0 and ETR1(support both hex and decimal values):
> echo 0x10000 4096 > /sys/devices/platform/soc(a)0/4001000.ctcu/ctcu0/byte_cntr_val
>
> Setting the BYTECNTRVAL registers to 0 disables the byte-cntr function.
> Disable byte-cntr for ETR0:
> echo 0 > /sys/devices/platform/soc(a)0/4001000.ctcu/ctcu0/byte_cntr_val
>
> Disable byte-cntr for both ETR0 and ETR1:
> echo 0 0 > /sys/devices/platform/soc(a)0/4001000.ctcu/ctcu0/byte_cntr_val
>
> There is a minimum threshold to prevent generating too many interrupts.
> The minimum threshold is 4096 bytes. The write process will fail if user try
> to set the BYTECNTRVAL registers to a value less than 4096 bytes(except
> for 0).
>
> Way to enable and start byte-cntr for ETR0:
> echo 0x10000 > /sys/devices/platform/soc(a)0/4001000.ctcu/ctcu0/byte_cntr_val
> echo 1 > /sys/bus/coresight/devices/tmc_etr0/enable_sink
> echo 1 > /sys/bus/coresight/devices/etm0/enable_source
> cat /dev/tmc_etr0
>
> Testing case has conducted for the byte-cntr read work:
> 1. Setting the buffer_size of the ETR as large as possile, here is for ETR0
> echo 0x1000000 > /sys/bus/coresight/devices/tmc_etr0/buffer_size
> 2. Setting the threshold for the ETR0 to 0x10000
> echo 0x10000 > /sys/bus/coresight/devices/ctcu0/byte_cntr_val
> 3. Enable ETR0
> echo 1 > /sys/bus/coresight/devices/tmc_etr0/enable_sink
> 4. Enable ETM0 as source and enable byte-cntr to read data
> echo 1 > /sys/bus/coresight/devices/etm0/enable_source;
> cat /dev/tmc_etr0 > /tmp/file_byte_cntr.bin &
> 5. Disable ETM0
> echo 0 > /sys/bus/coresight/devices/etm0/enable_source
> 6. Disable byte-cntr and flush the etr buffer
> echo 0 > /sys/bus/coresight/devices/ctcu0/byte_cntr_val;
> cat /dev/tmc_etr0 > /tmp/file_etr0.bin
> ls -l /tmp
>
> -rw-r--r-- 1 root root 12628960 Apr 28 17:44 file_byte_cntr.bin
> -rw-r--r-- 1 root root 12669296 Apr 28 17:45 file_etr0.bin
>
> 7. Deal with the file_etr0.bin with following command:
> dd if=/tmp/file_etr0.bin of=/tmp/file_etr0_aligned.bin bs=1
> count=12628960 skip=40336
> ls -l /tmp
>
> -rw-r--r-- 1 root root 12628960 Apr 28 17:44 file_byte_cntr.bin
> -rw-r--r-- 1 root root 12669296 Apr 28 17:45 file_etr0.bin
> -rw-r--r-- 1 root root 12628960 Apr 28 17:49 file_etr0_aligned.bin
>
> 8. Compared file_byte_cntr.bin with file_etr0_aligned.bin and identified
> they are competely same.
> diff file_byte_cntr.bin file_etr0_aligned.bin
>
> =======================
> Changes in V2:
> 1. Removed the independent file node /dev/byte_cntr.
> 2. Integrated the byte-cntr's file operations with current ETR file
> node.
> 3. Optimized the driver code of the CTCU that associated with byte-cntr.
> 4. Add kernel document for the export API tmc_etr_get_rwp_offset.
> 5. Optimized the way to read the rwp_offset according to Mike's
> suggestion.
> 6. Removed the dependency of the dts patch.
> Link to V1 - https://lore.kernel.org/all/20250310090407.2069489-1-quic_jiegan@quicinc.co…
>
> Jie Gan (5):
> coresight: tmc: Introduce new APIs to get the RWP offset of ETR buffer
> dt-bindings: arm: Add an interrupt property for Coresight CTCU
> coresight: ctcu: Enable byte-cntr for TMC ETR devices
> coresight: tmc: add functions for byte-cntr operation
> arm64: dts: qcom: sa8775p: Add interrupts to CTCU device
>
> .../bindings/arm/qcom,coresight-ctcu.yaml | 17 ++
> arch/arm64/boot/dts/qcom/sa8775p.dtsi | 5 +
> drivers/hwtracing/coresight/Makefile | 2 +-
> .../coresight/coresight-ctcu-byte-cntr.c | 119 ++++++++++++
> .../hwtracing/coresight/coresight-ctcu-core.c | 88 ++++++++-
> drivers/hwtracing/coresight/coresight-ctcu.h | 49 ++++-
> .../hwtracing/coresight/coresight-tmc-core.c | 29 ++-
> .../hwtracing/coresight/coresight-tmc-etr.c | 175 ++++++++++++++++++
> drivers/hwtracing/coresight/coresight-tmc.h | 10 +-
> 9 files changed, 483 insertions(+), 11 deletions(-)
> create mode 100644 drivers/hwtracing/coresight/coresight-ctcu-byte-cntr.c
>
> --
> 2.34.1
>
--
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK