ETF may fail to re-enable after reading, and driver->reading will
not be set to false, this will cause failure to enable/disable to ETF.
This change set driver->reading to false even if re-enabling fail.
Fixes: 669c4614236a ("coresight: tmc: Don't enable TMC when it's not ready.")
Co-developed-by: Yuanfang Zhang <quic_yuanfang(a)quicinc.com>
Signed-off-by: Yuanfang Zhang <quic_yuanfang(a)quicinc.com>
Signed-off-by: Mao Jinlong <quic_jinlmao(a)quicinc.com>
---
drivers/hwtracing/coresight/coresight-tmc-etf.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etf.c b/drivers/hwtracing/coresight/coresight-tmc-etf.c
index d858740001c2..c9e2d95ae295 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etf.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etf.c
@@ -747,7 +747,6 @@ int tmc_read_unprepare_etb(struct tmc_drvdata *drvdata)
char *buf = NULL;
enum tmc_mode mode;
unsigned long flags;
- int rc = 0;
/* config types are set a boot time and never change */
if (WARN_ON_ONCE(drvdata->config_type != TMC_CONFIG_TYPE_ETB &&
@@ -773,11 +772,7 @@ int tmc_read_unprepare_etb(struct tmc_drvdata *drvdata)
* can't be NULL.
*/
memset(drvdata->buf, 0, drvdata->size);
- rc = __tmc_etb_enable_hw(drvdata);
- if (rc) {
- raw_spin_unlock_irqrestore(&drvdata->spinlock, flags);
- return rc;
- }
+ __tmc_etb_enable_hw(drvdata);
} else {
/*
* The ETB/ETF is not tracing and the buffer was just read.
--
2.25.1
This series fixes and improves clock usage in the Arm CoreSight drivers.
Based on the DT binding documents, the trace clock (atclk) is defined in
some CoreSight modules, but support is absent. In most cases, the issue
is hidden because the atclk clock is shared by multiple CoreSight
modules and the clock is enabled anyway by other drivers. The first
three patches address this issue.
The programming clock (pclk) management in CoreSight drivers does not
use the devm_XXX() variant APIs, the drivers needs to manually disable
and release clocks for errors and for normal module exit. However, the
drivers miss to disable clocks during module exit. The atclk may also
not be disabled in CoreSight drivers during module exit. By using devm
APIs, patches 04 and 05 fix clock disabling issues.
Another issue is pclk might be enabled twice in init phase - once by
AMBA bus driver, and again by CoreSight drivers.
This is fixed in patch 06.
Patches 07 to 09 refactor the clock related code. Patch 07 consolidats
the clock initialization into a central place. Patch 08 makes the
clock enabling sequence consistent. Patch 09 removes redundant
condition checks and adds error handling in runtime PM.
This series is verified on Arm64 Hikey960 platform.
Changes from v1:
- Moved the coresight_get_enable_clocks() function into CoreSight core
layer (James).
- Added comments for clock naming "apb_pclk" and "apb" (James).
- Re-ordered patches for easier understanding (Anshuman).
- Minor improvement for commit log in patch 01 (Anshuman).
Leo Yan (9):
coresight: tmc: Support atclk
coresight: catu: Support atclk
coresight: etm4x: Support atclk
coresight: Disable programming clock properly
coresight: Disable trace bus clock properly
coresight: Avoid enable programming clock duplicately
coresight: Consolidate clock enabling
coresight: Make clock sequence consistent
coresight: Refactor runtime PM
drivers/hwtracing/coresight/coresight-catu.c | 53 ++++++++++++++++-----------------
drivers/hwtracing/coresight/coresight-catu.h | 1 +
drivers/hwtracing/coresight/coresight-core.c | 45 ++++++++++++++++++++++++++++
drivers/hwtracing/coresight/coresight-cpu-debug.c | 41 +++++++++-----------------
drivers/hwtracing/coresight/coresight-ctcu-core.c | 24 +++++----------
drivers/hwtracing/coresight/coresight-etb10.c | 18 ++++--------
drivers/hwtracing/coresight/coresight-etm3x-core.c | 17 ++++-------
drivers/hwtracing/coresight/coresight-etm4x-core.c | 32 ++++++++++----------
drivers/hwtracing/coresight/coresight-etm4x.h | 4 ++-
drivers/hwtracing/coresight/coresight-funnel.c | 66 +++++++++++++++---------------------------
drivers/hwtracing/coresight/coresight-replicator.c | 63 ++++++++++++++--------------------------
drivers/hwtracing/coresight/coresight-stm.c | 34 +++++++++-------------
drivers/hwtracing/coresight/coresight-tmc-core.c | 48 +++++++++++++++---------------
drivers/hwtracing/coresight/coresight-tmc.h | 2 ++
drivers/hwtracing/coresight/coresight-tpiu.c | 36 ++++++++++-------------
include/linux/coresight.h | 30 ++-----------------
16 files changed, 225 insertions(+), 289 deletions(-)
--
2.34.1
The Trace Network On Chip (TNOC) is an integration hierarchy which is a
hardware component that integrates the functionalities of TPDA and
funnels. It collects trace form subsystems and transfers to coresight
sink.
Signed-off-by: Yuanfang Zhang <quic_yuanfang(a)quicinc.com>
---
Changes in v3:
- Remove unnecessary sysfs nodes.
- update commit messages.
- Use 'writel' instead of 'write_relaxed' when writing to the register for the last time.
- Add trace_id ops.
- Link to v2: https://lore.kernel.org/r/20250226-trace-noc-driver-v2-0-8afc6584afc5@quici…
Changes in v2:
- Modified the format of DT binging file.
- Fix compile warnings.
- Link to v1: https://lore.kernel.org/r/46643089-b88d-49dc-be05-7bf0bb21f847@quicinc.com
---
Yuanfang Zhang (2):
dt-bindings: arm: Add device Trace Network On Chip definition
coresight: add coresight Trace Network On Chip driver
.../bindings/arm/qcom,coresight-tnoc.yaml | 111 ++++++++++++
drivers/hwtracing/coresight/Kconfig | 13 ++
drivers/hwtracing/coresight/Makefile | 1 +
drivers/hwtracing/coresight/coresight-tnoc.c | 186 +++++++++++++++++++++
drivers/hwtracing/coresight/coresight-tnoc.h | 34 ++++
5 files changed, 345 insertions(+)
---
base-commit: a2cc6ff5ec8f91bc463fd3b0c26b61166a07eb11
change-id: 20250403-trace-noc-f8286b30408e
Best regards,
--
Yuanfang Zhang <quic_yuanfang(a)quicinc.com>
The Trace Network On Chip (TNOC) is an integration hierarchy which is a
hardware component that integrates the functionalities of TPDA and
funnels. It collects trace form subsystems and transfers to coresight
sink.
Signed-off-by: Yuanfang Zhang <quic_yuanfang(a)quicinc.com>
---
Changes in v4:
- Fix dt_binding warning.
- update mask of trace_noc amba_id.
- Modify driver comments.
- rename TRACE_NOC_SYN_VAL to TRACE_NOC_SYNC_INTERVAL.
- Link to v3: https://lore.kernel.org/r/20250411-trace-noc-v3-0-1f19ddf7699b@quicinc.com
Changes in v3:
- Remove unnecessary sysfs nodes.
- update commit messages.
- Use 'writel' instead of 'write_relaxed' when writing to the register for the last time.
- Add trace_id ops.
- Link to v2: https://lore.kernel.org/r/20250226-trace-noc-driver-v2-0-8afc6584afc5@quici…
Changes in v2:
- Modified the format of DT binging file.
- Fix compile warnings.
- Link to v1: https://lore.kernel.org/r/46643089-b88d-49dc-be05-7bf0bb21f847@quicinc.com
---
Yuanfang Zhang (2):
dt-bindings: arm: Add device Trace Network On Chip definition
coresight: add coresight Trace Network On Chip driver
.../bindings/arm/qcom,coresight-tnoc.yaml | 111 ++++++++++++
drivers/hwtracing/coresight/Kconfig | 13 ++
drivers/hwtracing/coresight/Makefile | 1 +
drivers/hwtracing/coresight/coresight-tnoc.c | 191 +++++++++++++++++++++
drivers/hwtracing/coresight/coresight-tnoc.h | 34 ++++
5 files changed, 350 insertions(+)
---
base-commit: a2cc6ff5ec8f91bc463fd3b0c26b61166a07eb11
change-id: 20250403-trace-noc-f8286b30408e
Best regards,
--
Yuanfang Zhang <quic_yuanfang(a)quicinc.com>
On 5/2/25 23:00, Yabin Cui wrote:
> On Fri, May 2, 2025 at 3:51 AM Anshuman Khandual
> <anshuman.khandual(a)arm.com> wrote:
>>
>> On 5/2/25 01:05, Yabin Cui wrote:
>>> perf always allocates contiguous AUX pages based on aux_watermark.
>>> However, this contiguous allocation doesn't benefit all PMUs. For
>>> instance, ARM SPE and TRBE operate with virtual pages, and Coresight
>>> ETR allocates a separate buffer. For these PMUs, allocating contiguous
>>> AUX pages unnecessarily exacerbates memory fragmentation. This
>>> fragmentation can prevent their use on long-running devices.
>>>
>>> This patch modifies the perf driver to be memory-friendly by default,
>>> by allocating non-contiguous AUX pages. For PMUs requiring contiguous
>>> pages (Intel BTS and some Intel PT), the existing
>>> PERF_PMU_CAP_AUX_NO_SG capability can be used. For PMUs that don't
>>> require but can benefit from contiguous pages (some Intel PT), a new
>>> capability, PERF_PMU_CAP_AUX_PREFER_LARGE, is added to maintain their
>>> existing behavior.
>>>
>>> Signed-off-by: Yabin Cui <yabinc(a)google.com>
>>> ---
>>> Changes since v2:
>>> Let NO_SG imply PREFER_LARGE. So PMUs don't need to set both flags.
>>> Then the only place needing PREFER_LARGE is intel/pt.c.
>>>
>>> Changes since v1:
>>> In v1, default is preferring contiguous pages, and add a flag to
>>> allocate non-contiguous pages. In v2, default is allocating
>>> non-contiguous pages, and add a flag to prefer contiguous pages.
>>>
>>> v1 patchset:
>>> perf,coresight: Reduce fragmentation with non-contiguous AUX pages for
>>> cs_etm
>>>
>>> arch/x86/events/intel/pt.c | 2 ++
>>> include/linux/perf_event.h | 1 +
>>> kernel/events/ring_buffer.c | 20 +++++++++++++-------
>>> 3 files changed, 16 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
>>> index fa37565f6418..25ead919fc48 100644
>>> --- a/arch/x86/events/intel/pt.c
>>> +++ b/arch/x86/events/intel/pt.c
>>> @@ -1863,6 +1863,8 @@ static __init int pt_init(void)
>>>
>>> if (!intel_pt_validate_hw_cap(PT_CAP_topa_multiple_entries))
>>> pt_pmu.pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG;
>>> + else
>>> + pt_pmu.pmu.capabilities = PERF_PMU_CAP_AUX_PREFER_LARGE;
>>>
>>> pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE |
>>> PERF_PMU_CAP_ITRACE |
>>
>> Why this PMU has PERF_PMU_CAP_AUX_PREFER_LARGE fallback option but
>> not the other PMU in arch/x86/events/intel/bts.c even though both
>> had PERF_PMU_CAP_AUX_NO_SG ?
>
> Because Intel BTS always use NO_SG, while in some cases Intel PT
> doesn't use NO_SG.
Makes sense.
>>
>>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>>> index 0069ba6866a4..56d77348c511 100644
>>> --- a/include/linux/perf_event.h
>>> +++ b/include/linux/perf_event.h
>>> @@ -301,6 +301,7 @@ struct perf_event_pmu_context;
>>> #define PERF_PMU_CAP_AUX_OUTPUT 0x0080
>>> #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
>>> #define PERF_PMU_CAP_AUX_PAUSE 0x0200
>>> +#define PERF_PMU_CAP_AUX_PREFER_LARGE 0x0400
>>>
>>> /**
>>> * pmu::scope
>>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>>> index 5130b119d0ae..4d2f1c95673e 100644
>>> --- a/kernel/events/ring_buffer.c
>>> +++ b/kernel/events/ring_buffer.c
>>> @@ -679,7 +679,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>>> {
>>> bool overwrite = !(flags & RING_BUFFER_WRITABLE);
>>> int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
>>> - int ret = -ENOMEM, max_order;
>>> + int ret = -ENOMEM, max_order = 0;
>>
>> 0 order is now the default allocation granularity. This might benefit
>> from a comment above explaining that max_order could change only with
>> PERF_PMU_CAP_AUX_NO_SG or PERF_PMU_CAP_AUX_PREFER_LARGE PMU flags etc.
>>
> Will add the comment in the next respin.
>>>
>>> if (!has_aux(event))
>>> return -EOPNOTSUPP;
>>> @@ -689,8 +689,8 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>>>
>>> if (!overwrite) {
>>> /*
>>> - * Watermark defaults to half the buffer, and so does the
>>> - * max_order, to aid PMU drivers in double buffering.
>>> + * Watermark defaults to half the buffer, to aid PMU drivers
>>> + * in double buffering.
>>> */
>>> if (!watermark)
>>> watermark = min_t(unsigned long,
>>> @@ -698,16 +698,22 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>>> (unsigned long)nr_pages << (PAGE_SHIFT - 1));
>>>
>>> /*
>>> - * Use aux_watermark as the basis for chunking to
>>> + * For PMUs that need or prefer large contiguous buffers,
>>> + * use aux_watermark as the basis for chunking to
>>> * help PMU drivers honor the watermark.
>>> */
>>> - max_order = get_order(watermark);
>>> + if (event->pmu->capabilities & (PERF_PMU_CAP_AUX_NO_SG |
>>> + PERF_PMU_CAP_AUX_PREFER_LARGE))
>>> + max_order = get_order(watermark);
>>> } else {
>>> /*
>>> - * We need to start with the max_order that fits in nr_pages,
>>> + * For PMUs that need or prefer large contiguous buffers,
>>> + * we need to start with the max_order that fits in nr_pages,
>>> * not the other way around, hence ilog2() and not get_order.
>>> */
>>> - max_order = ilog2(nr_pages);
>>> + if (event->pmu->capabilities & (PERF_PMU_CAP_AUX_NO_SG |
>>> + PERF_PMU_CAP_AUX_PREFER_LARGE))
>>> + max_order = ilog2(nr_pages);
>>> watermark = 0;
>>> }
>>>
>>
>> Although not really sure, could event->pmu->capabilities check against the ORed
>> PMU flags PERF_PMU_CAP_AUX_NO_SG and PERF_PMU_CAP_AUX_PREFER_LARGE be contained
>> in a helper pmu_prefers_cont_alloc(struct *pmu ...) or something similar ?
>
> Sure, but I feel it's not very worthwhile. Maybe add a local variable
> use_contiguous_pages? It can also work as another comment near
> max_order.
Probably that will be better.
On 5/2/25 01:05, Yabin Cui wrote:
> perf always allocates contiguous AUX pages based on aux_watermark.
> However, this contiguous allocation doesn't benefit all PMUs. For
> instance, ARM SPE and TRBE operate with virtual pages, and Coresight
> ETR allocates a separate buffer. For these PMUs, allocating contiguous
> AUX pages unnecessarily exacerbates memory fragmentation. This
> fragmentation can prevent their use on long-running devices.
>
> This patch modifies the perf driver to be memory-friendly by default,
> by allocating non-contiguous AUX pages. For PMUs requiring contiguous
> pages (Intel BTS and some Intel PT), the existing
> PERF_PMU_CAP_AUX_NO_SG capability can be used. For PMUs that don't
> require but can benefit from contiguous pages (some Intel PT), a new
> capability, PERF_PMU_CAP_AUX_PREFER_LARGE, is added to maintain their
> existing behavior.
>
> Signed-off-by: Yabin Cui <yabinc(a)google.com>
> ---
> Changes since v2:
> Let NO_SG imply PREFER_LARGE. So PMUs don't need to set both flags.
> Then the only place needing PREFER_LARGE is intel/pt.c.
>
> Changes since v1:
> In v1, default is preferring contiguous pages, and add a flag to
> allocate non-contiguous pages. In v2, default is allocating
> non-contiguous pages, and add a flag to prefer contiguous pages.
>
> v1 patchset:
> perf,coresight: Reduce fragmentation with non-contiguous AUX pages for
> cs_etm
>
> arch/x86/events/intel/pt.c | 2 ++
> include/linux/perf_event.h | 1 +
> kernel/events/ring_buffer.c | 20 +++++++++++++-------
> 3 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/events/intel/pt.c b/arch/x86/events/intel/pt.c
> index fa37565f6418..25ead919fc48 100644
> --- a/arch/x86/events/intel/pt.c
> +++ b/arch/x86/events/intel/pt.c
> @@ -1863,6 +1863,8 @@ static __init int pt_init(void)
>
> if (!intel_pt_validate_hw_cap(PT_CAP_topa_multiple_entries))
> pt_pmu.pmu.capabilities = PERF_PMU_CAP_AUX_NO_SG;
> + else
> + pt_pmu.pmu.capabilities = PERF_PMU_CAP_AUX_PREFER_LARGE;
>
> pt_pmu.pmu.capabilities |= PERF_PMU_CAP_EXCLUSIVE |
> PERF_PMU_CAP_ITRACE |
Why this PMU has PERF_PMU_CAP_AUX_PREFER_LARGE fallback option but
not the other PMU in arch/x86/events/intel/bts.c even though both
had PERF_PMU_CAP_AUX_NO_SG ?
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 0069ba6866a4..56d77348c511 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -301,6 +301,7 @@ struct perf_event_pmu_context;
> #define PERF_PMU_CAP_AUX_OUTPUT 0x0080
> #define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
> #define PERF_PMU_CAP_AUX_PAUSE 0x0200
> +#define PERF_PMU_CAP_AUX_PREFER_LARGE 0x0400
>
> /**
> * pmu::scope
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 5130b119d0ae..4d2f1c95673e 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -679,7 +679,7 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> {
> bool overwrite = !(flags & RING_BUFFER_WRITABLE);
> int node = (event->cpu == -1) ? -1 : cpu_to_node(event->cpu);
> - int ret = -ENOMEM, max_order;
> + int ret = -ENOMEM, max_order = 0;
0 order is now the default allocation granularity. This might benefit
from a comment above explaining that max_order could change only with
PERF_PMU_CAP_AUX_NO_SG or PERF_PMU_CAP_AUX_PREFER_LARGE PMU flags etc.
>
> if (!has_aux(event))
> return -EOPNOTSUPP;
> @@ -689,8 +689,8 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
>
> if (!overwrite) {
> /*
> - * Watermark defaults to half the buffer, and so does the
> - * max_order, to aid PMU drivers in double buffering.
> + * Watermark defaults to half the buffer, to aid PMU drivers
> + * in double buffering.
> */
> if (!watermark)
> watermark = min_t(unsigned long,
> @@ -698,16 +698,22 @@ int rb_alloc_aux(struct perf_buffer *rb, struct perf_event *event,
> (unsigned long)nr_pages << (PAGE_SHIFT - 1));
>
> /*
> - * Use aux_watermark as the basis for chunking to
> + * For PMUs that need or prefer large contiguous buffers,
> + * use aux_watermark as the basis for chunking to
> * help PMU drivers honor the watermark.
> */
> - max_order = get_order(watermark);
> + if (event->pmu->capabilities & (PERF_PMU_CAP_AUX_NO_SG |
> + PERF_PMU_CAP_AUX_PREFER_LARGE))
> + max_order = get_order(watermark);
> } else {
> /*
> - * We need to start with the max_order that fits in nr_pages,
> + * For PMUs that need or prefer large contiguous buffers,
> + * we need to start with the max_order that fits in nr_pages,
> * not the other way around, hence ilog2() and not get_order.
> */
> - max_order = ilog2(nr_pages);
> + if (event->pmu->capabilities & (PERF_PMU_CAP_AUX_NO_SG |
> + PERF_PMU_CAP_AUX_PREFER_LARGE))
> + max_order = ilog2(nr_pages);
> watermark = 0;
> }
>
Although not really sure, could event->pmu->capabilities check against the ORed
PMU flags PERF_PMU_CAP_AUX_NO_SG and PERF_PMU_CAP_AUX_PREFER_LARGE be contained
in a helper pmu_prefers_cont_alloc(struct *pmu ...) or something similar ?