CoreSight September 2021

coresight@lists.linaro.org

12 participants
48 discussions

Re: [PATCH v2 2/2] arm64: dts: qcom: sm8250: Add Coresight support

by Suzuki K Poulose

Hi Tao Are there no sinks at all on this platform ? I had this question on the previous series. How is CoreSight useful on this platform otherwise ? On 13/09/2021 07:40, Tao Zhang wrote: > Add the basic coresight components found on Qualcomm SM8250 Soc. The > basic coresight components include ETF, ETMs,STM and the related > funnels. > > Signed-off-by: Tao Zhang <quic_taozha(a)quicinc.com> > --- > arch/arm64/boot/dts/qcom/qrb5165-rb5.dts | 442 ++++++++++++++++++++++- > 1 file changed, 438 insertions(+), 4 deletions(-) > > diff --git a/arch/arm64/boot/dts/qcom/qrb5165-rb5.dts b/arch/arm64/boot/dts/qcom/qrb5165-rb5.dts > index 8ac96f8e79d4..9c8f87d80afc 100644 > --- a/arch/arm64/boot/dts/qcom/qrb5165-rb5.dts > +++ b/arch/arm64/boot/dts/qcom/qrb5165-rb5.dts > @@ -222,11 +222,445 @@ > regulator-max-microvolt = <1800000>; > regulator-always-on; > }; > -}; > > -&adsp { > - status = "okay"; > - firmware-name = "qcom/sm8250/adsp.mbn"; Unrelated change ? Please keep it separate from the CoreSight changes. Suzuki

3 years, 9 months

Re: [PATCH v2 0/2] Add Coresight support for RB5 board

by Suzuki K Poulose

Hi Tao On 13/09/2021 07:40, Tao Zhang wrote: > This series adds Coresight support for SM8250 Soc on RB5 board. > It is composed of two elements. > a) Add ETM PID for Kryo-5XX. > b) Add coresight support to DTS for RB5. > > This series applies to coresight/next > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git > Please could you mention what has changed since the previous version in the cover letter ? Kind regards Suzuki > Tao Zhang (2): > coresight: etm4x: Add ETM PID for Kryo-5XX > arm64: dts: qcom: sm8250: Add Coresight support > > arch/arm64/boot/dts/qcom/qrb5165-rb5.dts | 442 +++++++++++++++++- > .../coresight/coresight-etm4x-core.c | 1 + > 2 files changed, 439 insertions(+), 4 deletions(-) >

3 years, 9 months

Re: [PATCH v2 1/2] coresight: etm4x: Add ETM PID for Kryo-5XX

by Suzuki K Poulose

On 13/09/2021 07:40, Tao Zhang wrote: > Add ETM PID for Kryo-5XX to the list of supported ETMs. > Otherwise, Kryo-5XX ETMs will not be initialized successfully. > e.g. > This change can be verified on qrb5165-rb5 board. ETM4-ETM7 nodes > will not be visible without this change. > > Signed-off-by: Tao Zhang <quic_taozha(a)quicinc.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose(a)arm.com> > --- > drivers/hwtracing/coresight/coresight-etm4x-core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/hwtracing/coresight/coresight-etm4x-core.c b/drivers/hwtracing/coresight/coresight-etm4x-core.c > index da27cd4a3c38..52ca918c3bec 100644 > --- a/drivers/hwtracing/coresight/coresight-etm4x-core.c > +++ b/drivers/hwtracing/coresight/coresight-etm4x-core.c > @@ -2065,6 +2065,7 @@ static const struct amba_id etm4_ids[] = { > CS_AMBA_UCI_ID(0x000bb803, uci_id_etm4),/* Qualcomm Kryo 385 Cortex-A75 */ > CS_AMBA_UCI_ID(0x000bb805, uci_id_etm4),/* Qualcomm Kryo 4XX Cortex-A55 */ > CS_AMBA_UCI_ID(0x000bb804, uci_id_etm4),/* Qualcomm Kryo 4XX Cortex-A76 */ > + CS_AMBA_UCI_ID(0x000bbd0d, uci_id_etm4),/* Qualcomm Kryo 5XX Cortex-A77 */ > CS_AMBA_UCI_ID(0x000cc0af, uci_id_etm4),/* Marvell ThunderX2 */ > CS_AMBA_UCI_ID(0x000b6d01, uci_id_etm4),/* HiSilicon-Hip08 */ > CS_AMBA_UCI_ID(0x000b6d02, uci_id_etm4),/* HiSilicon-Hip09 */ >

3 years, 9 months

[PATCH] coresight: cpu-debug: control default behavior via Kconfig

by Brian Norris

Debugfs is nice and so are module parameters, but * debugfs doesn't take effect early (e.g., if drivers are locking up before user space gets anywhere) * module parameters either add a lot to the kernel command line, or else take effect late as well (if you build =m and configure in /etc/modprobe.d/) So in the same spirit as these CONFIG_PANIC_ON_OOPS (also available via cmdline or modparam) CONFIG_INTEL_IOMMU_DEFAULT_ON (also available via cmdline) add a new Kconfig option. Module parameters and debugfs can still override. Signed-off-by: Brian Norris <briannorris(a)chromium.org> --- drivers/hwtracing/coresight/Kconfig | 13 +++++++++++++ drivers/hwtracing/coresight/coresight-cpu-debug.c | 2 +- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/hwtracing/coresight/Kconfig b/drivers/hwtracing/coresight/Kconfig index f026e5c0e777..8b638eb3cb7d 100644 --- a/drivers/hwtracing/coresight/Kconfig +++ b/drivers/hwtracing/coresight/Kconfig @@ -150,6 +150,19 @@ config CORESIGHT_CPU_DEBUG To compile this driver as a module, choose M here: the module will be called coresight-cpu-debug. +config CORESIGHT_CPU_DEBUG_DEFAULT_ON + bool "Enable CoreSight CPU Debug by default + depends on CORESIGHT_CPU_DEBUG + help + Say Y here to enable the CoreSight Debug panic-debug by default. This + can also be enabled via debugfs, but this ensures the debug feature + is enabled as early as possible. + + Has the same effect as setting coresight_cpu_debug.enable=1 on the + kernel command line. + + Say N if unsure. + config CORESIGHT_CTI tristate "CoreSight Cross Trigger Interface (CTI) driver" depends on ARM || ARM64 diff --git a/drivers/hwtracing/coresight/coresight-cpu-debug.c b/drivers/hwtracing/coresight/coresight-cpu-debug.c index 00de46565bc4..8845ec4b4402 100644 --- a/drivers/hwtracing/coresight/coresight-cpu-debug.c +++ b/drivers/hwtracing/coresight/coresight-cpu-debug.c @@ -105,7 +105,7 @@ static DEFINE_PER_CPU(struct debug_drvdata *, debug_drvdata); static int debug_count; static struct dentry *debug_debugfs_dir; -static bool debug_enable; +static bool debug_enable = IS_ENABLED(CONFIG_CORESIGHT_CPU_DEBUG_DEFAULT_ON); module_param_named(enable, debug_enable, bool, 0600); MODULE_PARM_DESC(enable, "Control to enable coresight CPU debug functionality"); -- 2.33.0.153.gba50c8fa24-goog

3 years, 9 months

[PATCH v4] coresight: tmc-etr: Speed up for bounce buffer in flat mode

by Leo Yan

The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the low level's architecture code, e.g. for Arm64, it maps the memory with the attribution "Normal non-cacheable"; this can be concluded from the definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h. Later when access the AUX bounce buffer, since the memory mapping is non-cacheable, it's low efficiency due to every load instruction must reach out DRAM. This patch changes to allocate pages with dma_alloc_noncoherent(), the driver can access the memory via cacheable mapping; therefore, load instructions can fetch data from cache lines rather than always read data from DRAM, the driver can boost memory performance. After using the cacheable mapping, the driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to read bounce buffer so can avoid read stale trace data. By measurement the duration for function tmc_update_etr_buffer() with ftrace function_graph tracer, it shows the performance significant improvement for copying 4MiB data from bounce buffer: # echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise # echo tmc_update_etr_buffer > set_graph_function # echo function_graph > current_tracer before: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 8148.320 us | } after: # CPU DURATION FUNCTION CALLS # | | | | | | | 2) | tmc_update_etr_buffer() { ... 2) # 2525.420 us | } Signed-off-by: Leo Yan <leo.yan(a)linaro.org> Reviewed-by: Suzuki K Poulose <suzuki.poulose(a)arm.com> --- Changes from v3: Refined change to use dma_alloc_noncoherent()/dma_free_noncoherent() (Robin Murphy); Retested functionality and performance on Juno-r2 board. Changes from v2: Sync the entire buffer in one go when the tracing is wrap around (Suzuki); Add Suzuki's review tage. .../hwtracing/coresight/coresight-tmc-etr.c | 26 ++++++++++++++++--- 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c index acdb59e0e661..a049b525a274 100644 --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c @@ -609,8 +609,9 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata, if (!flat_buf) return -ENOMEM; - flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size, - &flat_buf->daddr, GFP_KERNEL); + flat_buf->vaddr = dma_alloc_noncoherent(real_dev, etr_buf->size, + &flat_buf->daddr, + DMA_FROM_DEVICE, GFP_KERNEL); if (!flat_buf->vaddr) { kfree(flat_buf); return -ENOMEM; @@ -631,14 +632,18 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf) if (flat_buf && flat_buf->daddr) { struct device *real_dev = flat_buf->dev->parent; - dma_free_coherent(real_dev, flat_buf->size, - flat_buf->vaddr, flat_buf->daddr); + dma_free_noncoherent(real_dev, etr_buf->size, + flat_buf->vaddr, flat_buf->daddr, + DMA_FROM_DEVICE); } kfree(flat_buf); } static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) { + struct etr_flat_buf *flat_buf = etr_buf->private; + struct device *real_dev = flat_buf->dev->parent; + /* * Adjust the buffer to point to the beginning of the trace data * and update the available trace data. @@ -648,6 +653,19 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp) etr_buf->len = etr_buf->size; else etr_buf->len = rwp - rrp; + + /* + * The driver always starts tracing at the beginning of the buffer, + * the only reason why we would get a wrap around is when the buffer + * is full. Sync the entire buffer in one go for this case. + */ + if (etr_buf->offset + etr_buf->len > etr_buf->size) + dma_sync_single_for_cpu(real_dev, flat_buf->daddr, + etr_buf->size, DMA_FROM_DEVICE); + else + dma_sync_single_for_cpu(real_dev, + flat_buf->daddr + etr_buf->offset, + etr_buf->len, DMA_FROM_DEVICE); } static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf, -- 2.25.1

3 years, 9 months

[PATCH v4 0/2] coresight: Fix for snapshot mode

by Leo Yan

This patch series fixes the snapshot mode in the CoreSight driver. The first patch simplifies the head pointer handling for AUX buffer, the second patch updates the driver comments to reflect the latest status for perf tool. Changes from v3: - Made the comments generic in CoreSight drivers (Suzuki). Changes from v2: - Minor improvement the commits for patches 01 and 02. Leo Yan (2): coresight: tmc-etr: Use perf_output_handle::head for AUX ring buffer coresight: Update comments for removing cs_etm_find_snapshot() drivers/hwtracing/coresight/coresight-etb10.c | 5 ++--- drivers/hwtracing/coresight/coresight-tmc-etf.c | 5 ++--- drivers/hwtracing/coresight/coresight-tmc-etr.c | 15 +++++---------- 3 files changed, 9 insertions(+), 16 deletions(-) -- 2.25.1

3 years, 9 months

Re: [PATCH v5 4/9] perf/x86: Add compiler barrier after updating BTS

by Leo Yan

Hi Peter, On Tue, Sep 14, 2021 at 11:51:36AM +0200, Peter Zijlstra wrote: > On Sun, Aug 29, 2021 at 06:56:57PM +0800, Leo Yan wrote: > > Hi Peter, or any x86 maintainer, > > > > On Mon, Aug 09, 2021 at 07:14:02PM +0800, Leo Yan wrote: > > > Since BTS is coherent, simply add a compiler barrier to separate the BTS > > > update and aux_head store. > > > > Could you reivew this patch and check if BTS needs the comipler > > barrier in this case? Thanks. > > Yes, a compiler barrier is sufficient. > > You want me to pick it up? Maybe other maintainers are more suitable than me to answer this :) Yeah, I think it's great if you could pick it. Thanks, Leo

3 years, 9 months

[PATCH v5 0/9] perf: Refine barriers for AUX ring buffer

by Leo Yan

This patch series is to refine the memory barriers for AUX ring buffer. Patches 01 ~ 04 to address the barriers usage in the kernel. The first patch is to make clear comment for how to use the barriers between the data store and aux_head store, this asks the driver to make sure the data is visible. Patches 02 ~ 04 is to refine the drivers for barriers after the data store. Patch 05 is to use WRITE_ONCE() for updating aux_tail. Patches 06 ~ 09 is to drop the legacy __sync functions, and polish for duplicate code and cleanup the build and feature test after SYNC_COMPARE_AND_SWAP is not used. For easier review and more clear patch organization, comparing against to the previous patch series, the patches for support compat mode for AUX trace have been left out and will be sent out as a separate patch set. This patch set have been tested on Arm64 Juno platform. Changes from v4: - Refined comment for CoreSight ETR/ETF drivers (Suzuki/Peter); - Changed to use compiler barrier for BTS (mentioned by Peter, but have not received response from Intel developer); - Refined the coding style for patch 07 (Adrian). Changes from v3: - Removed the inapprocate paragraph in the commit log for patch "perf auxtrace: Drop legacy __sync functions" (Adrian); - Added new patch to remove feature-sync-compare-and-swap test (Adrian); - Th patch for "perf auxtrace: Use WRITE_ONCE() for updating aux_tail", is a standlone and simple change, so moved it ahead in the patch set for better ordering; - Minor improvement for commit logs in the last two patches. Changes from v2: - Removed auxtrace_mmap__read_snapshot_head(), which has the duplicated code with auxtrace_mmap__read_head(); - Cleanuped the build for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT (Adrian); - Added global variable "kernel_is_64_bit" (Adrian); - Added compat variants compat_auxtrace_mmap__{read_head|write_tail} (Adrian). Leo Yan (9): perf/ring_buffer: Add comment for barriers on AUX ring buffer coresight: tmc-etr: Add barrier after updating AUX ring buffer coresight: tmc-etf: Add comment for store ordering perf/x86: Add compiler barrier after updating BTS perf auxtrace: Use WRITE_ONCE() for updating aux_tail perf auxtrace: Drop legacy __sync functions perf auxtrace: Remove auxtrace_mmap__read_snapshot_head() perf: Cleanup for HAVE_SYNC_COMPARE_AND_SWAP_SUPPORT tools: Remove feature-sync-compare-and-swap feature detection arch/x86/events/intel/bts.c | 6 ++++ .../hwtracing/coresight/coresight-tmc-etf.c | 5 +++ .../hwtracing/coresight/coresight-tmc-etr.c | 8 +++++ kernel/events/ring_buffer.c | 9 ++++++ tools/build/Makefile.feature | 1 - tools/build/feature/Makefile | 4 --- tools/build/feature/test-all.c | 4 --- .../feature/test-sync-compare-and-swap.c | 15 --------- tools/perf/Makefile.config | 4 --- tools/perf/util/auxtrace.c | 18 +++-------- tools/perf/util/auxtrace.h | 31 +------------------ 11 files changed, 34 insertions(+), 71 deletions(-) delete mode 100644 tools/build/feature/test-sync-compare-and-swap.c -- 2.25.1

3 years, 9 months

Re: [PATCH 08/10] coresight: trbe: Workaround TRBE errat overwrite in FILL mode

by Suzuki K Poulose

On 09/09/2021 05:21, Anshuman Khandual wrote: > > > On 9/7/21 3:28 PM, Suzuki K Poulose wrote: >> On 03/08/2021 11:25, Anshuman Khandual wrote: >>> >>> >>> On 7/28/21 7:22 PM, Suzuki K Poulose wrote: >>>> ARM Neoverse-N2 (#2139208) and Cortex-A710(##2119858) suffers >>>> from an erratum, which when triggered, might cause the TRBE to >>>> overwrite the trace data already collected in FILL mode, in the >>>> event of a WRAP. i.e, the TRBE doesn't stop writing the data, >>>> instead wraps to the base and could write upto 3 cache line >>>> size worth trace. Thus, this could corrupt the trace at the >>>> "BASE" pointer. >>>> >>>> The workaround is to program the write pointer 256bytes from >>>> the base, such that if the erratum is triggered, it doesn't >>>> overwrite the trace data that was captured. This skipped region >>>> could be padded with ignore packets at the end of the session, >>>> so that the decoder sees a continuous buffer with some padding >>>> at the beginning. The trace data written at the base is >>>> considered lost as the limit could have been in the middle of >>>> the perf ring buffer, and jumping to the "base" is not >>>> acceptable. >>> >>> If the write pointer is guaranteed to have always started beyond >>> [base + 256] and then wraps around. Why cannot the initial trace >>> at [base .. base + 256] not to be considered for perf ? >> >> Remember that we could be dealing with a region within the larger >> ring buffer. >> >> Ring buffer >> >> | write (head + 256) v >> \---------xxxxx----------------------------------/ ^ >> ^ ^ head | | Limit >> Ring buffer end >> >> In this case, you are talking about the area marked as "xxx", which >> ideally should be appearing just after Limit, in the normal course >> of the trace without the erratum. > > Right. > >> >> Thus leaving the data at [base.. base+256] should be ideally moved >> to the end (after limit ptr). And : >> >> 1) We don't always have space after the "limit" in the ring >> buffer. 2) We don't know if the erratum was triggered at all, under >> the micro architectural conditions. >> >> So, it is ideal to live with this ignoring the trace for now. > > But could not we just do something like this instead. > > - While re-configuring and before starting the trace, pad the entire > area i.e [base .. base + 256] with ignore packets. > Ok, we do that now. > - After an wrap event, push [base .. base + 256] into the perf ring > buffer via perf_aux_output_begin()/end() just after normal trace data > capture (perf ring buffer indices would have got adjusted to accept > some new buffer data). How ? The TRBE *is* writing to the perf ring buffer. When the erratum is triggered, the trace data written at [base ... base + 256] was supposed to be written to [Limit .. Limit + 256] by the TRBE. So if at all we want to push those 256bytes into the ring buffer, we have to *move* 256bytes from "base" to the "Limit" (and pad the buffer at [base...base + 256]). Now, the Limit was chosen based on the available space in the ring buffer (or the wakeup). And thus we may not have space beyond (i.e is outside the head..head+size) the "Limit". > > Afterwards, if the erratum is triggered, real trace is captured. But > otherwise, there is always ignore packets for the decoder to parse. The erratum is only triggered at WRAP what you describe above is combined with the above step. Userspace dealing with 256bytes of Ignore for each record is still acceptable, given the benefit of being able to use the TRBE on such affected systems. >> >>> >>>> We set the flags already to indicate that some amount of trace >>>> was lost during the FILL event IRQ. So this is fine. >>> >>> Right, from perf data flag point of view it is not a problem. But >>> I am still wondering why cannot the trace at the base can not be >>> consumed by perf. >>> >> >> Please see above. >> >>>> >>>> One important change with the work around is, we program the >>>> TRBBASER_EL1 to current page where we are allowed to write. >>>> Otherwise, it could overwrite a region that may be consumed by >>>> the perf. Towards this, we always make sure that the >>> >>> While enabling TRBE after required reconfiguration, this will >>> cause both TRBBASER_EL1 and TRBPTR_EL1 to change each time around >>> (basically TRBBASER_EL1 follows handle->head) ? >> >> Yes. This is to ensure that the TRBE doesn't corrupt the "trace >> data" at the "start" of the actual ring buffer (not the area that >> we are allowed to write). In a nutshell, TRBE is only allowed to >> write within the area [head ... head + size]. > > Okay. > >> >>> >>>> "handle->head" and thus the trbe_write is PAGE_SIZE aligned, so >>>> that we can set the BASE to the PAGE base and move the TRBPTR >>>> to the 256bytes offset. >>> >>> Now that base needs to follow handle->head, both needs to be >>> PAGE_SIZE aligned (including the write pointer) because base >>> pointer needs to be PAGE_SIZE aligned ? >> >> Base must always be page aligned for TRBE. Now, with the erratum, >> the Base must be within [ head ... head + size ]. Thus we cannot >> put TRBPTR (write ptr) outside [TRBBASER ... TRBLIMITR], the write >> ptr follows the base and it is shifted by 256byte to allow the >> scratch space for overwrite, when the erratum triggers. > > Okay. > >> >>> >>>> >>>> Cc: Mike Leach <mike.leach(a)linaro.org> Cc: Mathieu Poirier >>>> <mathieu.poirier(a)linaro.org> Cc: Anshuman Khandual >>>> <anshuman.khandual(a)arm.com> Cc: Leo Yan <leo.yan(a)linaro.org> >>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose(a)arm.com> --- >>>> drivers/hwtracing/coresight/coresight-trbe.c | 111 >>>> +++++++++++++++++-- 1 file changed, 102 insertions(+), 9 >>>> deletions(-) >>>> >>>> diff --git a/drivers/hwtracing/coresight/coresight-trbe.c >>>> b/drivers/hwtracing/coresight/coresight-trbe.c index >>>> 9ea28813182b..cd997ed5d918 100644 --- >>>> a/drivers/hwtracing/coresight/coresight-trbe.c +++ >>>> b/drivers/hwtracing/coresight/coresight-trbe.c @@ -17,6 +17,7 >>>> @@ #include <asm/barrier.h> #include <asm/cputype.h> +#include >>>> <asm/cpufeature.h> #include "coresight-self-hosted-trace.h" >>>> #include "coresight-trbe.h" @@ -84,9 +85,17 @@ struct trbe_buf >>>> { * per TRBE instance, we keep track of the list of errata >>>> that * affects the given instance of the TRBE. */ -#define >>>> TRBE_ERRATA_MAX 0 +#define >>>> TRBE_WORKAROUND_OVERWRITE_FILL_MODE 0 +#define >>>> TRBE_ERRATA_MAX 1 >>> >>> Although I am not sure how to achieve it, TRBE_ERRATA_MAX should >>> be derived from errata cpucap start and end markers which >>> identify all TRBE specific erratas. The hard coding above could >>> be avoided. >> >> This lets us hand pick the TRBE errata and avoid having to do >> something you say is hard to achieve. What do you think is >> problematic with this approach ? > > No problem as such. > > Now that we know the arm64 erratums are not linear and might also > have non-TRBE erratums in-between its range. Hence there might not be > any ARM64_TRBE_ERRATA_START and ARM64_TRBE_ERRATA_END markers, which > I was hoping for. The original thought was those potential markers > could have helped derive TRBE_ERRATA_MAX automatically. > >> >>> >>>> + +/* + * Safe limit for the number of bytes that may be >>>> overwritten + * when the erratum is triggered. >>> >>> Specify which errata ? >> >> There are two distinct CPU errata. I will try to make it a bit >> more clear. >> >>> >>>> + */ +#define TRBE_WORKAROUND_OVERWRITE_FILL_MODE_SKIP 256 >>> >>> Needs to have _BYTES. TRBE_ERRATA_OVERWRITE_FILL_MODE_SKIP_BYTES >>> ? >>> >>>> static unsigned long trbe_errata_cpucaps[TRBE_ERRATA_MAX] = { + >>>> [TRBE_WORKAROUND_OVERWRITE_FILL_MODE] = >>>> ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE, >>> >>> s/TRBE_WORKAROUND_OVERWRITE_FILL_MODE/TRBE_ERRATA_OVERWRITE_FILL_MODE >>> instead ? >>> >>> ARM64_WORKAROUND_TRBE_XXX -----> TRBE_ERRATA_XXX >> >> I think WORKAROUND still makes it clear, and is inline with what we >> follow in generic arm64 and I would prefer to keep that. > > Sure. > >> >>> >>>> }; /* @@ -531,10 +540,13 @@ static enum trbe_fault_action >>>> trbe_get_fault_act(u64 trbsr) if ((ec == TRBE_EC_STAGE1_ABORT) >>>> || (ec == TRBE_EC_STAGE2_ABORT)) return TRBE_FAULT_ACT_FATAL; - >>>> if (is_trbe_wrap(trbsr) && (ec == TRBE_EC_OTHERS) && (bsc == >>>> TRBE_BSC_FILLED)) { - if (get_trbe_write_pointer() == >>>> get_trbe_base_pointer()) - return >>>> TRBE_FAULT_ACT_WRAP; - } + /* + * It is not necessary >>>> to verify the TRBPTR == TRBBASER to detect + * a FILL >>>> event. Moreover, CPU errata could make this check invalid. + >>>> */ >>> >>> Why should the check be dropped for CPU instances which dont have >>> the errata ? >> >> Good point. On the other hand, if the TRBPTR != TRBBASER, that >> indicates a TRBE erratum. So, I don't see too much value in it. I >> could keep that back in bypassed by the erratum check. > > If (TRBPTR != TRBBASER) is a definitive indicator for the erratum's > presence this could also be used for capturing [base .. base + 256] > in the perf ring buffer as mentioned before. > >> >>> >>>> + if (is_trbe_wrap(trbsr) && (ec == TRBE_EC_OTHERS) && (bsc >>>> == TRBE_BSC_FILLED)) + return TRBE_FAULT_ACT_WRAP; + >>>> return TRBE_FAULT_ACT_SPURIOUS; } @@ -544,6 +556,7 @@ static >>>> unsigned long trbe_get_trace_size(struct perf_output_handle >>>> *handle, { u64 write; u64 start_off, end_off; + u64 size; >>>> /* * If the TRBE has wrapped around the write pointer has @@ >>>> -559,7 +572,18 @@ static unsigned long >>>> trbe_get_trace_size(struct perf_output_handle *handle, if >>>> (WARN_ON_ONCE(end_off < start_off)) return 0; - return >>>> (end_off - start_off); + + size = end_off - start_off; + >>>> /* + * If the TRBE is affected by the following erratum, we >>>> must fill + * the space we skipped with IGNORE packets. And >>>> we are always + * guaranteed to have at least a PAGE_SIZE >>>> space in the buffer. + */ + if >>>> (trbe_has_erratum(TRBE_WORKAROUND_OVERWRITE_FILL_MODE, >>>> buf->cpudata) && + !WARN_ON(size < >>>> TRBE_WORKAROUND_OVERWRITE_FILL_MODE_SKIP)) >>> >>> Why warn here ? The skid can some times be less than 256 bytes. >>> OR because now write pointer alignment is PAGE_SIZE (from later >>> code), size too needs to be always a PAGE_SIZE multiple ? >> >> Because, we expect at least a page size worth space before we run. >> (Since BASE, WRITE and LIMIT are page aligned). If that guarantee >> is broken, we have serious problem with the size calculation >> logic. > > Okay. > >> >>> >>>> + __trbe_pad_buf(buf, start_off, >>>> TRBE_WORKAROUND_OVERWRITE_FILL_MODE_SKIP); + + return size; >>>> } static void *arm_trbe_alloc_buffer(struct coresight_device >>>> *csdev, @@ -704,20 +728,73 @@ static unsigned long >>>> arm_trbe_update_buffer(struct coresight_device *csdev, return >>>> size; } + +static int >>>> trbe_apply_work_around_before_enable(struct trbe_buf *buf) +{ + >>>> /* + * TRBE_WORKAROUND_OVERWRITE_FILL_MODE causes the TRBE >>>> to overwrite a few cache + * line size from the >>>> "TRBBASER_EL1" in the event of a "FILL". + * Thus, we could >>>> loose some amount of the trace at the base. + * + * To >>>> work around this: + * - Software must leave 256bytes of >>>> space from the base, so that + * the trace collected now >>>> is not overwritten. + * - Fill the first 256bytes with >>>> IGNORE packets for the decoder + * to ignore at the end >>>> of the session, so that the decoder ignores + * this >>>> gap. + * + * This also means that, the TRBE driver must >>>> set the TRBBASER_EL1 + * such that, when the erratum is >>>> triggered, it doesn't overwrite + * the "area" outside the >>>> area marked by (handle->head, +size). + * So, we make sure >>>> that the handle->head is always PAGE aligned, + * by >>>> tweaking the required alignment for the TRBE (trbe_align). + >>>> * And when we enable the TRBE, + * + * - move the >>>> TRBPTR_EL1 to 256bytes past the starting point. + * So >>>> that any trace collected in this run is not overwritten. + >>>> * + * - set the TRBBASER_EL1 to the original trbe_write. >>>> This will + * ensure that, if the TRBE hits the >>>> erratum, it would only + * write within the region >>>> allowed for the TRBE. + * + * At the trace collection >>>> time, we always pad the skipped bytes + * with IGNORE >>>> packets to make sure the decoder doesn't see any + * >>>> overwritten packets. + */ + if >>>> (trbe_has_erratum(TRBE_WORKAROUND_OVERWRITE_FILL_MODE, >>>> buf->cpudata)) { + if >>>> (WARN_ON(!IS_ALIGNED(buf->trbe_write, PAGE_SIZE))) + >>>> return -EINVAL; + buf->trbe_hw_base = buf->trbe_write; + >>>> buf->trbe_write += TRBE_WORKAROUND_OVERWRITE_FILL_MODE_SKIP; >>> >>> Not sure if I really understood this correctly, but would not >>> the above keep on reducing the usable buffer space for TRBE each >>> time around ? BTW buffer adjustment complexity here requires a >>> proper >> >> Yes, unfortunately, we have to make sure every time the TRBE is >> turned ON, we leave this space. We always recaculate the offsets >> anyway, before turning it ON and this additional step makes sure >> that the WRITE ptr is moved 256bytes (and also the BASE is moved to >> the "head"). >> >> I don't think this adjustment changes much. It is only shifting >> the write pointer within the region [head ... limit]. The >> calculation of the head and the limit still remains the same. > > Okay, got it. > >> >>> ASCII diagram based illustration like before. >>> >>>> + } + + return 0; +} + static int >>>> __arm_trbe_enable(struct trbe_buf *buf, struct >>>> perf_output_handle *handle) { + int ret = 0; + >>>> buf->trbe_limit = compute_trbe_buffer_limit(handle); >>>> buf->trbe_write = buf->trbe_base + PERF_IDX2OFF(handle->head, >>>> buf); if (buf->trbe_limit == buf->trbe_base) { - >>>> trbe_stop_and_truncate_event(handle); - return -ENOSPC; >>>> + ret = -ENOSPC; + goto err; } /* Set the base of >>>> the TRBE to the buffer base */ buf->trbe_hw_base = >>>> buf->trbe_base; + + ret = >>>> trbe_apply_work_around_before_enable(buf); + if (ret) + >>>> goto err; + *this_cpu_ptr(buf->cpudata->drvdata->handle) = >>>> handle; trbe_enable_hw(buf); return 0; +err: + >>>> trbe_stop_and_truncate_event(handle); + return ret; } static >>>> int arm_trbe_enable(struct coresight_device *csdev, u32 mode, >>>> void *data) @@ -1003,7 +1080,23 @@ static void >>>> arm_trbe_probe_cpu(void *info) pr_err("Unsupported alignment on >>>> cpu %d\n", cpu); goto cpu_clear; } - cpudata->trbe_align = >>>> cpudata->trbe_hw_align; + + /* + * If the TRBE is >>>> affected by erratum TRBE_WORKAROUND_OVERWRITE_FILL_MODE, >>> >>> Better to address it with the original errata name i.e >>> ARM64_WORKAROUND_TRBE_OVERWRITE_FILL_MODE instead. But not a big >>> issue though. >> >> We deal with the TRBE_xx name everywhere, for even checking the >> erratum, so using the other one is going to confuse the code >> reading. > > Okay. > >> >>> >>>> + * we must always program the TBRPTR_EL1, 256bytes from a >>>> page + * boundary, with TRBBASER_EL1 set to the page, to >>>> prevent + * TRBE over-writing 256bytes at TRBBASER_EL1 on >>>> FILL event. + * + * Thus make sure we always align our >>>> write pointer to a PAGE_SIZE, + * which also guarantees >>>> that we have at least a PAGE_SIZE space in + * the buffer >>>> (TRBLIMITR is PAGE aligned) and thus we can skip + * the >>>> required bytes at the base. + */ >>> >>> Should not TRBPTR_EL1 be aligned to 256 bytes instead, as >>> TRBBASER_EL1 is always at PAGE_SIZE boundary. Hence TRBPTR_EL1 >>> could never be in between base and [base + 256 bytes]. Why >>> alignment needs to go all the way upto PAGE_SIZE instead ? OR am >>> I missing something here ? >> >> Please see the diagram above. When the TRBE wraps, it jumps to the >> TRBBASER, and if that is outside the region permitted for TRBE, i.e >> [head... head + size], we risk corrupting data potentially consumed >> by the userspace. So, we must make sure that the TRBBASER is within >> the [head...head+size], which is why we update the hw_base >> accordingly when we detect the erratum. > > Okay, got it. > >> >>> >>>> + if (trbe_has_erratum(TRBE_WORKAROUND_OVERWRITE_FILL_MODE, >>>> cpudata)) + cpudata->trbe_align = PAGE_SIZE; + else + >>>> cpudata->trbe_align = cpudata->trbe_hw_align; + >>>> cpudata->trbe_flag = get_trbe_flag_update(trbidr); cpudata->cpu >>>> = cpu; cpudata->drvdata = drvdata; >>>> >>> >>> I will visit this patch again. Not sure if I really understand >>> all this changes correctly enough. >> >> Please let me know if you have further questions. > > Sure, will do. > >> >> Thanks for taking a look >> >> Suzuki >>> >>

3 years, 9 months

Re: [PATCH 10/10] arm64: errata: Add workaround for TSB flush failures

by Suzuki K Poulose

On 03/08/2021 04:51, Anshuman Khandual wrote: > > > On 8/2/21 3:05 PM, Marc Zyngier wrote: >> On 2021-08-02 10:12, Anshuman Khandual wrote: >>> On 7/29/21 4:11 PM, Suzuki K Poulose wrote: >>>> On 29/07/2021 10:55, Marc Zyngier wrote: >>>>> On Wed, 28 Jul 2021 14:52:17 +0100, >>>>> Suzuki K Poulose <suzuki.poulose(a)arm.com> >> >> [...] >> >>>>>> + __tsb_csync(); \ >>>>>> + __tsb_csync(); \ >>>>>> + } else { \ >>>>>> + __tsb_csync(); \ >>>>>> + } \ >>>>> >>>>> nit: You could keep one unconditional __tsb_csync(). >>>> >>>> I thought about that, I was worried if the CPU expects them back to back >>>> without any other instructions in between them. Thinking about it a bit >>>> more, it doesn't look like that is the case. I will confirm this and >>>> change it accordingly. >>> But its a very subtle change which might be difficult to debug and blame >>> later on, if indeed both the instructions need to be back to back. Seems >>> like just better to leave this unchanged. >> >> Is that an actual requirement? Sounds like you want to find out >> from the errata document. > > Sure, will get back on this. > >> >> And if they actually need to be back to back, what ensures that >> this is always called with interrupt disabled? >> >> You would also need to have them in the same asm block to avoid >> the compiler reordering stuff. > > Agreed, both the above constructs will be required to make sure that > the instructions will be executed consecutively (if required). > I checked this with our architects and it doesn't have to be in tight loop. The TSBs are meant to be used with the PE in Trace prohibited regions (which they are for TRBE and the KVM nvhe stub, TRFCR_ELx cleared). As long as that is honored, we are fine. I will update the patch. Kind regards Suzuki

3 years, 9 months

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

CoreSight September 2021