When build perf tool with passing option 'CORESIGHT=1' explicitly, if
the feature test fails for library libopencsd, the build doesn't
complain the feature failure and continue to build the tool with
disabling the CoreSight feature insteadly.
This patch changes the building behaviour, when build perf tool with the
option 'CORESIGHT=1' and detect the failure for testing feature
libopencsd, the build process will be aborted and it shows the complaint
info.
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
---
tools/perf/Makefile.config | 2 ++
1 file changed, 2 insertions(+)
diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config
index 4a0d9a6defc7..5df79538486b 100644
--- a/tools/perf/Makefile.config
+++ b/tools/perf/Makefile.config
@@ -489,6 +489,8 @@ ifdef CORESIGHT
CFLAGS += -DCS_RAW_PACKED
endif
endif
+ else
+ dummy := $(error Error: No libopencsd library found or the version is not up-to-date. Please install recent libopencsd to build with COREISGHT=1)
endif
endif
--
2.25.1
Hi Tao,
On Thu, 19 Aug 2021 at 03:29, Tao Zhang <quic_taozha(a)quicinc.com> wrote:
>
> This series adds Coresight support for SM8250 Soc on RB5 board.
> It is composed of two elements.
> a) Add ETM PID for Kryo-5XX.
> b) Add coresight support to DTS for RB5.
>
> This series applies to coresight/next
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>
> Tao Zhang (2):
> coresight: etm4x: Add ETM PID for Kryo-5XX
> arm64: dts: qcom: sm8250: Add Coresight support
>
> arch/arm64/boot/dts/qcom/qrb5165-rb5.dts | 442 +++++++++++++++++-
> .../coresight/coresight-etm4x-core.c | 1 +
> 2 files changed, 439 insertions(+), 4 deletions(-)
>
I have added your work to my patchset queue. On the other hand I have
a lot of patches to review these days and as such won't be able to
look at it for 4 to 5 weeks.
Thanks,
Mathieu
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
Hi Leo,
We are interested in adding support in Coresight driver to collect trace data at
the time of kernel panic. Would like to know the current status of this
patch series you posted some time back [1]. If no one is working on this,
we would like to revive and resubmit this.
Appreciate your guidance on this.
Also please discard the earlier duplicate emails which i have send to
you by mistake.
Thanks.
1. https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1652258.html.
This patch set is to address the comments in the patch set v1 [1] and
updates the patches.
Changes from v1:
- Added Adrian's review tag for patch 01;
- Refined comment in patch 01 (James);
- Added James' review and test tag for patch 02.
[1] https://lore.kernel.org/patchwork/cover/1473936/
Leo Yan (2):
perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}
perf auxtrace arm: Support
compat_auxtrace_mmap__{read_head|write_tail}
tools/perf/arch/arm/util/auxtrace.c | 32 +++++++++++
tools/perf/util/auxtrace.c | 88 +++++++++++++++++++++++++++--
tools/perf/util/auxtrace.h | 22 +++++++-
3 files changed, 135 insertions(+), 7 deletions(-)
--
2.25.1
Hi Tao,
On Thu, Aug 19, 2021 at 05:29:37PM +0800, Tao Zhang wrote:
> The input parameter of the function pm_runtime_put should be the
> same in the function cti_enable_hw and cti_disable_hw. The correct
> parameter to use here should be dev->parent.
>
> Signed-off-by: Tao Zhang <quic_taozha(a)quicinc.com>
> ---
> drivers/hwtracing/coresight/coresight-cti-core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-cti-core.c b/drivers/hwtracing/coresight/coresight-cti-core.c
> index e2a3620..8988b2e 100644
> --- a/drivers/hwtracing/coresight/coresight-cti-core.c
> +++ b/drivers/hwtracing/coresight/coresight-cti-core.c
> @@ -175,7 +175,7 @@ static int cti_disable_hw(struct cti_drvdata *drvdata)
> coresight_disclaim_device_unlocked(csdev);
> CS_LOCK(drvdata->base);
> spin_unlock(&drvdata->spinlock);
> - pm_runtime_put(dev);
> + pm_runtime_put(dev->parent);
coresight_register() allocates data structure 'coresight_device' and
assigns probed device to the field 'coresight_device::dev::parent';
thus afterwards we need to pass 'coresight_device::dev::parent' for
pm_runtime_xxx() functions. It's not directive for understanding, so
log the info at here.
For the patch:
Reviewed-by: Leo Yan <leo.yan(a)linaro.org>
We could wait for Mike to review as well.
Thanks,
Leo
> return 0;
>
> /* not disabled this call */
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the
low level's architecture code, e.g. for Arm64, it maps the memory with
the attribution "Normal non-cacheable"; this can be concluded from the
definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h.
Later when access the AUX bounce buffer, since the memory mapping is
non-cacheable, it's low efficiency due to every load instruction must
reach out DRAM.
This patch changes to allocate pages with alloc_pages_node(), thus the
driver can access the memory with cacheable mapping in the kernel linear
virtual address; therefore, because load instructions can fetch data
from cache lines rather than always read data from DRAM, the driver can
boost memory coping performance. After using the cacheable mapping, the
driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to
read bounce buffer so can avoid read stale trace data.
By measurement the duration for function tmc_update_etr_buffer() with
ftrace function_graph tracer, it shows the performance significant
improvement for copying 4MiB data from bounce buffer:
# echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise
# echo tmc_update_etr_buffer > set_graph_function
# echo function_graph > current_tracer
before:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 8148.320 us | }
after:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 2463.980 us | }
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
---
Changes from v2:
Sync the entire buffer in one go when the tracing is wrap around
(Suzuki);
Add Suzuki's review tage.
Changes from v1:
Set "flat_buf->daddr" to 0 when fails to map DMA region; and dropped the
unexpected if condition change in tmc_etr_free_flat_buf().
.../hwtracing/coresight/coresight-tmc-etr.c | 47 ++++++++++++++++---
1 file changed, 40 insertions(+), 7 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
index 13fd1fc730ed..ac37e9376d2b 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
@@ -21,6 +21,7 @@
struct etr_flat_buf {
struct device *dev;
+ struct page *pages;
dma_addr_t daddr;
void *vaddr;
size_t size;
@@ -600,6 +601,7 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
{
struct etr_flat_buf *flat_buf;
struct device *real_dev = drvdata->csdev->dev.parent;
+ ssize_t aligned_size;
/* We cannot reuse existing pages for flat buf */
if (pages)
@@ -609,11 +611,18 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
if (!flat_buf)
return -ENOMEM;
- flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size,
- &flat_buf->daddr, GFP_KERNEL);
- if (!flat_buf->vaddr) {
- kfree(flat_buf);
- return -ENOMEM;
+ aligned_size = PAGE_ALIGN(etr_buf->size);
+ flat_buf->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
+ get_order(aligned_size));
+ if (!flat_buf->pages)
+ goto fail_alloc_pages;
+
+ flat_buf->vaddr = page_address(flat_buf->pages);
+ flat_buf->daddr = dma_map_page(real_dev, flat_buf->pages, 0,
+ aligned_size, DMA_FROM_DEVICE);
+ if (dma_mapping_error(real_dev, flat_buf->daddr)) {
+ flat_buf->daddr = 0;
+ goto fail_dma_map_page;
}
flat_buf->size = etr_buf->size;
@@ -622,6 +631,12 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
etr_buf->mode = ETR_MODE_FLAT;
etr_buf->private = flat_buf;
return 0;
+
+fail_dma_map_page:
+ __free_pages(flat_buf->pages, get_order(aligned_size));
+fail_alloc_pages:
+ kfree(flat_buf);
+ return -ENOMEM;
}
static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
@@ -630,15 +645,20 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
if (flat_buf && flat_buf->daddr) {
struct device *real_dev = flat_buf->dev->parent;
+ ssize_t aligned_size = PAGE_ALIGN(etr_buf->size);
- dma_free_coherent(real_dev, flat_buf->size,
- flat_buf->vaddr, flat_buf->daddr);
+ dma_unmap_page(real_dev, flat_buf->daddr, aligned_size,
+ DMA_FROM_DEVICE);
+ __free_pages(flat_buf->pages, get_order(aligned_size));
}
kfree(flat_buf);
}
static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
{
+ struct etr_flat_buf *flat_buf = etr_buf->private;
+ struct device *real_dev = flat_buf->dev->parent;
+
/*
* Adjust the buffer to point to the beginning of the trace data
* and update the available trace data.
@@ -648,6 +668,19 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
etr_buf->len = etr_buf->size;
else
etr_buf->len = rwp - rrp;
+
+ /*
+ * The driver always starts tracing at the beginning of the buffer,
+ * the only reason why we would get a wrap around is when the buffer
+ * is full. Sync the entire buffer in one go for this case.
+ */
+ if (etr_buf->offset + etr_buf->len > etr_buf->size)
+ dma_sync_single_for_cpu(real_dev, flat_buf->daddr,
+ etr_buf->size, DMA_FROM_DEVICE);
+ else
+ dma_sync_single_for_cpu(real_dev,
+ flat_buf->daddr + etr_buf->offset,
+ etr_buf->len, DMA_FROM_DEVICE);
}
static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf,
--
2.25.1
Hi Al and Leo,
> Hi Al,
>
> On Tue, Aug 24, 2021 at 02:55:07PM +0000, Al Grant wrote:
>
> [...]
>
> > > > > > > > 16 (0xF) should work for all silicon, as AXI allows burst
> sizes up to 16.
> > > > > > > > So unless we've missed something, this is an implementation
> > > > > > > > non-compliance and we should not be penalising compliant
> > > > > > > > implementations by reducing the default burst size - the
> > > > > > > > question is how we can enable the upstream kernel to
> > > > > > > > workaround the issue on this chip only, and to me that sounds
> > > > > > > > like it needs something that can be triggered by a setting in
> DT/ACPI.
>
> [...]
>
> > > we can add an optional property like below:
> > >
> > > * Optional property for TMC:
> > >
> > > * arm,burst-size: burst size initiated by the TMC on the AXI
> master
> > > interface. The burst size can be in the range [0..15], the
> setting
> > > supports one data transfer per burst to maximum of 16 data
> > > transfers per burst. If don't set this property, the driver
> > > will set to 15 (16 data transfers per burst) as default value.
> > >
> > > I don't think it's a right way to use CoreSight ROM to read out part
> number and
> > > producer number, and based on these numbers to set burst size. The main
> > > reason is this will add SoC specific code into the driver, and DT
> binding is just
> > > used to decouple the platform specific with the driver code, so with the
> DT
> > > binding, the driver works as common as possible.
> > >
> > > Using errata workaround also is not the right thing to do. Based on the
> TMC
> > > spec (ARM DDI 0461B), the burst size on AXI bus could be different on
> different
> > > SoCs
> >
> > Hi Leo, where are you reading that this is a property of the AXI bus?
>
> No, I cannot find any description in TMC spec (ARM DDI 0461B) to say
> AXICTL.WrBurstLen is a property of the AXI bus.
>
> After read the register AXICTL (DDI 0461B 3.3.15), I mixed the
> concepts, AXICTL.WrBurstLen is an attribution for TMC to initiate the
> max number of data transfers, it's not a property for AXI bus itself.
>
> Sorry I introduced confusion.
>
> > There is a constraint that the burst size must not be greater than the
> > TMC buffer size (see DDI 0461B 3.3.1), but buffer size is indicated by
> > the TMC's MEMWIDTH register and the driver can determine that.
>
> DEVID.MEMWIDTH: The width of the memory interface databus
> for ETB/ETF, DEVID.MEMWIDT = 2 * (ATB datawidth)
> for ETR, DEVID.MEMWIDT = ATB datawidth
>
> AXICTL.WrBurstLen: The maximum number of data transfer that can occur
> within each burst.
>
> RSZ.RSZ: The size of the local RAM buffer (in 32-bit words).
>
> And there have requirement for burst size and TMC buffer size:
>
> 2 ^ (DEVID.MEMWIDTH - 2) * AXICTL.WrBurstLen <= RSZ.RSZ
>
> > The situation I thought we were dealing with, is where the TMC
> > presents a burst that the AXI spec says is valid but the downstream
> > component in this SoC can't accept a burst of that size. Our understanding
> > is that this would be a non-compliance in the downstream component -
> > i.e. failure to handle a valid AXI transaction - and it would be
> > appropriate to treat this as a SoC-specific issue triggering a SoC-
> specific
> > workaround. (The alternative would be to represent the AXI topology
> > in DT/ACPI and then drive the workaround from knowing what the
> > downstream component is.)
>
> Okay, I see, the downstream component (e.g. memory controller) has the
> hardware constraint, it imposes the limitation back on TMC burst size.
>
> The downstream can be a memory controller, or any other component. I
> understand your suggestion that use DT device node to bind TMC and
> downstream device, and based on the property in the downstream device
> we can get to know the constriant for burst size.
>
> But here it's hard to unify property for downstream components; and I
> checked a bit for the latest Device tree spec, if we create the binding
> between TMC device and a memory node ('memory' is a general node in DT
> binding), the 'memory' node doesn't provide property for burst size.
>
> So seems to me, a feasible (and simple) way is still to add a property
> for TMC, since the TMC is connected to its downstream component through
> AXI bus, the developer for a platform has the knowledge for the
> constraint for the max burst size.
>
> * Optional property for TMC:
>
> * arm,max-burst-size: the max burst size initiated by the TMC on the
> AXI master interface. The burst size can be in the range [0..15],
> the setting supports one data transfer per burst to maximum of 16 data
> transfers per burst.
>
> When a platform has specific constriant for the maximum burst size
> (e.g.
> caused by its downstream component), it can set this property for the
> maximum burst size; if this property isn't set, the driver will set to
> 15 (16 data transfers per burst) as default value.
>
> Hope I am not arbitrary for this, so I am curious what's opinions from
> others. any thoughts?
Thanks for providing all the information and ways to fix this issue.
Based on the suggestions, would it be okay if we introduce the optional
DT property as "mrvl,max-burst-size" since it's a SoC specific setting ?
Thanks and Regards,
Tanmay
>
> Thanks,
> Leo
>
> [1] https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_devicetree-2Dorg_devicetree-
> 2Dspecification_releases_download_v0.3_devicetree-2Dspecification-
> 2Dv0.3.pdf&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=nUIuNGvBPgdGfPtnCZf81muScqsqX
> vLuOk9ojMTpiTc&m=j5ICgutTAewROjPhs6f9hmuagqSziAFuElP5DLmrP_4&s=YbM6THDChtLLK
> xUMtZMiM1BpU3Lyw0OfKW_E3nYiCtQ&e=