Hi Tao,
On Thu, Aug 19, 2021 at 05:29:37PM +0800, Tao Zhang wrote:
> The input parameter of the function pm_runtime_put should be the
> same in the function cti_enable_hw and cti_disable_hw. The correct
> parameter to use here should be dev->parent.
>
> Signed-off-by: Tao Zhang <quic_taozha(a)quicinc.com>
> ---
> drivers/hwtracing/coresight/coresight-cti-core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-cti-core.c b/drivers/hwtracing/coresight/coresight-cti-core.c
> index e2a3620..8988b2e 100644
> --- a/drivers/hwtracing/coresight/coresight-cti-core.c
> +++ b/drivers/hwtracing/coresight/coresight-cti-core.c
> @@ -175,7 +175,7 @@ static int cti_disable_hw(struct cti_drvdata *drvdata)
> coresight_disclaim_device_unlocked(csdev);
> CS_LOCK(drvdata->base);
> spin_unlock(&drvdata->spinlock);
> - pm_runtime_put(dev);
> + pm_runtime_put(dev->parent);
coresight_register() allocates data structure 'coresight_device' and
assigns probed device to the field 'coresight_device::dev::parent';
thus afterwards we need to pass 'coresight_device::dev::parent' for
pm_runtime_xxx() functions. It's not directive for understanding, so
log the info at here.
For the patch:
Reviewed-by: Leo Yan <leo.yan(a)linaro.org>
We could wait for Mike to review as well.
Thanks,
Leo
> return 0;
>
> /* not disabled this call */
> --
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
The AUX bounce buffer is allocated with API dma_alloc_coherent(), in the
low level's architecture code, e.g. for Arm64, it maps the memory with
the attribution "Normal non-cacheable"; this can be concluded from the
definition for pgprot_dmacoherent() in arch/arm64/include/asm/pgtable.h.
Later when access the AUX bounce buffer, since the memory mapping is
non-cacheable, it's low efficiency due to every load instruction must
reach out DRAM.
This patch changes to allocate pages with alloc_pages_node(), thus the
driver can access the memory with cacheable mapping in the kernel linear
virtual address; therefore, because load instructions can fetch data
from cache lines rather than always read data from DRAM, the driver can
boost memory coping performance. After using the cacheable mapping, the
driver uses dma_sync_single_for_cpu() to invalidate cacheline prior to
read bounce buffer so can avoid read stale trace data.
By measurement the duration for function tmc_update_etr_buffer() with
ftrace function_graph tracer, it shows the performance significant
improvement for copying 4MiB data from bounce buffer:
# echo tmc_etr_get_data_flat_buf > set_graph_notrace // avoid noise
# echo tmc_update_etr_buffer > set_graph_function
# echo function_graph > current_tracer
before:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 8148.320 us | }
after:
# CPU DURATION FUNCTION CALLS
# | | | | | | |
2) | tmc_update_etr_buffer() {
...
2) # 2463.980 us | }
Signed-off-by: Leo Yan <leo.yan(a)linaro.org>
Reviewed-by: Suzuki K Poulose <suzuki.poulose(a)arm.com>
---
Changes from v2:
Sync the entire buffer in one go when the tracing is wrap around
(Suzuki);
Add Suzuki's review tage.
Changes from v1:
Set "flat_buf->daddr" to 0 when fails to map DMA region; and dropped the
unexpected if condition change in tmc_etr_free_flat_buf().
.../hwtracing/coresight/coresight-tmc-etr.c | 47 ++++++++++++++++---
1 file changed, 40 insertions(+), 7 deletions(-)
diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
index 13fd1fc730ed..ac37e9376d2b 100644
--- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
+++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
@@ -21,6 +21,7 @@
struct etr_flat_buf {
struct device *dev;
+ struct page *pages;
dma_addr_t daddr;
void *vaddr;
size_t size;
@@ -600,6 +601,7 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
{
struct etr_flat_buf *flat_buf;
struct device *real_dev = drvdata->csdev->dev.parent;
+ ssize_t aligned_size;
/* We cannot reuse existing pages for flat buf */
if (pages)
@@ -609,11 +611,18 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
if (!flat_buf)
return -ENOMEM;
- flat_buf->vaddr = dma_alloc_coherent(real_dev, etr_buf->size,
- &flat_buf->daddr, GFP_KERNEL);
- if (!flat_buf->vaddr) {
- kfree(flat_buf);
- return -ENOMEM;
+ aligned_size = PAGE_ALIGN(etr_buf->size);
+ flat_buf->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
+ get_order(aligned_size));
+ if (!flat_buf->pages)
+ goto fail_alloc_pages;
+
+ flat_buf->vaddr = page_address(flat_buf->pages);
+ flat_buf->daddr = dma_map_page(real_dev, flat_buf->pages, 0,
+ aligned_size, DMA_FROM_DEVICE);
+ if (dma_mapping_error(real_dev, flat_buf->daddr)) {
+ flat_buf->daddr = 0;
+ goto fail_dma_map_page;
}
flat_buf->size = etr_buf->size;
@@ -622,6 +631,12 @@ static int tmc_etr_alloc_flat_buf(struct tmc_drvdata *drvdata,
etr_buf->mode = ETR_MODE_FLAT;
etr_buf->private = flat_buf;
return 0;
+
+fail_dma_map_page:
+ __free_pages(flat_buf->pages, get_order(aligned_size));
+fail_alloc_pages:
+ kfree(flat_buf);
+ return -ENOMEM;
}
static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
@@ -630,15 +645,20 @@ static void tmc_etr_free_flat_buf(struct etr_buf *etr_buf)
if (flat_buf && flat_buf->daddr) {
struct device *real_dev = flat_buf->dev->parent;
+ ssize_t aligned_size = PAGE_ALIGN(etr_buf->size);
- dma_free_coherent(real_dev, flat_buf->size,
- flat_buf->vaddr, flat_buf->daddr);
+ dma_unmap_page(real_dev, flat_buf->daddr, aligned_size,
+ DMA_FROM_DEVICE);
+ __free_pages(flat_buf->pages, get_order(aligned_size));
}
kfree(flat_buf);
}
static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
{
+ struct etr_flat_buf *flat_buf = etr_buf->private;
+ struct device *real_dev = flat_buf->dev->parent;
+
/*
* Adjust the buffer to point to the beginning of the trace data
* and update the available trace data.
@@ -648,6 +668,19 @@ static void tmc_etr_sync_flat_buf(struct etr_buf *etr_buf, u64 rrp, u64 rwp)
etr_buf->len = etr_buf->size;
else
etr_buf->len = rwp - rrp;
+
+ /*
+ * The driver always starts tracing at the beginning of the buffer,
+ * the only reason why we would get a wrap around is when the buffer
+ * is full. Sync the entire buffer in one go for this case.
+ */
+ if (etr_buf->offset + etr_buf->len > etr_buf->size)
+ dma_sync_single_for_cpu(real_dev, flat_buf->daddr,
+ etr_buf->size, DMA_FROM_DEVICE);
+ else
+ dma_sync_single_for_cpu(real_dev,
+ flat_buf->daddr + etr_buf->offset,
+ etr_buf->len, DMA_FROM_DEVICE);
}
static ssize_t tmc_etr_get_data_flat_buf(struct etr_buf *etr_buf,
--
2.25.1
Hi Al and Leo,
> Hi Al,
>
> On Tue, Aug 24, 2021 at 02:55:07PM +0000, Al Grant wrote:
>
> [...]
>
> > > > > > > > 16 (0xF) should work for all silicon, as AXI allows burst
> sizes up to 16.
> > > > > > > > So unless we've missed something, this is an implementation
> > > > > > > > non-compliance and we should not be penalising compliant
> > > > > > > > implementations by reducing the default burst size - the
> > > > > > > > question is how we can enable the upstream kernel to
> > > > > > > > workaround the issue on this chip only, and to me that sounds
> > > > > > > > like it needs something that can be triggered by a setting in
> DT/ACPI.
>
> [...]
>
> > > we can add an optional property like below:
> > >
> > > * Optional property for TMC:
> > >
> > > * arm,burst-size: burst size initiated by the TMC on the AXI
> master
> > > interface. The burst size can be in the range [0..15], the
> setting
> > > supports one data transfer per burst to maximum of 16 data
> > > transfers per burst. If don't set this property, the driver
> > > will set to 15 (16 data transfers per burst) as default value.
> > >
> > > I don't think it's a right way to use CoreSight ROM to read out part
> number and
> > > producer number, and based on these numbers to set burst size. The main
> > > reason is this will add SoC specific code into the driver, and DT
> binding is just
> > > used to decouple the platform specific with the driver code, so with the
> DT
> > > binding, the driver works as common as possible.
> > >
> > > Using errata workaround also is not the right thing to do. Based on the
> TMC
> > > spec (ARM DDI 0461B), the burst size on AXI bus could be different on
> different
> > > SoCs
> >
> > Hi Leo, where are you reading that this is a property of the AXI bus?
>
> No, I cannot find any description in TMC spec (ARM DDI 0461B) to say
> AXICTL.WrBurstLen is a property of the AXI bus.
>
> After read the register AXICTL (DDI 0461B 3.3.15), I mixed the
> concepts, AXICTL.WrBurstLen is an attribution for TMC to initiate the
> max number of data transfers, it's not a property for AXI bus itself.
>
> Sorry I introduced confusion.
>
> > There is a constraint that the burst size must not be greater than the
> > TMC buffer size (see DDI 0461B 3.3.1), but buffer size is indicated by
> > the TMC's MEMWIDTH register and the driver can determine that.
>
> DEVID.MEMWIDTH: The width of the memory interface databus
> for ETB/ETF, DEVID.MEMWIDT = 2 * (ATB datawidth)
> for ETR, DEVID.MEMWIDT = ATB datawidth
>
> AXICTL.WrBurstLen: The maximum number of data transfer that can occur
> within each burst.
>
> RSZ.RSZ: The size of the local RAM buffer (in 32-bit words).
>
> And there have requirement for burst size and TMC buffer size:
>
> 2 ^ (DEVID.MEMWIDTH - 2) * AXICTL.WrBurstLen <= RSZ.RSZ
>
> > The situation I thought we were dealing with, is where the TMC
> > presents a burst that the AXI spec says is valid but the downstream
> > component in this SoC can't accept a burst of that size. Our understanding
> > is that this would be a non-compliance in the downstream component -
> > i.e. failure to handle a valid AXI transaction - and it would be
> > appropriate to treat this as a SoC-specific issue triggering a SoC-
> specific
> > workaround. (The alternative would be to represent the AXI topology
> > in DT/ACPI and then drive the workaround from knowing what the
> > downstream component is.)
>
> Okay, I see, the downstream component (e.g. memory controller) has the
> hardware constraint, it imposes the limitation back on TMC burst size.
>
> The downstream can be a memory controller, or any other component. I
> understand your suggestion that use DT device node to bind TMC and
> downstream device, and based on the property in the downstream device
> we can get to know the constriant for burst size.
>
> But here it's hard to unify property for downstream components; and I
> checked a bit for the latest Device tree spec, if we create the binding
> between TMC device and a memory node ('memory' is a general node in DT
> binding), the 'memory' node doesn't provide property for burst size.
>
> So seems to me, a feasible (and simple) way is still to add a property
> for TMC, since the TMC is connected to its downstream component through
> AXI bus, the developer for a platform has the knowledge for the
> constraint for the max burst size.
>
> * Optional property for TMC:
>
> * arm,max-burst-size: the max burst size initiated by the TMC on the
> AXI master interface. The burst size can be in the range [0..15],
> the setting supports one data transfer per burst to maximum of 16 data
> transfers per burst.
>
> When a platform has specific constriant for the maximum burst size
> (e.g.
> caused by its downstream component), it can set this property for the
> maximum burst size; if this property isn't set, the driver will set to
> 15 (16 data transfers per burst) as default value.
>
> Hope I am not arbitrary for this, so I am curious what's opinions from
> others. any thoughts?
Thanks for providing all the information and ways to fix this issue.
Based on the suggestions, would it be okay if we introduce the optional
DT property as "mrvl,max-burst-size" since it's a SoC specific setting ?
Thanks and Regards,
Tanmay
>
> Thanks,
> Leo
>
> [1] https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__github.com_devicetree-2Dorg_devicetree-
> 2Dspecification_releases_download_v0.3_devicetree-2Dspecification-
> 2Dv0.3.pdf&d=DwIBAg&c=nKjWec2b6R0mOyPaz7xtfQ&r=nUIuNGvBPgdGfPtnCZf81muScqsqX
> vLuOk9ojMTpiTc&m=j5ICgutTAewROjPhs6f9hmuagqSziAFuElP5DLmrP_4&s=YbM6THDChtLLK
> xUMtZMiM1BpU3Lyw0OfKW_E3nYiCtQ&e=
Hi Daniel,
On Tue, Jul 13, 2021 at 02:15:30PM +0200, Daniel Kiss wrote:
> Keep track of the perf handler that is registred by the first tracer.
> This will be used by the update call from polling.
>
> Signed-off-by: Daniel Kiss <daniel.kiss(a)arm.com>
> Signed-off-by: Branislav Rankov <Branislav.Rankov(a)arm.com>
> ---
> drivers/hwtracing/coresight/coresight-tmc-etr.c | 6 ++++--
> drivers/hwtracing/coresight/coresight-tmc.h | 2 ++
> 2 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/hwtracing/coresight/coresight-tmc-etr.c b/drivers/hwtracing/coresight/coresight-tmc-etr.c
> index 589bb2d56e802..55c9b5fd9f832 100644
> --- a/drivers/hwtracing/coresight/coresight-tmc-etr.c
> +++ b/drivers/hwtracing/coresight/coresight-tmc-etr.c
> @@ -1503,8 +1503,8 @@ tmc_update_etr_buffer(struct coresight_device *csdev,
>
> spin_lock_irqsave(&drvdata->spinlock, flags);
>
> - /* Don't do anything if another tracer is using this sink */
> - if (atomic_read(csdev->refcnt) != 1) {
> + /* Serve only the tracer with the leading perf handler */
> + if (drvdata->perf_handle != handle) {
In CPU wide trace scenarios the first CPU to enable a sink is not
guaranteed to be the same as the last CPU to use it. As far as I understand the
above assumes the first and last CPUs to use a sink are the same.
> spin_unlock_irqrestore(&drvdata->spinlock, flags);
> goto out;
> }
> @@ -1619,6 +1619,7 @@ static int tmc_enable_etr_sink_perf(struct coresight_device *csdev, void *data)
> drvdata->pid = pid;
> drvdata->mode = CS_MODE_PERF;
> drvdata->perf_buf = etr_perf->etr_buf;
> + drvdata->perf_handle = handle;
> atomic_inc(csdev->refcnt);
> }
>
> @@ -1666,6 +1667,7 @@ static int tmc_disable_etr_sink(struct coresight_device *csdev)
> drvdata->mode = CS_MODE_DISABLED;
> /* Reset perf specific data */
> drvdata->perf_buf = NULL;
> + drvdata->perf_handle = NULL;
>
> spin_unlock_irqrestore(&drvdata->spinlock, flags);
>
> diff --git a/drivers/hwtracing/coresight/coresight-tmc.h b/drivers/hwtracing/coresight/coresight-tmc.h
> index b91ec7dde7bc9..81583ffb973dc 100644
> --- a/drivers/hwtracing/coresight/coresight-tmc.h
> +++ b/drivers/hwtracing/coresight/coresight-tmc.h
> @@ -184,6 +184,7 @@ struct etr_buf {
> * @idr_mutex: Access serialisation for idr.
> * @sysfs_buf: SYSFS buffer for ETR.
> * @perf_buf: PERF buffer for ETR.
> + * @perf_handle: PERF handle for ETR.
> */
> struct tmc_drvdata {
> void __iomem *base;
> @@ -207,6 +208,7 @@ struct tmc_drvdata {
> struct mutex idr_mutex;
> struct etr_buf *sysfs_buf;
> struct etr_buf *perf_buf;
> + struct perf_output_handle *perf_handle;
> };
>
> struct etr_buf_operations {
> --
> 2.25.1
>
Hi Al and Mike,
> >
> > 16 (0xF) should work for all silicon, as AXI allows burst sizes up to 16.
> > So unless we've missed something, this is an implementation non-compliance
> > and we should not be penalising compliant implementations by reducing the
> > default burst size - the question is how we can enable the upstream kernel
> to
> > workaround the issue on this chip only, and to me that sounds like it
> needs
> > something that can be triggered by a setting in DT/ACPI.
> >
>
> Another possibility would be to introduce an errata workaround in
> Kconfig for your silicon.
> There are a number of these already in KConfig for PE issues e.g.
> CONFIG_ARM64_ERRATUM_826319,
> and we have introduced CONFIG_ETM4X_IMPDEF_FEATURE for silicon
> specific variants in the ETMv4 driver.
>
> The latter config compiles in implementation defined workarounds which
> operate on the basis of matching the AMBA ID for the silicon.
> This means they will operate only if configured in KConfig and only on
> silicon where the workaround is needed.
>
Thanks a lot for the suggestions.
We are thinking of using "part-number" and "designer" rather than device
tree property. While we use standard ARM core IP and Coresight SoC-600 IP,
we cannot differentiate our silicon from others using ETR AMBA ID, PIDR
and CPU MIDR registers.
We are proposing to expose Coresight ROM region to the driver and determine
part number and designer by reading the following fields.
part_number = (PIDR1.PART_1 << 8) | PIDR0.PART_0;
designer = ((PIDR4.DES_2 << 7) & 0xf) |
((PIDR2.DES_1 << 4) & 0x7) |
((PIDR1.DES_0 & 0xf));
Using a combination of part number and designer from ROM region would
help in identifying the Marvell implementation.
This option would be generic and could be helpful for other silicon with
similar issues and can be applied across Coresight components like ETF/ETR.
What are your thoughts on this approach ?
With Regards,
Tanmay
> Regards
>
> Mike
>
>
> > Al
Adds a generic API to allow packet processors to count the amount of bytes per channel
processed and not synced plus any packet header or format errors.
The ETMv4 / ETE packet processor is update to use this API.
API adds ocsd_decode_stats_t structure to contain the statistics. (ocsd_if_types.h)
C-API (ocsd_c_apo.h) adds functions:-
ocsd_dt_get_decode_stats() - get pointer to stats block.
ocsd_dt_reset_decode_stats() - resets the counts to zero. This function operates independently
of the main decoder reset.
This allows for tools such as perf which may reset the decoder multiple times per AUXTRACE_BUFFER
to count stats for the entire buffer rather than each capture block.
Mike Leach (4):
opencsd: Add decode statistics API to packet processor.
opencsd: ETMv4: ETE: Add packet processing stats to decoders.
opencsd: tests: Update test programs to use the packet decoder
statistics API
opencsd: Update readme and version info for v1.2.0
README.md | 5 ++-
decoder/include/common/ocsd_dcd_tree.h | 26 ++++++++++-
decoder/include/common/trc_pkt_proc_base.h | 44 ++++++++++++++++++-
decoder/include/opencsd/c_api/opencsd_c_api.h | 30 ++++++++++++-
decoder/include/opencsd/ocsd_if_types.h | 20 +++++++++
decoder/include/opencsd/ocsd_if_version.h | 6 +--
decoder/source/c_api/ocsd_c_api.cpp | 20 ++++++++-
decoder/source/etmv4/trc_pkt_proc_etmv4i.cpp | 10 ++++-
decoder/source/ocsd_dcd_tree.cpp | 39 ++++++++++++++++
decoder/tests/source/c_api_pkt_print_test.c | 37 +++++++++++++++-
decoder/tests/source/trc_pkt_lister.cpp | 37 +++++++++++++++-
11 files changed, 260 insertions(+), 14 deletions(-)
--
2.17.1
For better organising and easier review, this patch series is extracted
from the patch set "perf: Refine barriers for AUX ring buffer" . When
applying this patch series, it needs to be applied on the top of the
patch series [1].
To support the compat mode in perf tool, the patch 01 adds an new item
in "perf_env" to track if kernel is running in 64-bit mode. This patch
is a preparation for later changes.
Patch 02 introduces compat variant functions for accessing AUX trace's
head and tail, these two functions are defined with weak attribute, so
they can be called when any architectures cannot provide 64-bit value
atomic accessing when perf is in compat mode.
Patch 03 supports compat_auxtrace_mmap__{read_head|write_tail} on Arm
platform. For Arm platform with compat mode, the kernel runs in 64-bit
kernel mode and user space tool runs in 32-bit mode, it uses the
instructions "ldrd" and "strd" for 64-bit value atomicity.
This patch set have been tested on Arm64 Juno platform for the perf tool
is built with compiler arm-linux-gnueabihf-gcc.
[1] https://lore.kernel.org/patchwork/cover/1473916/
Leo Yan (3):
perf env: Track kernel 64-bit mode in environment
perf auxtrace: Add compat_auxtrace_mmap__{read_head|write_tail}
perf auxtrace arm: Support
compat_auxtrace_mmap__{read_head|write_tail}
tools/perf/arch/arm/util/auxtrace.c | 32 +++++++++++
tools/perf/util/auxtrace.c | 88 +++++++++++++++++++++++++++--
tools/perf/util/auxtrace.h | 22 +++++++-
tools/perf/util/env.c | 24 +++++++-
tools/perf/util/env.h | 3 +
5 files changed, 161 insertions(+), 8 deletions(-)
--
2.25.1