Re: Questions about ETM overflow

20 Sep 2019

      Thanks for the suggestions!
...
I think if use ETR's sg mode or CATU mode, it might ease the overhead
for CPU and memory bandwidth rather than flat mode.
I am using ETR's flat mode, not sure if the qcom sdm845 supports SG or CATU.
If they are good for performance, I can try them.
...
Another thing you could try is to use 'perf -e cycles' command to
locate the hotspot in CoreSight tracing flow
Yes, I will profile the kernel code. One thing interesting is that
tmc_flush_and_stop() always show timeout warnings
on hikey board, but not on pixel 3. Don't know if it does some time
consuming hardware operation.
...
I suppose that depends on the mode operation, i.e per-thread of
CPU-wide.  It is hard for me to comment without more information about
how you came up with those metrics.
I am using an aux buffer per cpu. Differents threads on one cpu share the
same buffer using ioctl(PERF_EVENT_IOC_SET_OUTPUT).
I measure the metrics like below:
$ simpleperf record -e cs-etm simpleperf stat -e cpu-cycles
etm_test_one_thread
simpleperf is an alternative of linux perf we use on android. you can
replace it with linux perf with tiny option change.
cpu-cycles can be replaced by cpu-cycles:k or cpu-cycles:u to measure
kernel or user space only.
etm_test_one_thread can be replaced by etm_test_multi_threads.
I included the test programs in the attachment, in case anyone wants to try
them in your environment. They need to be compiled with -O0.
On Fri, Sep 20, 2019 at 9:13 AM Mathieu Poirier mathieu.poirier@linaro.org
wrote:
...
On Thu, 19 Sep 2019 at 17:45, Yabin Cui yabinc@google.com wrote:
...
Thanks for both of your suggestions! I will try return stack and stall
control later.
...
Another problem I found recently is about performance degradation when
using ETM:
...
When running a program with ETM enabled on it using perf interface, ETM
records data and sends to ETR, and ETR moves the data to memory. And kernel
copies
...
the data from ETR memory to a per-cpu buffer.
So I think the performance cost is: ETR uses memory bus to move data,
and the kernel uses both cpu and memory bus to copy data.
...
In my experiments, the kernel moving data can take a lot of cpu-cycles,
which slows down the running program.
You are correct - double buffering mode is definitely not an optimal
way to work but at this time the HW doesn't give us another choice.
...
When testing a single thread busy loop program, the overhead in kernel
is about 11% cpu-cycles (comparing to all cpu-cycles running the program).
...
When testing a two threads busy looping program, the overhead in kernel
is about 74% cpu-cycles.
...
In both cases, the overhead in user space is very small, about 0.2%
cpu-cycles.
...
I am surprised at the performance difference between single thread and
multi thread programs. I haven't spent much time on this.
I suppose that depends on the mode operation, i.e per-thread of
CPU-wide.  It is hard for me to comment without more information about
how you came up with those metrics.
...
I will check the accuracy and reason for the difference.
Any suggestions or comments are welcomed.
On Thu, Sep 19, 2019 at 5:04 AM Mike Leach mike.leach@linaro.org
wrote:
...
...
HI Yabin,
A couple of additional suggestions:-

If available use the return stack - this can remove a number of

address packets from the trace stream.
2) Check that both SYSSTALL and STALLCTL are 1 on IDR3 on your device

otherwise using stall will not work.

Ensure that the coresight system is correctly described in the

device tree in respect of possible replicators and TPIU devices on the
system. TPIU will ordinarily be disabled, but the bus to the TPIU may
go through a downsizer - which can produce backpressure beyond any
replicator. The coresight infrastucture should disable the TPIU branch
on a correctly described system
Regards
Mike
On Wed, 18 Sep 2019 at 16:35, Mathieu Poirier
mathieu.poirier@linaro.org wrote:
...
Hi Yabin,
On Tue, 17 Sep 2019 at 15:53, Yabin Cui yabinc@google.com wrote:
...
Hi guys,
I am trying to reduce ETM data lose. There seems to have two types
of data lose:
...
...
...
...

caused by instruction trace buffer overflow, which generates an

Overflow packet after recovering.
...
...
...
...

caused by ETR buffer overflow, which generates an

PERF_RECORD_AUX with PERF_AUX_FLAG_TRUNCATED.
...
...
...
...
In practice, the second one is unlikely to happen when I set ETR
buffer size to 4M, and can be improved by setting higher buffer size.
...
...
...
...
The first one happens much more frequently, can happen about 21K
times when generating 5.3M etm data.
...
...
...
...
So I want to know if there is any way to reduce that.
I found in the ETM arch manual that the overflow seems controlled
by TRCSTALLCTLR, a stall control register.
...
...
...
...
But TRCIDR3.NOOVERFLOW is not supported on my experiment device,
which seems using qualcomm sdm845.
...
...
...
...
TRCSTALLCTLR.ISTALL seems can be set by writing to mode file in
/sys. But I didn't find any way to set
...
...
...
...
TRCSTALLCTLR.LEVEL in linux kernel. So I wonder if it is ever used.
Do you guys have any suggestions?
In order to get a timely response to your questions I advise to CC the
coresight mailing list (which I have included).  There is a lot of
knowledgeable people there that can also help, especially with
architecture specific configuration.
TRCSTALLCTLR.LEVEL is currently not accessible to users because we
simply never needed to use the feature.  If using the ISTALL and LEVEL
parameters help with your use case send a patch that exposes the
entire TRCSTALLCTLR register (rather than just the LEVEL field) and
I'll be happy to fold it in.
Thanks,
Mathieu
...
Best,
Yabin

CoreSight mailing list
CoreSight@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/coresight
--
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: Questions about ETM overflow