Thanks for the suggestions!

> I think if use ETR's sg mode or CATU mode, it might ease the overhead
  for CPU and memory bandwidth rather than flat mode. 

I am using ETR's flat mode, not sure if the qcom sdm845 supports SG or CATU.
If they are good for performance, I can try them.

> Another thing you could try is to use 'perf -e cycles' command to
  locate the hotspot in CoreSight tracing flow

Yes, I will profile the kernel code. One thing interesting is that tmc_flush_and_stop() always show timeout warnings
on hikey board, but not on pixel 3. Don't know if it does some time consuming hardware operation.

> I suppose that depends on the mode operation, i.e per-thread of
CPU-wide.  It is hard for me to comment without more information about
how you came up with those metrics.

I am using an aux buffer per cpu. Differents threads on one cpu share the same buffer using ioctl(PERF_EVENT_IOC_SET_OUTPUT).
I measure the metrics like below:
$ simpleperf record -e cs-etm simpleperf stat -e cpu-cycles etm_test_one_thread

simpleperf is an alternative of linux perf we use on android. you can replace it with linux perf with tiny option change.
cpu-cycles can be replaced by cpu-cycles:k or cpu-cycles:u to measure kernel or user space only.
etm_test_one_thread can be replaced by etm_test_multi_threads.

I included the test programs in the attachment, in case anyone wants to try them in your environment. They need to be compiled with -O0.






On Fri, Sep 20, 2019 at 9:13 AM Mathieu Poirier <mathieu.poirier@linaro.org> wrote:
On Thu, 19 Sep 2019 at 17:45, Yabin Cui <yabinc@google.com> wrote:
>
> Thanks for both of your suggestions! I will try return stack and stall control later.
>
> Another problem I found recently is about performance degradation when using ETM:
>
> When running a program with ETM enabled on it using perf interface, ETM records data and sends to ETR, and ETR moves the data to memory. And kernel copies
> the data from ETR memory to a per-cpu buffer.
> So I think the performance cost is: ETR uses memory bus to move data, and the kernel uses both cpu and memory bus to copy data.
> In my experiments, the kernel moving data can take a lot of cpu-cycles, which slows down the running program.

You are correct - double buffering mode is definitely not an optimal
way to work but at this time the HW doesn't give us another choice.

> When testing a single thread busy loop program, the overhead in kernel is about 11% cpu-cycles (comparing to all cpu-cycles running the program).
> When testing a two threads busy looping program, the overhead in kernel is about 74% cpu-cycles.
> In both cases, the overhead in user space is very small, about 0.2% cpu-cycles.
>
> I am surprised at the performance difference between single thread and multi thread programs. I haven't spent much time on this.

I suppose that depends on the mode operation, i.e per-thread of
CPU-wide.  It is hard for me to comment without more information about
how you came up with those metrics.

> I will check the accuracy and reason for the difference.
> Any suggestions or comments are welcomed.
>
>
> On Thu, Sep 19, 2019 at 5:04 AM Mike Leach <mike.leach@linaro.org> wrote:
>>
>> HI Yabin,
>>
>> A couple of additional suggestions:-
>> 1) If available use the return stack - this can remove a number of
>> address packets from the trace stream.
>> 2) Check that both SYSSTALL and STALLCTL are 1 on IDR3 on your device
>> - otherwise using stall will not work.
>> 3) Ensure that the coresight system is correctly described in the
>> device tree in respect of possible replicators and TPIU devices on the
>> system. TPIU will ordinarily be disabled, but the bus to the TPIU may
>> go through a downsizer - which can produce backpressure beyond any
>> replicator. The coresight infrastucture should disable the TPIU branch
>> on a correctly described system
>>
>> Regards
>>
>> Mike
>>
>> On Wed, 18 Sep 2019 at 16:35, Mathieu Poirier
>> <mathieu.poirier@linaro.org> wrote:
>> >
>> > Hi Yabin,
>> >
>> > On Tue, 17 Sep 2019 at 15:53, Yabin Cui <yabinc@google.com> wrote:
>> > >
>> > > Hi guys,
>> > >
>> > > I am trying to reduce ETM data lose. There seems to have two types of data lose:
>> > > 1. caused by instruction trace buffer overflow, which generates an Overflow packet after recovering.
>> > > 2. caused by ETR buffer overflow, which generates an PERF_RECORD_AUX with PERF_AUX_FLAG_TRUNCATED.
>> > >
>> > > In practice, the second one is unlikely to happen when I set ETR buffer size to 4M, and can be improved by setting higher buffer size.
>> > > The first one happens much more frequently, can happen about 21K times when generating 5.3M etm data.
>> > > So I want to know if there is any way to reduce that.
>> > >
>> > > I found in the ETM arch manual that the overflow seems controlled by TRCSTALLCTLR, a stall control register.
>> > > But TRCIDR3.NOOVERFLOW is not supported on my experiment device, which seems using qualcomm sdm845.
>> > > TRCSTALLCTLR.ISTALL seems can be set by writing to mode file in /sys. But I didn't find any way to set
>> > > TRCSTALLCTLR.LEVEL in linux kernel. So I wonder if it is ever used.
>> > >
>> > > Do you guys have any suggestions?
>> >
>> > In order to get a timely response to your questions I advise to CC the
>> > coresight mailing list (which I have included).  There is a lot of
>> > knowledgeable people there that can also help, especially with
>> > architecture specific configuration.
>> >
>> > TRCSTALLCTLR.LEVEL is currently not accessible to users because we
>> > simply never needed to use the feature.  If using the ISTALL and LEVEL
>> > parameters help with your use case send a patch that exposes the
>> > entire TRCSTALLCTLR register (rather than just the LEVEL field) and
>> > I'll be happy to fold it in.
>> >
>> > Thanks,
>> > Mathieu
>> >
>> > >
>> > > Best,
>> > > Yabin
>> > >
>> > _______________________________________________
>> > CoreSight mailing list
>> > CoreSight@lists.linaro.org
>> > https://lists.linaro.org/mailman/listinfo/coresight
>>
>>
>>
>> --
>> Mike Leach
>> Principal Engineer, ARM Ltd.
>> Manchester Design Centre. UK