Re: Questions about ETM overflow

20 Sep 2019


      Hi Yabin,
On Thu, Sep 19, 2019 at 04:45:25PM -0700, Yabin Cui wrote:
...
Thanks for both of your suggestions! I will try return stack and stall
control later.
Another problem I found recently is about performance degradation when
using ETM:
When running a program with ETM enabled on it using perf interface, ETM
records data and sends to ETR, and ETR moves the data to memory. And kernel
copies
the data from ETR memory to a per-cpu buffer.
So I think the performance cost is: ETR uses memory bus to move data, and
the kernel uses both cpu and memory bus to copy data.
In my experiments, the kernel moving data can take a lot of cpu-cycles,
which slows down the running program.
When testing a single thread busy loop program, the overhead in kernel is
about 11% cpu-cycles (comparing to all cpu-cycles running the program).
When testing a two threads busy looping program, the overhead in kernel is
about 74% cpu-cycles.
In both cases, the overhead in user space is very small, about 0.2%
cpu-cycles.
I am surprised at the performance difference between single thread and
multi thread programs. I haven't spent much time on this.
I will check the accuracy and reason for the difference.
Any suggestions or comments are welcomed.
- I think if use ETR's sg mode or CATU mode, it might ease the overhead
  for CPU and memory bandwidth rather than flat mode.  To be honest, I
  am not famaliar with these modes, so will leave this to others.
P.s. IIRC, before I was told CATU is not commonly supported in
  hardware, but sg mode can be supported (??).
- Another thing you could try is to use 'perf -e cycles' command to
  locate the hotspot in CoreSight tracing flow, the command can be:
# perf record -e cycles//k -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./test
Then you could use this way to find hotspot.  I read a bit code, seems
  like tmc_etr_sync_perf_buffer() would be a potential function to
  introduce overhead; and tmc_flush_and_stop() runs with spinlock/irq
  disabled and it calls coresight_timeout() for polling, this flow
  also might cause performance degradation, if you can see the hotspot
  is located in spin_unlock_irqrestore() which means the 'cycles' PMU
  event can trigger interrupt only after local interrupt is enabled
  but actually the busy loop is in the interrupt disabled region.
Thanks,
Leo Yan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: Questions about ETM overflow