Hi Yabin,
On Thu, Sep 19, 2019 at 04:45:25PM -0700, Yabin Cui wrote:
Thanks for both of your suggestions! I will try return stack and stall control later.
Another problem I found recently is about performance degradation when using ETM:
When running a program with ETM enabled on it using perf interface, ETM records data and sends to ETR, and ETR moves the data to memory. And kernel copies the data from ETR memory to a per-cpu buffer. So I think the performance cost is: ETR uses memory bus to move data, and the kernel uses both cpu and memory bus to copy data. In my experiments, the kernel moving data can take a lot of cpu-cycles, which slows down the running program. When testing a single thread busy loop program, the overhead in kernel is about 11% cpu-cycles (comparing to all cpu-cycles running the program). When testing a two threads busy looping program, the overhead in kernel is about 74% cpu-cycles. In both cases, the overhead in user space is very small, about 0.2% cpu-cycles.
I am surprised at the performance difference between single thread and multi thread programs. I haven't spent much time on this. I will check the accuracy and reason for the difference. Any suggestions or comments are welcomed.
- I think if use ETR's sg mode or CATU mode, it might ease the overhead for CPU and memory bandwidth rather than flat mode. To be honest, I am not famaliar with these modes, so will leave this to others.
P.s. IIRC, before I was told CATU is not commonly supported in hardware, but sg mode can be supported (??).
- Another thing you could try is to use 'perf -e cycles' command to locate the hotspot in CoreSight tracing flow, the command can be:
# perf record -e cycles//k -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./test
Then you could use this way to find hotspot. I read a bit code, seems like tmc_etr_sync_perf_buffer() would be a potential function to introduce overhead; and tmc_flush_and_stop() runs with spinlock/irq disabled and it calls coresight_timeout() for polling, this flow also might cause performance degradation, if you can see the hotspot is located in spin_unlock_irqrestore() which means the 'cycles' PMU event can trigger interrupt only after local interrupt is enabled but actually the busy loop is in the interrupt disabled region.
Thanks, Leo Yan