Hi Yabin,
On Thu, Sep 19, 2019 at 04:45:25PM -0700, Yabin Cui wrote:
Thanks for both of your suggestions! I will try return stack and stall control later.
Another problem I found recently is about performance degradation when using ETM:
When running a program with ETM enabled on it using perf interface, ETM records data and sends to ETR, and ETR moves the data to memory. And kernel copies the data from ETR memory to a per-cpu buffer. So I think the performance cost is: ETR uses memory bus to move data, and the kernel uses both cpu and memory bus to copy data. In my experiments, the kernel moving data can take a lot of cpu-cycles, which slows down the running program. When testing a single thread busy loop program, the overhead in kernel is about 11% cpu-cycles (comparing to all cpu-cycles running the program). When testing a two threads busy looping program, the overhead in kernel is about 74% cpu-cycles. In both cases, the overhead in user space is very small, about 0.2% cpu-cycles.
I am surprised at the performance difference between single thread and multi thread programs. I haven't spent much time on this. I will check the accuracy and reason for the difference. Any suggestions or comments are welcomed.
At my side, I did some performance testing for Arm CoreSight with your shared single thread and multi thread testing cases. Below is the testing result:
# perf record -e cpu-clock -o perf_cpu_cycle_one_thread.data \ -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./one_thread # perf report -i perf_cpu_cycle_one_thread.data -k vmlinux --stdio
# # # Total Lost Samples: 0 # # Samples: 7K of event 'cpu-clock' # Event count (approx.): 1973750000 # # Overhead Command Shared Object Symbol # ........ .......... ................... ............................................. # 40.39% one_thread one_thread [.] LoopA3 11.69% perf [kernel.kallsyms] [k] arch_local_irq_restore 8.13% one_thread one_thread [.] LoopA4 2.81% perf [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.6 2.80% perf [kernel.kallsyms] [k] __arch_copy_from_user 2.37% perf libc-2.28.so [.] __GI_____strtoull_l_internal 2.10% perf [kernel.kallsyms] [k] format_decode 2.08% perf [kernel.kallsyms] [k] number 1.79% perf [kernel.kallsyms] [k] vsnprintf 1.24% perf [kernel.kallsyms] [k] string_nocheck 0.92% perf [kernel.kallsyms] [k] __set_page_dirty 0.87% perf libc-2.28.so [.] _IO_getdelim 0.79% perf [kernel.kallsyms] [k] update_iter 0.75% perf [kernel.kallsyms] [k] __add_to_page_cache_locked [...]
# perf record -e cpu-clock -o perf_cpu_cycle_multi_thread.data \ -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./multi_thread # perf report -i perf_cpu_cycle_multi_thread.data -k vmlinux --stdio # # # Total Lost Samples: 0 # # Samples: 11K of event 'cpu-clock' # Event count (approx.): 2805750000 # # Overhead Command Shared Object Symbol # ........ ............ ................... ........................................... # 31.93% multi_thread multi_thread [.] LoopB3 31.39% multi_thread multi_thread [.] LoopA3 6.20% perf [kernel.kallsyms] [k] arch_local_irq_restore 3.09% multi_thread multi_thread [.] LoopA4 2.91% multi_thread multi_thread [.] LoopB4 2.00% perf [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.6 1.78% perf libc-2.28.so [.] __GI_____strtoull_l_internal 1.73% perf [kernel.kallsyms] [k] format_decode 1.13% perf [kernel.kallsyms] [k] vsnprintf 1.10% perf [kernel.kallsyms] [k] number 1.06% perf [kernel.kallsyms] [k] __arch_copy_from_user 0.97% perf [kernel.kallsyms] [k] string_nocheck 0.62% perf [kernel.kallsyms] [k] update_iter 0.53% perf libc-2.28.so [.] _IO_getdelim [...]
So the testing result seems reasonable to me, the user space loops consumes much the CPU bandwidth (40%+); the kernel space program occupies less CPU utilization. Here need to note one thing is: I use CPU clock as event for the performance profiling (but not CPU cycles).
Please confirm if there have any differences between us two sides. I personally think your testing commands are quenstional for me, especially 'simpleperf record -e cs-etm simpleperf stat -e cpu-cycles etm_test_one_thread', seems to me you are using CoreSight to trace command 'simpleperf stat -e cpu-cycles etm_test_one_thread' and finally uses CoreSight trace data in perf.data for performance analysis. As we know, there have many trace data is lost thus the CoreSight trace data is not reliable for performance analysis?
Thanks, Leo Yan