Re: Questions about ETM overflow

8 Oct 2019


      Hi Yabin,
On Thu, Sep 19, 2019 at 04:45:25PM -0700, Yabin Cui wrote:
...
Thanks for both of your suggestions! I will try return stack and stall
control later.
Another problem I found recently is about performance degradation when
using ETM:
When running a program with ETM enabled on it using perf interface, ETM
records data and sends to ETR, and ETR moves the data to memory. And kernel
copies
the data from ETR memory to a per-cpu buffer.
So I think the performance cost is: ETR uses memory bus to move data, and
the kernel uses both cpu and memory bus to copy data.
In my experiments, the kernel moving data can take a lot of cpu-cycles,
which slows down the running program.
When testing a single thread busy loop program, the overhead in kernel is
about 11% cpu-cycles (comparing to all cpu-cycles running the program).
When testing a two threads busy looping program, the overhead in kernel is
about 74% cpu-cycles.
In both cases, the overhead in user space is very small, about 0.2%
cpu-cycles.
I am surprised at the performance difference between single thread and
multi thread programs. I haven't spent much time on this.
I will check the accuracy and reason for the difference.
Any suggestions or comments are welcomed.
At my side, I did some performance testing for Arm CoreSight with your
shared single thread and multi thread testing cases.  Below is the
testing result:
# perf record -e cpu-clock -o perf_cpu_cycle_one_thread.data \
      -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./one_thread
  # perf report -i perf_cpu_cycle_one_thread.data -k vmlinux --stdio
#
  #
  # Total Lost Samples: 0
  #
  # Samples: 7K of event 'cpu-clock'
  # Event count (approx.): 1973750000
  #
  # Overhead  Command     Shared Object        Symbol
  # ........  ..........  ...................  .............................................
  #
      40.39%  one_thread  one_thread           [.] LoopA3
      11.69%  perf        [kernel.kallsyms]    [k] arch_local_irq_restore
       8.13%  one_thread  one_thread           [.] LoopA4
       2.81%  perf        [kernel.kallsyms]    [k] kallsyms_expand_symbol.constprop.6
       2.80%  perf        [kernel.kallsyms]    [k] __arch_copy_from_user
       2.37%  perf        libc-2.28.so         [.] __GI_____strtoull_l_internal
       2.10%  perf        [kernel.kallsyms]    [k] format_decode
       2.08%  perf        [kernel.kallsyms]    [k] number
       1.79%  perf        [kernel.kallsyms]    [k] vsnprintf
       1.24%  perf        [kernel.kallsyms]    [k] string_nocheck
       0.92%  perf        [kernel.kallsyms]    [k] __set_page_dirty
       0.87%  perf        libc-2.28.so         [.] _IO_getdelim
       0.79%  perf        [kernel.kallsyms]    [k] update_iter
       0.75%  perf        [kernel.kallsyms]    [k] __add_to_page_cache_locked
  [...]
# perf record -e cpu-clock -o perf_cpu_cycle_multi_thread.data \
      -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./multi_thread
  # perf report -i perf_cpu_cycle_multi_thread.data -k vmlinux --stdio
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 11K of event 'cpu-clock'
  # Event count (approx.): 2805750000
  #
  # Overhead  Command       Shared Object        Symbol
  # ........  ............  ...................  ...........................................
  #
      31.93%  multi_thread  multi_thread         [.] LoopB3
      31.39%  multi_thread  multi_thread         [.] LoopA3
       6.20%  perf          [kernel.kallsyms]    [k] arch_local_irq_restore
       3.09%  multi_thread  multi_thread         [.] LoopA4
       2.91%  multi_thread  multi_thread         [.] LoopB4
       2.00%  perf          [kernel.kallsyms]    [k] kallsyms_expand_symbol.constprop.6
       1.78%  perf          libc-2.28.so         [.] __GI_____strtoull_l_internal
       1.73%  perf          [kernel.kallsyms]    [k] format_decode
       1.13%  perf          [kernel.kallsyms]    [k] vsnprintf
       1.10%  perf          [kernel.kallsyms]    [k] number
       1.06%  perf          [kernel.kallsyms]    [k] __arch_copy_from_user
       0.97%  perf          [kernel.kallsyms]    [k] string_nocheck
       0.62%  perf          [kernel.kallsyms]    [k] update_iter
       0.53%  perf          libc-2.28.so         [.] _IO_getdelim
  [...]
So the testing result seems reasonable to me, the user space loops
consumes much the CPU bandwidth (40%+); the kernel space program
occupies less CPU utilization.  Here need to note one thing is: I use
CPU clock as event for the performance profiling (but not CPU cycles).
Please confirm if there have any differences between us two sides.  I
personally think your testing commands are quenstional for me,
especially 'simpleperf record -e cs-etm simpleperf stat -e cpu-cycles
etm_test_one_thread', seems to me you are using CoreSight to trace
command 'simpleperf stat -e cpu-cycles etm_test_one_thread' and
finally uses CoreSight trace data in perf.data for performance
analysis.  As we know, there have many trace data is lost thus the
CoreSight trace data is not reliable for performance analysis?
Thanks,
Leo Yan

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: Questions about ETM overflow