Re: Questions about ETM overflow

List overview All Threads
Download

newer

older

[PATCH v3 00/11] coresight: etm4x:...

Authentication problem of...

Mathieu Poirier

18 Sep 2019 18 Sep '19

3:35 p.m.

Hi Yabin,

On Tue, 17 Sep 2019 at 15:53, Yabin Cui yabinc@google.com wrote:

...

Hi guys,

I am trying to reduce ETM data lose. There seems to have two types of data lose:

caused by instruction trace buffer overflow, which generates an Overflow packet after recovering.

caused by ETR buffer overflow, which generates an PERF_RECORD_AUX with PERF_AUX_FLAG_TRUNCATED.

In practice, the second one is unlikely to happen when I set ETR buffer size to 4M, and can be improved by setting higher buffer size. The first one happens much more frequently, can happen about 21K times when generating 5.3M etm data. So I want to know if there is any way to reduce that.

I found in the ETM arch manual that the overflow seems controlled by TRCSTALLCTLR, a stall control register. But TRCIDR3.NOOVERFLOW is not supported on my experiment device, which seems using qualcomm sdm845. TRCSTALLCTLR.ISTALL seems can be set by writing to mode file in /sys. But I didn't find any way to set TRCSTALLCTLR.LEVEL in linux kernel. So I wonder if it is ever used.

Do you guys have any suggestions?

In order to get a timely response to your questions I advise to CC the coresight mailing list (which I have included). There is a lot of knowledgeable people there that can also help, especially with architecture specific configuration.

TRCSTALLCTLR.LEVEL is currently not accessible to users because we simply never needed to use the feature. If using the ISTALL and LEVEL parameters help with your use case send a patch that exposes the entire TRCSTALLCTLR register (rather than just the LEVEL field) and I'll be happy to fold it in.

Thanks, Mathieu

...

Best, Yabin

Show replies by date

Mike Leach

19 Sep 19 Sep

12:04 p.m.

New subject: Questions about ETM overflow

HI Yabin,

A couple of additional suggestions:- 1) If available use the return stack - this can remove a number of address packets from the trace stream. 2) Check that both SYSSTALL and STALLCTL are 1 on IDR3 on your device - otherwise using stall will not work. 3) Ensure that the coresight system is correctly described in the device tree in respect of possible replicators and TPIU devices on the system. TPIU will ordinarily be disabled, but the bus to the TPIU may go through a downsizer - which can produce backpressure beyond any replicator. The coresight infrastucture should disable the TPIU branch on a correctly described system

Regards

Mike

On Wed, 18 Sep 2019 at 16:35, Mathieu Poirier mathieu.poirier@linaro.org wrote:

...

Hi Yabin,

On Tue, 17 Sep 2019 at 15:53, Yabin Cui yabinc@google.com wrote:

...
Hi guys,

I am trying to reduce ETM data lose. There seems to have two types of data lose:

caused by instruction trace buffer overflow, which generates an Overflow packet after recovering.

caused by ETR buffer overflow, which generates an PERF_RECORD_AUX with PERF_AUX_FLAG_TRUNCATED.

In practice, the second one is unlikely to happen when I set ETR buffer size to 4M, and can be improved by setting higher buffer size. The first one happens much more frequently, can happen about 21K times when generating 5.3M etm data. So I want to know if there is any way to reduce that.

I found in the ETM arch manual that the overflow seems controlled by TRCSTALLCTLR, a stall control register. But TRCIDR3.NOOVERFLOW is not supported on my experiment device, which seems using qualcomm sdm845. TRCSTALLCTLR.ISTALL seems can be set by writing to mode file in /sys. But I didn't find any way to set TRCSTALLCTLR.LEVEL in linux kernel. So I wonder if it is ever used.

Do you guys have any suggestions?

In order to get a timely response to your questions I advise to CC the coresight mailing list (which I have included). There is a lot of knowledgeable people there that can also help, especially with architecture specific configuration.

TRCSTALLCTLR.LEVEL is currently not accessible to users because we simply never needed to use the feature. If using the ISTALL and LEVEL parameters help with your use case send a patch that exposes the entire TRCSTALLCTLR register (rather than just the LEVEL field) and I'll be happy to fold it in.

Thanks, Mathieu

...
Best, Yabin

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK

Yabin Cui

11:45 p.m.

New subject: Questions about ETM overflow

Thanks for both of your suggestions! I will try return stack and stall control later.

Another problem I found recently is about performance degradation when using ETM:

When running a program with ETM enabled on it using perf interface, ETM records data and sends to ETR, and ETR moves the data to memory. And kernel copies the data from ETR memory to a per-cpu buffer. So I think the performance cost is: ETR uses memory bus to move data, and the kernel uses both cpu and memory bus to copy data. In my experiments, the kernel moving data can take a lot of cpu-cycles, which slows down the running program. When testing a single thread busy loop program, the overhead in kernel is about 11% cpu-cycles (comparing to all cpu-cycles running the program). When testing a two threads busy looping program, the overhead in kernel is about 74% cpu-cycles. In both cases, the overhead in user space is very small, about 0.2% cpu-cycles.

I am surprised at the performance difference between single thread and multi thread programs. I haven't spent much time on this. I will check the accuracy and reason for the difference. Any suggestions or comments are welcomed.

On Thu, Sep 19, 2019 at 5:04 AM Mike Leach mike.leach@linaro.org wrote:

...

HI Yabin,

A couple of additional suggestions:-

If available use the return stack - this can remove a number of

address packets from the trace stream. 2) Check that both SYSSTALL and STALLCTL are 1 on IDR3 on your device

otherwise using stall will not work.

Ensure that the coresight system is correctly described in the

device tree in respect of possible replicators and TPIU devices on the system. TPIU will ordinarily be disabled, but the bus to the TPIU may go through a downsizer - which can produce backpressure beyond any replicator. The coresight infrastucture should disable the TPIU branch on a correctly described system

Regards

Mike

On Wed, 18 Sep 2019 at 16:35, Mathieu Poirier mathieu.poirier@linaro.org wrote:

...
Hi Yabin,

On Tue, 17 Sep 2019 at 15:53, Yabin Cui yabinc@google.com wrote:

...
Hi guys,

I am trying to reduce ETM data lose. There seems to have two types of

data lose:

...
...

caused by instruction trace buffer overflow, which generates an

Overflow packet after recovering.

...
...

caused by ETR buffer overflow, which generates an PERF_RECORD_AUX

with PERF_AUX_FLAG_TRUNCATED.

...
...
In practice, the second one is unlikely to happen when I set ETR

buffer size to 4M, and can be improved by setting higher buffer size.

...
...
The first one happens much more frequently, can happen about 21K times

when generating 5.3M etm data.

...
...
So I want to know if there is any way to reduce that.

I found in the ETM arch manual that the overflow seems controlled by

TRCSTALLCTLR, a stall control register.

...
...
But TRCIDR3.NOOVERFLOW is not supported on my experiment device, which

seems using qualcomm sdm845.

...
...
TRCSTALLCTLR.ISTALL seems can be set by writing to mode file in /sys.

But I didn't find any way to set

...
...
TRCSTALLCTLR.LEVEL in linux kernel. So I wonder if it is ever used.

Do you guys have any suggestions?

In order to get a timely response to your questions I advise to CC the coresight mailing list (which I have included). There is a lot of knowledgeable people there that can also help, especially with architecture specific configuration.

TRCSTALLCTLR.LEVEL is currently not accessible to users because we simply never needed to use the feature. If using the ISTALL and LEVEL parameters help with your use case send a patch that exposes the entire TRCSTALLCTLR register (rather than just the LEVEL field) and I'll be happy to fold it in.

Thanks, Mathieu

...
Best, Yabin

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK

Leo Yan

20 Sep 20 Sep

2:26 a.m.

New subject: Questions about ETM overflow

Hi Yabin,

On Thu, Sep 19, 2019 at 04:45:25PM -0700, Yabin Cui wrote:

...

Thanks for both of your suggestions! I will try return stack and stall control later.

Another problem I found recently is about performance degradation when using ETM:

When running a program with ETM enabled on it using perf interface, ETM records data and sends to ETR, and ETR moves the data to memory. And kernel copies the data from ETR memory to a per-cpu buffer. So I think the performance cost is: ETR uses memory bus to move data, and the kernel uses both cpu and memory bus to copy data. In my experiments, the kernel moving data can take a lot of cpu-cycles, which slows down the running program. When testing a single thread busy loop program, the overhead in kernel is about 11% cpu-cycles (comparing to all cpu-cycles running the program). When testing a two threads busy looping program, the overhead in kernel is about 74% cpu-cycles. In both cases, the overhead in user space is very small, about 0.2% cpu-cycles.

I am surprised at the performance difference between single thread and multi thread programs. I haven't spent much time on this. I will check the accuracy and reason for the difference. Any suggestions or comments are welcomed.

- I think if use ETR's sg mode or CATU mode, it might ease the overhead for CPU and memory bandwidth rather than flat mode. To be honest, I am not famaliar with these modes, so will leave this to others.

P.s. IIRC, before I was told CATU is not commonly supported in hardware, but sg mode can be supported (??).

- Another thing you could try is to use 'perf -e cycles' command to locate the hotspot in CoreSight tracing flow, the command can be:

# perf record -e cycles//k -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./test

Then you could use this way to find hotspot. I read a bit code, seems like tmc_etr_sync_perf_buffer() would be a potential function to introduce overhead; and tmc_flush_and_stop() runs with spinlock/irq disabled and it calls coresight_timeout() for polling, this flow also might cause performance degradation, if you can see the hotspot is located in spin_unlock_irqrestore() which means the 'cycles' PMU event can trigger interrupt only after local interrupt is enabled but actually the busy loop is in the interrupt disabled region.

Thanks, Leo Yan

Mathieu Poirier

4:12 p.m.

New subject: Questions about ETM overflow

On Thu, 19 Sep 2019 at 17:45, Yabin Cui yabinc@google.com wrote:

...

Thanks for both of your suggestions! I will try return stack and stall control later.

Another problem I found recently is about performance degradation when using ETM:

When running a program with ETM enabled on it using perf interface, ETM records data and sends to ETR, and ETR moves the data to memory. And kernel copies the data from ETR memory to a per-cpu buffer. So I think the performance cost is: ETR uses memory bus to move data, and the kernel uses both cpu and memory bus to copy data. In my experiments, the kernel moving data can take a lot of cpu-cycles, which slows down the running program.

You are correct - double buffering mode is definitely not an optimal way to work but at this time the HW doesn't give us another choice.

...

When testing a single thread busy loop program, the overhead in kernel is about 11% cpu-cycles (comparing to all cpu-cycles running the program). When testing a two threads busy looping program, the overhead in kernel is about 74% cpu-cycles. In both cases, the overhead in user space is very small, about 0.2% cpu-cycles.

I am surprised at the performance difference between single thread and multi thread programs. I haven't spent much time on this.

I suppose that depends on the mode operation, i.e per-thread of CPU-wide. It is hard for me to comment without more information about how you came up with those metrics.

...

I will check the accuracy and reason for the difference. Any suggestions or comments are welcomed.

On Thu, Sep 19, 2019 at 5:04 AM Mike Leach mike.leach@linaro.org wrote:

...
HI Yabin,

A couple of additional suggestions:-

If available use the return stack - this can remove a number of

address packets from the trace stream. 2) Check that both SYSSTALL and STALLCTL are 1 on IDR3 on your device

otherwise using stall will not work.

Ensure that the coresight system is correctly described in the

device tree in respect of possible replicators and TPIU devices on the system. TPIU will ordinarily be disabled, but the bus to the TPIU may go through a downsizer - which can produce backpressure beyond any replicator. The coresight infrastucture should disable the TPIU branch on a correctly described system

Regards

Mike

On Wed, 18 Sep 2019 at 16:35, Mathieu Poirier mathieu.poirier@linaro.org wrote:

...
Hi Yabin,

On Tue, 17 Sep 2019 at 15:53, Yabin Cui yabinc@google.com wrote:

...
Hi guys,

I am trying to reduce ETM data lose. There seems to have two types of data lose:

caused by instruction trace buffer overflow, which generates an Overflow packet after recovering.

caused by ETR buffer overflow, which generates an PERF_RECORD_AUX with PERF_AUX_FLAG_TRUNCATED.

In practice, the second one is unlikely to happen when I set ETR buffer size to 4M, and can be improved by setting higher buffer size. The first one happens much more frequently, can happen about 21K times when generating 5.3M etm data. So I want to know if there is any way to reduce that.

I found in the ETM arch manual that the overflow seems controlled by TRCSTALLCTLR, a stall control register. But TRCIDR3.NOOVERFLOW is not supported on my experiment device, which seems using qualcomm sdm845. TRCSTALLCTLR.ISTALL seems can be set by writing to mode file in /sys. But I didn't find any way to set TRCSTALLCTLR.LEVEL in linux kernel. So I wonder if it is ever used.

Do you guys have any suggestions?

In order to get a timely response to your questions I advise to CC the coresight mailing list (which I have included). There is a lot of knowledgeable people there that can also help, especially with architecture specific configuration.

TRCSTALLCTLR.LEVEL is currently not accessible to users because we simply never needed to use the feature. If using the ISTALL and LEVEL parameters help with your use case send a patch that exposes the entire TRCSTALLCTLR register (rather than just the LEVEL field) and I'll be happy to fold it in.

Thanks, Mathieu

...
Best, Yabin

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK

Yabin Cui

7:06 p.m.

New subject: Questions about ETM overflow

Thanks for the suggestions!

...

I think if use ETR's sg mode or CATU mode, it might ease the overhead

for CPU and memory bandwidth rather than flat mode.

I am using ETR's flat mode, not sure if the qcom sdm845 supports SG or CATU. If they are good for performance, I can try them.

...

Another thing you could try is to use 'perf -e cycles' command to

locate the hotspot in CoreSight tracing flow

Yes, I will profile the kernel code. One thing interesting is that tmc_flush_and_stop() always show timeout warnings on hikey board, but not on pixel 3. Don't know if it does some time consuming hardware operation.

...

I suppose that depends on the mode operation, i.e per-thread of

CPU-wide. It is hard for me to comment without more information about how you came up with those metrics.

I am using an aux buffer per cpu. Differents threads on one cpu share the same buffer using ioctl(PERF_EVENT_IOC_SET_OUTPUT). I measure the metrics like below: $ simpleperf record -e cs-etm simpleperf stat -e cpu-cycles etm_test_one_thread

simpleperf is an alternative of linux perf we use on android. you can replace it with linux perf with tiny option change. cpu-cycles can be replaced by cpu-cycles:k or cpu-cycles:u to measure kernel or user space only. etm_test_one_thread can be replaced by etm_test_multi_threads.

I included the test programs in the attachment, in case anyone wants to try them in your environment. They need to be compiled with -O0.

On Fri, Sep 20, 2019 at 9:13 AM Mathieu Poirier mathieu.poirier@linaro.org wrote:

...

On Thu, 19 Sep 2019 at 17:45, Yabin Cui yabinc@google.com wrote:

...
Thanks for both of your suggestions! I will try return stack and stall

control later.

...
Another problem I found recently is about performance degradation when

using ETM:

...
When running a program with ETM enabled on it using perf interface, ETM

records data and sends to ETR, and ETR moves the data to memory. And kernel copies

...
the data from ETR memory to a per-cpu buffer. So I think the performance cost is: ETR uses memory bus to move data,

and the kernel uses both cpu and memory bus to copy data.

...
In my experiments, the kernel moving data can take a lot of cpu-cycles,

which slows down the running program.

You are correct - double buffering mode is definitely not an optimal way to work but at this time the HW doesn't give us another choice.

...
When testing a single thread busy loop program, the overhead in kernel

is about 11% cpu-cycles (comparing to all cpu-cycles running the program).

...
When testing a two threads busy looping program, the overhead in kernel

is about 74% cpu-cycles.

...
In both cases, the overhead in user space is very small, about 0.2%

cpu-cycles.

...
I am surprised at the performance difference between single thread and

multi thread programs. I haven't spent much time on this.

I suppose that depends on the mode operation, i.e per-thread of CPU-wide. It is hard for me to comment without more information about how you came up with those metrics.

...
I will check the accuracy and reason for the difference. Any suggestions or comments are welcomed.

On Thu, Sep 19, 2019 at 5:04 AM Mike Leach mike.leach@linaro.org

wrote:

...
...
HI Yabin,

A couple of additional suggestions:-

If available use the return stack - this can remove a number of

address packets from the trace stream. 2) Check that both SYSSTALL and STALLCTL are 1 on IDR3 on your device

otherwise using stall will not work.

Ensure that the coresight system is correctly described in the

device tree in respect of possible replicators and TPIU devices on the system. TPIU will ordinarily be disabled, but the bus to the TPIU may go through a downsizer - which can produce backpressure beyond any replicator. The coresight infrastucture should disable the TPIU branch on a correctly described system

Regards

Mike

On Wed, 18 Sep 2019 at 16:35, Mathieu Poirier mathieu.poirier@linaro.org wrote:

...
Hi Yabin,

On Tue, 17 Sep 2019 at 15:53, Yabin Cui yabinc@google.com wrote:

...
Hi guys,

I am trying to reduce ETM data lose. There seems to have two types

of data lose:

...
...
...
...

caused by instruction trace buffer overflow, which generates an

Overflow packet after recovering.

...
...
...
...

caused by ETR buffer overflow, which generates an

PERF_RECORD_AUX with PERF_AUX_FLAG_TRUNCATED.

...
...
...
...
In practice, the second one is unlikely to happen when I set ETR

buffer size to 4M, and can be improved by setting higher buffer size.

...
...
...
...
The first one happens much more frequently, can happen about 21K

times when generating 5.3M etm data.

...
...
...
...
So I want to know if there is any way to reduce that.

I found in the ETM arch manual that the overflow seems controlled

by TRCSTALLCTLR, a stall control register.

...
...
...
...
But TRCIDR3.NOOVERFLOW is not supported on my experiment device,

which seems using qualcomm sdm845.

...
...
...
...
TRCSTALLCTLR.ISTALL seems can be set by writing to mode file in

/sys. But I didn't find any way to set

...
...
...
...
TRCSTALLCTLR.LEVEL in linux kernel. So I wonder if it is ever used.

Do you guys have any suggestions?

In order to get a timely response to your questions I advise to CC the coresight mailing list (which I have included). There is a lot of knowledgeable people there that can also help, especially with architecture specific configuration.

TRCSTALLCTLR.LEVEL is currently not accessible to users because we simply never needed to use the feature. If using the ISTALL and LEVEL parameters help with your use case send a patch that exposes the entire TRCSTALLCTLR register (rather than just the LEVEL field) and I'll be happy to fold it in.

Thanks, Mathieu

...
Best, Yabin

CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight

-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK

Leo Yan

22 Sep 22 Sep

2:12 a.m.

New subject: Questions about ETM overflow

Hi Yabin,

On Fri, Sep 20, 2019 at 12:06:26PM -0700, Yabin Cui wrote:

...

Thanks for the suggestions!

...
I think if use ETR's sg mode or CATU mode, it might ease the overhead

for CPU and memory bandwidth rather than flat mode.

I am using ETR's flat mode, not sure if the qcom sdm845 supports SG or CATU. If they are good for performance, I can try them.

Let me firstly try your testing and share back the profiling result on my Hikey board.

...

...
Another thing you could try is to use 'perf -e cycles' command to

locate the hotspot in CoreSight tracing flow

Yes, I will profile the kernel code. One thing interesting is that tmc_flush_and_stop() always show timeout warnings on hikey board, but not on pixel 3. Don't know if it does some time consuming hardware operation.

Could you confirm which Hikey board you are using? Hikey620 or Hikey960?

One thing should note is for Hikey620, usually it's required to disable CPUIdle, otherwise the CoreSight driver might report warning and dump stack trace. If possible, could you try to add "nohlt" into kernel command line and check if the timeout warnings will dismiss or not.

...

...
I suppose that depends on the mode operation, i.e per-thread of

CPU-wide. It is hard for me to comment without more information about how you came up with those metrics.

I am using an aux buffer per cpu. Differents threads on one cpu share the same buffer using ioctl(PERF_EVENT_IOC_SET_OUTPUT). I measure the metrics like below: $ simpleperf record -e cs-etm simpleperf stat -e cpu-cycles etm_test_one_thread

simpleperf is an alternative of linux perf we use on android. you can replace it with linux perf with tiny option change. cpu-cycles can be replaced by cpu-cycles:k or cpu-cycles:u to measure kernel or user space only. etm_test_one_thread can be replaced by etm_test_multi_threads.

I included the test programs in the attachment, in case anyone wants to try them in your environment. They need to be compiled with -O0.

Thanks for sharing the testing. I will try the testing on my Hikey620 board.

Thanks, Leo Yan

Yabin Cui

25 Sep 25 Sep

5:20 p.m.

New subject: Questions about ETM overflow

...

Could you confirm which Hikey board you are using? Hikey620 or Hikey960?

I saw the timeout warning on hikey620. I didn't test it on hikey960. I will try with nohlt when having time.

On Sat, Sep 21, 2019 at 7:12 PM Leo Yan leo.yan@linaro.org wrote:

...

Hi Yabin,

On Fri, Sep 20, 2019 at 12:06:26PM -0700, Yabin Cui wrote:

...
Thanks for the suggestions!

...
I think if use ETR's sg mode or CATU mode, it might ease the overhead

for CPU and memory bandwidth rather than flat mode.

I am using ETR's flat mode, not sure if the qcom sdm845 supports SG or

CATU.

...
If they are good for performance, I can try them.

Let me firstly try your testing and share back the profiling result on my Hikey board.

...
...
Another thing you could try is to use 'perf -e cycles' command to

locate the hotspot in CoreSight tracing flow

Yes, I will profile the kernel code. One thing interesting is that tmc_flush_and_stop() always show timeout warnings on hikey board, but not on pixel 3. Don't know if it does some time consuming hardware operation.

Could you confirm which Hikey board you are using? Hikey620 or Hikey960?

One thing should note is for Hikey620, usually it's required to disable CPUIdle, otherwise the CoreSight driver might report warning and dump stack trace. If possible, could you try to add "nohlt" into kernel command line and check if the timeout warnings will dismiss or not.

...
...
I suppose that depends on the mode operation, i.e per-thread of

CPU-wide. It is hard for me to comment without more information about how you came up with those metrics.

I am using an aux buffer per cpu. Differents threads on one cpu share the same buffer using ioctl(PERF_EVENT_IOC_SET_OUTPUT). I measure the metrics like below: $ simpleperf record -e cs-etm simpleperf stat -e cpu-cycles etm_test_one_thread

simpleperf is an alternative of linux perf we use on android. you can replace it with linux perf with tiny option change. cpu-cycles can be replaced by cpu-cycles:k or cpu-cycles:u to measure kernel or user space only. etm_test_one_thread can be replaced by etm_test_multi_threads.

I included the test programs in the attachment, in case anyone wants to

try

...
them in your environment. They need to be compiled with -O0.

Thanks for sharing the testing. I will try the testing on my Hikey620 board.

Thanks, Leo Yan

Suzuki K Poulose

11 Oct 11 Oct

8:53 a.m.

New subject: Questions about ETM overflow

On 20/09/2019 20:06, Yabin Cui wrote:

...

Thanks for the suggestions!

...
I think if use ETR's sg mode or CATU mode, it might ease the overhead

for CPU and memory bandwidth rather than flat mode.

I am using ETR's flat mode, not sure if the qcom sdm845 supports SG or CATU. If they are good for performance, I can try them.

As far as I understand, sdm845 supports SG and this is part of the standard upstream device tree. (see "arm,scatter-gather" property for tmc_etr).

Suzuki

Leo Yan

8 Oct 8 Oct

9:56 a.m.

New subject: Questions about ETM overflow

Hi Yabin,

On Thu, Sep 19, 2019 at 04:45:25PM -0700, Yabin Cui wrote:

...

Thanks for both of your suggestions! I will try return stack and stall control later.

Another problem I found recently is about performance degradation when using ETM:

When running a program with ETM enabled on it using perf interface, ETM records data and sends to ETR, and ETR moves the data to memory. And kernel copies the data from ETR memory to a per-cpu buffer. So I think the performance cost is: ETR uses memory bus to move data, and the kernel uses both cpu and memory bus to copy data. In my experiments, the kernel moving data can take a lot of cpu-cycles, which slows down the running program. When testing a single thread busy loop program, the overhead in kernel is about 11% cpu-cycles (comparing to all cpu-cycles running the program). When testing a two threads busy looping program, the overhead in kernel is about 74% cpu-cycles. In both cases, the overhead in user space is very small, about 0.2% cpu-cycles.

I am surprised at the performance difference between single thread and multi thread programs. I haven't spent much time on this. I will check the accuracy and reason for the difference. Any suggestions or comments are welcomed.

At my side, I did some performance testing for Arm CoreSight with your shared single thread and multi thread testing cases. Below is the testing result:

# perf record -e cpu-clock -o perf_cpu_cycle_one_thread.data \ -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./one_thread # perf report -i perf_cpu_cycle_one_thread.data -k vmlinux --stdio

# # # Total Lost Samples: 0 # # Samples: 7K of event 'cpu-clock' # Event count (approx.): 1973750000 # # Overhead Command Shared Object Symbol # ........ .......... ................... ............................................. # 40.39% one_thread one_thread [.] LoopA3 11.69% perf [kernel.kallsyms] [k] arch_local_irq_restore 8.13% one_thread one_thread [.] LoopA4 2.81% perf [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.6 2.80% perf [kernel.kallsyms] [k] __arch_copy_from_user 2.37% perf libc-2.28.so [.] __GI_____strtoull_l_internal 2.10% perf [kernel.kallsyms] [k] format_decode 2.08% perf [kernel.kallsyms] [k] number 1.79% perf [kernel.kallsyms] [k] vsnprintf 1.24% perf [kernel.kallsyms] [k] string_nocheck 0.92% perf [kernel.kallsyms] [k] __set_page_dirty 0.87% perf libc-2.28.so [.] _IO_getdelim 0.79% perf [kernel.kallsyms] [k] update_iter 0.75% perf [kernel.kallsyms] [k] __add_to_page_cache_locked [...]

# perf record -e cpu-clock -o perf_cpu_cycle_multi_thread.data \ -- perf record -e cs_etm/@tmc_etr0/ --per-thread ./multi_thread # perf report -i perf_cpu_cycle_multi_thread.data -k vmlinux --stdio # # # Total Lost Samples: 0 # # Samples: 11K of event 'cpu-clock' # Event count (approx.): 2805750000 # # Overhead Command Shared Object Symbol # ........ ............ ................... ........................................... # 31.93% multi_thread multi_thread [.] LoopB3 31.39% multi_thread multi_thread [.] LoopA3 6.20% perf [kernel.kallsyms] [k] arch_local_irq_restore 3.09% multi_thread multi_thread [.] LoopA4 2.91% multi_thread multi_thread [.] LoopB4 2.00% perf [kernel.kallsyms] [k] kallsyms_expand_symbol.constprop.6 1.78% perf libc-2.28.so [.] __GI_____strtoull_l_internal 1.73% perf [kernel.kallsyms] [k] format_decode 1.13% perf [kernel.kallsyms] [k] vsnprintf 1.10% perf [kernel.kallsyms] [k] number 1.06% perf [kernel.kallsyms] [k] __arch_copy_from_user 0.97% perf [kernel.kallsyms] [k] string_nocheck 0.62% perf [kernel.kallsyms] [k] update_iter 0.53% perf libc-2.28.so [.] _IO_getdelim [...]

So the testing result seems reasonable to me, the user space loops consumes much the CPU bandwidth (40%+); the kernel space program occupies less CPU utilization. Here need to note one thing is: I use CPU clock as event for the performance profiling (but not CPU cycles).

Please confirm if there have any differences between us two sides. I personally think your testing commands are quenstional for me, especially 'simpleperf record -e cs-etm simpleperf stat -e cpu-cycles etm_test_one_thread', seems to me you are using CoreSight to trace command 'simpleperf stat -e cpu-cycles etm_test_one_thread' and finally uses CoreSight trace data in perf.data for performance analysis. As we know, there have many trace data is lost thus the CoreSight trace data is not reliable for performance analysis?

Thanks, Leo Yan

2140

days inactive

2163

days old

coresight@lists.linaro.org

9 comments

participants

tags (0)

participants (5)

Leo Yan
Mathieu Poirier
Mike Leach
Suzuki K Poulose
Yabin Cui