Re: [PATCH 0/4] coresight: Add ETR-PERF polling.

14 May 2021

      Hi,
On Fri, 14 May 2021 at 10:02, Denis Nikitin denik@google.com wrote:
...
On Wed, May 5, 2021 at 8:29 AM Mathieu Poirier
mathieu.poirier@linaro.org wrote:
...
On Tue, May 04, 2021 at 11:46:20PM -0700, Denis Nikitin wrote:
...
On Tue, Apr 27, 2021 at 9:04 AM Leo Yan leo.yan@linaro.org wrote:
...
On Tue, Apr 27, 2021 at 09:47:46AM -0600, Mathieu Poirier wrote:
[...]
...
...

ETR polling ensures that more trace is collected across the entire

trace session - seeking to reduce inconsistent capture volumes.
I am not convinced disabling a sink to collect traces while an
event is active is the right way to go.  To me it will add (more) complexity to
the coresight subsystem for very little gains, if any.
If I remember correctly Leo brought forward the exact same idea about a year ago
and after discussion, we all agreed the benefit would not be important enough to
offset the drawbacks.
As usual I am open to discussion and my opinion is not set in stone.  But as I
mentioned I worry the feature will increase complexity in the driver and
produce dubious results.  And we also have to factor in usability which, as
Al pointed, out will be a problem.
Just want to remind one thing for ETR polling.  From one perspective,
the ETR polling mode is actually very similar with perf's snapshot
mode.  E.g. we can use specific interval to send USR2 singal to perf
tool to captcure CoreSight trace data, thus it also can record the
trace data continuously.
I can see a benefit from ETR polling mode is it might introduce less
overhead than perf snapshot mode.  The kernel's mechanism (workqueue
or kernel thread) will be much efficiency than perf's signal handling

SMP call with IPIs.

So it's good to firstly understand if perf snapshot mode can meet the
requirement or not.
We evaluated the patch on Chrome OS and I can confirm that the quality
of AutoFDO profiles greatly improved with the ETR polling.
Tested with per-thread and system-wide mode.
Without ETR polling the size of the collected ETM data was very
inconsistent on the same workload and could vary by a factor of two.
This, in turn, affects the quality of the AutoFDO profiles generated from ETM.
With ETR polling the data size became pretty stable.
Performance evaluation shows a similar consistency in performance gain
of AutoFDO optimization.
This, I think, supports the idea that data collection right now is sensitive
to the process scheduling and can be improved with ETR polling.
For the system-wide mode particularly we didn't see any other alternatives
to collect data periodically on a long-running workload.
We haven't tested snapshot mode though. The idea sounds interesting.
But small runtime overhead is crucial for the sampling profiler in the field
and if there is a noticeable difference we would incline towards the
ETR polling.
Please see if Leo's approach[1], or any kind of extension to the current
snapshot feature, would be a viable solution.  Reusing or extending code that is
already there is always a better option.
Thanks,
Mathieu
[1]. https://lists.linaro.org/pipermail/coresight/2021-April/006254.html
Hi Mattieu and Leo,
I did some evaluation of the snapshot mode.
Performance overhead is indeed higher than with ETR polling patch.
Here are some numbers for comparison (measured on browser
Speedometer2 benchmark):
Runtime overhead of ETM tracing with ETR poll period 100ms is less than
0.5%. Snapshot mode gives 2.1%.
With 10ms period I see 4.6% with ETR polling and 22% in snapshot mode.
We could probably utilize the ETM strobing feature and reduce frequency
of data collection but I see a problem when I'm using both.
Within a minute of profiling the ETM generates a reasonable profile size
(with strobing autofdo,preset=9 with period 0x1000 it is up to 20MB).
But then the size grows unproportionally.
With a 4 minute run I got a 6.3GB profile.
I don't see such a problem with the ETR polling patch.
Leo, could you please take a look at this problem?
Thanks,
Denis
I do think this patchset presents a valuable feature to add to the
CoreSight support.
This closes a gap on systems where we do not have an interrupt to
trigger on a full sink, and is of evident value to users, giving them
predictable trace sessions, with an even spread of trace across the
session, as has been requested a number of times.
I note the concern regarding stopping trace while the event is active
- but is this not  precisely what we do with ETE/TRBE when it hits an
interrupt? The hardware is stopped and the IRQ updates the perf aux
buffer, and allows the perf core to decide what to do next. In this
patchset we are simply replacing the IRQ with a timer.
Though ETR does not have a direct interrupt - it does have a full
flag, which can be - in a system dependent way - wired through the CTI
and into a PE interrupt. This would give the equivalent operation of
TRBE. Now I do not suggest pursuing this course - as this is per
system architecture dependent - the timer is a better fit for a
generic solution.
The main issue I see with this solution - which is shared with
ETE/TRBE, is that for accurate decode we must know the boundaries of
each captured block in the AUXTRACE buffer in order to reset the
decoder. As has been discussed elsewhere, this work is in hand.
While there are clearly technical implementation details to be fixed
in this set, which others have commented on in the subsequent patches.
I believe in principle it is sound.
Regards
Mike
-- 
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [PATCH 0/4] coresight: Add ETR-PERF polling.