Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode

22 Nov 2017


      OK - I assume these tests were for 'perf report --stdio'
Given that, I don't see why the addition of the code for AutoFDO -
which I believe is to make 'perf inject' work should alter in the
slightest an apparently unrelated command.
On 22 November 2017 at 11:23, Robert Walker robert.walker@arm.com wrote:
...
...
-----Original Message-----
From: Mike Leach [mailto:mike.leach@linaro.org]
Sent: 22 November 2017 10:59
To: Mathieu Poirier mathieu.poirier@linaro.org
Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org
Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM
decode
Hi Rob, Mathieu,
On 21 November 2017 at 20:32, Mathieu Poirier mathieu.poirier@linaro.org
wrote:
...
On 21 November 2017 at 13:05, Robert Walker robert.walker@arm.com
wrote:
...
...
...
-----Original Message-----
From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
Sent: 21 November 2017 17:46
To: Mike Leach mike.leach@linaro.org
Cc: Robert Walker Robert.Walker@arm.com;
CoreSight@lists.linaro.org
Subject: Re: [PATCH 2/2] perf: Fix branch stack records from
CoreSight ETM decode
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org
wrote:
...
...
...
...
On 20 November 2017 at 15:21, Mathieu Poirier
mathieu.poirier@linaro.org wrote:
>
> I noticed that just doing a "perf report --stdio" on the autoFDO
> branch hangs with the commit I pointed out.
>
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he
could have a look.
...
--stdio --dump works, --stdio only hangs.
I've tried it on a few trace captures from the HiKey 960 - it does
complete
eventually, but takes 10-20 minutes for a 50Mb input file.  Does perf
report
ever complete for you if you leave it for a longer time?
...
Thanks for looking into this.  It probably does complete but it's just
a matter of giving it time.  My development environment is a little
far from that right now so I'll test again when you send a second
revision.
...
If I inspect it with gdb, it seems to be spending a lot of time in
cs_etm__run_decoder() making calls to
cs_etm_decoder__process_data_block() - these usually only add a single
packet to the output queue for cs_etm__sample(), but it is make *slow*
progress through the trace data.  Digging down a bit further,
cs_etm_decoder__process_data_block() is most often calling the decoder
with OCSD_OP_FLUSH because the previous call returned a WAIT response.
I wonder if there's an efficiency problem here?  With dense trace (i.e.
all
ATOM packets), it ends up calling into the trace decoder and
cs_etm__sample() for almost every bit in the trace data. Can we make it
build up a larger queue of packets from the decoder to pass to
cs_etm__sample()?
...
There is probably room for improvement as this code has stayed largely
untouched since inception where our goal was to "just make it work".
One way or another we'll have to look at it again when I get to
implement support for cpu-wide scenarios.  I'm currently half-way into
how Intel has done it but had to set that aside to concentrate on
upstreaming support for per-thread scenarios.  I'm expecting a big
sit-in with Mike at the Hong Kong Connect to iron out what and how
we'll make cpu-wide tracing work.
I'd agree with the above - we haven't really looked at optimising the code
to
date.
The decode library client registers a callback with the library to accept
decoded trace packets, and can return CONT / WAIT accordingly.
I would expect that perf should be processing each packet immediately
(which is likely what does happen with --dump) and return CONT which
avoids the necessity to run a FLUSH operation. It seem to recall Tor
mentioning that perf may be buffering a couple of packets - there was a
reason that perf needed a couple of consecutive packets to generate
necessary perf structures - the then returning WAIT, which must then be
followed by a FLUSH call before more trace data is fed into the library.
(we
need flush as a multi-atom byte can generate many output packets, so we
must not feed in the next byte of trace data until the current one is
done.). It
may be better to buffer up multiple packets in a future version.
I am working on some extension code for the library that will implement an
in-library packet buffer, allowing clients to have a polling/request type
API
(this is to make it easier to interface with scripting languages such as
python
for library standalone testing - registering callbacks from C into python
seems
tricky at best). It may be we can adjust the perf code to use the same
API.
Regards
Mike
I've been running some tests on the various branches of
https://github.com/Linaro/perf-opencsd.git.  On a 50Mb capture file, perf
report takes:

master: 19m15s
autoFDO: > 1hr (I killed it after 1hr)
autoFDO+my patches: 15m33s (which is about the same time as perf inject takes)

I'm running perf on an x86 box, so these times would be slower on something
like a Juno.  I think the autoFDO branch is so much slower because it's
generating instruction samples every 64 branches (every time the branch
stack fills up) which then go through whatever processing perf report does
to produce the statistics.  With my patches, the samples are only generated and processed every 10000 instructions (or whatever interval is specified).
OK - I assume these tests were for 'perf report --stdio'
Given that, I don't see why the addition of the code for AutoFDO -
which I believe is to make 'perf inject' work should alter in the
slightest an apparently unrelated command.
Mike
...
Regards
Rob
-- 
Mike Leach
Principal Engineer, ARM Ltd.
Blackburn Design Centre. UK

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode