Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode

21 Nov 2017


      On 21 November 2017 at 13:05, Robert Walker robert.walker@arm.com wrote:
...
...
-----Original Message-----
From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
Sent: 21 November 2017 17:46
To: Mike Leach mike.leach@linaro.org
Cc: Robert Walker Robert.Walker@arm.com; CoreSight@lists.linaro.org
Subject: Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM
decode
On 21 November 2017 at 10:41, Mike Leach mike.leach@linaro.org wrote:
...
On 20 November 2017 at 15:21, Mathieu Poirier
mathieu.poirier@linaro.org wrote:
...
I noticed that just doing a "perf report --stdio" on the autoFDO
branch hangs with the commit I pointed out.
This hangs without Rob's patches too.
Correct - since Rob is already roaming in that code I was hoping he
could have a look.
...
--stdio --dump works, --stdio only hangs.
I've tried it on a few trace captures from the HiKey 960 - it does complete eventually, but takes 10-20 minutes for a 50Mb input file.  Does perf report ever complete for you if you leave it for a longer time?
Thanks for looking into this.  It probably does complete but it's just
a matter of giving it time.  My development environment is a little
far from that right now so I'll test again when you send a second
revision.
...
If I inspect it with gdb, it seems to be spending a lot of time in cs_etm__run_decoder() making calls to cs_etm_decoder__process_data_block() - these usually only add a single packet to the output queue for cs_etm__sample(), but it is make *slow* progress through the trace data.  Digging down a bit further, cs_etm_decoder__process_data_block() is most often calling the decoder with OCSD_OP_FLUSH because the previous call returned a WAIT response.  I wonder if there's an efficiency problem here?  With dense trace (i.e. all ATOM packets), it ends up calling into the trace decoder and cs_etm__sample() for almost every bit in the trace data. Can we make it build up a larger queue of packets from the decoder to pass to cs_etm__sample()?
There is probably room for improvement as this code has stayed largely
untouched since inception where our goal was to "just make it work".
One way or another we'll have to look at it again when I get to
implement support for cpu-wide scenarios.  I'm currently half-way into
how Intel has done it but had to set that aside to concentrate on
upstreaming support for per-thread scenarios.  I'm expecting a big
sit-in with Mike at the Hong Kong Connect to iron out what and how
we'll make cpu-wide tracing work.
...
Regards
Rob
...
I'm not surprised as --dump avoids a lot of code.
...
Not sure we haven't seen this before?
Mike

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [PATCH 2/2] perf: Fix branch stack records from CoreSight ETM decode