Hi Wojciech,
On Mon, Apr 29, 2019 at 05:24:47PM +0000, Wojciech Żmuda wrote:
[...]
Analyzing this, I learned that perf-script is capable of understanding perf.data AUXTRACE section and parsing some of the trace elements to
branch samples, which illustrate how the IP moved around.
These pieces of information are available for the built-in python interpreter, so we can script it to get assembly from the program image.
If I understand perf-script in its current shape correctly, it ignores all the non-branching events (so everything that's not an ATOM, EXCEPTION or TRACE_ON packet) - specifically, timestamping is lost during the process. I'd like to modify perf-script to generate samples on such timing events,
The CoreSight trace data is saved into perf.data (it is compressed data) and we need to use OpenCSD to decode the trace data and output for different kinds packets.
Based on these packet, perf-script (and perf-report) can generate branch samples, it also can generate out instruction samples and last branch stack samples if we specify flags 'i' and 'l' for option '-itrace'; but by default perf-script will only generate branch samples if we don't specify any flags for '-itrace'.
Looking at the disassembly python script, I thought branch samples are enough to reconstruct the program flow. Since it is possible to get instruction samples as well (with --itrace=il - skipping 'l' here gives me a segfault), do you see any usage of this sample type? It was not mentioned in Linux/CoreSight documents I've read, so I'm not quite sure how can I use this feature.
For the segment fault, please see the patches [1].
For the '--itrace' user case, as I know Mike/Rob worked on it for AutoFDO optimization for programs. For this part, you could refer Mike's slides [2] with AudoFDO advaced usage case, and my slides [3] gave a very brief introduction (start from page 26).
[1] https://archive.armlinux.org.uk/lurker/message/20190428.083227.a35c261b.en.h... [2] https://connect.linaro.org/resources/yvr18/sessions/yvr18-225/ [3] https://s3.amazonaws.com/connect.linaro.org/yvr18/presentations/yvr18-416.pd...
[...]
If we use command 'perf script -F time' it should output samples with timestamp field. But from my testing, this command will fail; but I am not sure if this is caused by the reason mentioned the timestamping is lost during the process. If it is, how about to fix the issue for 'perf script -F time'.
I confirm that '-F time' does not work. This does not seem odd, since the time field is empty, but I don't understand the error. Despite perf-script generated branch samples, it complains about 'dummy' samples:
# perf script -F time Samples for 'dummy:u' event do not have TIME attribute set. Cannot print 'time' field.
I browsed the sample generation code (cs-etm.c) and I can't see code producing this type of samples. Anwyay, I think this may be a bug in the printing part of perf-script, since I actually managed to populate the time field and access it in python (see below).
You could firstly check the time field with other events (e.g. PMU cycle event), if it can work well for other events, seems to me it's an issue in CoreSight samples generation rather than an issue in the perf-script common code.
[...]
I need to investigate it further to make sure getting timestamps from this source is a good idea - I'm not convinced if timestamp as a packet queue parameter is refreshed frequently enough to keep up with timestamp packets actually emitted.
Thanks for digging for this. I will test after you work out formal patches.
[...]
What's the brief benefit we can get from enabling timestamp for CoreSight branch events, and this cannot be fulfilled by Perf's cpu-clock/task-clock events and PMU cpu cycle event?
I tried to research cpu-clock and task-clock and it looks like they are based on wall clock, while CS timestamping is CPU-independent. Measurement with CS may help to narrow down instruction stalls, which, I believe, would be hidden otherwise. CPU cycle seems like a good measurement in this case, but I'm not sure if correlating PMU events with specific instruction range wouldn't be harder than extracting timestamps we already have in the stream.
To be honest, I do not really understand for CPU's micro architecture and pipelines (so cannot give any useful suggestion for this part).
But if you can demonstrate CoreSight + timestamp for CPU micro architecture performance profiling (e.g. find the instruction/data dependency for CPU's pipeline), seems to me this will be very cool user case :)
Thanks, Leo Yan