On 14 April 2017 at 16:35, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
> Hello Mathieu
>
> We have changed the device tree to reflect all CoreSight components on our platform.
> Still I do not see any device listed in the following directory
>
> /sys/bus/coresight/devices
>
> Am I missing something in Linux 4.9 configuration? I have enabled CONFIG_CORESIGHT in .config.
When the system boots AMBA probes for devices connected (and
discoverable) on the bus. The first thing to do is make sure the CS
devices show up at enumeration time. I suggest instrumenting function
amba_device_try_add() [1] to see if CS devices are discovered. If not
then it is probably a matter of enabling the debug power domain.
While debugging I also suggest to boot with the "nohlt" option or
disable CPUidle completely. That way tracers (usually located in the
CPU/cluster power domain) are guaranteed to be enabled.
If CS devices show up then you will need to instrument the _probe()
function of each CS driver to see what makes them unhappy.
Mathieu
[1]. http://lxr.free-electrons.com/source/drivers/amba/bus.c#L344
>
> Regards, Reza
>
>
> -----Original Message-----
> From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
> Sent: Friday, March 31, 2017 3:37 PM
> To: Etemadi, Mohammad <mohammad.etemadi(a)intel.com>
> Subject: Re: Perf-opencsd-4.9
>
> On 31 March 2017 at 14:32, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
>> Thanks Mathieu
>>
>> I am using Yocto tool chain. So, I have to add --sysroot to both opencsd and tools/perf makefiles.
>> If I see more problems, I may have to switch building them on the target.
>
> Ok, let me know.
>
> I will be travelling for the next two weeks. As such I may be slow to respond, if at all. While I am away you can always send your questions to "coresight(a)lists.linaro.org". There is a lot of very knowledgeable people on there that can help you.
>
>>
>> Regards, Reza
>>
>> -----Original Message-----
>> From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
>> Sent: Friday, March 31, 2017 3:27 PM
>> To: Etemadi, Mohammad <mohammad.etemadi(a)intel.com>
>> Subject: Re: Perf-opencsd-4.9
>>
>> For my own sanity I tried doing the same thing and everything works as advertised. The only difference is that I'm not cross compiling and I have gcc 5.4.0. We know how to fix the gcc 6.2 problem and the cross compilation isn't part of the equation.
>>
>> For the perf tools I am working with branch perf-openscd-4.9 (b50067a52cf3). On the openCSD side I am using the master branch (054c07caa2eb).
>>
>> Mathieu
>>
>> On 31 March 2017 at 13:52, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
>>> Thanks Mathieu
>>>
>>> I am using master branch
>>>
>>> git clone https://github.com/Linaro/OpenCSD.git my-opencsd
>>> git checkout -b master origin/master
>>>
>>> Is this a correct branch for opencsd?
>>>
>>> I am cross compiling both opencsd and tools/perf.
>>>
>>> Regards, Reza
>>>
>>> -----Original Message-----
>>> From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
>>> Sent: Friday, March 31, 2017 2:34 PM
>>> To: Etemadi, Mohammad <mohammad.etemadi(a)intel.com>
>>> Subject: Re: Perf-opencsd-4.9
>>>
>>> On 31 March 2017 at 12:58, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
>>>> Hello Mathieu
>>>>
>>>>
>>>> We have taken all the patches. When building tools/perf we get the following compilation errors.
>>>> Do you have some ideas if we are missing a patch?
>>>
>>> The problem comes from mismatches between the kernel and the openCSD version. We try to keep them working for older versions but that can't be guaranteed all the time.
>>>
>>>>
>>>> CC util/parse-events.o
>>>> CC util/parse-events-flex.o
>>>> CC util/pmu.o
>>>> util/cs-etm.c:1282:27: error: ?cs_etm_global_header_fmts? defined
>>>> but not used [-Werror=unused-const-variable=] static const char * const cs_etm_global_header_fmts[] = {
>>>> ^~~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> Right, you are building branch 4.9 with a gcc version that is higher than 6.2. We have a patch for that on the 4.11-rc1 branch [1].
>>>
>>> [1].
>>> https://github.com/Linaro/OpenCSD/commit/2379238cd554a3445b87ceaa4ef0
>>> 0
>>> 15c5d25c4b3
>>>
>>>> CC util/pmu-flex.o
>>>> util/cs-etm-decoder/cs-etm-decoder.c: In function ?cs_etm_decoder__gen_trace_elem_printer?:
>>>> util/cs-etm-decoder/cs-etm-decoder.c:204:7: error: ?OCSD_GEN_TRC_ELEM_SWTRACE? undeclared (first use in this function)
>>>> case OCSD_GEN_TRC_ELEM_SWTRACE:
>>>> ^~~~~~~~~~~~~~~~~~~~~~~~~
>>>> util/cs-etm-decoder/cs-etm-decoder.c:204:7: note: each undeclared
>>>> identifier is reported only once for each function it appears in
>>>> util/cs-etm-decoder/cs-etm-decoder.c:205:7: error: ?OCSD_GEN_TRC_ELEM_CUSTOM? undeclared (first use in this function)
>>>> case OCSD_GEN_TRC_ELEM_CUSTOM:
>>>> ^~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> Those two are related to the openCSD version. What version of the library are you working with? In any case if you stick with the latest revision (0.5.4) or TIP you should be fine. The commits that added these two is here [2].
>>>
>>> [2].
>>> https://github.com/Linaro/OpenCSD/commit/885cb935cf04acc3d0afd2bb242b
>>> d
>>> 9f7328e7104
>>>
>>>> mv: cannot stat 'util/cs-etm-decoder/.cs-etm-decoder.o.tmp': No such
>>>> file or directory
>>>>
>>>> Regards, Reza
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
>>>> Sent: Thursday, March 30, 2017 10:15 AM
>>>> To: Etemadi, Mohammad <mohammad.etemadi(a)intel.com>
>>>> Subject: Re: Perf-opencsd-4.9
>>>>
>>>> On 30 March 2017 at 09:12, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
>>>>> Thanks.
>>>>>
>>>>> Do you have links to these patches?
>>>>
>>>> It's all there:
>>>> https://github.com/Linaro/OpenCSD/commits/perf-opencsd-4.9
>>>>
>>>>>
>>>>> Regards, Reza
>>>>>
>>>>> -----Original Message-----
>>>>> From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
>>>>> Sent: Thursday, March 30, 2017 10:06 AM
>>>>> To: Etemadi, Mohammad <mohammad.etemadi(a)intel.com>
>>>>> Subject: Re: Perf-opencsd-4.9
>>>>>
>>>>> On 30 March 2017 at 08:51, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
>>>>>> Thanks Mathieu
>>>>>>
>>>>>> This is an in-house board and we had to do some BSP changes to boot Linux 4.9 on our board.
>>>>>
>>>>> Ok.
>>>>>
>>>>>> I just want to make sure that I merge perf-openCSD-4.9 correctly with our source tree.
>>>>>> So, are you saying that I only need to replace tool/perf directory?
>>>>>
>>>>> Simply apply the patches you find on perf-opencsd-4.9, that will make life easier on you.
>>>>>
>>>>>>
>>>>>> Regards, Reza
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mathieu Poirier [mailto:mathieu.poirier@linaro.org]
>>>>>> Sent: Thursday, March 30, 2017 9:40 AM
>>>>>> To: Etemadi, Mohammad <mohammad.etemadi(a)intel.com>
>>>>>> Subject: Re: Perf-opencsd-4.9
>>>>>>
>>>>>> On 29 March 2017 at 19:47, Etemadi, Mohammad <mohammad.etemadi(a)intel.com> wrote:
>>>>>>> Hello Matt
>>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We have Linux 4.9 running on our ARMv8 platform. I like to try
>>>>>>> instruction tracing using perf-opencsd-4.9.
>>>>>>
>>>>>> Ok, that should be an interesting project.
>>>>>>
>>>>>>>
>>>>>>> I noticed that there is a Linaro Linux source tree for perf-opencsd-4.9.
>>>>>>> Looks like this tree consists of base
>>>>>>>
>>>>>>> Linux kernel 4.9 plus some some additions to support CoreSight
>>>>>>> and instruction trace using Perf tool.
>>>>>>
>>>>>> All the kernel side of the solution is already upstream. What is on gitHub it part of the user space perf tools that haven't been upstreamed yet (we are working on it) and the openCSD decoding library.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> For adding trace functionality, Is it possible to get a patch
>>>>>>> that I can apply to my base Linux 4.9?
>>>>>>
>>>>>> I'm not sure of what you mean by "a patch I can add" - can you rephrase of give me more details? All the patches on top of mainline (on gitHub) should apply cleanly to your tree.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Also which files must be changed to reflect specific SOC CoreSight topology?
>>>>>>
>>>>>> CoreSight is different on every platform, which is why all the platform specific stuff has been pushed to the device tree. Is this a commercial board or something in-house?
>>>>>>
>>>>>> In the former case support for vexpress[1], juno (R0/1/2)[2] and the dragonboard[3] are already upstream. There is also something for HiKey that I could share with you if need be. If this is an in-house project you will need to make up your own device tree base on the topology you are working with.
>>>>>>
>>>>>> I'm always interested by what people are doing with CoreSight. Get back to me if you have more questions.
>>>>>>
>>>>>> Regards,
>>>>>> Mathieu
>>>>>>
>>>>>> [1].
>>>>>> http://lxr.free-electrons.com/source/arch/arm/boot/dts/vexpress-v2
>>>>>> p
>>>>>> -
>>>>>> c
>>>>>> a
>>>>>> 15_a7.dts [2].
>>>>>> http://lxr.free-electrons.com/source/arch/arm64/boot/dts/arm/juno-
>>>>>> b
>>>>>> a
>>>>>> s
>>>>>> e
>>>>>> .dtsi [3].
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>>>>> /
>>>>>> t
>>>>>> r
>>>>>> e
>>>>>> e/arch/arm64/boot/dts/qcom/msm8916.dtsi?id=refs/tags/v4.11-rc4
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Regards, Reza
Hi Thierry,
I see you have also sent this mail to Mathieu who has answered some of the points and cc:ed the linaro coresight mailing list.
I'll give you my spin on a couple of things here.....
> -----Original Message-----
> From: Thierry Laviron
> Sent: 29 June 2017 15:45
> To: Mike Leach
> Subject: Using Coresight in SysFS mode on Juno board
>
> Hi Mike,
>
>
>
> I am currently trying to get trace data using the CoreSight system in SysFS
> mode on my Juno r2 board.
>
>
>
> I found some documentation on how to use it in the
> Documentation/trace/coresight.txt file of the perf-opencsd-4.11 branch of the
> OpenCSD repository.
>
>
>
> This document says that I can retrieve the trace data from /dev/ using dd, for
> example in my case that would be
>
> root@juno-debian:~# dd if=/dev/20070000.etr of=~/cstrace.bin
>
>
>
> However, I am assuming this produces a dump of the memory buffer as it was
> when I stopped trace collection,
>
> And that I do not have the full trace data generated (because it does not fit on
> the buffer).
>
> I would like to be able to capture a continuous stream of data from the ETR, but
> did not find how should I do that.
>
It is not possible to read trace while still collecting it - the process you are tracing must be stopped while trace is saved. Perf can achieve this as it is integrated into the kernel, but this is difficult to achieve from the sysfs interface.
As Mathieu says, you need to limit the amount of trace to the application you are tracing - but even so, the rate of trace collection can easily overflow buffers.
>
>
> I am writing a C program. Can I open a read access to the ETR buffer like this?
>
> open("/dev/20070000.etr", O_RDONLY);
>
>
>
> and then read its content, to write somewhere else? (e.g. to a file on the disc)?
>
>
>
> As a second step, I am also trying to filter the trace generated. I found some
> useful documentation in
>
> Documentation/ABI/testing/sysfs-bus-coresight-devices-etm4x
>
> However, while this is very useful to understand what are the purpose of the
> different files that appear in the
>
> /sys/bus/coresight/devices/<mmap>.etm/ folders, I am not sure of the format
> to put stuff in.
>
>
>
> For example, I want to use the Context ID comparator, so the ETM traces only
> the process I am interested in.
>
> I assume I need to write the PID of my process in ctxid_pid, probably write 0x1
> in ctxid_idx to activate it, and leave 0x0 in ctxid_mask
>
> according to the ETM v4.3 architecture specification.
>
> But I feel that I am missing something else, as it seems the ETM is not taking
> the filter into account.
>
i) you will need to have enabled PID=>context idr tracking in your kernel.
ii) you need to set up the ViewInst event resource selector to select a context ID event to start and stop the trace, in addition to setting the context ID comparators.
Additionally you will need some address range enabled as well - though by default the etm drivers set up the full address range under sysfs.
The hardware registers needed for all this are described in the ETM TRM, but at present I don't know of any docs that map the sysfs names onto the relevant HW registers.
Regards
Mike
>
>
> If there is more relevant documentation on this that I have not found, I would
> appreciate if you could point me to it.
>
> If not, and what I am trying to do will not work, I would welcome some advice
> on how to do it properly.
>
>
>
> Thanks in advance.
>
>
>
> Best regards,
>
>
>
> Thierry Laviron
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
On 29 June 2017 at 05:46, Leo Yan <leo.yan(a)linaro.org> wrote:
> Hi Mathieu, Mike,
Good morning Leo,
>
> Guodong and me have planning to enable coresight on Hikey960, but we
> are not quite sure if you have requirement for this or not.
I currently don't have the bandwidth to work on this.
> Due
> Guodong told me so far community has no many inputs for coresight
> enabling on Hikey960, so want to check with you if you are
> intreseting on this platform for coresight works, if you think there
> have strong requirement we can start related enabling with Hisilicon.
I would be delighted to have CS support on Hikey960 - current
platforms are well supported but the passage of time can't be ignored.
>
> Methieu/Chunyan before have took much efforts to enable Hikey, you
> could see Hikey960 we still have very poor doc for coresight module in
> below section; so if you think this is important for your work, Guodong
> and me will sync with Hisilicon for coresight enabling ASAP (we heavily
> depend Hisilicon to provide info for clock and coresight topology),
> from Hikey experience this took very long time, but we can summary the
> check points based on previous experience and accelerate a bit for
> this (if necessary, I'm glad to work in Hisilicon lab to enable it).
>
> If you think this platform is redundant with others, I still will send
> to Hisilicon and will take it as a low priority task.
I don't think it's redundant at all...
Before you start implementing anything I'd like to see the CoreSight
topology for this board. Newer design are getting more creative and
there may be cases we haven't expected in the initial design. If
that's the case I'll spot them right away and offer ways to address
the problems.
Regards,
Mathieu
>
> -----
> 2.7.2 CoreSight Debugging
> The Hi3660 has a powerful debug system that integrates an ARM
> CoreSight system. The CoreSight system supports the following features:
> - Top-level CoreSight and local CoreSight in each cluster. The local
> CoreSight contains the A73 CoreSight and A53 CoreSight.
> - Intrusive debugging (debug) and non-intrusive debugging (trace)
> A73 and A53 support both debug and trace.
> - Software debugging and traditional JTAG debugging
>
> Thanks,
> Leo Yan
Good day Thierry,
On 29 June 2017 at 03:09, Thierry Laviron <Thierry.Laviron(a)arm.com> wrote:
> Hi Mathieu,
>
>
>
> I am currently trying to get trace data using the CoreSight system in SysFS
> mode on my Juno r2 board.
>
>
>
> I found some documentation on how to use it in the
> Documentation/trace/coresight.txt file of the perf-opencsd-4.11 branch of
> the OpenCSD repository.
>
>
>
> This document says that I can retrieve the trace data from /dev/ using dd,
> for example in my case that would be
>
> root@juno-debian:~# dd if=/dev/20070000.etr of=~/cstrace.bin
>
>
>
> However, I am assuming this produces a dump of the memory buffer as it was
> when I stopped trace collection,
That is correct.
>
> And that I do not have the full trace data generated (because it does not
> fit on the buffer).
Also correct. If there was a buffer overflow then you'll only get the
latest trace data.
>
> I would like to be able to capture a continuous stream of data from the ETR,
> but did not find how should I do that.
>
Currently the only way to do that is to use coresight from the perf
interface (see HOWTO.md on github).
>
>
> I am writing a C program. Can I open a read access to the ETR buffer like
> this?
>
> open(“/dev/20070000.etr”, O_RDONLY);
So simply have a read() or a select() blocking on the file descriptor,
waiting for trace data to be produced and consuming it as it is
generated?
>
>
>
> and then read its content, or pipe it somewhere else (e.g. to a file on the
> disc)?
Unfortunately no.
>
>
>
> If there is more relevant documentation on this that I have not found, I
> would appreciate if you could point me to it.
>
> If not, and what I am trying to do will not work, I would welcome some
> advice on how to do it properly.
You are raising an interesting scenario that hasn't occurred before.
When operating from sysFS the problem is to program the tracers to
reduce the amount of traces generated. Otherwise userspace can't
possibly cope and you'd end up with buffer overflows. But let's
assume you got that part covered there is still a problem of when to
move trace data from the ETR buffer (contiguous or SG list) to the
buffer conveyed by read/select(). That is a tedious problem that
currently doesn't have a solution.
As I said earlier this is a compelling use case. As such I am coping
the coresight mailing list along with Mike and Suzuki. Someone might
have some interest in working on this or some thoughts on how to
address the issue. It's even better if you want to offer a solution -
we'll be happy to provide help and support.
Thanks,
Mathieu
>
>
>
> Thanks in advance.
>
>
>
> Best regards,
>
>
>
> Thierry Laviron
>
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
On Fri, 26 May 2017 14:12:21 +0100 Mike Leach wrote:
> Hi,
>
> Tried out Sebastians patches and got some similarities to Kim but a
> couple of differences and some interesting results if you look at the
> disassemble of the resulting routines.
>
> So as per the AutoFDO instructions I built a sort program with no
> optimisations and debug:
> gcc -g sort.c -o sort
> This I profiled on juno- with 3000 interations
>
> The resulting disassembly of the bubble_sort routine is in
> bubble-sorts-disass.txt and the dump gcov profile is below...
> --------------------------------
> bubble_sort total:33987051 head:0
> 0: 0
> 1: 0
> 2: 2839
> 3: 2839
> 4: 2839
> 4.1: 8522673
> 4.2: 8519834
> 5: 8517035
> 6: 2104748
> 7: 2104748
> 8: 2104748
> 9: 2104748
> 13: 0
> -------------------------------
> So in my view - the swap lines (6:-9:) - see attached sort.c, are run
> less than the enclosing loop (2:-4:,4.1:-5:) - which is what Kim
> observed with the intel version.
> The synthesized LBR records looked reasonable from comparison with the
> disassembly too.
>
> Trying out the O3 and O3-autofdo from this profile resulted in O3
> running marginally faster, but both faster than unoptimised debug.
>
> So now look at the disassemblies from the -O3 and -autofdo-O3 versions
> of the sort routine [bubble-sorts-disass.txt again]. Both appear to
> define a bubble_sort routine, but embed the same / similar code into
> sort_array.
> Unsurprisingly the O3 version is considerably more compact - hence it
> runs faster. I have no idea what the autofdo version is up to, but the
> I cannot see how the massive expansion of the routine with compare and
> jump tables is going to help.
>
> So perhaps:-
> 1) the LBR stacks are still not correct - though code and dump
> inspection might suggest otherwise - are there features in the intel
> LBR we are not yet synthesizing?
> 2) There is some adverse interaction with the profiles we are
> generating and the autofdo code generation.
> 3) The amount of coverage in the file is hitting the process - looking
> at gcov above then we only have trace from the bubble sort routine. I
> did reduce the number of iterations to get more of the program
> captured in coverage but this did not seem to make a difference.
> Mike
Apologies for the delay in replying to this.
Some further thoughts on this.
1) This is not an apples-to-apples comparison. The baseline code will most likely have different optimizations applied for x86-64, which will give rise to different code paths and so different profiles. Also is someone here able to comment on to what extent the optimizations applied by the "autofdo-O3" compiler are machine independent?
I assume that the work done to create that flow has been done on an x86 version of the compiler, and it might be that regressions exist in the A64 compiler that do not exist in x86: I don't know. For example, the unrolling done for the sort.c example might not be a suitable optimization for the target CPU.
This isn't a real-world code example. Bubble sort is sorting random data, so at its heart is an unpredictable compare-and-swap check, and a small inner-loop. The unrolled code, on the other hand, contains many unpredictable branches. It would be better to reproduce this experiment, if not on real-world code then at least on a more sensible benchmark.
2) AIUI, "perf inject --itrace" on the ETM uses systematic block-based sampling to break the trace into LBR records. (That is, after N trace block records it creates a sample with an LBR attached, where a trace block represents a sequence of instructions between two waypoints.) E.g. "perf inject --itrace=il64"
Conversely, also AIUI, the reference method for doing this with Intel PT samples based on a reconstructed view of time. (That is, every N reconstructed clock periods, it creates a sample with an LBR attached.) E.g. "perf inject --itrace=i100usle".
Time-based sampling will generate more samples from code hot spots, where a hot spot is defined as where *time* is spent in the program. The ETM flow will also favour hot spots, obviously, because these will appear more in the trace. However, because the sampling is not time-based, each *range* is as likely to be sampled as any other range.
E.g. if there is a short code sequence that executes in 10 clock periods and a long sequence that executes in 100 clock periods, and both appear equally often in the code, then using time-based sampling the former will appear 10x less often than the latter, but using systematic block-based sampling they appear at the same rate.
Furthermore, from a cursory look at the Intel PT code, it looks to me like the Intel PT perf driver walks through each block, instruction by instruction. If I understand this correctly, then that means that even if sampling were systematic and instruction-based rather than time-based (e.g. would "--itrace=i64i" do this on PT?), then the population for sampling is instructions rather than blocks, and again won't match what cs-etm.c is doing.
E.g. if the short code sequence is 10 instructions and the long sequence is 100 instructions, then with systematic instruction-based sampling the former block will appear 10x less often in the code, whereas with systematic block-based sampling, they appear at the same rate.
One could hack the Intel PT inject tool to implement the same kind of block-based sampling, and see what effect this has (assuming there is a good reason why the ETM inject doesn't implement the time-based sampling -- I've not investigated this). If you have such a sample you can also use the profile_diff tool from AutoFDO to compare the shape of the samples.
Now, the extent to which this affects the compiler I do not know. E.g. both sampling schemes are OK for telling a compiler which branches are taken, but if the compiler thinks the samples are time-based and so represent code hotspots, then systematic block-based sampling would be misleading.
Mike.
> On 25 May 2017 at 05:12, Kim Phillips <kim.phillips at arm.com> wrote:
> > On Wed, 24 May 2017 12:48:04 -0500
> > Sebastian Pop <sebpop at gmail.com> wrote:
> >
> >> On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier
> >> <mathieu.poirier at linaro.org> wrote:
> >> > Are the instructions in the autoFDO section of the HOWTO.md on
> GitHub sufficient
> >> > to test this or there is another way?
> >>
> >> Here is how I tested it: (supposing that perf.data contains an ETM
> trace)
> >>
> >> # perf inject -i perf.data -o inj --itrace=il64 --strip
> >> # perf report -i inj -D &> dump
> >>
> >> and I inspected the addresses from the last branch stack in the output
> dump
> >> with the addresses of the disassembled program from:
> >>
> >> # objdump -d sort
> >
> > Re-running the AutoFDO process with these two patches continue to make
> > the resultant executable perform worse, however:
> >
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5306 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5304 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 5851 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 5889 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 5888 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5318 ms
> >
> > The gcov file generated from the inj.data (no matter whether it's
> > --itrace=il64 or --itrace=i100usle) still looks wrong:
> >
> > $ ~/git/autofdo/dump_gcov -gcov_version=1 sort-O3.gcov
> > sort_array total:19309128 head:0
> > 0: 0
> > 1: 0
> > 5: 0
> > 6: 0
> > 7.1: 0
> > 7.3: 0
> > 8.3: 0
> > 15: 2
> > 16: 2
> > 17: 2
> > 10: start total:0
> > 1: 0
> > 11: bubble_sort total:19309119
> > 2: 1566
> > 4: 6266668
> > 5: 6071341
> > 7: 6266668
> > 9: 702876
> > 12: stop total:3
> > 2: 0
> > 3: 1
> > 4: 1
> > 5: 1
> > main total:1 head:0
> > 0: 0
> > 2: 0
> > 4: 1
> > 1: cmd_line total:0
> > 3: 0
> > 4: 0
> > 5: 0
> > 6: 0
> >
> > Whereas the one generated by intel-pt run looks correct, showing the
> > swap (11: bubble_sort 7,8) as executed less times:
> >
> > kim at juno sort-etm$ ~/git/autofdo/dump_gcov -gcov_version=1 ../sort-
> O3.gcov
> > sort_array total:105658 head:0
> > 0: 0
> > 5: 0
> > 6: 0
> > 7.1: 0
> > 7.3: 0
> > 8.3: 0
> > 16: 0
> > 17: 0
> > 1: printf total:0
> > 2: 0
> > 10: start total:0
> > 1: 0
> > 11: bubble_sort total:105658
> > 2: 14
> > 4: 28740
> > 5: 28628
> > 7: 9768
> > 8: 9768
> > 9: 28740
> > 12: stop total:0
> > 2: 0
> > 3: 0
> > 4: 0
> > 5: printf total:0
> > 2: 0
> > 15: printf total:0
> > 2: 0
> >
> > I have to run the 'perf inject' on the x86 host because of the
> > aforementioned:
> >
> > 0x350 [0x50]: failed to process type: 1
> >
> > problem when trying to run it natively on the aarch64 target.
> >
> > However, it doesn't matter whether I run the create_gcov - like so btw:
> >
> > ~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data --
> gcov=sort-O3.gcov -gcov_version=1
> >
> > on the x86 host or the aarch64 target: I still get the same (negative
> > performance) results.
> >
> > As Sebastian asked, if I take the intel-pt sourced inject
> > generated .gcov onto the target and rebuild sort, the performance
> > improves:
> >
> > $ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3-
> autofdo
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5309 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 5310 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 4443 ms
> > $ taskset -c 2 ./sort-O3-autofdo
> > Bubble sorting array of 30000 elements
> > 4443 ms
> >
> > And if I take the ETM-generated gcov and use that to build a new x86_64
> > binary, it indeed performs worse on x86_64 also:
> >
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 1502 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 1500 ms
> > $ taskset -c 2 ./sort-O3
> > Bubble sorting array of 30000 elements
> > 1501 ms
> > $ taskset -c 2 ./sort-O3-autofdo-etmgcov
> > Bubble sorting array of 30000 elements
> > 1907 ms
> > $ taskset -c 2 ./sort-O3-autofdo-etmgcov
> > Bubble sorting array of 30000 elements
> > 1893 ms
> > $ taskset -c 2 ./sort-O3-autofdo-etmgcov
> > Bubble sorting array of 30000 elements
> > 1907 ms
> >
> > Kim
> > _______________________________________________
> > CoreSight mailing list
> > CoreSight at lists.linaro.org
> > https://lists.linaro.org/mailman/listinfo/coresight
>
>
>
> --
> Mike Leach
> Principal Engineer, ARM Ltd.
> Blackburn Design Centre. UK
<snip>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Adds in call to decode library to activate the barrier packet
detection option.
Adds in additional per trace source info to associate CS trace ID with
incoming stream and dump ID info.
Adds in compile time option to dump raw trace data and packed trace
frames for debugging trace issues.
Updates for v2:
Per: mpoirier...
1/3 Update comment to explain FSYNC 4x flag.
2/3 Change to use struct list_head as base of list for trace IDs.
Merge in change to "RESET DECODER" message from v1 3/3 patch.
3/3 Create init_raw func to combine conditionally compiled code into
single block.
Mike Leach (3):
perf: cs-etm: Active barrier packet option in decoder.
perf: cs-etm: Add channel context item to track packet sources.
perf: cs-etm: Add options to log raw trace data for debug.
tools/perf/Makefile.config | 6 ++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 122 +++++++++++++++++++++++-
2 files changed, 123 insertions(+), 5 deletions(-)
--
2.7.4
Adds in call to decode library to activate the barrier packet
detection option.
Adds in additional per trace source info to associate CS trace ID with
incoming stream and dump ID info.
Adds in compile time option to dump raw trace data and packed trace
frames for debugging trace issues.
Mike Leach (3):
perf: cs-etm: Active barrier packet option in decoder.
perf: cs-etm: Add channel context item to track packet sources.
perf: cs-etm: Add options to log raw trace data for debug.
tools/perf/Makefile.config | 6 ++
tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 108 ++++++++++++++++++++++--
2 files changed, 109 insertions(+), 5 deletions(-)
--
2.7.4