CoreSight

coresight@lists.linaro.org

4 participants
2387 discussions

Re: Using Coresight in SysFS mode on Juno board

by Mathieu Poirier

Good day Thierry, On 29 June 2017 at 03:09, Thierry Laviron <Thierry.Laviron(a)arm.com> wrote: > Hi Mathieu, > > > > I am currently trying to get trace data using the CoreSight system in SysFS > mode on my Juno r2 board. > > > > I found some documentation on how to use it in the > Documentation/trace/coresight.txt file of the perf-opencsd-4.11 branch of > the OpenCSD repository. > > > > This document says that I can retrieve the trace data from /dev/ using dd, > for example in my case that would be > > root@juno-debian:~# dd if=/dev/20070000.etr of=~/cstrace.bin > > > > However, I am assuming this produces a dump of the memory buffer as it was > when I stopped trace collection, That is correct. > > And that I do not have the full trace data generated (because it does not > fit on the buffer). Also correct. If there was a buffer overflow then you'll only get the latest trace data. > > I would like to be able to capture a continuous stream of data from the ETR, > but did not find how should I do that. > Currently the only way to do that is to use coresight from the perf interface (see HOWTO.md on github). > > > I am writing a C program. Can I open a read access to the ETR buffer like > this? > > open(“/dev/20070000.etr”, O_RDONLY); So simply have a read() or a select() blocking on the file descriptor, waiting for trace data to be produced and consuming it as it is generated? > > > > and then read its content, or pipe it somewhere else (e.g. to a file on the > disc)? Unfortunately no. > > > > If there is more relevant documentation on this that I have not found, I > would appreciate if you could point me to it. > > If not, and what I am trying to do will not work, I would welcome some > advice on how to do it properly. You are raising an interesting scenario that hasn't occurred before. When operating from sysFS the problem is to program the tracers to reduce the amount of traces generated. Otherwise userspace can't possibly cope and you'd end up with buffer overflows. But let's assume you got that part covered there is still a problem of when to move trace data from the ETR buffer (contiguous or SG list) to the buffer conveyed by read/select(). That is a tedious problem that currently doesn't have a solution. As I said earlier this is a compelling use case. As such I am coping the coresight mailing list along with Mike and Suzuki. Someone might have some interest in working on this or some thoughts on how to address the issue. It's even better if you want to offer a solution - we'll be happy to provide help and support. Thanks, Mathieu > > > > Thanks in advance. > > > > Best regards, > > > > Thierry Laviron > > IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you.

8 years

RE: [PATCH 1/2] perf inject: correct recording of branch address and destination

by Michael Williams

On Fri, 26 May 2017 14:12:21 +0100 Mike Leach wrote: > Hi, > > Tried out Sebastians patches and got some similarities to Kim but a > couple of differences and some interesting results if you look at the > disassemble of the resulting routines. > > So as per the AutoFDO instructions I built a sort program with no > optimisations and debug: > gcc -g sort.c -o sort > This I profiled on juno- with 3000 interations > > The resulting disassembly of the bubble_sort routine is in > bubble-sorts-disass.txt and the dump gcov profile is below... > -------------------------------- > bubble_sort total:33987051 head:0 > 0: 0 > 1: 0 > 2: 2839 > 3: 2839 > 4: 2839 > 4.1: 8522673 > 4.2: 8519834 > 5: 8517035 > 6: 2104748 > 7: 2104748 > 8: 2104748 > 9: 2104748 > 13: 0 > ------------------------------- > So in my view - the swap lines (6:-9:) - see attached sort.c, are run > less than the enclosing loop (2:-4:,4.1:-5:) - which is what Kim > observed with the intel version. > The synthesized LBR records looked reasonable from comparison with the > disassembly too. > > Trying out the O3 and O3-autofdo from this profile resulted in O3 > running marginally faster, but both faster than unoptimised debug. > > So now look at the disassemblies from the -O3 and -autofdo-O3 versions > of the sort routine [bubble-sorts-disass.txt again]. Both appear to > define a bubble_sort routine, but embed the same / similar code into > sort_array. > Unsurprisingly the O3 version is considerably more compact - hence it > runs faster. I have no idea what the autofdo version is up to, but the > I cannot see how the massive expansion of the routine with compare and > jump tables is going to help. > > So perhaps:- > 1) the LBR stacks are still not correct - though code and dump > inspection might suggest otherwise - are there features in the intel > LBR we are not yet synthesizing? > 2) There is some adverse interaction with the profiles we are > generating and the autofdo code generation. > 3) The amount of coverage in the file is hitting the process - looking > at gcov above then we only have trace from the bubble sort routine. I > did reduce the number of iterations to get more of the program > captured in coverage but this did not seem to make a difference. > Mike Apologies for the delay in replying to this. Some further thoughts on this. 1) This is not an apples-to-apples comparison. The baseline code will most likely have different optimizations applied for x86-64, which will give rise to different code paths and so different profiles. Also is someone here able to comment on to what extent the optimizations applied by the "autofdo-O3" compiler are machine independent? I assume that the work done to create that flow has been done on an x86 version of the compiler, and it might be that regressions exist in the A64 compiler that do not exist in x86: I don't know. For example, the unrolling done for the sort.c example might not be a suitable optimization for the target CPU. This isn't a real-world code example. Bubble sort is sorting random data, so at its heart is an unpredictable compare-and-swap check, and a small inner-loop. The unrolled code, on the other hand, contains many unpredictable branches. It would be better to reproduce this experiment, if not on real-world code then at least on a more sensible benchmark. 2) AIUI, "perf inject --itrace" on the ETM uses systematic block-based sampling to break the trace into LBR records. (That is, after N trace block records it creates a sample with an LBR attached, where a trace block represents a sequence of instructions between two waypoints.) E.g. "perf inject --itrace=il64" Conversely, also AIUI, the reference method for doing this with Intel PT samples based on a reconstructed view of time. (That is, every N reconstructed clock periods, it creates a sample with an LBR attached.) E.g. "perf inject --itrace=i100usle". Time-based sampling will generate more samples from code hot spots, where a hot spot is defined as where *time* is spent in the program. The ETM flow will also favour hot spots, obviously, because these will appear more in the trace. However, because the sampling is not time-based, each *range* is as likely to be sampled as any other range. E.g. if there is a short code sequence that executes in 10 clock periods and a long sequence that executes in 100 clock periods, and both appear equally often in the code, then using time-based sampling the former will appear 10x less often than the latter, but using systematic block-based sampling they appear at the same rate. Furthermore, from a cursory look at the Intel PT code, it looks to me like the Intel PT perf driver walks through each block, instruction by instruction. If I understand this correctly, then that means that even if sampling were systematic and instruction-based rather than time-based (e.g. would "--itrace=i64i" do this on PT?), then the population for sampling is instructions rather than blocks, and again won't match what cs-etm.c is doing. E.g. if the short code sequence is 10 instructions and the long sequence is 100 instructions, then with systematic instruction-based sampling the former block will appear 10x less often in the code, whereas with systematic block-based sampling, they appear at the same rate. One could hack the Intel PT inject tool to implement the same kind of block-based sampling, and see what effect this has (assuming there is a good reason why the ETM inject doesn't implement the time-based sampling -- I've not investigated this). If you have such a sample you can also use the profile_diff tool from AutoFDO to compare the shape of the samples. Now, the extent to which this affects the compiler I do not know. E.g. both sampling schemes are OK for telling a compiler which branches are taken, but if the compiler thinks the samples are time-based and so represent code hotspots, then systematic block-based sampling would be misleading. Mike. > On 25 May 2017 at 05:12, Kim Phillips <kim.phillips at arm.com> wrote: > > On Wed, 24 May 2017 12:48:04 -0500 > > Sebastian Pop <sebpop at gmail.com> wrote: > > > >> On Wed, May 24, 2017 at 11:36 AM, Mathieu Poirier > >> <mathieu.poirier at linaro.org> wrote: > >> > Are the instructions in the autoFDO section of the HOWTO.md on > GitHub sufficient > >> > to test this or there is another way? > >> > >> Here is how I tested it: (supposing that perf.data contains an ETM > trace) > >> > >> # perf inject -i perf.data -o inj --itrace=il64 --strip > >> # perf report -i inj -D &> dump > >> > >> and I inspected the addresses from the last branch stack in the output > dump > >> with the addresses of the disassembled program from: > >> > >> # objdump -d sort > > > > Re-running the AutoFDO process with these two patches continue to make > > the resultant executable perform worse, however: > > > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 5306 ms > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 5304 ms > > $ taskset -c 2 ./sort-O3-autofdo > > Bubble sorting array of 30000 elements > > 5851 ms > > $ taskset -c 2 ./sort-O3-autofdo > > Bubble sorting array of 30000 elements > > 5889 ms > > $ taskset -c 2 ./sort-O3-autofdo > > Bubble sorting array of 30000 elements > > 5888 ms > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 5318 ms > > > > The gcov file generated from the inj.data (no matter whether it's > > --itrace=il64 or --itrace=i100usle) still looks wrong: > > > > $ ~/git/autofdo/dump_gcov -gcov_version=1 sort-O3.gcov > > sort_array total:19309128 head:0 > > 0: 0 > > 1: 0 > > 5: 0 > > 6: 0 > > 7.1: 0 > > 7.3: 0 > > 8.3: 0 > > 15: 2 > > 16: 2 > > 17: 2 > > 10: start total:0 > > 1: 0 > > 11: bubble_sort total:19309119 > > 2: 1566 > > 4: 6266668 > > 5: 6071341 > > 7: 6266668 > > 9: 702876 > > 12: stop total:3 > > 2: 0 > > 3: 1 > > 4: 1 > > 5: 1 > > main total:1 head:0 > > 0: 0 > > 2: 0 > > 4: 1 > > 1: cmd_line total:0 > > 3: 0 > > 4: 0 > > 5: 0 > > 6: 0 > > > > Whereas the one generated by intel-pt run looks correct, showing the > > swap (11: bubble_sort 7,8) as executed less times: > > > > kim at juno sort-etm$ ~/git/autofdo/dump_gcov -gcov_version=1 ../sort- > O3.gcov > > sort_array total:105658 head:0 > > 0: 0 > > 5: 0 > > 6: 0 > > 7.1: 0 > > 7.3: 0 > > 8.3: 0 > > 16: 0 > > 17: 0 > > 1: printf total:0 > > 2: 0 > > 10: start total:0 > > 1: 0 > > 11: bubble_sort total:105658 > > 2: 14 > > 4: 28740 > > 5: 28628 > > 7: 9768 > > 8: 9768 > > 9: 28740 > > 12: stop total:0 > > 2: 0 > > 3: 0 > > 4: 0 > > 5: printf total:0 > > 2: 0 > > 15: printf total:0 > > 2: 0 > > > > I have to run the 'perf inject' on the x86 host because of the > > aforementioned: > > > > 0x350 [0x50]: failed to process type: 1 > > > > problem when trying to run it natively on the aarch64 target. > > > > However, it doesn't matter whether I run the create_gcov - like so btw: > > > > ~/git/autofdo/create_gcov --binary=sort-O3 --profile=inj.data -- > gcov=sort-O3.gcov -gcov_version=1 > > > > on the x86 host or the aarch64 target: I still get the same (negative > > performance) results. > > > > As Sebastian asked, if I take the intel-pt sourced inject > > generated .gcov onto the target and rebuild sort, the performance > > improves: > > > > $ gcc -g -O3 -fauto-profile=../sort-O3.gcov ./sort.c -o ./sort-O3- > autofdo > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 5309 ms > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 5310 ms > > $ taskset -c 2 ./sort-O3-autofdo > > Bubble sorting array of 30000 elements > > 4443 ms > > $ taskset -c 2 ./sort-O3-autofdo > > Bubble sorting array of 30000 elements > > 4443 ms > > > > And if I take the ETM-generated gcov and use that to build a new x86_64 > > binary, it indeed performs worse on x86_64 also: > > > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 1502 ms > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 1500 ms > > $ taskset -c 2 ./sort-O3 > > Bubble sorting array of 30000 elements > > 1501 ms > > $ taskset -c 2 ./sort-O3-autofdo-etmgcov > > Bubble sorting array of 30000 elements > > 1907 ms > > $ taskset -c 2 ./sort-O3-autofdo-etmgcov > > Bubble sorting array of 30000 elements > > 1893 ms > > $ taskset -c 2 ./sort-O3-autofdo-etmgcov > > Bubble sorting array of 30000 elements > > 1907 ms > > > > Kim > > _______________________________________________ > > CoreSight mailing list > > CoreSight at lists.linaro.org > > https://lists.linaro.org/mailman/listinfo/coresight > > > > -- > Mike Leach > Principal Engineer, ARM Ltd. > Blackburn Design Centre. UK <snip> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

8 years

[PATCH v2 0/3] perf: adds barrier FSYNC packet + raw trace dump

by Mike Leach

Adds in call to decode library to activate the barrier packet detection option. Adds in additional per trace source info to associate CS trace ID with incoming stream and dump ID info. Adds in compile time option to dump raw trace data and packed trace frames for debugging trace issues. Updates for v2: Per: mpoirier... 1/3 Update comment to explain FSYNC 4x flag. 2/3 Change to use struct list_head as base of list for trace IDs. Merge in change to "RESET DECODER" message from v1 3/3 patch. 3/3 Create init_raw func to combine conditionally compiled code into single block. Mike Leach (3): perf: cs-etm: Active barrier packet option in decoder. perf: cs-etm: Add channel context item to track packet sources. perf: cs-etm: Add options to log raw trace data for debug. tools/perf/Makefile.config | 6 ++ tools/perf/util/cs-etm-decoder/cs-etm-decoder.c | 122 +++++++++++++++++++++++- 2 files changed, 123 insertions(+), 5 deletions(-) -- 2.7.4

8 years

[PATCH v2 0/9] Adds trace return stack handling to decoder library.

by Mike Leach

ETMv4 and PTM trace have a runtime option called "return stack". This is where the hardware saves potential return addresses from Branch+Link instructions onto a stack and uses the top of stack value if it matches for indirect branch instructions, rather than explicitly trace the branch value. This reduces the number of trace bytes in the stream. If this option is active then the decoder must mirror the return stack from BL instructions and use it when it would ordinarily expect an explicitly traced address on an indirect branch instruction. V2 changes: This set applies onto 0.6.1 of the master branch which contains a bugfix for the instruction follower branching from A32 to T32. Tests on mixed T32/A32 code showed an issue with the return stack push/pop not saving ISA state. This is now added. Mike Leach (9): opencsd: Add log message call to trace component base class. opencsd: new error type for return stack handling. opencsd: Add return stack object to library opencsd: Add return stack handling to PTM decoder opencsd: Add test trace snapshot for PTM return stack. opencsd: ETMv4 decoder: Implement trace return stack handling. opencsd: ETMv4 - juno add test snapshot for return stack testing opencsd: PTM test snapshot for return stack with A32 and T32 mixed code. opencsd: Update README and versions for v0.7 README.md | 15 +- decoder/build/linux/ref_trace_decode_lib/makefile | 1 + .../ref_trace_decode_lib.vcxproj | 2 + .../ref_trace_decode_lib.vcxproj.filters | 6 + decoder/include/common/trc_component.h | 8 + decoder/include/common/trc_ret_stack.h | 120 + decoder/include/etmv4/trc_pkt_decode_etmv4i.h | 5 + decoder/include/ocsd_if_types.h | 1 + decoder/include/ocsd_if_version.h | 6 +- decoder/include/ptm/trc_pkt_decode_ptm.h | 3 + decoder/source/etmv4/trc_pkt_decode_etmv4i.cpp | 66 +- decoder/source/ocsd_error.cpp | 1 + decoder/source/ptm/trc_pkt_decode_ptm.cpp | 51 +- decoder/source/trc_ret_stack.cpp | 122 + decoder/tests/snapshots/juno-ret-stck/cpu_0.ini | 16 + decoder/tests/snapshots/juno-ret-stck/cpu_1.ini | 16 + decoder/tests/snapshots/juno-ret-stck/cpu_2.ini | 16 + decoder/tests/snapshots/juno-ret-stck/cpu_3.ini | 16 + decoder/tests/snapshots/juno-ret-stck/cpu_4.ini | 16 + decoder/tests/snapshots/juno-ret-stck/cpu_5.ini | 16 + decoder/tests/snapshots/juno-ret-stck/cstrace.bin | Bin 0 -> 65536 bytes .../tests/snapshots/juno-ret-stck/device_10.ini | 18 + .../tests/snapshots/juno-ret-stck/device_11.ini | 18 + decoder/tests/snapshots/juno-ret-stck/device_6.ini | 18 + decoder/tests/snapshots/juno-ret-stck/device_7.ini | 18 + decoder/tests/snapshots/juno-ret-stck/device_8.ini | 18 + decoder/tests/snapshots/juno-ret-stck/device_9.ini | 18 + .../tests/snapshots/juno-ret-stck/kernel_dump.bin | Bin 0 -> 7340032 bytes decoder/tests/snapshots/juno-ret-stck/snapshot.ini | 20 + decoder/tests/snapshots/juno-ret-stck/trace.ini | 23 + .../tests/snapshots/tc2-ptm-rstk-t32/PTM_0_2.bin | Bin 0 -> 27884 bytes .../tests/snapshots/tc2-ptm-rstk-t32/README.txt | 1 + .../tests/snapshots/tc2-ptm-rstk-t32/device1.ini | 357 + .../tests/snapshots/tc2-ptm-rstk-t32/device2.ini | 129 + .../tests/snapshots/tc2-ptm-rstk-t32/device3.ini | 129 + .../tests/snapshots/tc2-ptm-rstk-t32/device4.ini | 129 + .../tests/snapshots/tc2-ptm-rstk-t32/device5.ini | 89 + .../tests/snapshots/tc2-ptm-rstk-t32/device6.ini | 89 + .../tc2-ptm-rstk-t32/ds-5_trace_dump/a15_rs.txt | 10005 +++++++++++++++++++ .../mem_Cortex-A15_0_0_VECTORS.bin | Bin 0 -> 632 bytes .../mem_Cortex-A15_0_1_RO_CODE.bin | Bin 0 -> 6576 bytes .../mem_Cortex-A15_0_2_RO_DATA.bin | Bin 0 -> 304 bytes .../mem_Cortex-A15_0_3_RW_DATA.bin | Bin 0 -> 16 bytes .../mem_Cortex-A15_0_4_ZI_DATA.bin | Bin 0 -> 576 bytes .../mem_Cortex-A15_0_5_ARM_LIB_HEAP.bin | Bin 0 -> 262144 bytes .../mem_Cortex-A15_0_6_ARM_LIB_STACK.bin | Bin 0 -> 65536 bytes .../mem_Cortex-A15_0_7_IRQ_STACK.bin | Bin 0 -> 65536 bytes .../tc2-ptm-rstk-t32/mem_Cortex-A15_0_8_TTB.bin | Bin 0 -> 16384 bytes .../tests/snapshots/tc2-ptm-rstk-t32/snapshot.ini | 16 + decoder/tests/snapshots/tc2-ptm-rstk-t32/trace.ini | 24 + decoder/tests/snapshots/trace_cov_a15/PTM_0_2.bin | Bin 0 -> 36 bytes decoder/tests/snapshots/trace_cov_a15/README.txt | 1 + decoder/tests/snapshots/trace_cov_a15/device1.ini | 357 + decoder/tests/snapshots/trace_cov_a15/device2.ini | 129 + decoder/tests/snapshots/trace_cov_a15/device3.ini | 129 + decoder/tests/snapshots/trace_cov_a15/device4.ini | 129 + decoder/tests/snapshots/trace_cov_a15/device5.ini | 89 + decoder/tests/snapshots/trace_cov_a15/device6.ini | 89 + .../trace_cov_a15/mem_Cortex-A15_0_0_VECTORS.bin | Bin 0 -> 632 bytes .../trace_cov_a15/mem_Cortex-A15_0_1_RO_CODE.bin | Bin 0 -> 6576 bytes .../trace_cov_a15/mem_Cortex-A15_0_2_RO_DATA.bin | Bin 0 -> 304 bytes .../trace_cov_a15/mem_Cortex-A15_0_3_RW_DATA.bin | Bin 0 -> 16 bytes .../trace_cov_a15/mem_Cortex-A15_0_4_ZI_DATA.bin | Bin 0 -> 576 bytes .../mem_Cortex-A15_0_5_ARM_LIB_HEAP.bin | Bin 0 -> 262144 bytes .../mem_Cortex-A15_0_6_ARM_LIB_STACK.bin | Bin 0 -> 65536 bytes .../trace_cov_a15/mem_Cortex-A15_0_7_IRQ_STACK.bin | Bin 0 -> 65536 bytes .../trace_cov_a15/mem_Cortex-A15_0_8_TTB.bin | Bin 0 -> 16384 bytes decoder/tests/snapshots/trace_cov_a15/snapshot.ini | 16 + decoder/tests/snapshots/trace_cov_a15/trace.ini | 24 + 69 files changed, 12563 insertions(+), 22 deletions(-) create mode 100644 decoder/include/common/trc_ret_stack.h create mode 100644 decoder/source/trc_ret_stack.cpp create mode 100644 decoder/tests/snapshots/juno-ret-stck/cpu_0.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/cpu_1.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/cpu_2.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/cpu_3.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/cpu_4.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/cpu_5.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/cstrace.bin create mode 100644 decoder/tests/snapshots/juno-ret-stck/device_10.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/device_11.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/device_6.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/device_7.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/device_8.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/device_9.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/kernel_dump.bin create mode 100644 decoder/tests/snapshots/juno-ret-stck/snapshot.ini create mode 100644 decoder/tests/snapshots/juno-ret-stck/trace.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/PTM_0_2.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/README.txt create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/device1.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/device2.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/device3.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/device4.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/device5.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/device6.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/ds-5_trace_dump/a15_rs.txt create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_0_VECTORS.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_1_RO_CODE.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_2_RO_DATA.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_3_RW_DATA.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_4_ZI_DATA.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_5_ARM_LIB_HEAP.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_6_ARM_LIB_STACK.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_7_IRQ_STACK.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/mem_Cortex-A15_0_8_TTB.bin create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/snapshot.ini create mode 100644 decoder/tests/snapshots/tc2-ptm-rstk-t32/trace.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/PTM_0_2.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/README.txt create mode 100644 decoder/tests/snapshots/trace_cov_a15/device1.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/device2.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/device3.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/device4.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/device5.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/device6.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_0_VECTORS.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_1_RO_CODE.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_2_RO_DATA.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_3_RW_DATA.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_4_ZI_DATA.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_5_ARM_LIB_HEAP.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_6_ARM_LIB_STACK.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_7_IRQ_STACK.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/mem_Cortex-A15_0_8_TTB.bin create mode 100644 decoder/tests/snapshots/trace_cov_a15/snapshot.ini create mode 100644 decoder/tests/snapshots/trace_cov_a15/trace.ini -- 2.7.4

8 years

[PATCH 0/8] Adds trace return stack handling to decoder library.

by Mike Leach

8 years

[PATCH 0/3] perf: adds barrier FSYNC packet + raw trace dump

by Mike Leach

8 years

[PATCH v2 0/6] coresight: barrier packet

by Mathieu Poirier

Hi Mike, Here is the second patchset that adds barrier packets to traces collected from ETB and ETR devices. It applies cleanly on top of perf-opencsd-master (4.12-rc1). Let me know how those work out for you. Thanks, Mathieu Mathieu Poirier (6): coresight: Correct buffer lost increment coresight: etf: Add barrier packet for synchronisation coresight: etb10: Remove useless conversion to LE coresight: etb10: Add barrier packet for synchronisation coresight: etr: Correct buffer lost increment coresight: etr: Add barrier packet for synchronisation drivers/hwtracing/coresight/coresight-etb10.c | 36 ++++++++++++++----------- drivers/hwtracing/coresight/coresight-priv.h | 2 ++ drivers/hwtracing/coresight/coresight-tmc-etf.c | 15 +++++++++-- drivers/hwtracing/coresight/coresight-tmc-etr.c | 15 +++++++++-- drivers/hwtracing/coresight/coresight.c | 8 ++++++ 5 files changed, 57 insertions(+), 19 deletions(-) -- 2.7.4

8 years

[PATCH 00/11] OpenCSD v0.6 - barrier packet and debug updates

by Mike Leach

This patch set provides the following additional features:- i) Support for the FSYNC barrier packets inserted by perf. ii) Moves the packet printers from the test code into the main library to allow use by client programs. Additional APIs provided to create and use this packet printers. Plus additional minor fixes and docs changes. Mathieu Poirier (1): opencsd: update content to work with any kernel version Mike Leach (10): opencsd: remove deprecated C-API functions opencsd: Allow FSYNCs to be used as special frames. opencsd: Update message logger for string print callback. opencsd: Move packet printer classes from test code to main library. opencsd: Update C-API to for addtional features. opencsd: tests: Add command line print to test log opencsd: etmv4: Minor print string mod for trace info packet. opencsd: Update error logger interface. opencsd: Add packet printer API to decode tree and C-API opencsd: update README, HOWTO and versions for v0.6 HOWTO.md | 43 +- README.md | 4 +- TODO | 16 +- decoder/build/linux/makefile | 2 - decoder/build/linux/rctdl_c_api_lib/makefile | 1 - decoder/build/linux/ref_trace_decode_lib/makefile | 10 +- .../rctdl_c_api_lib/rctdl_c_api_lib.vcxproj | 2 - .../rctdl_c_api_lib.vcxproj.filters | 6 - .../ref_trace_decode_lib/ref_trace_decode_lib.sln | 21 - .../ref_trace_decode_lib.vcxproj | 8 + .../ref_trace_decode_lib.vcxproj.filters | 30 + decoder/include/c_api/ocsd_c_api_deprc_fn.h | 233 ------ decoder/include/c_api/ocsd_c_api_types.h | 2 + decoder/include/c_api/opencsd_c_api.h | 59 +- decoder/include/common/ocsd_dcd_tree.h | 18 +- decoder/include/common/ocsd_error_logger.h | 4 +- decoder/include/common/ocsd_msg_logger.h | 18 +- decoder/include/common/trc_frame_deformatter.h | 1 + decoder/include/common/trc_pkt_proc_base.h | 10 +- decoder/include/interfaces/trc_error_log_i.h | 3 + decoder/include/ocsd_if_types.h | 3 +- decoder/include/ocsd_if_version.h | 6 +- decoder/include/opencsd.h | 4 + decoder/include/pkt_printers/gen_elem_printer.h | 95 +++ decoder/include/pkt_printers/item_printer.h | 94 +++ decoder/include/pkt_printers/pkt_printer_t.h | 189 +++++ decoder/include/pkt_printers/raw_frame_printer.h | 69 ++ decoder/include/pkt_printers/trc_pkt_printers.h | 43 + decoder/include/pkt_printers/trc_print_fact.h | 60 ++ decoder/source/c_api/ocsd_c_api.cpp | 48 ++ decoder/source/c_api/ocsd_c_api_deprc_fn.cpp | 200 ----- decoder/source/c_api/ocsd_c_api_obj.h | 35 +- decoder/source/etmv4/trc_pkt_elem_etmv4i.cpp | 2 +- decoder/source/ocsd_dcd_tree.cpp | 111 ++- decoder/source/ocsd_msg_logger.cpp | 34 +- decoder/source/pkt_printers/raw_frame_printer.cpp | 104 +++ decoder/source/pkt_printers/trc_print_fact.cpp | 123 +++ decoder/source/trc_frame_deformatter.cpp | 108 ++- decoder/source/trc_frame_deformatter_impl.h | 3 +- .../build/linux/simple_pkt_print_c_api/makefile | 82 -- decoder/tests/build/linux/trc_pkt_lister/makefile | 3 +- .../simple_pkt_print_c_api.vcxproj | 333 -------- .../simple_pkt_print_c_api.vcxproj.filters | 22 - .../trc_pkt_lister/trc_pkt_lister.vcxproj | 6 +- .../trc_pkt_lister/trc_pkt_lister.vcxproj.filters | 14 +- decoder/tests/source/c_api_pkt_print_test.c | 100 ++- decoder/tests/source/gen_elem_printer.h | 96 --- decoder/tests/source/item_printer.h | 94 --- decoder/tests/source/pkt_printer_t.h | 188 ----- decoder/tests/source/raw_frame_printer.cpp | 96 --- decoder/tests/source/raw_frame_printer.h | 71 -- decoder/tests/source/simple_pkt_c_api.c | 923 --------------------- decoder/tests/source/trc_pkt_lister.cpp | 179 +--- 53 files changed, 1413 insertions(+), 2616 deletions(-) delete mode 100644 decoder/include/c_api/ocsd_c_api_deprc_fn.h create mode 100644 decoder/include/pkt_printers/gen_elem_printer.h create mode 100644 decoder/include/pkt_printers/item_printer.h create mode 100644 decoder/include/pkt_printers/pkt_printer_t.h create mode 100644 decoder/include/pkt_printers/raw_frame_printer.h create mode 100644 decoder/include/pkt_printers/trc_pkt_printers.h create mode 100644 decoder/include/pkt_printers/trc_print_fact.h delete mode 100644 decoder/source/c_api/ocsd_c_api_deprc_fn.cpp create mode 100644 decoder/source/pkt_printers/raw_frame_printer.cpp create mode 100644 decoder/source/pkt_printers/trc_print_fact.cpp delete mode 100644 decoder/tests/build/linux/simple_pkt_print_c_api/makefile delete mode 100644 decoder/tests/build/win-vs2015/simple_pkt_print_c_api/simple_pkt_print_c_api.vcxproj delete mode 100644 decoder/tests/build/win-vs2015/simple_pkt_print_c_api/simple_pkt_print_c_api.vcxproj.filters delete mode 100644 decoder/tests/source/gen_elem_printer.h delete mode 100644 decoder/tests/source/item_printer.h delete mode 100644 decoder/tests/source/pkt_printer_t.h delete mode 100644 decoder/tests/source/raw_frame_printer.cpp delete mode 100644 decoder/tests/source/raw_frame_printer.h delete mode 100644 decoder/tests/source/simple_pkt_c_api.c -- 2.7.4

8 years

enabling coresight etm on hikey

by Sebastian Pop

Hi, we started looking at how to enable collection of branch traces with coresight etm on the hikey boards that are the reference platform for the android linux-4.9 work. Does somebody from Linaro have access to the description of where the coresight components are located for the hikey devices? We would appreciate help on enabling linux-perf collection of traces on the hikey. Thanks, Sebastian

8 years, 1 month

Re: [PATCH 2/2] perf tools: new inject capabilitity for CoreSight traces

by Kim Phillips

[re-adding cc list, assuming didn't hit reply-all?] On Tue, 14 Mar 2017 19:12:08 +0000 Mike Leach <mike.leach(a)linaro.org> wrote: > On 14 March 2017 at 16:10, Kim Phillips <kim.phillips(a)arm.com> wrote: > > > still results in: > > > > util/cs-etm.c:1466:27: error: ‘cs_etm_global_header_fmts’ defined but not > > used [-Werror=unused-const-variable=] > > static const char * const cs_etm_global_header_fmts[] = { > > ^~~~~~~~~~~~~~~~~~~~~~~~~ > > > > What toolchain and/or version are you using? > > > I'm using a linaro build of gcc 4.9. > > aarch64-linux-gnu-gcc (Linaro GCC 4.9-2015.05) 4.9.3 20150413 (prerelease) Linaro appear to have removed that release from their repo: http://releases.linaro.org/components/toolchain/binaries/ So I used this one - which AFAICT is the closest to your version - to cross-build both the kernel with your config, and perf: aarch64-linux-gnu-gcc (Linaro GCC 4.9-2016.02) 4.9.4 20151028 (prerelease) Building perf didn't require the patch I sent due to an old gcc bug that apparently finally got fixed recently: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28901 I then ran the resulting kernel (which still had the extra patch, so commit 0d15341 == c50837 + the patch), and perf: root@juno:~# ./perf --version perf version 4.11.rc1.gc50837 root@juno:~# strings -a perf | grep "GCC: (" GCC: (Linaro GCC 4.9-2016.02) 4.9.4 20151028 (prerelease) root@juno:~# dmesg | grep gcc [ 0.000000] Linux version 4.11.0-rc1-g0d15341 (kim@dupont) (gcc version 4.9.4 20151028 (prerelease) (Linaro GCC 4.9-2016.02) ) #3 SMP PREEMPT Tue Mar 14 22:41:34 CDT 2017 root@juno:~# taskset -c 2 ./perf record -e cs_etm/(a)20070000.etr/u --per-thread taskset -c 2 uname [ 870.355660] coresight-replicator-qcom 20120000.replicator: REPLICATOR enabled [ 870.362736] coresight-funnel 20150000.funnel: FUNNEL inport 0 enabled [ 870.369127] coresight-tmc 20010000.etf: TMC-ETF enabled [ 870.374304] coresight-funnel 20040000.funnel: FUNNEL inport 0 enabled [ 870.380698] coresight-funnel 220c0000.funnel: FUNNEL inport 1 enabled [ 870.387858] coresight-funnel 220c0000.funnel: FUNNEL inport 1 disabled [ 870.394325] coresight-funnel 20040000.funnel: FUNNEL inport 0 disabled [ 870.400806] coresight-tmc 20010000.etf: TMC disabled [ 870.405722] coresight-funnel 20150000.funnel: FUNNEL inport 0 disabled [ 870.412184] coresight-replicator-qcom 20120000.replicator: REPLICATOR disabled [ 870.419350] coresight-tmc 20070000.etr: TMC-ETR disabled [ 870.425083] coresight-replicator-qcom 20120000.replicator: REPLICATOR enabled [ 870.432153] coresight-funnel 20150000.funnel: FUNNEL inport 0 enabled [ 870.438542] coresight-tmc 20010000.etf: TMC-ETF enabled [ 870.443718] coresight-funnel 20040000.funnel: FUNNEL inport 0 enabled [ 870.450112] coresight-funnel 220c0000.funnel: FUNNEL inport 1 enabled Linux [ 870.476156] coresight-funnel 220c0000.funnel: FUNNEL inport 1 disabled [ 870.482625] coresight-funnel 20040000.funnel: FUNNEL inport 0 disabled [ 870.489106] coresight-tmc 20010000.etf: TMC disabled [ 870.494023] coresight-funnel 20150000.funnel: FUNNEL inport 0 disabled [ 870.500485] coresight-replicator-qcom 20120000.replicator: REPLICATOR disabled [ 870.507651] coresight-tmc 20070000.etr: TMC-ETR disabled [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote 0.015 MB perf.data ] which appears to have worked the first time. Then I tried a second time - literally up-arrow, enter - and got the same exact hard hang as I get with the modern compilers the first time: root@juno:~# taskset -c 2 ./perf record -e cs_etm/(a)20070000.etr/u --per-thread taskset -c 2 uname [ 1965.355162] coresight-replicator-qcom 20120000.replicator: REPLICATOR enabled [ 1965.362238] coresight-funnel 20150000.funnel: FUNNEL inport 0 enabled [ 1965.368629] coresight-tmc 20010000.etf: TMC-ETF enabled [ 1965.373807] coresight-funnel 20040000.funnel: FUNNEL inport 0 enabled [ 1965.380201] coresight-funnel 220c0000.funnel: FUNNEL inport 1 enabled Linux [ 1965.405984] coresight-funnel 220c0000.funnel: FUNNEL inport 1 disabled [ 1965.412453] coresight-funnel 20040000.funnel: FUNNEL inport 0 disabled [ 1965.418934] coresight-tmc 20010000.etf: TMC disabled [ 1965.423850] coresight-funnel 20150000.funnel: FUNNEL inport 0 disabled [ 1965.430312] coresight-replicator-qcom 20120000.replicator: REPLICATOR disabled [ 1965.437478] coresight-tmc 20070000.etr: TMC-ETR disabled [ perf record: Woken up 2 times to write data ] [ perf record: Captured and wrote which is along the same instability lines as the other time that execution behaviour differed depending on whether it executed first or not... > > > > $ sudo taskset -c 2 ./perf record -e cs_etm/(a)20070000.etr/u > > --per-thread > > > taskset -c 2 uname > > > > failed to mmap with 12 (Cannot allocate memory) > > > > > > That said I get this too if I do enable sinks. However as you say after > > the > > > initial attempt the problem disappears. > > > > Right, at least we have one problem reproduced on both sides now. ...here. > > It hung before completing that last sentence. > > > > Perhaps the bug is exasperated by toolchain and/or host and target > > rootfs distribution differences? My juno target runs debian, and I > > recently upgraded to stretch in order to get a version of gcc that > > would support autoFDO. > > My juno is has a debian-jessie-developer root-fs from > http://releases.linaro.org/debian/images/developer-arm64/16.04/ > At present I cannot build autofdo - I still have some package issues, but > for perf and the kernel I am x-compiling on my ubuntu VM anyway, so the > installed compilers have no effect on the perf runs,. OK, can you try a more modern toolchain please? The one you're using isn't available anymore, AutoFDO requires gcc 5 and higher, and you're not seeing the build failures others see, but most importantly, it should make it easier to reproduce the hard lockup: at least that's the case on the Juno r2. Thanks, Kim

8 years, 1 month

Jump to page:

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

CoreSight