On Tue, May 16, 2017 at 1:03 PM, Sebastian Pop s.pop@samsung.com wrote:
On 05/16/2017 07:49 AM, Sebastian Pop wrote:
On 05/16/2017 07:17 AM, Kim Phillips wrote:
I'm using OpenCSD's perf-opencsd-4.11 branch, currently e0cf71e "coresight: tmc: minor fix for output log".
I'm attaching the perf.data file in a separate email to you.
Thanks Kim. I got your perf.data and I will try to see why the perf inject "compiler" crashes on aarch64.
I am not able to reproduce the error on Juno-r0 with the perf.data from Kim. I rebuilt the ETM decoder from master as of today 054c07caa2eb06ae74426b79cb3875bf2569d091 linux perf as indicated at sha e0cf71e.
# export LD_LIBRARY_PATH=/root/etm/decoder-may/decoder/tests/bin/linu x-arm64/dbg # /root/etm/OpenCSD-may/tools/perf/perf inject -i ~/perf.data -o /tmp/inj --itrace=il64 --strip /home/kim/sort-etm/sort-O3 with build id 69ba0f7aec800793a0d0f30d2acd50c11e87630d not found, continuing without symbols /lib/aarch64-linux-gnu/libc-2.24.so with build id 049352a85d29cb69c86c57f362dd10b7e2e62eba not found, continuing without symbols /lib/aarch64-linux-gnu/ld-2.24.so with build id f4fddd122d808f647d41d3b883f5cff43ca44f41 not found, continuing without symbols # echo $? 0 # ls -alh /tmp/inj -rw------- 1 root root 11K May 16 09:47 /tmp/inj # dump_gcov sort_3k.gcov -gcov_version=1
<empty> # /root/etm/OpenCSD-may/tools/perf/perf report -D -i inj | grep 'branch stack: nr:0' | wc -l 1
There is only one LBR event which does not have a branch stack.
However on a trace that I collect this way: # gcc -v [...] gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4) # gcc -O3 -g3 sort.c -o sort_3k # /root/etm/OpenCSD-may/tools/perf/perf record -e cs_etm/@20070000.etr/u --per-thread ./sort_3k Bubble sorting array of 3000 elements 77 ms [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.494 MB perf.data ] # /root/etm/OpenCSD-may/tools/perf/perf inject -f -i perf.data -o inj --itrace=i100usl --strip # /root/etm/OpenCSD-may/tools/perf/perf report -D -i inj | grep 'branch stack: nr:64' | wc -l 30040 # /root/etm/OpenCSD-may/tools/perf/perf report -D -i inj | grep 'branch stack: nr:39' | wc -l 1 # create_gcov --binary=./sort_3k --profile=inj --gcov=sort_3k.gcov -gcov_version=1 # dump_gcov sort_3k.gcov -gcov_version=1 sort_array total:684723 head:0 0: 1 3.3: 1 4.3: 0 7: 0 1: printf total:0 2: 0 6: bubble_sort total:684721 2: 138 4: 1 5: 0 7: 130873 9: 553709
There are 30040 * 64 + 39 = 1922599 branches recorded as "Last Branch Record" events.
On x86_64-linux with intel-pt, I see the following coverage file: $ gcc -v [...] gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) $ gcc -O3 -g3 sort.c -o sort_3k $ perf record -e intel_pt//u ./sort_3k WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted, check /proc/sys/kernel/kptr_restrict.
Samples in kernel functions may not be resolved if a suitable vmlinux file is not found in the buildid cache or in the vmlinux path.
Samples in kernel modules won't be resolved at all.
If some relocation was applied (e.g. kexec) symbols may be misresolved even with a suitable vmlinux or kallsyms file.
Cannot read kernel map Couldn't record kernel reference relocation symbol Symbol resolution may be skewed if relocation was used (e.g. kexec). Check /proc/kallsyms permission or run as root. Bubble sorting array of 3000 elements 39 ms [ perf record: Woken up 51 times to write data ] [ perf record: Captured and wrote 3.160 MB perf.data ] $ perf inject -i perf.data -o inj --itrace=i100usle --strip $ create_gcov --binary=./sort_3k --profile=inj --gcov=sort_3k.gcov -gcov_version=1 $ dump_gcov sort_3k.gcov -gcov_version=1 sort_array total:76281 head:0 0: 4 3.3: 4 4.3: 0 7: 0 1: printf total:0 2: 0 6: bubble_sort total:76273 2: 4 4: 19538 5: 19509 7: 8842 8: 8842 9: 19538 $ perf report -D -i inj 2> /dev/null | grep 'branch stack: nr:64' | wc -l 471 $ perf report -D -i inj 2> /dev/null | grep 'branch stack: nr:' | wc -l 491
There are in between 471 * 64 = 30144 and 491 * 64 branches recorded in as LBR events, as the --itrace option is sampling through the trace data every 100us. The sampling is not yet implemented in "perf inject" for aarch64.
If we sample every 10 micro seconds, we get more LBR events, and the profile is scaled up: $ perf inject -i perf.data -o inj --itrace=i10usle --strip $ perf report -D -i inj10 2> /dev/null | grep 'branch stack: nr:' | wc -l 4821 $ perf report -D -i inj10 2> /dev/null | grep 'branch stack: nr:64' | wc -l 4695 $ create_gcov --binary=./sort_3k --profile=inj10 --gcov=sort_3k_10.gcov -gcov_version=1 $ dump_gcov sort_3k_10.gcov -gcov_version=1 sort_array total:775053 head:0 0: 27 3.3: 56 4.3: 56 7: 0 1: printf total:0 2: 0 6: bubble_sort total:774914 2: 29 4: 195863 5: 195415 7: 93872 8: 93872 9: 195863
Now, let's investigate why the coverage samples are different on x86_64 and aarch64.
Here is again the source code for the hot code:
void bubble_sort (int *a, int n) { int i, t, s = 1; while (s) { s = 0; for (i = 1; i < n; i++) { // the condition and the inner loop are hot. if (a[i] < a[i - 1]) { // the code in the "then clause" is colder than the condition, // as most of the elements get sorted over time and do not need to be swapped. t = a[i]; a[i] = a[i - 1]; a[i - 1] = t; s = 1; } } } }
The coverage on aarch64:
6: bubble_sort total:684721 2: 138 4: 1 5: 0 7: 130873 9: 553709
"7:" is the code in the "then clause". This is colder than "9:". "9:" is the code of the innermost loop and the condition "a[i] < a[i-1]".
Looking at the assembler generated by gcc for aarch64, the inner loop is in L3 and L4:
# gcc -O3 sort.c -S -o - [...] bubble_sort: sub w5, w1, #2 add x5, x5, 1 add x5, x0, x5, lsl 2 .p2align 2 .L2: cmp w1, 1 // cold outer loop ble .L1 mov x2, x0 mov w6, 0 .p2align 2 .L4: ldp w4, w3, [x2] // hot inner loop cmp w3, w4 // hot bge .L3 // hot mov w6, 1 // colder code in "then clause" stp w3, w4, [x2] // colder code .L3: add x2, x2, 4 // hot cmp x2, x5 // hot bne .L4 // hot cbnz w6, .L2 // cold outer loop .L1: ret [...]
and here is what GCC generates for x86_64 $ gcc -O3 sort.c -S -o - [...] bubble_sort: .LFB40: .cfi_startproc leal -2(%rsi), %eax leaq 8(%rdi,%rax,4), %r8 .p2align 4,,10 .p2align 3 .L2: cmpl $1, %esi jle .L1 leaq 4(%rdi), %rax xorl %r9d, %r9d .p2align 4,,10 .p2align 3 .L4: movl (%rax), %edx movl -4(%rax), %ecx cmpl %ecx, %edx jge .L3 movl %ecx, (%rax) movl %edx, -4(%rax) movl $1, %r9d .L3: addq $4, %rax cmpq %r8, %rax jne .L4 testl %r9d, %r9d jne .L2 .L1: rep ret [...]
Here is the x86_64 coverage.
6: bubble_sort total:774914 2: 29 4: 195863 5: 195415 7: 93872 8: 93872 9: 195863
It seems to me that the debug info may be more precise on x86_64 than on aarch64, and create_gcov is able to identify in a more precise way which instruction belongs to which basic block in a more fine grain way than on aarch64. For example, "4:" and "9:" have the exact same number of event samples, and the same is true for "7:" and "8:". These may be because some instructions get tagged in the debug info as belonging to different basic blocks.
I'm adding Dehao to the "CC:" list to check my interpretation of what create_gcov does differently on x86_64 than on aarch64.
Could you copy-paste the code with line number (the one you pasted does not match the profile). Could you also provide the the assembly of the -g1 arm binary so that I can examine the debug info?
Also, it would be helpful if you can use sample_merger to generate the text version of the perf.data file and attach that to the thread too (as well as the full objdump of the bubble_sort function). This will help evaluate if profile is accurate at binary level.
Thanks, Dehao
Thanks, Sebastian