On Tue, May 16, 2017 at 1:03 PM, Sebastian Pop <s.pop@samsung.com> wrote:


On 05/16/2017 07:49 AM, Sebastian Pop wrote:


On 05/16/2017 07:17 AM, Kim Phillips wrote:
I'm using OpenCSD's perf-opencsd-4.11 branch, currently e0cf71e
"coresight: tmc: minor fix for output log".

I'm attaching the perf.data file in a separate email to you.
Thanks Kim.
I got your perf.data and I will try to see why the perf inject "compiler" crashes on aarch64.

I am not able to reproduce the error on Juno-r0 with the perf.data from Kim.
I rebuilt the ETM decoder from master as of today 054c07caa2eb06ae74426b79cb3875bf2569d091
linux perf as indicated at sha e0cf71e.

# export LD_LIBRARY_PATH=/root/etm/decoder-may/decoder/tests/bin/linux-arm64/dbg
# /root/etm/OpenCSD-may/tools/perf/perf inject -i ~/perf.data -o /tmp/inj --itrace=il64 --strip
/home/kim/sort-etm/sort-O3 with build id 69ba0f7aec800793a0d0f30d2acd50c11e87630d not found, continuing without symbols
/lib/aarch64-linux-gnu/libc-2.24.so with build id 049352a85d29cb69c86c57f362dd10b7e2e62eba not found, continuing without symbols
/lib/aarch64-linux-gnu/ld-2.24.so with build id f4fddd122d808f647d41d3b883f5cff43ca44f41 not found, continuing without symbols
# echo $?
0
# ls -alh /tmp/inj
-rw------- 1 root root 11K May 16 09:47 /tmp/inj
# dump_gcov sort_3k.gcov -gcov_version=1
<empty>
# /root/etm/OpenCSD-may/tools/perf/perf report -D -i inj | grep 'branch stack: nr:0' | wc -l
1

There is only one LBR event which does not have a branch stack.

However on a trace that I collect this way:
# gcc -v
[...]
gcc version 5.4.0 20160609 (Ubuntu/Linaro 5.4.0-6ubuntu1~16.04.4)
# gcc -O3 -g3 sort.c -o sort_3k
# /root/etm/OpenCSD-may/tools/perf/perf record -e cs_etm/@20070000.etr/u --per-thread ./sort_3k
Bubble sorting array of 3000 elements
77 ms
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.494 MB perf.data ]
# /root/etm/OpenCSD-may/tools/perf/perf inject -f -i perf.data -o inj --itrace=i100usl --strip
# /root/etm/OpenCSD-may/tools/perf/perf report -D -i inj | grep 'branch stack: nr:64' | wc -l
30040
# /root/etm/OpenCSD-may/tools/perf/perf report -D -i inj | grep 'branch stack: nr:39' | wc -l
1
# create_gcov --binary=./sort_3k --profile=inj --gcov=sort_3k.gcov -gcov_version=1
# dump_gcov sort_3k.gcov -gcov_version=1
sort_array total:684723 head:0
  0: 1
  3.3: 1
  4.3: 0
  7: 0
  1: printf total:0
    2: 0
  6: bubble_sort total:684721
    2: 138
    4: 1
    5: 0
    7: 130873
    9: 553709

There are 30040 * 64 + 39 = 1922599 branches recorded as "Last Branch Record" events.

On x86_64-linux with intel-pt, I see the following coverage file:
$ gcc -v
[...]
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
$ gcc -O3 -g3 sort.c -o sort_3k
$ perf record -e intel_pt//u ./sort_3k
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Cannot read kernel map
Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
Bubble sorting array of 3000 elements
39 ms
[ perf record: Woken up 51 times to write data ]
[ perf record: Captured and wrote 3.160 MB perf.data ]
$ perf inject -i perf.data -o inj --itrace=i100usle --strip
$ create_gcov --binary=./sort_3k --profile=inj --gcov=sort_3k.gcov -gcov_version=1
$ dump_gcov sort_3k.gcov -gcov_version=1
sort_array total:76281 head:0
  0: 4
  3.3: 4
  4.3: 0
  7: 0
  1: printf total:0
    2: 0
  6: bubble_sort total:76273
    2: 4
    4: 19538
    5: 19509
    7: 8842
    8: 8842
    9: 19538
$ perf report -D -i inj 2> /dev/null | grep 'branch stack: nr:64' | wc -l
471
$ perf report -D -i inj 2> /dev/null | grep 'branch stack: nr:' | wc -l
491

There are in between 471 * 64 = 30144 and 491 * 64 branches recorded in as LBR events,
as the --itrace option is sampling through the trace data every 100us.
The sampling is not yet implemented in "perf inject" for aarch64.

If we sample every 10 micro seconds, we get more LBR events, and the profile is scaled up:
$ perf inject -i perf.data -o inj --itrace=i10usle --strip
$ perf report -D -i inj10 2> /dev/null | grep 'branch stack: nr:' | wc -l
4821
$ perf report -D -i inj10 2> /dev/null | grep 'branch stack: nr:64' | wc -l
4695
$ create_gcov --binary=./sort_3k --profile=inj10 --gcov=sort_3k_10.gcov -gcov_version=1
$ dump_gcov sort_3k_10.gcov -gcov_version=1
sort_array total:775053 head:0
  0: 27
  3.3: 56
  4.3: 56
  7: 0
  1: printf total:0
    2: 0
  6: bubble_sort total:774914
    2: 29
    4: 195863
    5: 195415
    7: 93872
    8: 93872
    9: 195863

Now, let's investigate why the coverage samples are different on x86_64 and aarch64.

Here is again the source code for the hot code:

void bubble_sort (int *a, int n) {
  int i, t, s = 1;
  while (s) {
    s = 0;
    for (i = 1; i < n; i++) {
      // the condition and the inner loop are hot.
      if (a[i] < a[i - 1]) {
        // the code in the "then clause" is colder than the condition,
        // as most of the elements get sorted over time and do not need to be swapped.
        t = a[i];
        a[i] = a[i - 1];
        a[i - 1] = t;
        s = 1;
      }
    }
  }
}

The coverage on aarch64:

  6: bubble_sort total:684721
    2: 138
    4: 1
    5: 0
    7: 130873
    9: 553709

"7:" is the code in the "then clause".  This is colder than "9:".
"9:" is the code of the innermost loop and the condition "a[i] < a[i-1]".

Looking at the assembler generated by gcc for aarch64,
the inner loop is in L3 and L4:

# gcc -O3  sort.c -S -o -
[...]
bubble_sort:
        sub     w5, w1, #2
        add     x5, x5, 1
        add     x5, x0, x5, lsl 2
        .p2align 2
.L2:
        cmp     w1, 1                  // cold outer loop
        ble     .L1
        mov     x2, x0
        mov     w6, 0
        .p2align 2
.L4:
        ldp     w4, w3, [x2]          // hot inner loop
        cmp     w3, w4                // hot
        bge     .L3                       // hot
        mov     w6, 1                  // colder code in "then clause"
        stp     w3, w4, [x2]         // colder code
.L3:
        add     x2, x2, 4              // hot
        cmp     x2, x5                 // hot
        bne     .L4                       // hot
        cbnz    w6, .L2                // cold outer loop
.L1:
        ret
[...]

and here is what GCC generates for x86_64
$ gcc -O3  sort.c -S -o -
[...]
bubble_sort:
.LFB40:
        .cfi_startproc
        leal    -2(%rsi), %eax
        leaq    8(%rdi,%rax,4), %r8
        .p2align 4,,10
        .p2align 3
.L2:
        cmpl    $1, %esi
        jle     .L1
        leaq    4(%rdi), %rax
        xorl    %r9d, %r9d
        .p2align 4,,10
        .p2align 3
.L4:
        movl    (%rax), %edx
        movl    -4(%rax), %ecx
        cmpl    %ecx, %edx
        jge     .L3
        movl    %ecx, (%rax)
        movl    %edx, -4(%rax)
        movl    $1, %r9d
.L3:
        addq    $4, %rax
        cmpq    %r8, %rax
        jne     .L4
        testl   %r9d, %r9d
        jne     .L2
.L1:
        rep ret
[...]

Here is the x86_64 coverage.

  6: bubble_sort total:774914
    2: 29
    4: 195863
    5: 195415
    7: 93872
    8: 93872
    9: 195863

It seems to me that the debug info may be more precise on x86_64 than on aarch64,
and create_gcov is able to identify in a more precise way which instruction belongs to
which basic block in a more fine grain way than on aarch64.
For example, "4:" and "9:" have the exact same number of event samples,
and the same is true for "7:" and "8:".  These may be because some instructions
get tagged in the debug info as belonging to different basic blocks.

I'm adding Dehao to the "CC:" list to check my interpretation of what create_gcov does
differently on x86_64 than on aarch64.

Could you copy-paste the code with line number (the one you pasted does not match the profile). Could you also provide the the assembly of the -g1 arm binary so that I can examine the debug info?

Also, it would be helpful if you can use sample_merger to generate the text version of the perf.data file and attach that to the thread too (as well as the full objdump of the bubble_sort function). This will help evaluate if profile is accurate at binary level.

Thanks,
Dehao
 

Thanks,
Sebastian