New subject: Help sought on which approach to take for tracing with CoreSight

22 Apr 2021

      Hi,
Please ensure you reply to the list as well - this gives you a better
chance of getting a timely response & my find people who know more
about CSAL.
On Wed, 21 Apr 2021 at 20:36, Dominik Huber dominik.huber@fau.de wrote:
...
Hello,
Thanks for all the help. Especially the explanation of the trace was
very insightful.
For now, I am going to try implementing the sysfs tracing, since it
seemed easier at first glance. I found the documentation for the .ini
files within the snapshots, and am now trying to create my own version
of them. But I'm struggling to get the addresses for all the devices,
e.g. within cpu_0.ini. I've read in the discovery.md from CSAL that they
can be extracted by reading the ROM table of my Cortex-A53, but I'm
doing/understanding something wrong.
  I cross-compiled the CSAL library and the csinfo-folder (which btw
only compiled after inserting "MODULE_LICENSE("");" into the csinfo.c
file).
  Then I copied the resulting csinfo.ko to my Hikey620 where I tried
"sudo make" and it returned:
Load CoreSight reporting module... expect "Resource temporarily unavailable"
  insmod csinfo.ko
  insmod: ERROR: could not insert module csinfo.ko: Resource temporarily
unavailable
  Makefile:17: recipe for target 'load' failed
  make: *** [load] Error 1
If I understood correctly, that is intended, but should also give me the
base address of the ROM-table afterward. But there was no other output.
I do have CONFIG_DEVMEM enabled. I tried "sudo insmod csinfo.ko".
directly as well, with the same result.
Looking at the code it appears that the output is printk(KERN_INFO ... )
Check the printk level is correct on your system.
...
I could neither find the physical base address for the Cortex-A53 nor a
way to read the "CBAR_EL1" register, which apparently contains that
information. At least not without a debugger.
Since I don't know the correct address, i just tried "sudo ./csscan.py
0x0", and this was the result:
  @0x0   0x000 0x000 r0.0 unexpected CIDR: 0x000000f8
  class:0
  What does this output mean?
As far as I can tell to get valid output from this tool, you need to
give it a valid input address / range of addresses.
CIDR is the component ID register - this is used by the tool to match
to known component values. If it says unexpected then it means just
that - the value found is not a recognised CIDR for the tool.
See the Coresight Spoecification for how CIDR and other IDR values are
useda nd waht is expected.
Regards
Mike
...
For the address 0xE011000 I got:
  @0xe011000   0x200 0x000 r0.7 unexpected CIDR: 0x00001a3f
  ROM table
  Does this help me? Unfortunately, this is not persistent information,
since after a reboot it showed something different.
How can I find out the addresses, that I need for the .ini files?
Regards,
Dominik
On 12/04/2021 11:44, Mike Leach wrote:
...
HI,
On Sat, 10 Apr 2021 at 13:59, Dominik Huber dominik.huber@fau.de wrote:
...
Am 09.04.2021 um 17:01 schrieb Mike Leach:
...
Hi Dominik
On Fri, 9 Apr 2021 at 15:22, Dominik Huberdominik.huber@fau.de  wrote:
...
Hello,
I want to gather trace data of closed source binaries using CoreSight
ETMv4 on a Hikey620. I want to know the source and destination address
for all taken jumps of the traced program, like in the output of "perf
script". It would be great if I could get feedback on how to achieve this.
I'm not sure where to turn to with such a broad CoreSight problem, so
I'm sorry if you are not the right ones to turn to, but I'd be happy for
any help or advice you might have.
This is exactly the right place to ask for help on this!
I'm glad to hear that!
...
...
My main problem is, that I don't know which approaches are promising to
try. Below I describe two Ideas that I tried but where I got stuck after
a while. Are they any good for my use case? If yes, then how can I solve
the respective problems that have come up, or where can I look to solve
them? If not, are there maybe better ways to approach this, which I've
overlooked until now?
After hearing a presentation from Mathieu Poirier, I thought sysFS was
the (only) way to go. However, the decoded trace seems to show only the
jump address, instead of both the source and destination addresses, and
I did not find a register to change that.
What do you mean by decoded trace here & what are you using to decode the trace?
If you look at the ETM spec / OpenCSD documentation you will see that
to fully decode trace there is a two step process.

convert  the trace byte stream into trace packets. This will

require some minimal information regarding the configuration of the
ETM
2) convert the trace packets into the fully decoded execution trace.
This requires access to the binary images executed during the trace
session.
The reason for this is that trace packets are highly compressed, and
contain the minimum of address information. Decode requires that the
decoder will walk the binary images to deduce which branches are taken
and not taken.
Only were address information cannot be deduced from this code is
included in the trace packets  - and this is only ever target address
information - the source address can always be deduced from the code
walk.
I used ptm2human for decoding. I also tried the c_api_pkt_print_test.c
from OpenCSD, which decodes a single packet. However, depending on how I
collected the trace data, it does not always produce decoded data.
ptm2human performs the 1st stage of decode - byte stream to trace
packets. The output from this is the same as the packet only decode
from the OpenCSD library / trc_pkt_lister app. This will print out the
trace packets - but be aware that in both cases the addresses that you
see in the trace are not all the branch addresses used in the executed
application.
...
Once
I have a working proof-of-concept, I intended to write myself an OpenCSD
decoder. Because of that, and because the decoding often just seemed "to
work", I postponed any thoughts about the packet processing.
What are these "binary images"? Until now, I got only a cstrace.bin from
/dev/[sink], which I used as only Input for any decoding. How can I get
them? Can I, by using them, obtain the destination addresses of branch
instructions?
cstrace.bin is the binary trace data. The binary images I refer to are
the memory images of the executed code - the decoder walks this memory
image to deduce the path of the executed trace based on the opcodes
encountered.
These are either the binary files of your application and any loaded
.so files, or may be a memory dump of the locations these were loaded
during the trace session. Either will do to decode.
See https://github.com/Linaro/OpenCSD/blob/master/decoder/docs/prog_guide/prog_g...
So to successfully decode a trace session to obtain the source and
target branch addresses you want, you will need the following:-

The captured binary trace data.
The configuration registers of the ETM
The program binaries for all applications and .so libraries active

during the trace session and their load addresses.
Be aware that ETM will trace everything running on a core - so you it
may be necessary for any analysis of a particular program to filter
out anything unrelated.
...
...
...
Also, the trace gathered seems to lose some of the branch addresses.
Inserting a sleep instruction after each regular instruction into my
test program, fixed that. But since it should also work for closed
source binaries and has to be fast this is probably not an option.
Then I tried to copy the way "perf record" is tracing, and extract the
relevant code parts. But then I realized, that perf record doesn't use
sysFS, apart from enabling the sink in "util/cs-etm.c" (which apparently
is not used, and not even deactivated afterward).
perf uses the driver in the a similar way to sysfs. It does in fact
activate and de-activate the sources and sinks as the perf events are
run on any CPU.
perf also records the binaries used during the trace session - so that
full trace decode is possible. With sysfs you can get raw trace data -
but relating this to the binaries being executed is far more
difficult.
So is it your recommendation to use large parts of the perf source code
for my project because it already makes use of these "binary images"?
I was not recommending re-use of perf source code - I was recommending
using perf to capture and analyse trace. If this is not sufficient -
or you want to write your own application, then perf serves as an
example of the information you need to collect during a trace session
to successfully decode the trace. perf does two tasks - capture -
using perf record, and analysis using perf report. These are very
separate elements - the first happens in kernel space, the second,
often offline as a separate perf program, in user space.
...
Up until now, I only thought about using a few hundred lines, because I
believe perf has a lot of overhead (both in performance and code length)
just to get branch instructions from an executed binary.
The most important factors for me are that

the trace includes all branches (including the jump destination), and
it is fast (faster than using hardware emulation and extracting the

addresses from there)
What about the approaches to use the CoreSight driver directly or to use
CSAL for trace collection? Would they maybe be better suited?
The best method used for capturing trace is really an assessment for
you to decide based on your requirements. CSAL / direct driver access
/ sysfs are all options - though I am not aware of anyone actually
using a direct driver access method so could not really advise here.
Remember that whatever method you choose, in order to get the data
that you require you will need to collect the additional information I
describe above.
Once you have this data you can then pass it to the OpenCSD library
for full decode. This will output the executed trace ranges. You are
then free to analyse these ranges to synthesise source / target data
for branches.
Below is a short example of the trace output from the library - I have
annotated this '***' to explain what is happening. This was generated
using trc_pkt_lister test program from OpenCSD, running on one of the
supplied test captures.
The I_ packets are the trace packets - as would be output from
ptm2human, the OCSD_GEN_TRC packets are the output of the library -
fully decoded trace ranges.
Idx:356; ID:12; [0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x80 ];    I_ASYNC : Alignment Synchronisation.
*** start of trace - align the decoder to the incoming byte stream for
the ETM programmed with trace ID 0x12
Idx:369; ID:12; [0x01 0x01 0x00 ];    I_TRACE_INFO : Trace Info.;
INFO=0x0 { CC.0 }
*** Some setup information regarding trace configuration - used
internally by the decoder.
Idx:372; ID:12; [0xf7 ];    I_ATOM_F1 : Atom format 1.; E
*** atom packet - skipped - cannot decode until we get an address /
context packet
Idx:373; ID:12; [0x85 0x22 0x12 0x4d 0x00 0x00 0x00 0x00 0x00 0x30 ];
   I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.;
Addr=0x00000000004D2488; Ctxt: AArch64,EL0, NS;
*** address and context packet - this gives the trace start address
for the decoder + EL and ISA context.
Idx:384; ID:12; [0xf7 ];    I_ATOM_F1 : Atom format 1.; E
*** Atom packet. This indicates that the program executed a number of
instructions until it encountered a P0 element instruction (a P0
element instruction - previously referred to as waypoints in PTM, are
instructions that change the address flow of the program. These are
primarily branch instructions but the full range of P0 instruction
types is defined by the ETM4 protocol). This instruction was executed.
Idx:373; ID:12; OCSD_GEN_TRC_ELEM_PE_CONTEXT((ISA=A64) EL0N; 64-bit; )
*** OpenCSD library outputs the context information for the client
application. Where traced context also includes the Context ID
register which may contain the PID of the program running on the CPU
under dertain kernel configurations.
Idx:384; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec
range=0x4d2488:[0x4d2494] num_i(3) last_sz(4) (ISA=A64) E iBR A64:ret
)
*** OpenCSD outputs the executed instruction trace range three
instructions from addresses 0x4d2488 - 0x4d2493. This is output in
response to the atom packet.
This range was calculated by starting @ address 0x4d2488, walking
through the program image from that address, examining opcodes until
it found a P0 element (branch instruction) which it associates with
the atom packet.
This was an indirect branch (iBR A64:ret ) and also an aarch64 return
instruction. This branch was taken (E). At this point we do not know
the destination of the branch - so we are waiting for a target address
packet in the input stream.
Idx:385; ID:12; [0x9d 0x48 0x5f 0x4d 0x00 0x00 0x00 0x00 0x00 ];
I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0x00000000004DBF20;
*** Address packet - this updates the decoder with the target of the
prior indirect branch - which can now continue decoding
Idx:394; ID:12; [0xde ];    I_ATOM_F4 : Atom format 4.; NENE
*** 4 atom packets - representing 4 waypoint instructions that were
taken (E) or not taken (N). You will note that this single packet will
decode into 4 separate executed instruction ranges - with no further
address packets visible in the trace.
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec
range=0x4dbf20:[0x4dbf24] num_i(1) last_sz(4) (ISA=A64) N BR   <cond>)
*** OpenCSD executed instruction range - 1 instruction - was a not
taken branch. This range started @ 0x04DBF20 - taken from the address
packet above.
At this point the client program using the library can deduce the most
recent taken branch information - as 0x4d2490=>0x4dbf20
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec
range=0x4dbf24:[0x4dbf2c] num_i(2) last_sz(4) (ISA=A64) E BR  b+link )
*** OpenCSD executed instruction range - 2 instruction - last
instruction was a taken branch + link - in this case the decoder can
calculate the target address from the instruction opcode - so no
address data needs to appear in the trace.
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec
range=0x4d1d88:[0x4d1db4] num_i(11) last_sz(4) (ISA=A64) N BR
<cond>)
*** OpenCSD executed instruction range - 11 instrucitons, starting at
the address calculated from the previous range. So the start of the
range was the result of a branch 0x4dbf28=>0x4d1d88
Ends in a not taken branch.
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec
range=0x4d1db4:[0x4d1dbc] num_i(2) last_sz(4) (ISA=A64) E BR   <cond>)
*** OpenCSD executed instruction range - 2 instructions, last is a
direct branch which will allow us to calculate the target address.
Throughout this process the decoder maintains a current trace address
from the incoming address packets, and by walking through the executed
opcodes - calculating branch targets where possible. This is why the
program and library binaries (or a memeory dump or their load
locations ) is required for full trace decode and obtaining the
information you require.
Again, I would recommend reading the ETM protocol spec and the OpenCSD
libarary documentation to get a full understanding of the protocols
and how decoding works.
Regards
Mike
...
...
If you are interested in tracing a particular binary & this is a
userspace program then you may wish to try:-
perf record -e cs_etm//u --per-thread <program-to-trace>
to  ensure that any trace collected is related to the program you are
interested in. You can then use the facilites of perf report / perf
script to examine the trace.
Thanks, but I luckily already knew about that one.
...
Regards
Mike
...
So there is another way to gather trace, maybe by interacting with the
CoreSight driver directly. But looking into the "perf report" source
code I couldn't find it yet.
Thanks and regards,
Dominik

CoreSight mailing list
CoreSight@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/coresight
--
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK
--
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK
-- 
Mike Leach
Principal Engineer, ARM Ltd.
Manchester Design Centre. UK

Re: Help sought on which approach to take for tracing with CoreSight