Hello, I want to gather trace data of closed source binaries using CoreSight ETMv4 on a Hikey620. I want to know the source and destination address for all taken jumps of the traced program, like in the output of "perf script". It would be great if I could get feedback on how to achieve this. I'm not sure where to turn to with such a broad CoreSight problem, so I'm sorry if you are not the right ones to turn to, but I'd be happy for any help or advice you might have.
My main problem is, that I don't know which approaches are promising to try. Below I describe two Ideas that I tried but where I got stuck after a while. Are they any good for my use case? If yes, then how can I solve the respective problems that have come up, or where can I look to solve them? If not, are there maybe better ways to approach this, which I've overlooked until now?
After hearing a presentation from Mathieu Poirier, I thought sysFS was the (only) way to go. However, the decoded trace seems to show only the jump address, instead of both the source and destination addresses, and I did not find a register to change that. Also, the trace gathered seems to lose some of the branch addresses. Inserting a sleep instruction after each regular instruction into my test program, fixed that. But since it should also work for closed source binaries and has to be fast this is probably not an option.
Then I tried to copy the way "perf record" is tracing, and extract the relevant code parts. But then I realized, that perf record doesn't use sysFS, apart from enabling the sink in "util/cs-etm.c" (which apparently is not used, and not even deactivated afterward). So there is another way to gather trace, maybe by interacting with the CoreSight driver directly. But looking into the "perf report" source code I couldn't find it yet.
Thanks and regards,
Dominik
Hi Dominik
On Fri, 9 Apr 2021 at 15:22, Dominik Huber dominik.huber@fau.de wrote:
Hello, I want to gather trace data of closed source binaries using CoreSight ETMv4 on a Hikey620. I want to know the source and destination address for all taken jumps of the traced program, like in the output of "perf script". It would be great if I could get feedback on how to achieve this. I'm not sure where to turn to with such a broad CoreSight problem, so I'm sorry if you are not the right ones to turn to, but I'd be happy for any help or advice you might have.
This is exactly the right place to ask for help on this!
My main problem is, that I don't know which approaches are promising to try. Below I describe two Ideas that I tried but where I got stuck after a while. Are they any good for my use case? If yes, then how can I solve the respective problems that have come up, or where can I look to solve them? If not, are there maybe better ways to approach this, which I've overlooked until now?
After hearing a presentation from Mathieu Poirier, I thought sysFS was the (only) way to go. However, the decoded trace seems to show only the jump address, instead of both the source and destination addresses, and I did not find a register to change that.
What do you mean by decoded trace here & what are you using to decode the trace? If you look at the ETM spec / OpenCSD documentation you will see that to fully decode trace there is a two step process. 1) convert the trace byte stream into trace packets. This will require some minimal information regarding the configuration of the ETM 2) convert the trace packets into the fully decoded execution trace. This requires access to the binary images executed during the trace session. The reason for this is that trace packets are highly compressed, and contain the minimum of address information. Decode requires that the decoder will walk the binary images to deduce which branches are taken and not taken. Only were address information cannot be deduced from this code is included in the trace packets - and this is only ever target address information - the source address can always be deduced from the code walk.
Also, the trace gathered seems to lose some of the branch addresses. Inserting a sleep instruction after each regular instruction into my test program, fixed that. But since it should also work for closed source binaries and has to be fast this is probably not an option.
Then I tried to copy the way "perf record" is tracing, and extract the relevant code parts. But then I realized, that perf record doesn't use sysFS, apart from enabling the sink in "util/cs-etm.c" (which apparently is not used, and not even deactivated afterward).
perf uses the driver in the a similar way to sysfs. It does in fact activate and de-activate the sources and sinks as the perf events are run on any CPU. perf also records the binaries used during the trace session - so that full trace decode is possible. With sysfs you can get raw trace data - but relating this to the binaries being executed is far more difficult.
If you are interested in tracing a particular binary & this is a userspace program then you may wish to try:- perf record -e cs_etm//u --per-thread <program-to-trace> to ensure that any trace collected is related to the program you are interested in. You can then use the facilites of perf report / perf script to examine the trace.
Regards
Mike
So there is another way to gather trace, maybe by interacting with the CoreSight driver directly. But looking into the "perf report" source code I couldn't find it yet.
Thanks and regards,
Dominik
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
Am 09.04.2021 um 17:01 schrieb Mike Leach:
Hi Dominik
On Fri, 9 Apr 2021 at 15:22, Dominik Huberdominik.huber@fau.de wrote:
Hello, I want to gather trace data of closed source binaries using CoreSight ETMv4 on a Hikey620. I want to know the source and destination address for all taken jumps of the traced program, like in the output of "perf script". It would be great if I could get feedback on how to achieve this. I'm not sure where to turn to with such a broad CoreSight problem, so I'm sorry if you are not the right ones to turn to, but I'd be happy for any help or advice you might have.
This is exactly the right place to ask for help on this!
I'm glad to hear that!
My main problem is, that I don't know which approaches are promising to try. Below I describe two Ideas that I tried but where I got stuck after a while. Are they any good for my use case? If yes, then how can I solve the respective problems that have come up, or where can I look to solve them? If not, are there maybe better ways to approach this, which I've overlooked until now?
After hearing a presentation from Mathieu Poirier, I thought sysFS was the (only) way to go. However, the decoded trace seems to show only the jump address, instead of both the source and destination addresses, and I did not find a register to change that.
What do you mean by decoded trace here & what are you using to decode the trace? If you look at the ETM spec / OpenCSD documentation you will see that to fully decode trace there is a two step process.
- convert the trace byte stream into trace packets. This will
require some minimal information regarding the configuration of the ETM 2) convert the trace packets into the fully decoded execution trace. This requires access to the binary images executed during the trace session. The reason for this is that trace packets are highly compressed, and contain the minimum of address information. Decode requires that the decoder will walk the binary images to deduce which branches are taken and not taken. Only were address information cannot be deduced from this code is included in the trace packets - and this is only ever target address information - the source address can always be deduced from the code walk.
I used ptm2human for decoding. I also tried the c_api_pkt_print_test.c from OpenCSD, which decodes a single packet. However, depending on how I collected the trace data, it does not always produce decoded data. Once I have a working proof-of-concept, I intended to write myself an OpenCSD decoder. Because of that, and because the decoding often just seemed "to work", I postponed any thoughts about the packet processing. What are these "binary images"? Until now, I got only a cstrace.bin from /dev/[sink], which I used as only Input for any decoding. How can I get them? Can I, by using them, obtain the destination addresses of branch instructions?
Also, the trace gathered seems to lose some of the branch addresses. Inserting a sleep instruction after each regular instruction into my test program, fixed that. But since it should also work for closed source binaries and has to be fast this is probably not an option.
Then I tried to copy the way "perf record" is tracing, and extract the relevant code parts. But then I realized, that perf record doesn't use sysFS, apart from enabling the sink in "util/cs-etm.c" (which apparently is not used, and not even deactivated afterward).
perf uses the driver in the a similar way to sysfs. It does in fact activate and de-activate the sources and sinks as the perf events are run on any CPU. perf also records the binaries used during the trace session - so that full trace decode is possible. With sysfs you can get raw trace data - but relating this to the binaries being executed is far more difficult.
So is it your recommendation to use large parts of the perf source code for my project because it already makes use of these "binary images"? Up until now, I only thought about using a few hundred lines, because I believe perf has a lot of overhead (both in performance and code length) just to get branch instructions from an executed binary.
The most important factors for me are that 1) the trace includes all branches (including the jump destination), and 2) it is fast (faster than using hardware emulation and extracting the addresses from there)
What about the approaches to use the CoreSight driver directly or to use CSAL for trace collection? Would they maybe be better suited?
If you are interested in tracing a particular binary & this is a userspace program then you may wish to try:- perf record -e cs_etm//u --per-thread <program-to-trace> to ensure that any trace collected is related to the program you are interested in. You can then use the facilites of perf report / perf script to examine the trace.
Thanks, but I luckily already knew about that one.
Regards
Mike
So there is another way to gather trace, maybe by interacting with the CoreSight driver directly. But looking into the "perf report" source code I couldn't find it yet.
Thanks and regards,
Dominik
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
HI,
On Sat, 10 Apr 2021 at 13:59, Dominik Huber dominik.huber@fau.de wrote:
Am 09.04.2021 um 17:01 schrieb Mike Leach:
Hi Dominik
On Fri, 9 Apr 2021 at 15:22, Dominik Huberdominik.huber@fau.de wrote:
Hello, I want to gather trace data of closed source binaries using CoreSight ETMv4 on a Hikey620. I want to know the source and destination address for all taken jumps of the traced program, like in the output of "perf script". It would be great if I could get feedback on how to achieve this. I'm not sure where to turn to with such a broad CoreSight problem, so I'm sorry if you are not the right ones to turn to, but I'd be happy for any help or advice you might have.
This is exactly the right place to ask for help on this!
I'm glad to hear that!
My main problem is, that I don't know which approaches are promising to try. Below I describe two Ideas that I tried but where I got stuck after a while. Are they any good for my use case? If yes, then how can I solve the respective problems that have come up, or where can I look to solve them? If not, are there maybe better ways to approach this, which I've overlooked until now?
After hearing a presentation from Mathieu Poirier, I thought sysFS was the (only) way to go. However, the decoded trace seems to show only the jump address, instead of both the source and destination addresses, and I did not find a register to change that.
What do you mean by decoded trace here & what are you using to decode the trace? If you look at the ETM spec / OpenCSD documentation you will see that to fully decode trace there is a two step process.
- convert the trace byte stream into trace packets. This will
require some minimal information regarding the configuration of the ETM 2) convert the trace packets into the fully decoded execution trace. This requires access to the binary images executed during the trace session. The reason for this is that trace packets are highly compressed, and contain the minimum of address information. Decode requires that the decoder will walk the binary images to deduce which branches are taken and not taken. Only were address information cannot be deduced from this code is included in the trace packets - and this is only ever target address information - the source address can always be deduced from the code walk.
I used ptm2human for decoding. I also tried the c_api_pkt_print_test.c from OpenCSD, which decodes a single packet. However, depending on how I collected the trace data, it does not always produce decoded data.
ptm2human performs the 1st stage of decode - byte stream to trace packets. The output from this is the same as the packet only decode from the OpenCSD library / trc_pkt_lister app. This will print out the trace packets - but be aware that in both cases the addresses that you see in the trace are not all the branch addresses used in the executed application.
Once I have a working proof-of-concept, I intended to write myself an OpenCSD decoder. Because of that, and because the decoding often just seemed "to work", I postponed any thoughts about the packet processing. What are these "binary images"? Until now, I got only a cstrace.bin from /dev/[sink], which I used as only Input for any decoding. How can I get them? Can I, by using them, obtain the destination addresses of branch instructions?
cstrace.bin is the binary trace data. The binary images I refer to are the memory images of the executed code - the decoder walks this memory image to deduce the path of the executed trace based on the opcodes encountered. These are either the binary files of your application and any loaded .so files, or may be a memory dump of the locations these were loaded during the trace session. Either will do to decode.
See https://github.com/Linaro/OpenCSD/blob/master/decoder/docs/prog_guide/prog_g...
So to successfully decode a trace session to obtain the source and target branch addresses you want, you will need the following:-
1) The captured binary trace data. 2) The configuration registers of the ETM 3) The program binaries for all applications and .so libraries active during the trace session and their load addresses. Be aware that ETM will trace everything running on a core - so you it may be necessary for any analysis of a particular program to filter out anything unrelated.
Also, the trace gathered seems to lose some of the branch addresses. Inserting a sleep instruction after each regular instruction into my test program, fixed that. But since it should also work for closed source binaries and has to be fast this is probably not an option.
Then I tried to copy the way "perf record" is tracing, and extract the relevant code parts. But then I realized, that perf record doesn't use sysFS, apart from enabling the sink in "util/cs-etm.c" (which apparently is not used, and not even deactivated afterward).
perf uses the driver in the a similar way to sysfs. It does in fact activate and de-activate the sources and sinks as the perf events are run on any CPU. perf also records the binaries used during the trace session - so that full trace decode is possible. With sysfs you can get raw trace data - but relating this to the binaries being executed is far more difficult.
So is it your recommendation to use large parts of the perf source code for my project because it already makes use of these "binary images"?
I was not recommending re-use of perf source code - I was recommending using perf to capture and analyse trace. If this is not sufficient - or you want to write your own application, then perf serves as an example of the information you need to collect during a trace session to successfully decode the trace. perf does two tasks - capture - using perf record, and analysis using perf report. These are very separate elements - the first happens in kernel space, the second, often offline as a separate perf program, in user space.
Up until now, I only thought about using a few hundred lines, because I believe perf has a lot of overhead (both in performance and code length) just to get branch instructions from an executed binary.
The most important factors for me are that
- the trace includes all branches (including the jump destination), and
- it is fast (faster than using hardware emulation and extracting the
addresses from there)
What about the approaches to use the CoreSight driver directly or to use CSAL for trace collection? Would they maybe be better suited?
The best method used for capturing trace is really an assessment for you to decide based on your requirements. CSAL / direct driver access / sysfs are all options - though I am not aware of anyone actually using a direct driver access method so could not really advise here.
Remember that whatever method you choose, in order to get the data that you require you will need to collect the additional information I describe above. Once you have this data you can then pass it to the OpenCSD library for full decode. This will output the executed trace ranges. You are then free to analyse these ranges to synthesise source / target data for branches.
Below is a short example of the trace output from the library - I have annotated this '***' to explain what is happening. This was generated using trc_pkt_lister test program from OpenCSD, running on one of the supplied test captures.
The I_ packets are the trace packets - as would be output from ptm2human, the OCSD_GEN_TRC packets are the output of the library - fully decoded trace ranges.
Idx:356; ID:12; [0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x80 ]; I_ASYNC : Alignment Synchronisation. *** start of trace - align the decoder to the incoming byte stream for the ETM programmed with trace ID 0x12
Idx:369; ID:12; [0x01 0x01 0x00 ]; I_TRACE_INFO : Trace Info.; INFO=0x0 { CC.0 } *** Some setup information regarding trace configuration - used internally by the decoder.
Idx:372; ID:12; [0xf7 ]; I_ATOM_F1 : Atom format 1.; E *** atom packet - skipped - cannot decode until we get an address / context packet
Idx:373; ID:12; [0x85 0x22 0x12 0x4d 0x00 0x00 0x00 0x00 0x00 0x30 ]; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x00000000004D2488; Ctxt: AArch64,EL0, NS; *** address and context packet - this gives the trace start address for the decoder + EL and ISA context.
Idx:384; ID:12; [0xf7 ]; I_ATOM_F1 : Atom format 1.; E *** Atom packet. This indicates that the program executed a number of instructions until it encountered a P0 element instruction (a P0 element instruction - previously referred to as waypoints in PTM, are instructions that change the address flow of the program. These are primarily branch instructions but the full range of P0 instruction types is defined by the ETM4 protocol). This instruction was executed.
Idx:373; ID:12; OCSD_GEN_TRC_ELEM_PE_CONTEXT((ISA=A64) EL0N; 64-bit; ) *** OpenCSD library outputs the context information for the client application. Where traced context also includes the Context ID register which may contain the PID of the program running on the CPU under dertain kernel configurations.
Idx:384; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4d2488:[0x4d2494] num_i(3) last_sz(4) (ISA=A64) E iBR A64:ret ) *** OpenCSD outputs the executed instruction trace range three instructions from addresses 0x4d2488 - 0x4d2493. This is output in response to the atom packet. This range was calculated by starting @ address 0x4d2488, walking through the program image from that address, examining opcodes until it found a P0 element (branch instruction) which it associates with the atom packet. This was an indirect branch (iBR A64:ret ) and also an aarch64 return instruction. This branch was taken (E). At this point we do not know the destination of the branch - so we are waiting for a target address packet in the input stream.
Idx:385; ID:12; [0x9d 0x48 0x5f 0x4d 0x00 0x00 0x00 0x00 0x00 ]; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0x00000000004DBF20; *** Address packet - this updates the decoder with the target of the prior indirect branch - which can now continue decoding
Idx:394; ID:12; [0xde ]; I_ATOM_F4 : Atom format 4.; NENE *** 4 atom packets - representing 4 waypoint instructions that were taken (E) or not taken (N). You will note that this single packet will decode into 4 separate executed instruction ranges - with no further address packets visible in the trace.
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4dbf20:[0x4dbf24] num_i(1) last_sz(4) (ISA=A64) N BR <cond>) *** OpenCSD executed instruction range - 1 instruction - was a not taken branch. This range started @ 0x04DBF20 - taken from the address packet above. At this point the client program using the library can deduce the most recent taken branch information - as 0x4d2490=>0x4dbf20
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4dbf24:[0x4dbf2c] num_i(2) last_sz(4) (ISA=A64) E BR b+link ) *** OpenCSD executed instruction range - 2 instruction - last instruction was a taken branch + link - in this case the decoder can calculate the target address from the instruction opcode - so no address data needs to appear in the trace.
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4d1d88:[0x4d1db4] num_i(11) last_sz(4) (ISA=A64) N BR <cond>) *** OpenCSD executed instruction range - 11 instrucitons, starting at the address calculated from the previous range. So the start of the range was the result of a branch 0x4dbf28=>0x4d1d88 Ends in a not taken branch.
Idx:394; ID:12; OCSD_GEN_TRC_ELEM_INSTR_RANGE(exec range=0x4d1db4:[0x4d1dbc] num_i(2) last_sz(4) (ISA=A64) E BR <cond>) *** OpenCSD executed instruction range - 2 instructions, last is a direct branch which will allow us to calculate the target address.
Throughout this process the decoder maintains a current trace address from the incoming address packets, and by walking through the executed opcodes - calculating branch targets where possible. This is why the program and library binaries (or a memeory dump or their load locations ) is required for full trace decode and obtaining the information you require.
Again, I would recommend reading the ETM protocol spec and the OpenCSD libarary documentation to get a full understanding of the protocols and how decoding works.
Regards
Mike
If you are interested in tracing a particular binary & this is a userspace program then you may wish to try:- perf record -e cs_etm//u --per-thread <program-to-trace> to ensure that any trace collected is related to the program you are interested in. You can then use the facilites of perf report / perf script to examine the trace.
Thanks, but I luckily already knew about that one.
Regards
Mike
So there is another way to gather trace, maybe by interacting with the CoreSight driver directly. But looking into the "perf report" source code I couldn't find it yet.
Thanks and regards,
Dominik
CoreSight mailing list CoreSight@lists.linaro.org https://lists.linaro.org/mailman/listinfo/coresight
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK
-- Mike Leach Principal Engineer, ARM Ltd. Manchester Design Centre. UK