On 15 March 2017 at 17:33, Muhammad Abdul WAHAB <muhammadabdul.wahab@centralesupelec.fr> wrote:

Hi Mike,

Le 15/03/2017 à 18:04, Mike Leach a écrit :

What bandwidth are you interested in? The throughput of the trace infrastructure will depend on the system design, activated trace buffers and trace sinks, trace bus width, and other factors. Trace generation rate is dependent on the code being executed .If the generation rate at the currently traced cores exceeds the rate that the system van sink the trace data then the
trace generation will stall until bandwidth becomes available. This stalling will not affect the operation of the cores - trace will be lost but the core will continue to execute at full speed.

I am using CoreSight components on Zynq SoC. I am using the Linux 4.4 kernel. I am using PTM as trace source and TPIU as trace sink. I want to evaluate CoreSight components to find out under what conditions the trace will be lost. That's why I am trying to evaluate the maximum trace throughput but it does not seem to be obvious how to do it. It will depend on multiple factors and mainly on traced program. What kind of programs I need to trace in order to provoke PTM's internal FIFO overflow. Are their any benchmark to evaluate debug components ?

Program factors that can affect the amount of trace generated are anything that results in a number of changes in program flow. A long loop with a single test / exit point will trace with few output packets, more complex flow structures will increase the output.

1) Interrupts - this will cause a significant program flow change.

2) Complex code with many indirect branches.

Additionally there are features in the PTM that can be turned on which will increase the amount of trace being generated. These are normally used for trace investigation of specific issues, but you might wish to try switching them on to see if overflows can be provoked for a given program:-

cycle accurate trace, branch broadcast, increase rate of sync packet emissions.

The PTM will not overflow in isolation - only back pressure from the trace sink (TPIU) being unable to output existing trace will eventually fill any internal buffers and cause the PTM to drop trace an emit overflow packets. Also affecting this will be other active trace paths - in a multicore system you may have N core/PTM combination tracing through a funnel into the TPIU.

The effective output clock frequency and output port width will in general have the largest effect on if the TPIU is able to keep up with the incoming trace.

There are no standard benchmarks I know of to evaluate trace components or a given trace system.

In general trace technique is about balancing the amount of trace generated with the problem being investigated. It would be unusual for any significant application or system to be able to be traced in its entirity from start to finish - even if the TPIU is able to keep up with trace generation - the capture device will still have a limited (even if it is measured in GB) capture buffer.

So we use different techniques to control the generation of trace, or focus on specific areas:-

1) trace filters - limit the traced address range to functions / libraries of interest.

2) trace triggers - set a point in the program where we believe an error occurs and capture trace before / after the event. Generally we can control the % ratio of trace before and after the event, and halt trace automatically once that ratio is acheived. This event might be a kernel panic for example - trace up to crash is a good way of determining what caused the issue.

3) if we are using an external debugger then we can set breakpoints and trace up to a breakpoint.

In linux, pinning the traced application to a core and only starting trace on that core is also a useful technique.

Are you currently designing a SoC or are you using an existing device?

If you are in the design stage then I recommend contacting ARM support who may be able to help with energy estimates for these components - hardware emulation should also help in this case.

For existing devices it rather depends if your SoC has any self power monitoring capability. If so then using this with trace powered and running or not should give you an answer. Otherwise you may need to contact the device manufacturer.

I am using Zynq SoC. You are right. I don't think there is any self power monitoring capability in Zynq. I need to ask somebody from Xilinx about this.

That would really depend on the purpose of your evaluation - what are you trying to discover? Are you using them with an external debugger / on a bare metal system or self hosted on a linux based board?

I am trying to know what type of programs I will not be able to trace on my Zynq SoC using CS components (PTM as trace source, TPIU as trace sink). Does the same program will be successfully traced by using ETB as a trace source ? That's why I thought about evaluating the trace bandwidth.

In general the ETB is much smaller than any capture device - typically 8k - 64k when compared to MB/GB for external capture devices. Best use of ETBs tends to be as a ring buffer with capture triggers described above. In the OpenCSD work we can use perf to capture trace - it will copy out of the ETF/ETB periodically but this will result in some trace loss compared to larger buffers (the buffer will wrap).

To put some numbers on this, we run an example on the ARM Juno board using perf monitoring the 'uname' command. 'uname' is about as short an application as you can get. perf handles the trace enable / disable as the program is scheduled, so we are in effect only tracing this application on the current active core, along with kernel calls by the application.

On juno we can either use an ETF (in ETB mode) 64k in size, or the ETR - this traces to a 3MB buffer in system memory.
With the ETF/ETB typically we get between 64 and 90 k of trace data - this varies quite a lot between runs and will depend on how often the buffer has wrapped and when perf decides to read it.

With the ETR to system memory we get about 165k of trace data - this does not vary much - the variance can probably be accounted for in interrupts causing stop / restarts of the application and trace. We have not checked line by line if we have traced every instruction of the application, but the lack of variance would suggest we have.

This brings me to the final point - if you can trace the entire program, you are going to have MB/GB of trace data, representing millions of instructions (A single PTM trace byte might represent 16 waypoint instructions, plus 10s of instruction in between each waypoint). You now need to be able to search this trace for the point of investigation.

Therefore trace techniques above again become useful - to reduce the amount of trace to a usable level around the target of investigation. Trace of a working program is usually dull - unless you are looking for coverage statistics.

You are correct - the operation of the PTM will not slow down the associated core. It is coupled directly to the core and monitors certain signals which allow waypoint trace to be generated. Waypoints are flow control changes that cannot be directly inferred from the opcodes being executed. The trace decoder will then take the waypoint trace packets and the memory image of the opcodes executed to decode the program flow over time. So not every instruction is traced as an output packet

I recommend reading the PTF architecture manual for further explanation of program flow (PTM) trace.
[http://infocenter.arm.com/help/topic/com.arm.doc.ihi0035b/index.html}

Thanks.

Thank you for your quick reply.
Best Regards,
M.Abdul WAHAB

Best Regards

Mike

Mike Leach

Principal Engineer, ARM Ltd.

Blackburn Design Centre. UK