David,

Please find below my replies (CB>). I hope that they can help you.
Sorry all the "I cannot say" due to legal stuffs but you should be in touch with ARM guys of Linaro to have the errata list. This does not provide many more info but at least  you will have their official communication. They could also provide more details on a HW point of view.
The issue is referenced by ARM as 751469.

Best regards,
Cyril


Texas Instruments France SA, 821 Avenue Jack Kilby, 06270 Villeneuve Loubet. 036 420 040 R.C.S Antibes. Capital de EUR 753.920




On Fri, Sep 16, 2011 at 3:04 PM, David Long <dave.long@linaro.org> wrote:
On Fri, 2011-09-16 at 11:30 +0200, Bianconi, Cyril wrote:
I don't think that the A9 issue is the same as the A8. However, effects are the same i.e. it's hard to use PMU.

I cannot communicate the A9 errata document as-is due to legal stuff but I belive that I can explain the issue.
The issue happens when counters are in overflow (then not sure that this impacts OProfile).

Overflow is the only way of getting a counter interrupt right?  Then it's a fundamental problem for oprofile.
 
 CB> Yes, to my understanding, this is the only way. I'm not an OProfile expert and how it behaves internally. Here are my assumptions for the "not sure that this impacs OProfile"
CB> As I remember, counter is 32 bits, then interrupt should fire only at about 2 Billion cycles, meaning for a device running at 1GHz, after about 2s.
CB> OProfile is monitoring processes or functions durations. My high level view is that OProfile is looking at this profiling counter at "system transitions" like interrupts, context switches, ...
CB>Then this means that the monitored activity should be longer than 2s without being preempted by the system in order to face the issue. Is such a use-case realistic? or may be I missed stg


Theoritically, an interrupt should fire in this case. In reality, this interrupt is lost randomly.
The ARM proposed workaround is to use 2 counters: counter 0 and counter1 initialized at counter0+1. If one interrupt is lost, the other one should fire just after.
We have noticed that this could not be sufficient and that a third counter should be used to have close to 0% of the interrupts lost.

So, even with three counters there's still a statistical chance of failure?

CB> ARM did not expain the root cause of their issue but only proposed a workaround, so its quite difficult to know the probability of the issue.
CB> However, your are right that there is always a statistical chance of failure. You can only reduce the probability.
CB> I saw the following percentage of missed interrupts in my tests (few 10s of seconds):
CB> 1 counter: about 28%
CB> 2 counters: about 5.5%
CB> 3 counters: about 0%
 

Note: This HW issue has been fixed by ARM quite "late", so I think that most of the devices on the market should be impacted.

Are there part numbers that we can be reasonably sure do work, say perhaps the 4460?

CB> For legal reasons, I don't think that I can provide the revision of A9 in 4460. However, ARM fixed it late i.e. in A9 r3p0, then ro, r1 and r2 "series" are impacted. I don't think that 4460 uses r3. May be the official TI representative at Linaro can provide you these info.
 

Thanks,
-dl