On Fri, Sep 16, 2011 at 3:04 PM, David Long
<dave.long@linaro.org> wrote:
On Fri, 2011-09-16 at 11:30 +0200, Bianconi, Cyril wrote:
I don't think that the A9 issue is the same as the A8. However, effects are the same i.e. it's hard to use PMU.
I cannot communicate the A9 errata document as-is due to legal stuff but I belive that I can explain the issue.
The issue happens when counters are in overflow (then not sure that this impacts OProfile).
Overflow is the only way of getting a counter interrupt right? Then it's a fundamental problem for oprofile.
CB> Yes, to my understanding, this is the only way. I'm not an OProfile expert and how it behaves internally. Here are my assumptions for the "not sure that this impacs OProfile"
CB> As I remember, counter is 32 bits, then interrupt should fire only at about 2 Billion cycles, meaning for a device running at 1GHz, after about 2s.
CB> OProfile is monitoring processes or functions durations. My high level view is that OProfile is looking at this profiling counter at "system transitions" like interrupts, context switches, ...
CB>Then this means that the monitored activity should be longer than 2s without being preempted by the system in order to face the issue. Is such a use-case realistic? or may be I missed stg
Theoritically, an interrupt should fire in this case. In reality, this interrupt is lost randomly.
The ARM proposed workaround is to use 2 counters: counter 0 and counter1 initialized at counter0+1. If one interrupt is lost, the other one should fire just after.
We have noticed that this could not be sufficient and that a third counter should be used to have close to 0% of the interrupts lost.
So, even with three counters there's still a statistical chance of failure?
CB> ARM did not expain the root cause of their issue but only proposed a workaround, so its quite difficult to know the probability of the issue.
CB> However, your are right that there is always a statistical chance of failure. You can only reduce the probability.
CB> I saw the following percentage of missed interrupts in my tests (few 10s of seconds):
CB> 1 counter: about 28%
CB> 2 counters: about 5.5%
CB> 3 counters: about 0%
Note: This HW issue has been fixed by ARM quite "late", so I think that most of the devices on the market should be impacted.
Are there part numbers that we can be reasonably sure do work, say perhaps the 4460?
CB> For legal reasons, I don't think that I can provide the revision of A9 in 4460. However, ARM fixed it late i.e. in A9 r3p0, then ro, r1 and r2 "series" are impacted. I don't think that 4460 uses r3. May be the official TI representative at Linaro can provide you these info.
Thanks,
-dl