David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 18:44, Måns Rullgård mans@mansr.com wrote:
The relative performance of NEON vs non-NEON seems to depend a lot on the size (relative to cache), alignment, and whether or not any prefetching (explicit PLD, automatic, or preload engine) is used.
Yes, agreed - Neon does very well in non-aligned cases (I have some graphs for non-aligned cases but am still working on it). I'd been wondering about using Neon only for the non-aligned cases.
For large copies (much larger than L2) NEON with prefetching wins in my testing (don't have numbers handy right now).
OK, I've not tried stuff larger than about 256k chunks - I'm a) not sure it's really common to have really big copies and
Not in the kernel at least, and that's the focus of this email, if not (at least not obviously) the wiki page.
b) with the larger L2 caches it seems less and less likely the copies will be larger than it. Of course that's all down to workload and exactly the cases you care about etc that's very difficult to pin down.
What really matters is whether source, destination, or both are already in some cache. Even a small copy of data currently not in any cache will perform similarly to a large one. Really tiny copies are another matter, but they should probably be inlined anyway.
For already cached data, things may be different. The A8 is also special with the fast path between L2 and NEON which the A9 lacks for obvious reasons.
So the thing I don't yet understand is whether A8 is special or A9 is special; if A8 is special and the current behaviour that A9 possesses is what will happen on other cores in the future then fine; if A9 is special then Neon may well be good in the future.
I think both are special in different ways. If I were to guess, I'd say the direct L2 path of the A8 is unlikely to show up again, whereas some of the weak points of the A9 (e.g. 64-bit data paths vs 128-bit in A8) will probably go away in some future core.