On 5 May 2011 18:44, Måns Rullgård firstname.lastname@example.org wrote:
The relative performance of NEON vs non-NEON seems to depend a lot on the size (relative to cache), alignment, and whether or not any prefetching (explicit PLD, automatic, or preload engine) is used.
Yes, agreed - Neon does very well in non-aligned cases (I have some graphs for non-aligned cases but am still working on it). I'd been wondering about using Neon only for the non-aligned cases.
For large copies (much larger than L2) NEON with prefetching wins in my testing (don't have numbers handy right now).
OK, I've not tried stuff larger than about 256k chunks - I'm a) not sure it's really common to have really big copies and b) with the larger L2 caches it seems less and less likely the copies will be larger than it. Of course that's all down to workload and exactly the cases you care about etc that's very difficult to pin down.
For already cached data, things may be different. The A8 is also special with the fast path between L2 and NEON which the A9 lacks for obvious reasons.
So the thing I don't yet understand is whether A8 is special or A9 is special; if A8 is special and the current behaviour that A9 possesses is what will happen on other cores in the future then fine; if A9 is special then Neon may well be good in the future.
I have also observed significant variation depending on the relative alignment of source and destination buffers, probably some cache effect.
Any claim of X being faster than Y should also specify for which sizes claim is valid. I have previously looked mostly at large copies, while you seem to be focused on small sizes. That is probably the reason for our different experiences. I'll try to be more specific from now on.
It also gets very difficult to present - I have one set of data which is basically that set of graphs but iterated over different source and destination alignments - and that's just a sea of graphs!