David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 18:17, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
At the top of the page: Do not rely on or use the numbers.
That's OK we put that in there since we're still experimenting.
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan).
That page is rather fuzzy on exactly what code was being tested as well as how the tests were performed. Without some actual code with which one can reproduce the results, those figures should not be used as basis for any decisions.
I'm happy to post my test harness; I've copy and pasted the main memcpy speed test below; give me a day or two and I can clean the whole thing up to run stand alone.
Thanks. It's easier to have a meaningful discussion when the details are known.
Also, when I showed those numbers to the guys at ARM they all said it was a bad idea to use Neon on A9 for memory manipulation workloads.
I have heard many claims passed around concerning memcpy on A9, none of which I have been able to reproduce myself. Some allegedly came from people at ARM.
What code do you base your claims on :-)
My own testing wherein the Bionic NEON memcpy vastly outperformed both glibc and Bionic ARMv5 memcpy.
Can you provide me with some actual results? You seem to be disputing my actual numbers that agree with the comments from the guys from ARM with an argument saying that you have seen the opposite - which I'm happy to believe, but I'd like to understand why.
Note also the graphs I produced for memset show similar behaviour: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset
in this case memset on both A9's is slower in the Neon case than the non-Neon.
The relative performance of NEON vs non-NEON seems to depend a lot on the size (relative to cache), alignment, and whether or not any prefetching (explicit PLD, automatic, or preload engine) is used. For large copies (much larger than L2) NEON with prefetching wins in my testing (don't have numbers handy right now). For already cached data, things may be different. The A8 is also special with the fast path between L2 and NEON which the A9 lacks for obvious reasons.
I have also observed significant variation depending on the relative alignment of source and destination buffers, probably some cache effect.
Any claim of X being faster than Y should also specify for which sizes claim is valid. I have previously looked mostly at large copies, while you seem to be focused on small sizes. That is probably the reason for our different experiences. I'll try to be more specific from now on.