On 5 May 2011 17:45, Deepak Saxena dsaxena@plexity.net wrote:
On May 05 2011, at 16:46, David Gilbert was caught saying:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan). I've seen different results for very non-aligned copies, where the vld/vst on Neon work very well.
Looking at the top part of the page, it looks like when doing large size copies, NEON has an obvious advantage; however, I'm not sure how often we do copies of that magnitude in the kernel (I would hope rarely) but I don't know that we have numbers tracking average copy sizes for different workloads. I don't think going for a one-size-fits all approach is the ideal and instead we should provide both build and and runtime configurability (something similar to the RAID code's boot-up performance tests) to allow for selection of the appropriate memcpy implementation.
The top part of the page is A8. The graphs at the bottom page are going upto 256k (log scale) so do have the large case and you can see after the cliff where it drops off the cache the non-neon is still winning for A9.
If people believe it's worth breaking the context-switching taboo and putting a neon version into the kernel then yes I agree it's something you'd want to do as a build and/or runtime selection - but that's quite a big taboo to break.
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however there is an interesting question of which is faster, and IMHO the ARM code is likely to be a bit faster in most cases.
Do we have numbers for this? :)
Hmm - not for memcpy specifically; I think we have some benchmark figures showing the thumb gcc is slower than the arm gcc for most cases (as expected). My belief is that the icache advantage of thumb code should be pretty much irrelevant for one or two small routines like memcpy, and the addition of the IT instructions and reduction in flexibility of condition codes has got to hurt somewhere - but I haven't tried the code built either way.
Dave