Re: Optimized kernel memcpy/memset

5 May 2011


      On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
...
David Gilbert david.gilbert@linaro.org writes:
...
Not quite:
  a) Neon memcpy/memset is worse on A9 than non-neon versions (better
on A8 typically)
That is not my experience at all.  On the contrary, I've seen memcpy
throughput on A9 roughly double with use of NEON for large copies.
For small copies, plain ARM is might be faster since the overhead of
preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8 (right);
on A9 the Neon memcpy's (red and green) top out much lower than their non-neon
best equivalents (black and cyan).  I've seen different results for
very non-aligned
copies, where the vld/vst on Neon work very well.
Also, when I showed those numbers to the guys at ARM they all said it was
a bad idea to use Neon on A9 for memory manipulation workloads.
What code do you base your claims on :-)
...
I don't see the connection between Thumb2 and memcpy performance.
Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however
there is an
interesting question of which is faster, and IMHO the ARM code is
likely to be a bit
faster in most cases.
Dave

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset