Re: Optimized kernel memcpy/memset

5 May 2011


      On 5 May 2011 18:44, Måns Rullgård mans@mansr.com wrote:
...
The relative performance of NEON vs non-NEON seems to depend a lot on
the size (relative to cache), alignment, and whether or not any
prefetching (explicit PLD, automatic, or preload engine) is used.
Yes, agreed - Neon does very well in non-aligned cases (I have some graphs
for non-aligned cases but am still working on it). I'd been wondering about
using Neon only for the non-aligned cases.
...
For
large copies (much larger than L2) NEON with prefetching wins in my
testing (don't have numbers handy right now).
OK, I've not tried stuff larger than about 256k chunks - I'm a) not sure
it's really common to have really big copies and b) with the larger L2 caches
it seems less and less likely the copies will be larger than it.
Of course that's all down to workload and exactly the cases you care
about etc that's very difficult to pin down.
...
For already cached data,
things may be different.  The A8 is also special with the fast path
between L2 and NEON which the A9 lacks for obvious reasons.
So the thing I don't yet understand is whether A8 is special or A9 is special;
if A8 is special and the current behaviour that A9 possesses is what
will happen on other cores in the future then fine; if A9 is special then
Neon may well be good in the future.
...
I have also observed significant variation depending on the relative
alignment of source and destination buffers, probably some cache effect.
Any claim of X being faster than Y should also specify for which sizes
claim is valid.  I have previously looked mostly at large copies, while
you seem to be focused on small sizes.  That is probably the reason for
our different experiences.  I'll try to be more specific from now on.
It also gets very difficult to present - I have one set of data which
is basically
that set of graphs but iterated over different source and destination
alignments - and
that's just a sea of graphs!
Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset