Re: Optimized kernel memcpy/memset

5 May 2011


      David Gilbert david.gilbert@linaro.org writes:
...
On 5 May 2011 18:44, Måns Rullgård mans@mansr.com wrote:
...
The relative performance of NEON vs non-NEON seems to depend a lot on
the size (relative to cache), alignment, and whether or not any
prefetching (explicit PLD, automatic, or preload engine) is used.
Yes, agreed - Neon does very well in non-aligned cases (I have some graphs
for non-aligned cases but am still working on it). I'd been wondering about
using Neon only for the non-aligned cases.
...
For large copies (much larger than L2) NEON with prefetching wins in
my testing (don't have numbers handy right now).
OK, I've not tried stuff larger than about 256k chunks - I'm a) not sure
it's really common to have really big copies and
Not in the kernel at least, and that's the focus of this email, if not
(at least not obviously) the wiki page.
...
b) with the larger L2 caches it seems less and less likely the copies
will be larger than it.  Of course that's all down to workload and
exactly the cases you care about etc that's very difficult to pin
down.
What really matters is whether source, destination, or both are already
in some cache.  Even a small copy of data currently not in any cache
will perform similarly to a large one.  Really tiny copies are another
matter, but they should probably be inlined anyway.
...
...
For already cached data, things may be different.  The A8 is also
special with the fast path between L2 and NEON which the A9 lacks for
obvious reasons.
So the thing I don't yet understand is whether A8 is special or A9 is special;
if A8 is special and the current behaviour that A9 possesses is what
will happen on other cores in the future then fine; if A9 is special then
Neon may well be good in the future.
I think both are special in different ways.  If I were to guess, I'd say
the direct L2 path of the A8 is unlikely to show up again, whereas some
of the weak points of the A9 (e.g. 64-bit data paths vs 128-bit in A8)
will probably go away in some future core.
-- 
Måns Rullgård
mans@mansr.com

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset