Re: Optimized kernel memcpy/memset

5 May 2011


      David Gilbert david.gilbert@linaro.org writes:
...
On 5 May 2011 17:45, Deepak Saxena dsaxena@plexity.net wrote:
...
On May 05 2011, at 16:46, David Gilbert was caught saying:
...
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
...
David Gilbert david.gilbert@linaro.org writes:
...
Not quite:
  a) Neon memcpy/memset is worse on A9 than non-neon versions (better
on A8 typically)
That is not my experience at all.  On the contrary, I've seen memcpy
throughput on A9 roughly double with use of NEON for large copies.
For small copies, plain ARM is might be faster since the overhead of
preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8
(right); on A9 the Neon memcpy's (red and green) top out much lower
than their non-neon best equivalents (black and cyan).  I've seen
different results for very non-aligned copies, where the vld/vst on
Neon work very well.
Looking at the top part of the page, it looks like when doing large size
copies, NEON has an obvious advantage; however, I'm not sure how often
we do copies of that magnitude in the kernel (I would hope rarely) but
I don't know that we have numbers tracking average copy sizes for
different workloads. I don't think going for a one-size-fits all
approach is the ideal and instead we should provide both build
and and runtime configurability (something similar to the RAID
code's boot-up performance tests) to allow for selection of the
appropriate memcpy implementation.
The top part of the page is A8.  The graphs at the bottom page are
going upto 256k (log scale) so do have the large case and you can see
after the cliff where it drops off the cache the non-neon is still
winning for A9.
That is still well within the OMAP4 L2 cache (1MB) and the same size as
the OMAP3 L2.  It would have been interesting to extend the graphs up to
8MB or so to ensure the caches become mostly irrelevant.
...
If people believe it's worth breaking the context-switching taboo and
putting a neon version into the kernel then yes I agree it's something
you'd want to do as a build and/or runtime selection - but that's
quite a big taboo to break.
I doubt it's worth the trouble, even if NEON is faster.  The kernel
shouldn't be doing much copying, certainly not of huge blocks at a time.
-- 
Måns Rullgård
mans@mansr.com

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset