Re: Optimized kernel memcpy/memset

6 May 2011


      Hi,
On Thu, May 05, 2011 at 03:47:08PM +0100, David Gilbert wrote:
...
Hi Kiko,
On 5 May 2011 15:21, Christian Robottom Reis kiko@linaro.org wrote:
...
Hey there,
I was asked today in the board meeting about the use of NEON
routines in the kernel; I said we had looked into this but hadn't done
it because a) it wasn't conclusively better and b) if better, it would
need to be done conditionally per-platform. But I wanted to double-check
that's actually true (and I'm copying Vijay to keep me honest). I have
some references:
Not quite:
  a) Neon memcpy/memset is worse on A9 than non-neon versions (better
on A8 typically)
  b) In general I don't believe fpu or Neon code can be used
internally to the kernel.
Dave
...
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig...
   https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
There may be the potential still for non-neon optimised memcpy/memset
for Cortex a9; however
the kernel routines are pretty good.
One important thing to observe is that NEON is, first and foremost, a
computation engine.  It isn't specifically designed for speeding up bulk
memory copies, so this probably isn't the first thing we should focus on
if we want to make a case for using NEON in the kernel.
Conversely, targeting NEON use at computational tasks is likely to deliver
much more consistent gains.
Secondly, VFP/NEON context switch overheads will tend towards the worst case
if NEON is used for memcpy(), simply because memcpy is used very often.
Microbenchmarks of core memcpy performance don't inform us about such system-
level effects.  We'd need metrics for the cost and frequency of those context
switches to get a better idea of the impact.  Even so, the ideal tradeoff may
not be the same on all platforms.
So some fruitful work therefore might involve:
* Create infrastructure to allow NEON/VFP to be used in kernel-space (other
        architectures provide an example of how this can be done).
  * Add instrumentation to gather metrics on the context switching behaviour
        and cost.
  * Port some no-brainer functionality (such as CRC32) to use NEON, instrument
        and benchmark as appropriate.
These will allow a properly quantified case to be presented to upstream: if
a clear benefit is demonstrated, I doubt that "taboos" will present too much
of an obstacle.
Needless to say, any benchmarking should be done on multiple platforms, at
least A8 and A9.
Once the above work is done, we have the option to add memcpy to the mix --
however, as discussed in this thread, this isn't a no-brainer everywhere and
has subtleties; so it's probably best kept orthogonal from the tasks above.
This above work is not currently in the planning for 11.11, so if we want any
of it to happen we will need to take account of this in the planning.
...
...
Incidentally, this ties into the question sent earlier this week which
had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
The NEON task was never really in my queue: its presence in the Thumb-2
blueprint seems a bit strange actually.  I believe there was no significant
work done on this in the 10.05 cycle.
Cheers
---Dave

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset