Hi,
On Thu, May 05, 2011 at 03:47:08PM +0100, David Gilbert wrote:
Hi Kiko,
On 5 May 2011 15:21, Christian Robottom Reis kiko@linaro.org wrote:
Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically) b) In general I don't believe fpu or Neon code can be used internally to the kernel.
Dave
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig... https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
There may be the potential still for non-neon optimised memcpy/memset for Cortex a9; however the kernel routines are pretty good.
One important thing to observe is that NEON is, first and foremost, a computation engine. It isn't specifically designed for speeding up bulk memory copies, so this probably isn't the first thing we should focus on if we want to make a case for using NEON in the kernel.
Conversely, targeting NEON use at computational tasks is likely to deliver much more consistent gains.
Secondly, VFP/NEON context switch overheads will tend towards the worst case if NEON is used for memcpy(), simply because memcpy is used very often. Microbenchmarks of core memcpy performance don't inform us about such system- level effects. We'd need metrics for the cost and frequency of those context switches to get a better idea of the impact. Even so, the ideal tradeoff may not be the same on all platforms.
So some fruitful work therefore might involve:
* Create infrastructure to allow NEON/VFP to be used in kernel-space (other architectures provide an example of how this can be done). * Add instrumentation to gather metrics on the context switching behaviour and cost. * Port some no-brainer functionality (such as CRC32) to use NEON, instrument and benchmark as appropriate.
These will allow a properly quantified case to be presented to upstream: if a clear benefit is demonstrated, I doubt that "taboos" will present too much of an obstacle.
Needless to say, any benchmarking should be done on multiple platforms, at least A8 and A9.
Once the above work is done, we have the option to add memcpy to the mix -- however, as discussed in this thread, this isn't a no-brainer everywhere and has subtleties; so it's probably best kept orthogonal from the tasks above.
This above work is not currently in the planning for 11.11, so if we want any of it to happen we will need to take account of this in the planning.
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
The NEON task was never really in my queue: its presence in the Thumb-2 blueprint seems a bit strange actually. I believe there was no significant work done on this in the 10.05 cycle.
Cheers ---Dave