David Gilbert david.gilbert@linaro.org writes:
Hi Kiko,
On 5 May 2011 15:21, Christian Robottom Reis kiko@linaro.org wrote:
Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
b) In general I don't believe fpu or Neon code can be used internally to the kernel.
That is true. There is currently no support for the context save and restore it would require.
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig... https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
There may be the potential still for non-neon optimised memcpy/memset for Cortex a9; however the kernel routines are pretty good.
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.