Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig... https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
Hi Kiko,
On 5 May 2011 15:21, Christian Robottom Reis kiko@linaro.org wrote:
Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically) b) In general I don't believe fpu or Neon code can be used internally to the kernel.
Dave
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig... https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
There may be the potential still for non-neon optimised memcpy/memset for Cortex a9; however the kernel routines are pretty good.
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
Dave
On Thu, May 05, 2011 at 03:47:08PM +0100, David Gilbert wrote:
Hi Kiko,
On 5 May 2011 15:21, Christian Robottom Reis kiko@linaro.org wrote:
Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
Yes. Internal hardware differences, apparently.
b) In general I don't believe fpu or Neon code can be used internally to the kernel.
Technically it *can*, but you'll then have to be responsible for dealing with all the extra register save/restores for context switches. Normal wisdom is that it's just not worth that cost unless you're doing an extended amount of such code (e.g. RAID block checksums using Neon).
Cheers,
On 5 May 2011 17:57, Steve McIntyre steve.mcintyre@linaro.org wrote:
Technically it *can*, but you'll then have to be responsible for dealing with all the extra register save/restores for context switches. Normal wisdom is that it's just not worth that cost unless you're doing an extended amount of such code (e.g. RAID block checksums using Neon).
I've always thought that all the crypto stuff in the kernel just *begs* to be vectorized -and should be, IMHO. But that's a lot of work, admittedly.
Konstantinos
David Gilbert david.gilbert@linaro.org writes:
Hi Kiko,
On 5 May 2011 15:21, Christian Robottom Reis kiko@linaro.org wrote:
Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
b) In general I don't believe fpu or Neon code can be used internally to the kernel.
That is true. There is currently no support for the context save and restore it would require.
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig... https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
There may be the potential still for non-neon optimised memcpy/memset for Cortex a9; however the kernel routines are pretty good.
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan). I've seen different results for very non-aligned copies, where the vld/vst on Neon work very well.
Also, when I showed those numbers to the guys at ARM they all said it was a bad idea to use Neon on A9 for memory manipulation workloads.
What code do you base your claims on :-)
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however there is an interesting question of which is faster, and IMHO the ARM code is likely to be a bit faster in most cases.
Dave
On May 05 2011, at 16:46, David Gilbert was caught saying:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan). I've seen different results for very non-aligned copies, where the vld/vst on Neon work very well.
Looking at the top part of the page, it looks like when doing large size copies, NEON has an obvious advantage; however, I'm not sure how often we do copies of that magnitude in the kernel (I would hope rarely) but I don't know that we have numbers tracking average copy sizes for different workloads. I don't think going for a one-size-fits all approach is the ideal and instead we should provide both build and and runtime configurability (something similar to the RAID code's boot-up performance tests) to allow for selection of the appropriate memcpy implementation.
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however there is an interesting question of which is faster, and IMHO the ARM code is likely to be a bit faster in most cases.
Do we have numbers for this? :)
~Deepak
On 5 May 2011 17:45, Deepak Saxena dsaxena@plexity.net wrote:
On May 05 2011, at 16:46, David Gilbert was caught saying:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan). I've seen different results for very non-aligned copies, where the vld/vst on Neon work very well.
Looking at the top part of the page, it looks like when doing large size copies, NEON has an obvious advantage; however, I'm not sure how often we do copies of that magnitude in the kernel (I would hope rarely) but I don't know that we have numbers tracking average copy sizes for different workloads. I don't think going for a one-size-fits all approach is the ideal and instead we should provide both build and and runtime configurability (something similar to the RAID code's boot-up performance tests) to allow for selection of the appropriate memcpy implementation.
The top part of the page is A8. The graphs at the bottom page are going upto 256k (log scale) so do have the large case and you can see after the cliff where it drops off the cache the non-neon is still winning for A9.
If people believe it's worth breaking the context-switching taboo and putting a neon version into the kernel then yes I agree it's something you'd want to do as a build and/or runtime selection - but that's quite a big taboo to break.
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however there is an interesting question of which is faster, and IMHO the ARM code is likely to be a bit faster in most cases.
Do we have numbers for this? :)
Hmm - not for memcpy specifically; I think we have some benchmark figures showing the thumb gcc is slower than the arm gcc for most cases (as expected). My belief is that the icache advantage of thumb code should be pretty much irrelevant for one or two small routines like memcpy, and the addition of the IT instructions and reduction in flexibility of condition codes has got to hurt somewhere - but I haven't tried the code built either way.
Dave
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 17:45, Deepak Saxena dsaxena@plexity.net wrote:
On May 05 2011, at 16:46, David Gilbert was caught saying:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan). I've seen different results for very non-aligned copies, where the vld/vst on Neon work very well.
Looking at the top part of the page, it looks like when doing large size copies, NEON has an obvious advantage; however, I'm not sure how often we do copies of that magnitude in the kernel (I would hope rarely) but I don't know that we have numbers tracking average copy sizes for different workloads. I don't think going for a one-size-fits all approach is the ideal and instead we should provide both build and and runtime configurability (something similar to the RAID code's boot-up performance tests) to allow for selection of the appropriate memcpy implementation.
The top part of the page is A8. The graphs at the bottom page are going upto 256k (log scale) so do have the large case and you can see after the cliff where it drops off the cache the non-neon is still winning for A9.
That is still well within the OMAP4 L2 cache (1MB) and the same size as the OMAP3 L2. It would have been interesting to extend the graphs up to 8MB or so to ensure the caches become mostly irrelevant.
If people believe it's worth breaking the context-switching taboo and putting a neon version into the kernel then yes I agree it's something you'd want to do as a build and/or runtime selection - but that's quite a big taboo to break.
I doubt it's worth the trouble, even if NEON is faster. The kernel shouldn't be doing much copying, certainly not of huge blocks at a time.
On Thu, 5 May 2011, Måns Rullgård wrote:
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 17:45, Deepak Saxena dsaxena@plexity.net wrote:
On May 05 2011, at 16:46, David Gilbert was caught saying:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan). I've seen different results for very non-aligned copies, where the vld/vst on Neon work very well.
Looking at the top part of the page, it looks like when doing large size copies, NEON has an obvious advantage; however, I'm not sure how often we do copies of that magnitude in the kernel (I would hope rarely) but I don't know that we have numbers tracking average copy sizes for different workloads. I don't think going for a one-size-fits all approach is the ideal and instead we should provide both build and and runtime configurability (something similar to the RAID code's boot-up performance tests) to allow for selection of the appropriate memcpy implementation.
The top part of the page is A8. The graphs at the bottom page are going upto 256k (log scale) so do have the large case and you can see after the cliff where it drops off the cache the non-neon is still winning for A9.
That is still well within the OMAP4 L2 cache (1MB) and the same size as the OMAP3 L2. It would have been interesting to extend the graphs up to 8MB or so to ensure the caches become mostly irrelevant.
Please look at the subject line above again.
If you do perform 8MB memcpy calls in kernel space you have a bigger problem.
Nicolas
On Thu, 5 May 2011, David Gilbert wrote:
If people believe it's worth breaking the context-switching taboo and putting a neon version into the kernel then yes I agree it's something you'd want to do as a build and/or runtime selection - but that's quite a big taboo to break.
There is no taboo. Only numbers.
The cost of using Neon in the kernel is non negligible. It is also hard to measure as it depends on the actual Neon usage simultaneously happening in user space or in other concurrent kernel contexts. This is not something that a dedicated benchmark can evaluate.
There _are_ cases for Neon to be used in the kernel i.e. those where the initial cost is offset by the gain. The first that comes to mind is crypto of course. But there is also simple things like CRC32 which is used all over the place by BTRFS for example. And that is the actual test case I think we should focus our efforts on, given that BTRFS is going to be the next major filesystem on Linux. Last time I tried BTRFS on ARM, the CRC32 computation was dominating CPU usage big time. CRC32 is easy to understand, easy to validate, and will provide the right reason for creating the needed infrastructure to manipulate the Neon context in kernel space. Once that's in place we could move to other targets such as crypto which is already complex enough without having to bother with the Neon context handling.
The memcpy case is not interesting. Not at all. Most kernel memcpy calls are for small size copies. The large copy instances are just bad and misdesigned in the first place if they rely on memcpy (maybe they should simply have a custom copy function, maybe implemented with Neon). And I doubt the small memcpy's are going to gain anything from Neon. Even on X86 they don't do it, while they do have a CRC32 function using SSE2. Maybe we could use Neon for copy_page() which is one of those custom bulk copy functions, but I've never seen memcpy() in kernel space show up on any profile.
Nicolas
On 5 May 2011 18:59, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Thu, 5 May 2011, David Gilbert wrote:
If people believe it's worth breaking the context-switching taboo and putting a neon version into the kernel then yes I agree it's something you'd want to do as a build and/or runtime selection - but that's quite a big taboo to break.
There is no taboo. Only numbers.
The cost of using Neon in the kernel is non negligible. It is also hard to measure as it depends on the actual Neon usage simultaneously happening in user space or in other concurrent kernel contexts. This is not something that a dedicated benchmark can evaluate.
Agreed, and that's also why it's partly a taboo; if it's only numbers but numbers based on some set of benchmarks that no two people are going to agree on it's very difficult. It would have to show a good win on something complex and well agreed upon.
There _are_ cases for Neon to be used in the kernel i.e. those where the initial cost is offset by the gain. The first that comes to mind is crypto of course. But there is also simple things like CRC32 which is used all over the place by BTRFS for example. And that is the actual test case I think we should focus our efforts on, given that BTRFS is going to be the next major filesystem on Linux. Last time I tried BTRFS on ARM, the CRC32 computation was dominating CPU usage big time. CRC32 is easy to understand, easy to validate, and will provide the right reason for creating the needed infrastructure to manipulate the Neon context in kernel space. Once that's in place we could move to other targets such as crypto which is already complex enough without having to bother with the Neon context handling.
Yes, while I've not actually looked at coding CRC32 or the crypto things I agree that they feel like they have much more room for working with; it's outside of the scope of what I was asked to look at however.
The memcpy case is not interesting. Not at all. Most kernel memcpy calls are for small size copies. The large copy instances are just bad and misdesigned in the first place if they rely on memcpy (maybe they should simply have a custom copy function, maybe implemented with Neon).
Even outside the kernel vast memcpy's are fairly rare as far as I can tell - everyone knows they're going to hurt so people try and avoid them; the other thing is that people have been optimising ARM memcpy for decades and it appears to me to be hitting cache/bus bandwidths somewhere (although I don't have any figures for what those bandwidths are) - there may be some scope for optimising the smaller memcpy cases (e.g. taking advantage of things like the newer cbz to cut a few instructions out) - from my graphs the slope up to the point at which the non-neon code plateaus is quite gradual, which suggests it might be possible to optimise it a bit. (Oddly the one case where my graph shows the neon winning is in small - ~32 byte cases where it's almost certainly not worth the pain in the kernel of protecting the context switch).
And I doubt the small memcpy's are going to gain anything from Neon. Even on X86 they don't do it, while they do have a CRC32 function using SSE2. Maybe we could use Neon for copy_page() which is one of those custom bulk copy functions, but I've never seen memcpy() in kernel space show up on any profile.
Dave
David Gilbert david.gilbert@linaro.org writes:
The memcpy case is not interesting. Not at all. Most kernel memcpy calls are for small size copies. The large copy instances are just bad and misdesigned in the first place if they rely on memcpy (maybe they should simply have a custom copy function, maybe implemented with Neon).
Even outside the kernel vast memcpy's are fairly rare as far as I can tell - everyone knows they're going to hurt so people try and avoid them;
If only that were true. I have long since lost count of the times I have (in vain) told people to lose the mempcys in order to improve performance.
On Thu, 5 May 2011, Måns Rullgård wrote:
David Gilbert david.gilbert@linaro.org writes:
The memcpy case is not interesting. Not at all. Most kernel memcpy calls are for small size copies. The large copy instances are just bad and misdesigned in the first place if they rely on memcpy (maybe they should simply have a custom copy function, maybe implemented with Neon).
Even outside the kernel vast memcpy's are fairly rare as far as I can tell - everyone knows they're going to hurt so people try and avoid them;
If only that were true. I have long since lost count of the times I have (in vain) told people to lose the mempcys in order to improve performance.
Keep on doing it. No amount of memcpy optimization will ever beat the performances of zero copy.
Nicolas
On Thu, 5 May 2011, David Gilbert wrote:
Yes, while I've not actually looked at coding CRC32 or the crypto things I agree that they feel like they have much more room for working with; it's outside of the scope of what I was asked to look at however.
Well, you said that the current memcpy code in the kernel is quite good, which is nice not only because I wrote it :-) but that might indicate that the Neon optimization efforts might have a bigger return on the investment elsewhere.
The memcpy case is not interesting. Not at all. Most kernel memcpy calls are for small size copies. The large copy instances are just bad and misdesigned in the first place if they rely on memcpy (maybe they should simply have a custom copy function, maybe implemented with Neon).
Even outside the kernel vast memcpy's are fairly rare as far as I can tell - everyone knows they're going to hurt so people try and avoid them; the other thing is that people have been optimising ARM memcpy for decades and it appears to me to be hitting cache/bus bandwidths somewhere (although I don't have any figures for what those bandwidths are) - there may be some scope for optimising the smaller memcpy cases (e.g. taking advantage of things like the newer cbz to cut a few instructions out) - from my graphs the slope up to the point at which the non-neon code plateaus is quite gradual, which suggests it might be possible to optimise it a bit.
Indeed.
Nicolas
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
At the top of the page: Do not rely on or use the numbers.
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan).
That page is rather fuzzy on exactly what code was being tested as well as how the tests were performed. Without some actual code with which one can reproduce the results, those figures should not be used as basis for any decisions.
Also, when I showed those numbers to the guys at ARM they all said it was a bad idea to use Neon on A9 for memory manipulation workloads.
I have heard many claims passed around concerning memcpy on A9, none of which I have been able to reproduce myself. Some allegedly came from people at ARM.
What code do you base your claims on :-)
My own testing wherein the Bionic NEON memcpy vastly outperformed both glibc and Bionic ARMv5 memcpy.
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however there is an interesting question of which is faster, and IMHO the ARM code is likely to be a bit faster in most cases.
Code with many conditional instructions may be faster in ARM mode since it avoids the IT instructions. Other than that I don't see why it should matter. The instruction prefetching should make possible misalignment of 32-bit instructions irrelevant. If anything, the usually smaller Thumb2 code should decrease I-cache pressure and increase performance.
On 5 May 2011 18:17, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
At the top of the page: Do not rely on or use the numbers.
That's OK we put that in there since we're still experimenting.
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan).
That page is rather fuzzy on exactly what code was being tested as well as how the tests were performed. Without some actual code with which one can reproduce the results, those figures should not be used as basis for any decisions.
I'm happy to post my test harness; I've copy and pasted the main memcpy speed test below; give me a day or two and I can clean the whole thing up to run stand alone.
Also, when I showed those numbers to the guys at ARM they all said it was a bad idea to use Neon on A9 for memory manipulation workloads.
I have heard many claims passed around concerning memcpy on A9, none of which I have been able to reproduce myself. Some allegedly came from people at ARM.
What code do you base your claims on :-)
My own testing wherein the Bionic NEON memcpy vastly outperformed both glibc and Bionic ARMv5 memcpy.
Can you provide me with some actual results? You seem to be disputing my actual numbers that agree with the comments from the guys from ARM with an argument saying that you have seen the opposite - which I'm happy to believe, but I'd like to understand why.
Note also the graphs I produced for memset show similar behaviour: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset
in this case memset on both A9's is slower in the Neon case than the non-Neon.
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however there is an interesting question of which is faster, and IMHO the ARM code is likely to be a bit faster in most cases.
Code with many conditional instructions may be faster in ARM mode since it avoids the IT instructions. Other than that I don't see why it should matter. The instruction prefetching should make possible misalignment of 32-bit instructions irrelevant. If anything, the usually smaller Thumb2 code should decrease I-cache pressure and increase performance.
As per my comment in a separate mail, I don't see that the i-cache pressure should be relevant for a small core routine.
Here is my memcpy harness; feel free to point out any silly mistakes!
void memcpyspeed(memcpyfunc mcf, const char* id, unsigned int srcalign, unsigned int dstalign) { const size_t largesttest=256*1024; /* Power of 2 */ const unsigned long loopsforlargest=2048; /* Number of loops to use for largest test */ const int sizeadd[]={0,1,3,4,7,15,31,-1}; struct timespec tbefore, tafter; size_t testsize; unsigned long numloops; unsigned int sizeoffset;
/* I'm assuming the mmap will conveniently give us something page aligned if the allocation size is reasonably large */ void *srcbase, *dstbase; srcbase=mmap(NULL, largesttest+8192, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (MAP_FAILED==srcbase) { perror("memcpyspeed: Failed to mmap src area"); }
dstbase=mmap(NULL, largesttest+8192, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (MAP_FAILED==dstbase) { perror("memcpyspeed: Failed to mmap dst area"); }
/* Wipe the mapped area through just to force it to be allocated */ local_memset(srcbase, '.', largesttest+8192); local_memset(dstbase, '*', largesttest+8192);
/* Run over the range of sizes, adjusting the number of loops to keep the same amount of memory accessed */ for(testsize=largesttest,numloops=loopsforlargest; testsize>0; testsize/=2, numloops*=2) { /* For stuff larger than 32 try a few odd sizes */ for(sizeoffset=0; ((testsize>=32) && (sizeadd[sizeoffset]>=0)) || ((testsize<32) && (sizeoffset==0)); sizeoffset++) { unsigned long l; double nsdiff; double mbtransferred;
clock_gettime(CLOCK_REALTIME, &tbefore);
for(l=0;l<numloops;l++) { mcf(dstbase+dstalign, srcbase+srcalign, testsize+sizeadd[sizeoffset]); }
clock_gettime(CLOCK_REALTIME, &tafter);
nsdiff=(double)(tafter.tv_nsec - tbefore.tv_nsec); nsdiff+=1000000000.0 *(tafter.tv_sec - tbefore.tv_sec); /* 2x is because it's a copy */ mbtransferred=2.0*((double)(testsize+sizeadd[sizeoffset]) * (double)numloops)/ (1024.0*1024.0);
printf("%s-s%d-d%d: ,%ld, loops of ,%ld, bytes=%lf MB, transferred in ,%lf ns, giving, %lf MB/s\n", id, srcalign, dstalign, numloops, testsize+sizeadd[sizeoffset], mbtransferred, nsdiff, (1000000000.0 * mbtransferred)/nsdiff); } }
munmap(srcbase, largesttest); munmap(dstbase, largesttest); }
Dave
-- Måns Rullgård mans@mansr.com
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 18:17, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
David Gilbert david.gilbert@linaro.org writes:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically)
That is not my experience at all. On the contrary, I've seen memcpy throughput on A9 roughly double with use of NEON for large copies. For small copies, plain ARM is might be faster since the overhead of preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
At the top of the page: Do not rely on or use the numbers.
That's OK we put that in there since we're still experimenting.
at the bottom of the page are sets of graphs for A9 (left) and A8 (right); on A9 the Neon memcpy's (red and green) top out much lower than their non-neon best equivalents (black and cyan).
That page is rather fuzzy on exactly what code was being tested as well as how the tests were performed. Without some actual code with which one can reproduce the results, those figures should not be used as basis for any decisions.
I'm happy to post my test harness; I've copy and pasted the main memcpy speed test below; give me a day or two and I can clean the whole thing up to run stand alone.
Thanks. It's easier to have a meaningful discussion when the details are known.
Also, when I showed those numbers to the guys at ARM they all said it was a bad idea to use Neon on A9 for memory manipulation workloads.
I have heard many claims passed around concerning memcpy on A9, none of which I have been able to reproduce myself. Some allegedly came from people at ARM.
What code do you base your claims on :-)
My own testing wherein the Bionic NEON memcpy vastly outperformed both glibc and Bionic ARMv5 memcpy.
Can you provide me with some actual results? You seem to be disputing my actual numbers that agree with the comments from the guys from ARM with an argument saying that you have seen the opposite - which I'm happy to believe, but I'd like to understand why.
Note also the graphs I produced for memset show similar behaviour: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset
in this case memset on both A9's is slower in the Neon case than the non-Neon.
The relative performance of NEON vs non-NEON seems to depend a lot on the size (relative to cache), alignment, and whether or not any prefetching (explicit PLD, automatic, or preload engine) is used. For large copies (much larger than L2) NEON with prefetching wins in my testing (don't have numbers handy right now). For already cached data, things may be different. The A8 is also special with the fast path between L2 and NEON which the A9 lacks for obvious reasons.
I have also observed significant variation depending on the relative alignment of source and destination buffers, probably some cache effect.
Any claim of X being faster than Y should also specify for which sizes claim is valid. I have previously looked mostly at large copies, while you seem to be focused on small sizes. That is probably the reason for our different experiences. I'll try to be more specific from now on.
On 5 May 2011 18:44, Måns Rullgård mans@mansr.com wrote:
The relative performance of NEON vs non-NEON seems to depend a lot on the size (relative to cache), alignment, and whether or not any prefetching (explicit PLD, automatic, or preload engine) is used.
Yes, agreed - Neon does very well in non-aligned cases (I have some graphs for non-aligned cases but am still working on it). I'd been wondering about using Neon only for the non-aligned cases.
For large copies (much larger than L2) NEON with prefetching wins in my testing (don't have numbers handy right now).
OK, I've not tried stuff larger than about 256k chunks - I'm a) not sure it's really common to have really big copies and b) with the larger L2 caches it seems less and less likely the copies will be larger than it. Of course that's all down to workload and exactly the cases you care about etc that's very difficult to pin down.
For already cached data, things may be different. The A8 is also special with the fast path between L2 and NEON which the A9 lacks for obvious reasons.
So the thing I don't yet understand is whether A8 is special or A9 is special; if A8 is special and the current behaviour that A9 possesses is what will happen on other cores in the future then fine; if A9 is special then Neon may well be good in the future.
I have also observed significant variation depending on the relative alignment of source and destination buffers, probably some cache effect.
Any claim of X being faster than Y should also specify for which sizes claim is valid. I have previously looked mostly at large copies, while you seem to be focused on small sizes. That is probably the reason for our different experiences. I'll try to be more specific from now on.
It also gets very difficult to present - I have one set of data which is basically that set of graphs but iterated over different source and destination alignments - and that's just a sea of graphs!
Dave
David Gilbert david.gilbert@linaro.org writes:
On 5 May 2011 18:44, Måns Rullgård mans@mansr.com wrote:
The relative performance of NEON vs non-NEON seems to depend a lot on the size (relative to cache), alignment, and whether or not any prefetching (explicit PLD, automatic, or preload engine) is used.
Yes, agreed - Neon does very well in non-aligned cases (I have some graphs for non-aligned cases but am still working on it). I'd been wondering about using Neon only for the non-aligned cases.
For large copies (much larger than L2) NEON with prefetching wins in my testing (don't have numbers handy right now).
OK, I've not tried stuff larger than about 256k chunks - I'm a) not sure it's really common to have really big copies and
Not in the kernel at least, and that's the focus of this email, if not (at least not obviously) the wiki page.
b) with the larger L2 caches it seems less and less likely the copies will be larger than it. Of course that's all down to workload and exactly the cases you care about etc that's very difficult to pin down.
What really matters is whether source, destination, or both are already in some cache. Even a small copy of data currently not in any cache will perform similarly to a large one. Really tiny copies are another matter, but they should probably be inlined anyway.
For already cached data, things may be different. The A8 is also special with the fast path between L2 and NEON which the A9 lacks for obvious reasons.
So the thing I don't yet understand is whether A8 is special or A9 is special; if A8 is special and the current behaviour that A9 possesses is what will happen on other cores in the future then fine; if A9 is special then Neon may well be good in the future.
I think both are special in different ways. If I were to guess, I'd say the direct L2 path of the A8 is unlikely to show up again, whereas some of the weak points of the A9 (e.g. 64-bit data paths vs 128-bit in A8) will probably go away in some future core.
On Thu, May 05, 2011 at 04:08:01PM +0100, Måns Rullgård wrote:
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
Well, the work item above is also about providing optimized memory routines that come out of the TCWG; if NEON isn't interesting, are any of the optimized Thumb2 versions that the toolchain team worked on that are worth looking at?
2011/5/6 Christian Robottom Reis kiko@linaro.org:
On Thu, May 05, 2011 at 04:08:01PM +0100, Måns Rullgård wrote:
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
I don't see the connection between Thumb2 and memcpy performance. Thumb2 can do anything 32-bit ARM can.
Well, the work item above is also about providing optimized memory routines that come out of the TCWG; if NEON isn't interesting, are any of the optimized Thumb2 versions that the toolchain team worked on that are worth looking at?
I don't think there are that many things that are vastly useful for the kernel, but here is a summary (I intend to write a full report at some point but am still fighting SPEC for some benchmark stats and some of the corner cases of these routines)
Note also that these graphs were put together as I was working on the routines and aren't consistently on the same machine/libc etc - when I write the report up I'll gather a full consistent set.
memset: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset
On A9 the kernel's memset is pretty good - at around 64bytes or so it beats everything else and is at the top at larger sizes on the exynos I tried it on (on a 400 MHz Vexpress there were some points where my own implemeentation beat it). On A8 interestingly the Neon version I wrote is really much faster than the ARM versions - maybe this is actually worth a try even with the context switching costs for page clearing?
Note that the kernel's memset takes a shortcut by not returning the correct result as per the C spec which probably helps it in the short cases (and frankly I don't think anyone actually ever cares about the return value).
strlen: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrlen
I've got a nice fast strlen that uses uadd8 - it's only really of benefit though if there are lots of longer strings.
strchr: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr
I've got two strchr implementations; one is the absolutely simplest one I could write (but taking advantage of a modern cbz) and another also using the uadd8 to do a few bytes/iteration. The really simple one is at least as good as the libc and kernel ones for sort (<50 byte) cases and no worse on A9 for longer cases; on A8 the more complex libc one wins out after about 16 characters. The uadd8 version is much faster on longer strchr's but I think those are so rare it's not worth it.
The 'simple strchr' is now in Ubuntu Natty's eglibc.
memchr: https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemchr
Memchr is my best performance win - but is probably not that heavily used; again it's using uadd8 (you can tell I like that) and it's much faster on longer runs. Where the length parameter is small it falls back to a simple loop; it's worse case is where you pass a large block of memory (and hence it uses the more complex loop) but the result is found in the first few bytes.
This is in Ubuntu Natty's eglibc.
memcpy: Updated! https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
As discussed previously, see the memcpy charts above - I've added a new 2x2 set at the bottom comparing aligned/misaligned (only by 1 byte), and also added a non-neon memcpy I just wrote.
My non-neon memcpy is similar to the kernel/libc's - it's a bit less smart about copying n*32+1 bytes (which are the spiky bits you can see) but seems a little faster at the start and end of the ranges - nothing really to distinguish it. (It doesn't know about co-misaligned yet - as in both source/dest misaligned by 1 byte which you can see in the lower right graph). It is however abysmal in the non-aligned case - hint: Don't bother taking advantage of v7's non-aligned load/stores.
For non-aligned Neon wins; one of mine or bionic's neon routines -I seem to prefer non-aligned source, Bionic seems to prefer non-aligned destination; and Bionic's really drops off when it runs out of cache.
(I have a run cooking at the moment with a much wider set of misalignments but it takes ages)
Dave
On 6 May 2011 19:57, David Gilbert david.gilbert@linaro.org wrote:
2011/5/6 Christian Robottom Reis kiko@linaro.org: I don't think there are that many things that are vastly useful for the kernel, but here is a summary (I intend to write a full report at some point but am still fighting SPEC for some benchmark stats and some of the corner cases of these routines)
Along those lines, I'm going to again refer to Altivec use in other kernels/OSes, perhaps it might give some ideas to people:
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/arch/powerpc/oea/altivec.c?rev=1...
In NetBSD/powerpc, Altivec is used for vm page zeroing and copying. I don't really know the speed gains, but given altivec's bus to L2 cache was 128 bits (compared to 32-bits of the integer unit) I guess it was probably worth to incorporate such a change. (No, I don't have NetBSD installed, but I did code Altivec for 5 years so I know some things in detail.)
I'm just mentioning this without knowing if this is actually possible or relevant in the Linux kernel. If it isn't, just ignore my post :)
Regards
Konstantinos
Hi,
On Thu, May 05, 2011 at 03:47:08PM +0100, David Gilbert wrote:
Hi Kiko,
On 5 May 2011 15:21, Christian Robottom Reis kiko@linaro.org wrote:
Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
Not quite: a) Neon memcpy/memset is worse on A9 than non-neon versions (better on A8 typically) b) In general I don't believe fpu or Neon code can be used internally to the kernel.
Dave
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig... https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
There may be the potential still for non-neon optimised memcpy/memset for Cortex a9; however the kernel routines are pretty good.
One important thing to observe is that NEON is, first and foremost, a computation engine. It isn't specifically designed for speeding up bulk memory copies, so this probably isn't the first thing we should focus on if we want to make a case for using NEON in the kernel.
Conversely, targeting NEON use at computational tasks is likely to deliver much more consistent gains.
Secondly, VFP/NEON context switch overheads will tend towards the worst case if NEON is used for memcpy(), simply because memcpy is used very often. Microbenchmarks of core memcpy performance don't inform us about such system- level effects. We'd need metrics for the cost and frequency of those context switches to get a better idea of the impact. Even so, the ideal tradeoff may not be the same on all platforms.
So some fruitful work therefore might involve:
* Create infrastructure to allow NEON/VFP to be used in kernel-space (other architectures provide an example of how this can be done). * Add instrumentation to gather metrics on the context switching behaviour and cost. * Port some no-brainer functionality (such as CRC32) to use NEON, instrument and benchmark as appropriate.
These will allow a properly quantified case to be presented to upstream: if a clear benefit is demonstrated, I doubt that "taboos" will present too much of an obstacle.
Needless to say, any benchmarking should be done on multiple platforms, at least A8 and A9.
Once the above work is done, we have the option to add memcpy to the mix -- however, as discussed in this thread, this isn't a no-brainer everywhere and has subtleties; so it's probably best kept orthogonal from the tasks above.
This above work is not currently in the planning for 11.11, so if we want any of it to happen we will need to take account of this in the planning.
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
The NEON task was never really in my queue: its presence in the Thumb-2 blueprint seems a bit strange actually. I believe there was no significant work done on this in the 10.05 cycle.
Cheers ---Dave
On 5 May 2011 17:21, Christian Robottom Reis kiko@linaro.org wrote:
Hey there,
I was asked today in the board meeting about the use of NEON routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest). I have some references:
http://lists.linaro.org/pipermail/linaro-toolchain/2011-January/000722.html
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc...
http://www.spinics.net/lists/arm-kernel/msg106503.html
http://dev.gentoo.org/~armin76/arm/memcpy-neon_result.txt
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy?hig... https://wiki.linaro.org/WorkingGroups/ToolChain/StringRoutines?highlight=%28...
Back in 2003-2004 iirc, Freescale played with the idea of using Altivec (the PowerPC SIMD engine) inside the kernel, and published a paper on this:
http://cache.freescale.com/files/32bit/doc/app_note/AN2581.pdf
All of it is a good read, but for the hasty ones, I'd suggest moving to paragraph 3.3: in essense it says that due to potential problems in context switching (ie if a usermode applications contests with the kernel for the SIMD unit), performance might drop for both due to excessive context switching. OTOH, a new SIMD-memcpy used in specific cases or even combined with some other functionality -as for example the TCP checksum in this case- might prove quite rewarding and probably the proper way to use a SIMD inside the kernel. I'm sure that this is irrelevant of the actual SIMD unit in question, whatever applies to NEON might apply as well to Altivec or SSE*.
Regards
Konstantinos
On Thu, 5 May 2011, Christian Robottom Reis wrote:
Hey there,
I was asked today in the board meeting about the use of NEON
routines in the kernel; I said we had looked into this but hadn't done it because a) it wasn't conclusively better and b) if better, it would need to be done conditionally per-platform. But I wanted to double-check that's actually true (and I'm copying Vijay to keep me honest).
Please see my previous answer.
Incidentally, this ties into the question sent earlier this week which had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
Thumb2 and Neon are orthogonal.
Nicolas
Hi there,
I would like to do the same things with you guys in my devkit8000 board and see the performance. But since the kernel does not support NEON by default, how can we enable that SIMD? Could someone give me a point to figure out?