On 5 May 2011 18:59, Nicolas Pitre nicolas.pitre@linaro.org wrote:
On Thu, 5 May 2011, David Gilbert wrote:
If people believe it's worth breaking the context-switching taboo and putting a neon version into the kernel then yes I agree it's something you'd want to do as a build and/or runtime selection - but that's quite a big taboo to break.
There is no taboo. Only numbers.
The cost of using Neon in the kernel is non negligible. It is also hard to measure as it depends on the actual Neon usage simultaneously happening in user space or in other concurrent kernel contexts. This is not something that a dedicated benchmark can evaluate.
Agreed, and that's also why it's partly a taboo; if it's only numbers but numbers based on some set of benchmarks that no two people are going to agree on it's very difficult. It would have to show a good win on something complex and well agreed upon.
There _are_ cases for Neon to be used in the kernel i.e. those where the initial cost is offset by the gain. The first that comes to mind is crypto of course. But there is also simple things like CRC32 which is used all over the place by BTRFS for example. And that is the actual test case I think we should focus our efforts on, given that BTRFS is going to be the next major filesystem on Linux. Last time I tried BTRFS on ARM, the CRC32 computation was dominating CPU usage big time. CRC32 is easy to understand, easy to validate, and will provide the right reason for creating the needed infrastructure to manipulate the Neon context in kernel space. Once that's in place we could move to other targets such as crypto which is already complex enough without having to bother with the Neon context handling.
Yes, while I've not actually looked at coding CRC32 or the crypto things I agree that they feel like they have much more room for working with; it's outside of the scope of what I was asked to look at however.
The memcpy case is not interesting. Not at all. Most kernel memcpy calls are for small size copies. The large copy instances are just bad and misdesigned in the first place if they rely on memcpy (maybe they should simply have a custom copy function, maybe implemented with Neon).
Even outside the kernel vast memcpy's are fairly rare as far as I can tell - everyone knows they're going to hurt so people try and avoid them; the other thing is that people have been optimising ARM memcpy for decades and it appears to me to be hitting cache/bus bandwidths somewhere (although I don't have any figures for what those bandwidths are) - there may be some scope for optimising the smaller memcpy cases (e.g. taking advantage of things like the newer cbz to cut a few instructions out) - from my graphs the slope up to the point at which the non-neon code plateaus is quite gradual, which suggests it might be possible to optimise it a bit. (Oddly the one case where my graph shows the neon winning is in small - ~32 byte cases where it's almost certainly not worth the pain in the kernel of protecting the context switch).
And I doubt the small memcpy's are going to gain anything from Neon. Even on X86 they don't do it, while they do have a CRC32 function using SSE2. Maybe we could use Neon for copy_page() which is one of those custom bulk copy functions, but I've never seen memcpy() in kernel space show up on any profile.
Dave