Re: Optimized kernel memcpy/memset

5 May 2011


      On 5 May 2011 18:59, Nicolas Pitre nicolas.pitre@linaro.org wrote:
...
On Thu, 5 May 2011, David Gilbert wrote:
...
If people believe it's worth breaking the context-switching taboo and
putting a neon version into the kernel then yes I agree it's something
you'd want to do as a build and/or runtime selection - but that's
quite a big taboo to break.
There is no taboo.  Only numbers.
The cost of using Neon in the kernel is non negligible.  It is also
hard to measure as it depends on the actual Neon usage simultaneously
happening in user space or in other concurrent kernel contexts.  This is
not something that a dedicated benchmark can evaluate.
Agreed, and that's also why it's partly a taboo; if it's only numbers but
numbers based on some set of benchmarks that no two people are going
to agree on it's very difficult.      It would have to show a good win
on something
complex and well agreed upon.
...
There _are_ cases for Neon to be used in the kernel i.e. those where the
initial cost is offset by the gain.  The first that comes to mind is
crypto of course.  But there is also simple things like CRC32 which is
used all over the place by BTRFS for example.  And that is the actual
test case I think we should focus our efforts on, given that BTRFS is
going to be the next major filesystem on Linux.  Last time I tried BTRFS
on ARM, the CRC32 computation was dominating CPU usage big time. CRC32
is easy to understand, easy to validate, and will provide the right
reason for creating the needed infrastructure to manipulate the Neon
context in kernel space.  Once that's in place we could move to other
targets such as crypto which is already complex enough without having to
bother with the Neon context handling.
Yes, while I've not actually looked at coding CRC32 or the crypto things
I agree that they feel like they have much more room for working with;
it's outside of the scope of what I was asked to look at however.
...
The memcpy case is not interesting.  Not at all.  Most kernel memcpy
calls are for small size copies.  The large copy instances are just bad
and misdesigned in the first place if they rely on memcpy (maybe they
should simply have a custom copy function, maybe implemented with Neon).
Even outside the kernel vast memcpy's are fairly rare as far as I can
tell - everyone
knows they're going to hurt so people try and avoid them;
the other thing is that people have been optimising ARM memcpy for decades
and it appears to me to be hitting cache/bus bandwidths somewhere (although
I don't have any figures for what those bandwidths are) - there may be
some scope
for optimising the smaller memcpy cases (e.g. taking advantage of things like
the newer cbz to cut a few instructions out) - from my graphs the
slope up to the
point at which the non-neon code plateaus is quite gradual, which suggests
it might be possible to optimise it a bit.
(Oddly the one case where my graph shows the neon winning is in small - ~32 byte
cases where it's almost certainly not worth the pain in the kernel of
protecting the
context switch).
...
And I doubt the small memcpy's are going to gain anything from Neon.
Even on X86 they don't do it, while they do have a CRC32 function using
SSE2.  Maybe we could use Neon for copy_page() which is one of those
custom bulk copy functions, but I've never seen memcpy() in kernel space
show up on any profile.
Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset