Re: Optimized kernel memcpy/memset

6 May 2011


      2011/5/6 Christian Robottom Reis kiko@linaro.org:
...
On Thu, May 05, 2011 at 04:08:01PM +0100, Måns Rullgård wrote:
...
...
...
Incidentally, this ties into the question sent earlier this week which
had to do with Nico's work item in:
https://blueprints.launchpad.net/linux-linaro/+spec/other-kernel-thumb2
Which IIRC Nico says probably isn't worth it, right?
I thought dmart had done a lot of that?
I don't see the connection between Thumb2 and memcpy performance.
Thumb2 can do anything 32-bit ARM can.
Well, the work item above is also about providing optimized memory
routines that come out of the TCWG; if NEON isn't interesting, are any
of the optimized Thumb2 versions that the toolchain team worked on that
are worth looking at?
I don't think there are that many things that are vastly useful for the kernel,
but here is a summary (I intend to write a full report at some point but
am still fighting SPEC for some benchmark stats and some of the corner cases
of these routines)
Note also that these graphs were put together as I was working on the routines
and aren't consistently on the same machine/libc etc - when I write the report
up I'll gather a full consistent set.
memset:
   https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset
On A9 the kernel's memset is pretty good - at around 64bytes or so it
beats everything else and is at the top at larger sizes on the exynos I tried
it on (on a 400 MHz Vexpress there were some points where my own implemeentation
beat it).  On A8 interestingly the Neon version I wrote is really much faster
than the ARM versions - maybe this is actually worth a try even with the context
switching costs for page clearing?
Note that the kernel's memset takes a shortcut by not returning the
correct result as per the C spec which probably helps it in the short cases
(and frankly I don't think anyone actually ever cares about the return value).
strlen:
   https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrlen
I've got a nice fast strlen that uses uadd8 - it's only really of benefit
though if there are lots of longer strings.
strchr:
   https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialStrchr
I've got two strchr implementations; one is the absolutely simplest one I
could write (but taking advantage of a modern cbz) and another also using
the uadd8 to do a few bytes/iteration.  The really simple one is at least
as good as the libc and kernel ones for sort (<50 byte) cases and no worse
on A9 for longer cases; on A8 the more complex libc one wins out after about 16
characters.
The uadd8 version is much faster on longer strchr's but I think those
are so rare
it's not worth it.
The 'simple strchr' is now in Ubuntu Natty's eglibc.
memchr:
  https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemchr
Memchr is my best performance win - but is probably not that heavily used;
again it's using uadd8 (you can tell I like that) and it's much faster on longer
runs.  Where the length parameter is small it falls back to a simple loop;
it's worse case is where you pass a large block of memory (and hence it uses
the more complex loop) but the result is found in the first few bytes.
This is in Ubuntu Natty's eglibc.
memcpy:
  Updated! https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
As discussed previously, see the memcpy charts above - I've added a new 2x2 set
at the bottom comparing aligned/misaligned (only by 1 byte), and also added
a non-neon memcpy I just wrote.
My non-neon memcpy is similar to the kernel/libc's - it's a bit less smart
about copying n*32+1 bytes (which are the spiky bits you can see) but seems a
little faster at the start and end of the ranges - nothing really to distinguish
it.
(It doesn't know about co-misaligned yet - as in both source/dest misaligned
by 1 byte which you can see in the lower right graph).
It is however abysmal in the non-aligned case - hint: Don't bother taking
advantage of v7's non-aligned load/stores.
For non-aligned Neon wins; one of mine or bionic's neon routines -I seem
to prefer non-aligned source, Bionic seems to prefer non-aligned
destination; and
Bionic's really drops off when it runs out of cache.
(I have a run cooking at the moment with a much wider set of misalignments
but it takes ages)
Dave

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset