Re: Optimized kernel memcpy/memset

5 May 2011


      On 5 May 2011 18:17, Måns Rullgård mans@mansr.com wrote:
...
David Gilbert david.gilbert@linaro.org writes:
...
On 5 May 2011 16:08, Måns Rullgård mans@mansr.com wrote:
...
David Gilbert david.gilbert@linaro.org writes:
...
Not quite:
  a) Neon memcpy/memset is worse on A9 than non-neon versions (better
on A8 typically)
That is not my experience at all.  On the contrary, I've seen memcpy
throughput on A9 roughly double with use of NEON for large copies.
For small copies, plain ARM is might be faster since the overhead of
preparing for a properly aligned NEON loop is avoided.
What do you base your claims on?
My tests here:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemcpy
At the top of the page: Do not rely on or use the numbers.
That's OK we put that in there since we're still experimenting.
...
...
at the bottom of the page are sets of graphs for A9 (left) and A8
(right); on A9 the Neon memcpy's (red and green) top out much lower
than their non-neon best equivalents (black and cyan).
That page is rather fuzzy on exactly what code was being tested as well
as how the tests were performed.  Without some actual code with which
one can reproduce the results, those figures should not be used as basis
for any decisions.
I'm happy to post my test harness; I've copy and pasted the main memcpy
speed test below; give me a day or two and I can clean the whole thing up to
run stand alone.
...
...
Also, when I showed those numbers to the guys at ARM they all said it was
a bad idea to use Neon on A9 for memory manipulation workloads.
I have heard many claims passed around concerning memcpy on A9, none of
which I have been able to reproduce myself.  Some allegedly came from
people at ARM.
...
What code do you base your claims on :-)
My own testing wherein the Bionic NEON memcpy vastly outperformed both
glibc and Bionic ARMv5 memcpy.
Can you provide me with some actual results?  You seem to be disputing my actual
numbers that agree with the comments from the guys from ARM with an argument
saying that you have seen the opposite - which I'm happy to believe,
but I'd like
to understand why.
Note also the graphs I produced for memset show similar behaviour:
https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset
in this case memset on both A9's is slower in the Neon case than the non-Neon.
...
...
...
I don't see the connection between Thumb2 and memcpy performance.
Thumb2 can do anything 32-bit ARM can.
There are the purists who says write everything in Thumb2 now; however
there is an interesting question of which is faster, and IMHO the ARM
code is likely to be a bit faster in most cases.
Code with many conditional instructions may be faster in ARM mode since
it avoids the IT instructions.  Other than that I don't see why it
should matter.  The instruction prefetching should make possible
misalignment of 32-bit instructions irrelevant.  If anything, the
usually smaller Thumb2 code should decrease I-cache pressure and
increase performance.
As per my comment in a separate mail, I don't see that the i-cache pressure
should be relevant for a small core routine.
Here is my memcpy harness; feel free to point out any silly mistakes!
void memcpyspeed(memcpyfunc mcf, const char* id, unsigned int
srcalign, unsigned int dstalign)
{
  const size_t largesttest=256*1024; /* Power of 2 */
  const unsigned long loopsforlargest=2048; /* Number of loops to use
for largest test */
  const int sizeadd[]={0,1,3,4,7,15,31,-1};
  struct timespec tbefore, tafter;
  size_t testsize;
  unsigned long numloops;
  unsigned int sizeoffset;
/* I'm assuming the mmap will conveniently give us something page
aligned if the allocation size is reasonably large */
  void *srcbase, *dstbase;
  srcbase=mmap(NULL, largesttest+8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (MAP_FAILED==srcbase) {
    perror("memcpyspeed: Failed to mmap src area");
  }
dstbase=mmap(NULL, largesttest+8192, PROT_READ|PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
  if (MAP_FAILED==dstbase) {
    perror("memcpyspeed: Failed to mmap dst area");
  }
/* Wipe the mapped area through just to force it to be allocated */
  local_memset(srcbase, '.', largesttest+8192);
  local_memset(dstbase, '*', largesttest+8192);
/* Run over the range of sizes, adjusting the number of loops to
keep the same amount of memory
     accessed */
  for(testsize=largesttest,numloops=loopsforlargest;
      testsize>0;
      testsize/=2, numloops*=2) {
    /* For stuff larger than 32 try a few odd sizes */
    for(sizeoffset=0;
         ((testsize>=32) && (sizeadd[sizeoffset]>=0)) ||
         ((testsize<32) && (sizeoffset==0));
        sizeoffset++) {
      unsigned long l;
      double nsdiff;
      double mbtransferred;
clock_gettime(CLOCK_REALTIME, &tbefore);
for(l=0;l<numloops;l++) {
        mcf(dstbase+dstalign, srcbase+srcalign, testsize+sizeadd[sizeoffset]);
      }
clock_gettime(CLOCK_REALTIME, &tafter);
nsdiff=(double)(tafter.tv_nsec - tbefore.tv_nsec);
      nsdiff+=1000000000.0 *(tafter.tv_sec - tbefore.tv_sec);
      /* 2x is because it's a copy */
      mbtransferred=2.0*((double)(testsize+sizeadd[sizeoffset]) *
(double)numloops)/ (1024.0*1024.0);
printf("%s-s%d-d%d: ,%ld, loops of ,%ld, bytes=%lf MB,
transferred in ,%lf ns, giving, %lf MB/s\n",
           id, srcalign, dstalign,
           numloops, testsize+sizeadd[sizeoffset], mbtransferred,
nsdiff, (1000000000.0 * mbtransferred)/nsdiff);
    }
  }
munmap(srcbase, largesttest);
  munmap(dstbase, largesttest);
}
Dave
...
--
Måns Rullgård
mans@mansr.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Optimized kernel memcpy/memset