Hi,
   I've been looking at some basic libc routine optimisation and have a curious problem with memset and wondered if
anyone can offer some insights.

Some graphs and links to code are on https://wiki.linaro.org/WorkingGroups/ToolChain/Benchmarks/InitialMemset

I've written a simple memset in both a with and without Neon variety and tested them on a Beagle(C4) and a Panda
board and I'm finding that the Neon version is faster than the non-neon version (a bit) on the Beagle but a LOT slower on the
Panda - and I'd like to understand why it's slower than the non-neon version - I'm guessing it's some form of cache interaction.

The graphs on that page are all generated by timing a loop that repeatedly memsets the same area of memory; the X axis
is the size of the memset.   Prior to the test loop the area is read into cache (I came to the conclusion the A8 didn't write
allocate?).  There are two variants of the graphs - absolute in MB/s on Y, and a relative set (below the absolute) that
are relative to the performance of the libc routines.  (The ones below those pairs are just older versions).

 if you look at the top left graph on that page you can see that on the Beagle (left) my Neon routine beats my Thumb routine
a bit (both beating libc).  If you look on the top right you see the Panda performance with my Thumb code being the fastest and generally
following libc, but the Neon code (red line) topping out at about 2.5GB/s which is substantially below the peak of the libc and ARM code.

The core loop of the Neon code (see the bzr link for the full thing) is:

4:
  subs  r4,r4,#32
  vst2.8  {d0,d1,d2,d3}, [ r3:256 ]!
  bne   4b

while the core of the non-Neon version is:

4:
  subs  r4,r4,#16
  stmia r3!,{r1,r5,r6,r7}
  bne   4b


I've also tried vst1 and vstm in the neon loop and it still won't match the non-Neon version.

All suggestions welcome, plus I'd appreciate if anyone can suggest which particular limit it's hitting - does
anyone have figures for the theoretical bus and L1 and L2 write bandwidths for a Panda (and Beagle) ?

Thanks in advance,

Dave