Memcpy and memset

Michael Hope michael.hope at
Tue Sep 6 02:46:16 UTC 2011

On Fri, Sep 2, 2011 at 4:51 AM, David Gilbert <david.gilbert at> wrote:
> Hi Michael,
>  I've just committed a pair of memcpy's into src/linaro-a9 - memcpy.S
> that is armv7
> and memcpy-hybrid.S that is a Neon hybrid which uses neon for non-aligned cases
> and for large (128K or larger) copies.   I've also (accidentally)
> wired the memcpy-hybrid
> one into the (I wasn't sure what the right way to do this
> was - the neon_sources
> seemed a good place for it, but there is nothing currently in there
> that turns off the non-neon
> version).
>  I'd be interested in seeing the results for both; I've got a bit of
> a soft spot for the hybrid
> solution.
>  On the memset, yes the 'and' that you added is fine - but I started
> having a play and have
> some performance results (on -t 128) that I don't really understand:
> 1) and r1,#0xff
>    orr  r1,r1,r1,lsl#8
>    orr  r1,r1,r1,lsl#16
>   That's your solution - and fastest at somewhere around 2270MB/s for
> me - by the TRM I reckon
> that should be 3 cycles.
> 2) lsls r1,#24
>    orr  r1,r1,r1,lsr#8
>    orr  r1,r1,r1,lsr#16
>    lsl isn't explicitly listed in the TRM, so I assumed that was the
> same as a move with a constant
> shift, which my reading is that it's a single cycle; and the lsls is 2
> bytes - so you would think
> that should be as fast as yours but 2 bytes smaller - except it's
> reliably down at 2228MB/s - so
> it is slower.
> 3) Thinking it was an alignment issue I tried adding a mov r5,r1 to
> the front of that, and got 2248MB/s -
> so being faster with an extra instruction it probably was an alignment issue?
> 4) I also tried a pair of bfi's:
>   bfi r1,r1, #8, #8
>   bfi r1,r1, #16, #16
>   That came out at 2228MB/s - and is 4 cycles by the book.

Unfortunately you can't tell the performance from the latency.
Attached is a micro benchmark that has the three different versions
(and, ubx, lsl).  After compensating for the loop time, I got:

 * lsl: 1.006 s
 * ubx: 0.876 s
 * and: 0.918 s

even though ubx has a latency of two cycles.

I then took the AND version and shifted it to the start of the file.
This small change in alignment pushed it up to 1.048 s which is 14 %

-- Michael
