NEON intrinsics and stack access - linaro-toolchain

31 Mar 2011


      For reference.  We know that the NEON intrinsics in GCC have issues.
I came across this page:
 http://hilbert-space.de/?p=22
which has a colour to greyscale conversion done using intrinsics.
gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate
values on the stack.  The core of the loop is:
.L3:
    mov	ip, r4
    vld3.8	{d16-d18}, [r6]
    vstmia	r4, {d16-d18}
    ldmia	ip!, {r0, r1, r2, r3}
    mov	sl, r9
    adds	r7, r7, #1
    adds	r6, r6, #24
    stmia	sl!, {r0, r1, r2, r3}
    fldd	d16, [sp, #24]
    fldd	d18, [sp, #32]
    ldmia	ip, {r0, r1}
    vmull.u8	q8, d16, d19
    stmia	sl, {r0, r1}
    vmlal.u8	q8, d18, d20
    fldd	d18, [sp, #40]
    vmlal.u8	q8, d18, d21
    vshrn.i16	d16, q8, #8
    vst1.8	{d16}, [r5]
    adds	r5, r5, #8
    cmp	r8, r7
    bgt	.L3
llvm-2.9~svn128540 does much better:
vld3.8	{d20, d21, d22}, [r1]!
    add	r3, r3, #1
    cmp	r3, r2
    vmull.u8	q12, d21, d16
    vmlal.u8	q12, d20, d17
    vmlal.u8	q12, d22, d18
    vshrn.i16	d19, q12, #8
    vst1.8	{d19}, [r0]!
    blt	.LBB0_1
and may actually be better than the had-written assembler on Nils's
page due to scheduling the loop comparison earlier.
Richard S, were you looking into this?
-- Michael