For reference. We know that the NEON intrinsics in GCC have issues.
I came across this page: http://hilbert-space.de/?p=22
which has a colour to greyscale conversion done using intrinsics. gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate values on the stack. The core of the loop is:
.L3: mov ip, r4 vld3.8 {d16-d18}, [r6] vstmia r4, {d16-d18} ldmia ip!, {r0, r1, r2, r3} mov sl, r9 adds r7, r7, #1 adds r6, r6, #24 stmia sl!, {r0, r1, r2, r3} fldd d16, [sp, #24] fldd d18, [sp, #32] ldmia ip, {r0, r1} vmull.u8 q8, d16, d19 stmia sl, {r0, r1} vmlal.u8 q8, d18, d20 fldd d18, [sp, #40] vmlal.u8 q8, d18, d21 vshrn.i16 d16, q8, #8 vst1.8 {d16}, [r5] adds r5, r5, #8 cmp r8, r7 bgt .L3
llvm-2.9~svn128540 does much better:
vld3.8 {d20, d21, d22}, [r1]! add r3, r3, #1 cmp r3, r2 vmull.u8 q12, d21, d16 vmlal.u8 q12, d20, d17 vmlal.u8 q12, d22, d18 vshrn.i16 d19, q12, #8 vst1.8 {d19}, [r0]! blt .LBB0_1
and may actually be better than the had-written assembler on Nils's page due to scheduling the loop comparison earlier.
Richard S, were you looking into this?
-- Michael