For reference. We know that the NEON intrinsics in GCC have issues.
I came across this page: http://hilbert-space.de/?p=22
which has a colour to greyscale conversion done using intrinsics. gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate values on the stack. The core of the loop is:
.L3: mov ip, r4 vld3.8 {d16-d18}, [r6] vstmia r4, {d16-d18} ldmia ip!, {r0, r1, r2, r3} mov sl, r9 adds r7, r7, #1 adds r6, r6, #24 stmia sl!, {r0, r1, r2, r3} fldd d16, [sp, #24] fldd d18, [sp, #32] ldmia ip, {r0, r1} vmull.u8 q8, d16, d19 stmia sl, {r0, r1} vmlal.u8 q8, d18, d20 fldd d18, [sp, #40] vmlal.u8 q8, d18, d21 vshrn.i16 d16, q8, #8 vst1.8 {d16}, [r5] adds r5, r5, #8 cmp r8, r7 bgt .L3
llvm-2.9~svn128540 does much better:
vld3.8 {d20, d21, d22}, [r1]! add r3, r3, #1 cmp r3, r2 vmull.u8 q12, d21, d16 vmlal.u8 q12, d20, d17 vmlal.u8 q12, d22, d18 vshrn.i16 d19, q12, #8 vst1.8 {d19}, [r0]! blt .LBB0_1
and may actually be better than the had-written assembler on Nils's page due to scheduling the loop comparison earlier.
Richard S, were you looking into this?
-- Michael
Michael Hope michael.hope@linaro.org writes:
For reference. We know that the NEON intrinsics in GCC have issues.
I came across this page: http://hilbert-space.de/?p=22
which has a colour to greyscale conversion done using intrinsics. gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate values on the stack. The core of the loop is:
.L3: mov ip, r4 vld3.8 {d16-d18}, [r6] vstmia r4, {d16-d18} ldmia ip!, {r0, r1, r2, r3} mov sl, r9 adds r7, r7, #1 adds r6, r6, #24 stmia sl!, {r0, r1, r2, r3} fldd d16, [sp, #24] fldd d18, [sp, #32] ldmia ip, {r0, r1} vmull.u8 q8, d16, d19 stmia sl, {r0, r1} vmlal.u8 q8, d18, d20 fldd d18, [sp, #40] vmlal.u8 q8, d18, d21 vshrn.i16 d16, q8, #8 vst1.8 {d16}, [r5] adds r5, r5, #8 cmp r8, r7 bgt .L3
llvm-2.9~svn128540 does much better:
vld3.8 {d20, d21, d22}, [r1]! add r3, r3, #1 cmp r3, r2 vmull.u8 q12, d21, d16 vmlal.u8 q12, d20, d17 vmlal.u8 q12, d22, d18 vshrn.i16 d19, q12, #8 vst1.8 {d19}, [r0]! blt .LBB0_1
and may actually be better than the had-written assembler on Nils's page due to scheduling the loop comparison earlier.
Richard S, were you looking into this?
Yeah, this is actually the first thing I had to tackle as part of the auto-vectorisation support for vld/vst. If it wasn't fixed, the same problems would affect the auto-vectoriser. The main patches are:
- http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01631.html When we're dealing with multiple vectors (vld2-vld4, etc.), allow GCC to access the individual vectors in-place, rather than forcing all the vectors to the stack and loading individual ones from there.
Now upstream.
- http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01996.html Changes the intrinsic patterns to use memory operands instead of register pointer operands. Among other things, this allows the vld3.8 and vst1.8 to take the post-incremented addresses, as it does in the LLVM code.
Only posted on Tuesday, awaiting review.
- Allow types like uint32x4x3_t to be stored in registers. I started a discussion related to this:
http://gcc.gnu.org/ml/gcc/2011-03/msg00342.html
and I think the outcome means that the implementation I wanted should be OK. I tested the patch I've been using last night, and plan to submit it today.
The build I tested last night gives:
cmp r2, #0 add r3, r2, #7 movlt r2, r3 mov r2, r2, asr #3 cmp r2, #0 vmov.i8 d21, #77 @ v8qi vmov.i8 d22, #151 @ v8qi vmov.i8 d23, #28 @ v8qi bxle lr mov r3, #0 .L3: vld3.8 {d18-d20}, [r1]! vmull.u8 q8, d18, d21 vmlal.u8 q8, d19, d22 vmlal.u8 q8, d20, d23 add r3, r3, #1 vshrn.i16 d16, q8, #8 cmp r3, r2 vst1.8 {d16}, [r0]! bne .L3 bx lr
which seems to be the same as LLVM code, but scheduled differently.
Richard
linaro-toolchain@lists.linaro.org