Re: NEON intrinsics and stack access

31 Mar 2011


      Michael Hope michael.hope@linaro.org writes:
...
For reference.  We know that the NEON intrinsics in GCC have issues.
I came across this page:
 http://hilbert-space.de/?p=22
which has a colour to greyscale conversion done using intrinsics.
gcc-linaro-4.5-2011.03-0 does poorly through saving intermediate
values on the stack.  The core of the loop is:
.L3:
   mov	ip, r4
   vld3.8	{d16-d18}, [r6]
   vstmia	r4, {d16-d18}
   ldmia	ip!, {r0, r1, r2, r3}
   mov	sl, r9
   adds	r7, r7, #1
   adds	r6, r6, #24
   stmia	sl!, {r0, r1, r2, r3}
   fldd	d16, [sp, #24]
   fldd	d18, [sp, #32]
   ldmia	ip, {r0, r1}
   vmull.u8	q8, d16, d19
   stmia	sl, {r0, r1}
   vmlal.u8	q8, d18, d20
   fldd	d18, [sp, #40]
   vmlal.u8	q8, d18, d21
   vshrn.i16	d16, q8, #8
   vst1.8	{d16}, [r5]
   adds	r5, r5, #8
   cmp	r8, r7
   bgt	.L3
llvm-2.9~svn128540 does much better:
vld3.8	{d20, d21, d22}, [r1]!
   add	r3, r3, #1
   cmp	r3, r2
   vmull.u8	q12, d21, d16
   vmlal.u8	q12, d20, d17
   vmlal.u8	q12, d22, d18
   vshrn.i16	d19, q12, #8
   vst1.8	{d19}, [r0]!
   blt	.LBB0_1
and may actually be better than the had-written assembler on Nils's
page due to scheduling the loop comparison earlier.
Richard S, were you looking into this?
Yeah, this is actually the first thing I had to tackle as part of the
auto-vectorisation support for vld/vst.  If it wasn't fixed, the same
problems would affect the auto-vectoriser.  The main patches are:
- http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01631.html
    When we're dealing with multiple vectors (vld2-vld4, etc.),
    allow GCC to access the individual vectors in-place, rather
    than forcing all the vectors to the stack and loading
    individual ones from there.
Now upstream.
- http://gcc.gnu.org/ml/gcc-patches/2011-03/msg01996.html
    Changes the intrinsic patterns to use memory operands
    instead of register pointer operands.  Among other things,
    this allows the vld3.8 and vst1.8 to take the post-incremented
    addresses, as it does in the LLVM code.
Only posted on Tuesday, awaiting review.
- Allow types like uint32x4x3_t to be stored in registers.
    I started a discussion related to this:
http://gcc.gnu.org/ml/gcc/2011-03/msg00342.html
and I think the outcome means that the implementation I wanted
    should be OK.  I tested the patch I've been using last night,
    and plan to submit it today.
The build I tested last night gives:
cmp	r2, #0
    add	r3, r2, #7
    movlt	r2, r3
    mov	r2, r2, asr #3
    cmp	r2, #0
    vmov.i8	d21, #77  @ v8qi
    vmov.i8	d22, #151  @ v8qi
    vmov.i8	d23, #28  @ v8qi
    bxle	lr
    mov	r3, #0
.L3:
    vld3.8	{d18-d20}, [r1]!
    vmull.u8	q8, d18, d21
    vmlal.u8	q8, d19, d22
    vmlal.u8	q8, d20, d23
    add	r3, r3, #1
    vshrn.i16	d16, q8, #8
    cmp	r3, r2
    vst1.8	{d16}, [r0]!
    bne	.L3
    bx	lr
which seems to be the same as LLVM code, but scheduled differently.
Richard

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: NEON intrinsics and stack access