NEON vectorization improvements - preliminary notes

Ira Rosen IRAR at
Wed Sep 22 12:23:17 BST 2010

Hi Julian,

Here are some thoughts about your report.

> Automatic vector size selection/mixed-size vectors
> ==================================================

I think we (I) need to cooperate with Richard Guenther: ask him about
committing his patch to 4.6 (they are probably planning to merge vect256
into 4.7?), offer help, etc. Looks like the last patch was committed to
vect256 in May... What do you think?

I can try to apply his patch and see how it behaves on ARM, once I have
access to an ARM board.

> Unimplemented GCC vector pattern names
> ======================================

> movmisalign<mode>
> -----------------
> Implemented by:

Are you waiting for approval from ARM maintainers?
Can I help somehow?
I think this patch is very important. Without it only aligned accesses can
be vectorized.

> vec_extract_even
(and interleave)
> ----------------

We can add, as a quick solution, those VZIP and VUZIP mappings. However, in
the long term, I think we need to exploit NEON's strided loads and stores.

> sdot_prod<mode>, udot_prod<mode>
> ----------------------------------

dot_prod (va, vb, acc) = { va.V0 * vb.V0 + acc.V0,  va.V1 * vb.V1 +
acc.V1, ... }
meaning it's a multiply-add, where acc and result are of twice of the
length of va and vb.
And yes, it is kind of several parallel dot-product operations, as you
wrote. In the end of a vector loop we have a vector of partial results,
which we have to reduce to a scalar result in a reduction epilogue.

> ssum_widen<mode>3, usum_widen<mode>3
> ------------------------------------
> Implemented, but called widen_[us]sum<mode>3. Doc or code bug? (Doc, I

This is how it is mapped in genopinit.c:

  "set_optab_handler (ssum_widen_optab, $A, CODE_FOR_$(widen_ssum$I$a3$))",
  "set_optab_handler (usum_widen_optab, $A, CODE_FOR_$(widen_usum$I$a3$))",

So, it is implemented.

> vec_pack_trunc_<mode>
> ---------------------
> Not implemented. ARM have a patch:

This is implemented in

> vec_pack_ssat_<mode>, vec_pack_usat_<mode>
> ------------------------------------------
> Not implemented (probably easy). VQMOVN. (VQMOVUN wouldn't be needed).

The only target that implements that is mips, so I am not sure it is

> vec_widen_[us]mult_{hi,lo}_<mode>
> ----------------------------------

This is used for widening multiplication:

int a[N];
short b[N], c[N];

for i
 a[i] = b[i] * c[i]

which gets vectorized as following:

vector int v0, v1;
for i

  v0 = vec_widen_smult_hi (b[8i:8i+7], c[8i:8i+7]);
  v1 = vec_widen_smult_lo (b[8i:8i+7], c[8i:8i+7]);
  c[8i:8i+3] = v0;
  c[8i+4:8i+7] = v1;

I think, on NEON we can just use VMULL (and one store) to do this. But, of
course, it requires support on the vectorizer side, including probably
multiple vector size support, unless it can be abstracted out somehow...

(After writing that, I checked ;), and these two are actually
there, implemented with two instructions each  (if I read it correctly). So
with the current implementation we need 6 instructions instead of,
hopefully, only 2).

> vec_unpack[su]_{hi,lo}_<mode>
> -----------------------------
> Not implemented. (Do ARM have a patch for this one?)

I see them in

> NEON capabilities not covered by the vectorizer
> ===============================================

I would start from a typical benchmark and see what features are required.

> The goal is a 15% speed improvement in EEMBC relative to FSF GCC 4.5.0
Does this mean to improve a single benchmark from EEMBC by 15%?
Do you have an EEMBC? I have a very old version, without DENBench, which
looks interesting according to EEMBC's site. Other than that TeleBench and
Consumer might have vectorization potential.

We have holidays till October 3, so probably I will not be able to respond
until then.


More information about the linaro-toolchain mailing list