NEON vectorization: use of specialized load/store instructions
IRAR at il.ibm.com
Thu Oct 14 13:35:28 UTC 2010
Julian Brown <julian at codesourcery.com> wrote on 11/10/2010 04:29:15 PM:
> In further followups (at the risk of misrepresenting Joseph & Paul
> Brook's opinions!), there seemed to be general agreement that a scheme
> something like that outlined below, with "permuting" loads/stores and
> some way of handling multiple in-register layouts for vectors seems
> like it will be a necessary addition to the vectorizer, going forward.
Let me check that I understand the problem first: the problem is that VLD1
and VST1 instructions in big endian mode follow the array numbering of
elements, while all other memory instructions (VLDR, VLDM,VSTR, VSTM) do
not. So, do we have two problems here? The first one that VLD1/VST1 and
VLDR, etc. can't be mixed in one computation. And the second one, that
access to a single element is incorrect, when VLDR, etc. are used. Is that
In addition, we need to think about how to represent VLD2/3, so the
vectorizer can use them. Right?
> I'm thinking (without having much idea about how feasible such an idea
> is) of something along the lines of a function (in the mathematical
> sense) attached to each vector value manipulated by the vectorizer, to
> map that value's element numberings to and from memory offsets.
Joseph Myers <joseph at codesourcery.com> wrote on 08/10/2010 02:54:29 AM:
> Make it possible to describe in generic RTL a permuting
> vector load whose alignment requirement is element alignment, describe
> vld1 that way, and teach the vectorizer how to use such loads and stores.
Does that mean that the vectorizer will be aware of specific instructions?
I can see several places where the order of elements is important in
vectorizer's code generation:
- interleave_high/low and widening operations - but I am not sure that the
current implementation suits NEON best, so maybe those are less important
- extraction of scalar result in reduction
> The ARM implementations of reduction operations
> fortuitously calculate the results across all elements simultaneously,
> so when one of those elements is extracted, we still get the right
So, does that mean that's not a problem?
- various scalar/invariant vectors, including initializations for reduction
- the order of elements in loads and stores should match
More information about the linaro-toolchain