NEON vectorization: use of specialized load/store instructions
IRAR at il.ibm.com
Mon Oct 18 10:39:59 UTC 2010
Joseph Myers <joseph at codesourcery.com> wrote on 14/10/2010 05:18:37 PM:
> On Thu, 14 Oct 2010, Ira Rosen wrote:
> > Let me check that I understand the problem first: the problem is that
> > and VST1 instructions in big endian mode follow the array numbering of
> > elements, while all other memory instructions (VLDR, VLDM,VSTR, VSTM)
> > not. So, do we have two problems here? The first one that VLD1/VST1 and
> > VLDR, etc. can't be mixed in one computation. And the second one, that
> > access to a single element is incorrect, when VLDR, etc. are used. Is
> > correct?
> In terms of the native lane numbering used in NEON instructions, VLD1 and
> VST1 respect array ordering and are the instructions that can be used
> single-element accesses, while the other instructions do not respect the
> ordering and cannot be so used without adjusting the element numbers.
> In terms of the architecture-independent RTL semantics, VLDR, VLDM, VSTR
> and VSTM respect array ordering and can be used with single-element
> accesses, while VLD1 and VST1 do not respect the ordering and cannot be
> used without adjusting element numbers.
> The VLDR etc. order is the one required to be used for argument passing
> and return of vectors, and is the only one readily available when vectors
> are loaded/stored using core registers rather than NEON registers.
> Thus, when generic RTL is generated from a NEON instrinsic (defined using
> native lane numbering) in big-endian mode, the lane number is adjusted to
> make the generic RTL correct, and when assembly code is generated from
> generic RTL the reverse adjustment is made.
> > In addition, we need to think about how to represent VLD2/3, so the
> > vectorizer can use them. Right?
> Yes. (I think code using arrays of red/green/blue values is the sort of
> real-world (and benchmark) code expected to be vectorized using VLD3.)
> > Joseph Myers <joseph at codesourcery.com> wrote on 08/10/2010 02:54:29 AM:
> > > Make it possible to describe in generic RTL a permuting
> > > vector load whose alignment requirement is element alignment,
> > > vld1 that way, and teach the vectorizer how to use such loads and
> > Does that mean that the vectorizer will be aware of specific
> I would imagine that it would need to know what permutations are
> available, yes (GIMPLE and RTL would have some form of general permuting
> load/store operation, which the vectorizer would only generate where
> relevant instructions exist for the chosen permutation).
So, there will be a new tree code, e.g. PERM_LOAD_EXPR, and the vectorizer
will use it for misaligned loads in big endian (or maybe for little endian
as well), and for strided loads. The vectorizer will check if the
instruction is supported giving the desired stride (1,2,3) as input, and
will receive a mask. It will use the mask in order to permute all other
relevant vectors (like vectors of constants) if necessary, making all the
generic GIMPLE and RTL correct. And later, when assembly code is generated,
everything should be permuted again?
> Joseph S. Myers
> joseph at codesourcery.com
More information about the linaro-toolchain