[gnu-arm-releases] Re: [PATCH, WIP] NEON quadword vectors in big-endian mode (#10061, #7306)
julian at codesourcery.com
Wed Dec 1 11:24:01 UTC 2010
On Wed, 1 Dec 2010 11:16:16 +0200
Ira Rosen <ira.rosen at linaro.org> wrote:
> >> > v0 = MEM_REF (addr)
> >> > v1 = MEM_REF (addr + 8B)
> >> > v2 = MEM_REF (addr + 16B)
> >> > builtin (v0, v1, v2, stride=3, reg_stride=1,...)
> > Would the builtin be changing the semantics of the preceding MEM_REF
> > codes? If so I don't like this much (the potential for the builtin
> > getting "separated" from the MEM_REFS by optimisation passes and
> > causing subtle breakage seems too high). But if eliding the builtin
> > would simply cause the code to degrade into separate loads/stores,
> > I guess that would be OK.
> The meaning of the builtin (or maybe a new tree code would be better?)
> is that the elements of v0, v1 and v2 are deinterleaved. I wanted the
> MEM_REFs, since we actually have three data accesses here, and
> something (builtin or tree code) to indicate the deinterleaving. Since
> the vectors are passed to the builtin, I don't think it's a problem if
> the statements get separated. When the expander sees the builtin, it
> has to remove the loads it created for the MEM_REFs and create a new
> "vector load multiple and deinterleave". Is that possible?
I can imagine it might get awkward if the builtin ends up in a
different basic block from the MEMs -- but I don't know enough about
the tree optimization passes to say if that's likely to ever happen.
> If there are other uses of the accesses, i.e., MEM_REF (addr) is used
> somewhere else in the loop, the vectorizer will have to create another
> v0' = MEM_REF (addr) for it.
OK, I see. I think this representation sounds quite reasonable to me,
> >> > to be expanded into:
> >> >
> >> > <regular RTL mem refs> (addr)
> >> > NOTE (...)
> >> I guess we can do something similar to load_multiple here (but it
> >> probably requires changes in neon.md as well).
> > Yeah, I like that idea. So we might have something like:
> > (parallel
> > [(set (reg) (mem addr))
> > (set (reg+2) (mem (plus addr 8)))
> > (set (reg+4) (mem (plus addr 16)))])
> > That should work fine I think -- but how to do register-allocation
> > on these values remains an open problem (since GCC has no direct
> > way of saying "allocate this set of registers contiguously"). ARM
> > load & store multiple are only used in a couple of places, where
> > hard regnos are already known, so aren't directly comparable.
> PowerPC also has load/store multiple, but I guess they are generated
> in the same phase as for ARM. Maybe there are other architectures that
> do that allocate contiguous register but earlier?
I don't know about other architectures which do that.
> Also, as you probably know, we have to keep in mind that the registers
> do not have to be contiguous: d, d+2, d+4 is fine as well - and this
> case is very important for us since this is the way to work with
> quadword vectors.
> > Choices I can think of are:
> > 1. Use "big integer" modes (TImode, OImode, XImode...), as in the
> > present patterns, but with (post-reload?) splitters to create
> > the parallel RTX as above. So, prior to reload, the RTL would look
> > different (like the existing patterns, with an UNSPEC?), so as to
> > allocate the "big" integer to consecutive vector registers. This
> > doesn't really gain anything, and I think it'd really be best if
> > those types could be removed from the NEON backend anyway.
> > 2. Use "big vector" modes (representing multiple vector registers --
> > up to e.g. V16SImode). We'd have to make sure these *never* end
> > up in core registers somehow, since that would certainly lead to
> > reload failure. Then the parallel might be written something like
> > this (for vld1 with four D registers):
> > (parallel
> > [(use (reg:V8SI 0))
> > (set (subreg:V2SI (match_dup 0) 0) (mem))
> > (set (subreg:V2SI (match_dup 0) 8) (plus (mem) 8))
> > (set (subreg:V2SI (match_dup 0) 16) (plus (mem) 16))
> > (set (subreg:V2SI (match_dup 0) 24) (plus (mem) 24))]
> > Or perhaps the same but with vec_select instead of subreg (the
> > patch I'm working on suggests that subreg on vector types works
> > fine, most of the time). This would require altering vld*/vst*
> > intrinsics -- but that's something I'm planning to do anyway, and
> > probably also tweaking the way "foo.val[X]" access (again for
> > intrinsics) is expanded in the front-ends, as a NEON-specific hack.
> > The main worry is that I'm not sure how well the register allocator
> > & reload will handle these large vectors.
> > The vectorizer would need a way of extracting elements or vectors
> > from these extra-wide vectors: in terms of RTL, subreg or
> > vec_select should suffice for that.
> Why does the vectorizer have to know about this?
> > 3. Treat vld* & vst* like (or even as?) libcalls. Regular function
> > calls have kind-of similar constraints on register usage to these
> > multi-register operations (i.e. arguments must be in consecutive
> > registers), so as a hack we could reuse some of that mechanism (or
> > create a similar mechanism), provided that we can live with vld* &
> > vst* always working on a fixed list of registers. E.g. we'd end up
> > with RTL:
> > Store,
> > (set (reg:V2SI d0) (...))
> > (set (reg:V2SI d1) (...))
> > (call_insn (fake_vst1) (use (reg:V2SI d0)) (use (reg:V2DI d1)))
> > Load,
> > (parallel (set (reg:V2SI d0) (call_insn (fake_vld1)))
> > (set (reg:V2SI d1) (...)))
> > (set (...) (reg:V2SI d0))
> > (set (...) (reg:V2SI d1))
> > (Not necessarily with actual call_insn RTL: I just wrote it like
> > that to illustrate the general idea.)
> > One could envisage further hacks to lift the restriction on the
> > fixed registers used, e.g. by allocating them in a round-robin
> > fashion per function. Doing things this way would also require
> > intrinsic-expansion changes, so isn't necessarily any easier than
> > (2).
> Is having a scan for special instructions before/after/during register
> allocation not an option?
I guess it is -- I think a pass before register allocation is broadly
equivalent to choice (3) above -- knowing that the register allocator
isn't smart enough to deal with contiguous register sets, so replacing
those registers with hard regs (and copying values in/out as
appropriate) before allocation proper.
The ideal solution would be to deal with those registers during the
register allocation phase, i.e. by adding generalised support for
allocating contiguous (or stride-2, ...) registers to the register
allocator. I'm kind of assuming this is intractable in practice though,
useful as it'd be. I'm far from an expert in the "new" RA though --
maybe others have input on how feasible it might be to add such support?
More information about the linaro-toolchain