Automatic vector size selection/mixed-size vectors ================================================== The "vect256" branch now has a vectorization factor argument for UNITS_PER_SIMD_WORD (allowing selection of different vector sizes). Patches to support that would need backporting to 4.5 if that looks useful. Could investigate the feasibility of doing that. Currently UNITS_PER_SIMD_WORD is only used in tree-vect-stmts.c:get_vectype_for_scalar_type (which itself is used in several places). Generally (check assumption) I think that wider vectors may make inner loops more efficient, but may increase the size of setup/teardown code (e.g. setup: increased versioning. Teardown, increased insns for reduction ops). More importantly, sometimes larger vectors may inhibit vectorization. We ideally want to calculate costs per vector-size per-loop (or per other vectorization opportunity). Using the vect256 bits is probably much easier than the alternatives. ARMv6 SIMD operations ===================== It looks like several of the ARMv6 instructions may be useful to the vectorizer, or even just to regular integer code. Some of the instructions are supported already, but it's possible that we could support more -- particularly if combine is now able to recognize longer instruction sequences. GCC already has V4QI and V2HI modes enabled on ARM. PKH --- Pack halfword. May be usable by combine (or may be too complicated). QADD16, QADD8, QASX, QSUB16, QSUB8, QSAX UQADD16, UQADD8, UQASX, UQSUB16, UQSUB8, UQSAX ---------------------------------------------- Saturating adds/subtracts. No use to vectorizer or combine at present. REV, REV16, REVSH ----------------- Unlikely to be usable without builtins. REV is currently supported like that. SADD8, SADD16, UADD8, UADD16 ---------------------------- Packed addition of bytes/halfwords (setting GE flags). Should be usable by vectorizer. SEL --- Select bytes depending on GE flags. Can probably be used in vectorizer to implement vcond on core registers. SHADD8, SHADD16, SHSUB8, SHSUB16 UHADD8, UHADD16, UHSUB8, UHSUB16 -------------------------------- Packed additions & subtractions, halving the results before writing to dest register. Probably can't be used by vectorizer at present. SMLAD, SMLALD ------------- Two packed 16-bit multiplies, adding both results to a 32-bit accumulator. Pattern can be written in RTL, possibly recognizable by combine. SMLSD, SMLSLD ------------- Adds difference of two packed 16-bit multiplies to an accumulator. Again can be written in RTL, but will combine be able to do anything with it? SMMLA, SMMLS, SMMUL ------------------- Can probably be added quite easily, if combine plays nicely. SMUAD, SMUSD ------------ Packed multiply with "sideways" add or subtract before writing to dest. Could probably be recognized by combine. SMULBB, SMULBT, SMULTB, SMULTT ------------------------------ (ARMv5TE instructions). Supported. No unsigned variants for these. SSAT, SSAT16, USAT, USAT16 -------------------------- Saturate (signed or unsigned) to power-of-two range given by a bit position. No use to vectorizer. SSUB8, SSUB16, USUB8, USUB16 ---------------------------- Packed 8- or 16-bit subtraction, setting flag bits. Could potentially be used by vectorizer. SASX, SSAX, UASX, USAX ---------------------- [Un]signed add/subtract with exchange, or [un]signed subtract/add with exchange. May be usable from regular code, but might be too much for combine. (Maybe the intermediate pseudo-instruction trick might work though?). SXTAB, SXTAH, UXTAB, UXTAH -------------------------- Signed extend and add halfword. Already supported. SXTAB16, UXTAB16 ---------------- Extract two 8-bit values from shifted register, sign extend to 16-bits, and add to 16-bit values from another register, e.g. add to wider accumulators. May be usable by vectorizer. SXTB, UXTB, SXTH, UXTH ---------------------- Sign-extend or zero-extend bytes or halfwords. Supported. SXTB16, UXTB16 -------------- Widening ops. Could potentially be used by the vectorizer. SHASX, SHSAX, UHASX, UHSAX -------------------------- [Un]signed halving subtract/add with exchange. 16-bit elements. Probably can't be used by vectorizer at present. Possibly combinable, but probably too complex. UMAAL ----- Unsigned 32x32->64bit multiply with two follow-on 32-bit adds to the 64-bit result. Might be combinable at a push. USAD8, USADA8 ------------- Absolute sum of 8-bit differences, writing (or accumulating) result to 32-bit register. Probably not usable by vectorizer or combine. Loops with more than two basic blocks (if statements, etc.) =========================================================== Use of specialized load instructions ==================================== Unimplemented GCC vector pattern names ====================================== movmisalign ----------------- Implemented by: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg00214.html In SG++/Linaro 4.5 already. vec_extract_even ---------------------- Not implemented. This can be done using VUZP, and only keeping the
/ result. vec_extract_odd --------------------- Not implemented. This can be done using VUZP, and only keeping the / result. Ideally a vec_extract_even paired with a vec_extract_odd would only create the one insn... vec_interleave_high ------------------------- Not implemented. Can be done using VZIP, keeping only the / result. vec_interleave_low ------------------------ Not implemented. Can be done using VZIP, keeping only the
/ result. (Similarly paired vec_interleave_low & vec_interleave_high would ideally only create one insn.) vec_init -------------- Implemented. Probably some scope for adding more cleverness for initialising values in vectors: arm.c:neon_expand_vector_init knows some tricks already. sdot_prod, udot_prod -------------------------------- Not implemented. It's not entirely clear what operation this should support: I think it's several parallel dot-product operations, not one big dot-product. So the most natural thing to implement would be something like e.g.: VMULL.s8 qTMP, d1, d2 VPADD.s16 dTMP2, dTMPlo, dTMPhi VADD.s16 d0, dTMP2, d3 We could possibly use VPADAL instead of VPADD, with d3 wider than dTMP2, if my reading is correct. In that case we wouldn't need the VADD. We can definitely do something here, though it's a little unclear what at present. ssum_widen3, usum_widen3 ------------------------------------ Implemented, but called widen_[us]sum3. Doc or code bug? (Doc, I think.) vec_pack_trunc_ --------------------- Not implemented. ARM have a patch: http://gcc.gnu.org/ml/gcc-patches/2010-08/msg02175.html vec_pack_ssat_, vec_pack_usat_ ------------------------------------------ Not implemented (probably easy). VQMOVN. (VQMOVUN wouldn't be needed). vec_pack_sfix_trunc_, vec_pack_ufix_trunc_ ------------------------------------------------------ Not implemented. Can be done for D registers converting via a Q register and a separate narrowing insn: (massage d1 & d2 into qTMP) VCVT.s32.f32 qTMP2, qTMP VMOVN.s32 d0, qTMP2 Usual caveats about register allocation & introducing copies. Used on other targets for DFmode to SImode conversions: might not be useful for NEON (which would only be supporting V2SF to V4HI conversions). vec_unpack[su]_{hi,lo}_ ----------------------------- Not implemented. (Do ARM have a patch for this one?) vec_unpack[su]_float_{hi,lo}_ ----------------------------------- Not implemented. vec_widen_[us]mult_{hi,lo}_ --------------------------------- Not implemented. vrotl3 vrotr3 ------------ Not implemented (no NEON insns). vec_set vec_extract reduc_[us]{min,max}_ vec_shl_ vec_shr_ vashl3 vashr3 vlshr3 ------------ Implemented. NEON capabilities not covered by the vectorizer =============================================== Any other missed opportunities ==============================