Agenda for tomorrow's call .
Ramana Radhakrishnan
ramana.radhakrishnan at linaro.org
Tue Nov 15 12:06:00 UTC 2011
On 15 November 2011 09:19, Richard Sandiford
<richard.sandiford at linaro.org> wrote:
> Revital Eres <revital.eres at linaro.org> writes:
>>> chain, so what makes the SMS version of it worse than the non-SMS version?
>>
>> I attached the SMS dump file. The problematic loop is the one with
>> "SMS succeeded 36 2" (there are three loops in total in this file).
>> Due to these accumulators min ii is 36 which seems to cause SMS to
>> take wrong decisions.
>>
>> SMS iis 36 36 72 (rec_mii, mii, maxii)
>
> OK, so the minimum ii comes from each dependency in the chain of
> 4 accumulations having a latency of 9 cycles. But the A9 TRM says:
>
> If a multiply-accumulate follows a multiply or another
> multiply-accumulate, and depends on the result of that first
> instruction, then if the dependency between both instructions are of the
> same type and size, the processor uses a special multiplier accumulator
> forwarding. This special forwarding means the multiply instructions can
> issue back-to-back because the result of the first instruction in cycle
> 5 is forwarded to the accumulator of the second instruction in cycle
> 4. If the size and type of the instructions do not match, then Dd or Qd
> is required in cycle 3. This applies to combinations of the
> multiply-accumulate instructions VMLA, VMLS, VQDMLA, and VQDMLS, and the
> multiply instructions VMUL andVQDMUL.
>
> So I think the problem is that successive VMLAs don't in fact have a
> latency of 9. However, this doesn't seem to be modelled in the ARM
> backend, either through bypasses or in a sched-reorder hook.
> In contrast, the A8 pipeline description has:
This should be identical for both the A8 and A9 descriptions.
;; Instructions using this reservation read their (D|Q)n operands at N2,
;; their (D|Q)m operands at N1, their (D|Q)d operands at N3, and
;; produce a result at N6 on cycle 4.
(define_insn_reservation "cortex_a8_neon_mla_qqq_32_qqd_32_scalar" 9
(and (eq_attr "tune" "cortexa8")
(eq_attr "neon_type" "neon_mla_qqq_32_qqd_32_scalar"))
"cortex_a8_neon_dp_4")
I thought I spotted the bypass for this but you are right, there is no
bypass that handles this particular case.
>
> ;; A multiply with a single-register result or an MLA, followed by an
> ;; MLA with an accumulator dependency, has its result forwarded so two
> ;; such instructions can issue back-to-back.
> (define_bypass 1 "cortex_a8_mul,cortex_a8_mla,cortex_a8_smulwy"
> "cortex_a8_mla"
> "arm_mac_accumulator_is_mul_result")
>
But that is modelling only scalar bypasses for the A8 indicating a
back to back issue of a multiply followed by an mla. The A9
descriptions should handle this with appropriate issue restrictions.
> I'm not sure from the A9 description whether "following" means
> "immediately following", or whether gaps between instructions are
> allowed (and, in the latter case, whether the gap can be filled with
> arbitrary instructions, or whether restrictions apply, such as
> "anything but another NEON multiplication"). Ramana, do you know?
I don't know the answer to that specific question and will have to try
a few experiments.
>
> Anyway, I think this explains why the non-SMS loop executes more
> quickly than GCC expects, and why the SMS loop is slower than it
> needs to be. It might be worth comparing the two loops with
> -mtune=cortex-a8.
>
> Richard
>
More information about the linaro-toolchain
mailing list