[RFC] NEON vs. ARM register selection
ams at codesourcery.com
Fri Mar 2 12:29:05 UTC 2012
As you know, the compiler currently has difficulties choosing between
whether to do an operation in NEON or not.
As I see it there are three problems:
1. Simply, is it profitable?
NEON can do many DImode operations in one or two instructions
where 2 to 10 normal ARM/Thumb instructions would be required
(not to mention the added register pressure), but there is a
cost associated with moving the inputs to NEON, and the results
If the data can stay in NEON for more than one operation,
then that's even better.
If the data must be loaded from memory, and the result stored back
to memory, then it's only a question of whether the register space
is available, or not.
Currently these decisions are made in the IRA/reload passes.
2. Values that originate in hard-registers stay there.
This applies to function parameters, mostly, but also in general
where the result of an operation is allocated first.
If there is no instruction that can use the value there then the
value is 'reloaded' to a more suitable register. If there is any
alternative that avoids the move then the register allocator will
use it, regardless of the relatives costs of the other
This problem is reduced where an operation and move can happen in
one instruction, but NEON instructions do not do this much. We can
write insns that appear to do it, but these output multiple
instructions (see my recent core-SI=>NEON-DI extend patch).
3. It all happens too late.
The decision whether to use NEON or not is not made until register
allocation time. Naturally this means that most of the optimization
passes are already completed.
Part of the problem is that the operation almost certainly needs
splitting (into whatever form was chosen) and this might not be
straight forward, post-reload. (However, the split1 pass is
already quite late, so perhaps this isn't such a big deal.)
Another part of the problem is that passes such as the two
lower-subreg passes make assumptions about the register width which
are not accurate if the operation is to end up in NEON.
There are other, lesser problems, such as it being hard to adjust the
costs for different cores (A8 in particular) and the cost of generating
an immediate constant can't be known until it's known what instructions
will be used to generate it.
These problems are not specific to NEON, of course. I believe IWMMXT
suffers from the same issues. Likewise the C6X port, and also the i386
MMX to some degree. Anything that has instructions that only operate on
a subset of registers, basically.
So, Bernd has suggested an outline of a solution. I've quizzed him on
this, added a few of my own ideas, and probably a good selection of
misunderstandings, bad assumptions, and general cock ups, and come up
with something I can write here for comment. I can post something to
upstream later if it doesn't get totally shot down now.
The basic idea is that we add a new RTL optimization pass (or two) that
assesses the usage of pseudo registers, and makes recommendations about
what register class each should end up in, if there's a choice. These
recommendations would then be used by later passes to get a better use
of NEON. I might call this the "prealloc" pass, or something.
Firstly, for each pseudo-register in a function, the pass would look at
the insn constraints for each "def" and "use", and see how the registers
relate to one another. This might determine things like "if rN is in
class A, then rM must be also in class A".
E.g. if you have two registers with constraints like this:
.. (and 'r' and 'w' do not overlap) then you know that there is a choice
between one mode or another, whereas this:
.. would impose no restrictions and we can carry on as normal.
Having done that we'd end up with sets of pseudo-registers that must
make a decision one way or the other, and we'd know where the operations
are that would force a move from one class to the other.
There's a fair amount of handwavium in there at present, because I've
not worked out what to do with overlapping register classes (think
VFP_LO_REGS) and all the other complications.
Secondly, the pass would consider the costs of each alternative, and
store a recommended register class for each pseudo-register in a table
somewhere. It would also create new pseudos and insert extra move
instructions at the register file boundaries where an existing register
would have had split recommendations (this would solve problem 2 above).
Again, there's handwavium in "consider the costs". This isn't too hard
for size-optimization (assuming the "length" attributes on the insn is
correct), but more difficult for speed optimization. Factors to include
would be the move costs (here the A8 issues would be addresses) and the
relative speeds of the operations in both alternatives. Also, the
various possible transition points between the two modes might need some
Thirdly, the subsequent passes would need to be modified, as would some
of the back-end bits and bobs.
1. Lower-subreg would need to detect 'word_mode' based on the
recommended register class, not the global value.
2. The many split patterns in the machine description could be adjusted
so that, instead of simply conditionalizing on "reload_completed", they
split at split1 if that's the best option. (Maybe it would be profitable
to insert a new, earlier split pass specifically for this case to take
advantage of the likes of combine? I mean, ideally this decision would
have been made at expand time, if it could have been?) It might be
useful to *not* split too soon, in some cases, so that the register
allocator can still make the final decision based on register pressure,
and whatever other factors it uses. Of course, the existing late-split
option would need to be retained in case the prealloc pass is disabled,
in any case.
3. Various passes would have to be taught not to remove seemingly
superfluous register moves where they actually move between register
4. Pretty much nothing would need doing to register allocation! The
extra moves should make allocation a register pressure management issue,
rather than a question of making it work. DImode operations preallocated
to core-registers may already have been lowered, one way or the other
(by split1) so there's no decision left there, and if no lowering was
necessary then that option ought to be obviously cheaper. If it insists
on making contrary decisions then it can be taught to use the
recommendation as a hint, perhaps? In specific problem cases it would
also be possible to use instruction attributes to disable (or strongly
discourage) certain alternatives based on the recommended class.
5. The existing 'onlya8'/'nota8' nonsense can be removed.
6. The register move cost can be set correctly for each core.
7. If a constant is destined for a NEON register, most likely,
arm_gen_constant can use the NEON immediate rules to determine the cost.
There's clearly a lot of thought that needs to go into the
pseudo-register scan and decision making logic, but the whole thing
doesn't look like it'll boil down to very much code in the end.
There's also the question of where to put the pass? Too early and you'd
need to put a second one in to reassess the much changed RTL later, and
too late and lower-subreg won't be able to use it.
It's possible that it might be better to treat it more like the
data-flow analysis where it's not actually a stand-alone pass, but
rather a tool other passes can use? That might depend how
computationally expensive it is.
Any thoughts anyone? Might something like this actually work? Would it
be worth spending the time on this?
More information about the linaro-toolchain