GCC Optimization Brain Storming Session

List overview All Threads
Download

newer

older

Perfromance Test Results using...

[ACTIVITY] 2010-12-09

Andrew Stubbs

26 Nov 2010 26 Nov '10

11:11 a.m.

Hi All,

As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities.

So, I'd like to get some idea of who would like to attend, and we'll try to find a slot we can all make. I'm on vacation next week, so I expect it'll be in two or three week's time.

Before we get there, I'd like to have a list of ideas to discuss. Partly so that we don't forget anything, and partly so that people can have a think about them before the day.

I'm really looking for bigger picture stuff, rather than individual poor code generation bugs.

So here's a few to kick off:

* Costs tuning. - GCC 4.6 has a new costs model, but are we making full use of it? - What about optimizing for size? - Do the optimizers take any notice? [1]

* Instruction set coverage. - Are there any ARM/Thumb2 instructions that we are not taking advantage of? [2] - Do we test that we use the instructions we do have? [3]

* Constant pools - it might be a very handy space optimization to have small functions share one constant pool, but the way the passes work one function at a time makes this hard. (LP:625233)

* NEON - There's already a lot of work going on here, and I don't want it to hog all our time, but it might be worth touching on.

What else? I'm not the most experienced person with GCC internals, and I'm relatively new to the ARM specific parts of those, so somebody else must be able to come up with something far more exciting!

So, please, get brain-storming!

Andrew

[1] We discovered recently that combine is happy to take two insns and combine them into a pattern that matches a splitter that then explodes into three insns (partly due to being no longer able to generate pseudo-registers).

[2] For example, I just wrote a patch to add addw and subw support (not yet submitted).

[3] LP:643479 is an example of a case where we don't.

Show replies by date

Mark Mitchell

26 Nov 26 Nov

6:40 p.m.

On 11/26/2010 3:11 AM, Andrew Stubbs wrote:

...

As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities.

I'm not sure I qualify as knowledgeable, and my schedule is a mess, so please don't try to schedule around me -- but please do let me know when the call will be.

For big-picture issues, I'd be interested in looking at inlining, both on its own, and in the context of profile-directed feedback. My guess is that we're not inlining enough in hot spots, and too much in cold spots. (The latter is important, in that I expect that overall system performance is impacted by larger binaries, leading to more cache misses and page faults.)

As an extension of that idea, and tying into the idea of optimizing for size, I suspect that we (a) ought be optimizing for size automatically in cold code, and (b) that we're probably not doing a great job of optimizing for size. (For example, and making the whole problem circular, I've seen cases where inlining would definitely reduce code size -- but we don't do it. I suspect that cases where many of the arguments to a relatively small function are constants should be considered for inlining even when optimizing for size; you're often going to simplify away the entire function. Perhaps we need to actually be able to try inlining, and back out if it is unprofitable.)

Both of these things require some significant infrastructure work. They aren't ARM-specific back-end changes. But, I suspect that they're important in terms of allowing GCC to take full advantage of ARM CPUs.

-- Mark Mitchell CodeSourcery mark@codesourcery.com (650) 331-3385 x713

Christian Robottom Reis

8:29 p.m.

On Fri, Nov 26, 2010 at 10:40:34AM -0800, Mark Mitchell wrote:

...

As an extension of that idea, and tying into the idea of optimizing for size, I suspect that we (a) ought be optimizing for size automatically in cold code, and (b) that we're probably not doing a great job of optimizing for size.

Somewhat OT, and I'm not sure this has been mentioned before, though I guess Mark has seen it:

http://embed.cs.utah.edu/embarrassing/

This group basically runs a set of size benchmarks against multiple compilers. I wonder how a) this would look for a survey of ARM compilers and b) if we could get them to run a benchmark for us if we provided them with hardware (and if not, if they'd provide us with the code).

Note also that the same group has a tool for generating random C code with volatiles which we could use for some level of validation:

http://www.cs.utah.edu/~eeide/emsoft08/

I've inlined below a sample program generated with randprog (you'll want to apt-get install libboost-program-options1.40-dev to build it):

---------------------------------------------------------------------- * This is a RANDOMLY GENERATED PROGRAM. * * Generator: randprog 1.0.0 * Options: (none) * Seed: 3881625688 */

#include <stdint.h> uint16_t context = 0;

#if defined(__AVR_ARCH__) # include "platform_avr.h" #elif defined (__MSP430__) # include "platform_msp430.h" #else # include "platform_generic.h" #endif

#include "random_runtime.h"

/* --- GLOBAL VARIABLES --- */ uint8_t g_4 = -4L;

/* --- FORWARD DECLARATIONS --- */ int32_t func_1(void);

/* --- FUNCTIONS --- */ /* ------------------------------------------ */ /* * reads : g_4 * writes: */ int32_t func_1(void) { { int8_t l_2 = 0L; for (l_2 = -1; (l_2 == 0); l_2 -= 0) { uint16_t l_3 = 1L; return l_3; } return g_4; } }

/* ---------------------------------------- */ int main(int argc, char *argv[]) { platform_main_begin(); /* Call the first function */ func_1(); crcBytes(g_4); platform_main_end(context); return 0; }

-- Christian Robottom Reis | [+55] 16 9112 6430 | http://launchpad.net/~kiko Linaro Engineering VP | [ +1] 612 216 4935 | http://async.com.br/~kiko

Michael Hope

29 Nov 29 Nov

12:18 a.m.

On Sat, Nov 27, 2010 at 12:11 AM, Andrew Stubbs ams@codesourcery.com wrote:

...

Hi All,

As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities.

So, I'd like to get some idea of who would like to attend, and we'll try to find a slot we can all make. I'm on vacation next week, so I expect it'll be in two or three week's time.

To add to the mix:

Some ideas that are logged as blueprints: Using ARMv5 saturated instructions (https://blueprints.launchpad.net/gcc-linaro/+spec/armv5-saturated-ops) Using ARMv6 SIMD instructions (https://blueprints.launchpad.net/gcc-linaro/+spec/armv6-simd) Using ARMv7 unaligned accesses (https://blueprints.launchpad.net/gcc-linaro/+spec/unaligned-accesses) Changing the built-in memcpy to use unaligned (https://blueprints.launchpad.net/gcc-linaro/+spec/unaligned-memcpy)

The following areas have been suggested. I don't know if they're still valid:

Register allocator: The register allocator is designed around the needs of architectures with a low register count and restrictive register classes. The ARM architecture has many general purpose registers. Different assumptions may give better code.

Conditional instructions: The ARM and, to a lesser extent, Thumb-2 ISAs allow conditional execution of instructions. This can be used in many situations to eliminate an expensive branch. The middle end expands and transforms branches. The ARM backend tries to recombine the RTL back into conditional instructions, but often can't due to the middle end transforms.

-- Michael

David Gilbert

9:32 a.m.

On 29 November 2010 00:18, Michael Hope michael.hope@linaro.org wrote:

...

To add to the mix:

Some ideas that are logged as blueprints: Using ARMv5 saturated instructions (https://blueprints.launchpad.net/gcc-linaro/+spec/armv5-saturated-ops) Using ARMv6 SIMD instructions (https://blueprints.launchpad.net/gcc-linaro/+spec/armv6-simd)

Those are quite nice instructions; certainly they seem useful for string ops of various types if misued creatively.

...

Using ARMv7 unaligned accesses (https://blueprints.launchpad.net/gcc-linaro/+spec/unaligned-accesses) Changing the built-in memcpy to use unaligned (https://blueprints.launchpad.net/gcc-linaro/+spec/unaligned-memcpy)

The interesting challenge here is figuring out how expensive unaligned's are and if the cost trade offs are the same on different chips.

The following areas have been suggested. I don't know if they're still valid:

...

Register allocator: The register allocator is designed around the needs of architectures with a low register count and restrictive register classes. The ARM architecture has many general purpose registers. Different assumptions may give better code.

Conditional instructions: The ARM and, to a lesser extent, Thumb-2 ISAs allow conditional execution of instructions. This can be used in many situations to eliminate an expensive branch. The middle end expands and transforms branches. The ARM backend tries to recombine the RTL back into conditional instructions, but often can't due to the middle end transforms.

GCC is quite creative in avoiding branches by doing lots of masking and logic; it'll be interesting how much this has to gain.

Dave

Andrew Stubbs

9 Dec 9 Dec

2:42 p.m.

On 26/11/10 11:11, Andrew Stubbs wrote:

...

As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities.

So, I'd like to get some idea of who would like to attend, and we'll try to find a slot we can all make. I'm on vacation next week, so I expect it'll be in two or three week's time.

I've now collected together all the ideas people sent to me here: https://wiki.linaro.org/AndrewStubbs/Sandbox/GCCoptimizations

Suggested meeting times:

Either Wednesday 15th December at 9am UTC Or Thursday 16th 6pm UTC

If selected these times so that Michael can attend. Obviously the former is more palatable to Europeans, and the latter allows Americans to join, if we have any interested (Mark?).

If I don't get any replies voting for 6pm, I'll default to the 9am slot.

I've checked the times don't clash with anything else in the Linaro Events calendar. Michael, will we be able to use your meeting number then?

There are quite a lot of topics on that list to discuss, so we'll discuss them in order of which seem most interesting to those present, and see how far we get.

If we want another session we can arrange one then. Those attending the January sprint in Dallas might also get an opportunity then.

Andrew

Mark Mitchell

3:09 p.m.

On 12/9/2010 6:42 AM, Andrew Stubbs wrote:

...

Either Wednesday 15th December at 9am UTC Or Thursday 16th 6pm UTC

If selected these times so that Michael can attend. Obviously the former is more palatable to Europeans, and the latter allows Americans to join, if we have any interested (Mark?).

I'm still interested, but I can't commit; I'm in an off-site meeting Tuesday-Friday of next week and am not sure when I'll be able to get away. So, I definitely do not want anyone to do anything inconvenient on my account; just pick the best time for you, and let me know when it is.

Thank you,

-- Mark Mitchell CodeSourcery mark@codesourcery.com (650) 331-3385 x713

Julian Brown

3:18 p.m.

On Thu, 09 Dec 2010 14:42:49 +0000 Andrew Stubbs ams@codesourcery.com wrote:

...

On 26/11/10 11:11, Andrew Stubbs wrote:

...
As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities.

So, I'd like to get some idea of who would like to attend, and we'll try to find a slot we can all make. I'm on vacation next week, so I expect it'll be in two or three week's time.

I've now collected together all the ideas people sent to me here: https://wiki.linaro.org/AndrewStubbs/Sandbox/GCCoptimizations

It might be worth talking about even the oldest differences between the ARM instruction sets and "traditional RISC", e.g.:

* Load/store multiple instructions: GCC only generates these from load_multiple/store_multiple (from a couple of places where hard registers are already known, e.g. function prologues & epilogues), or using peepholes. All ARM chips support much more generality for the latter: e.g. sparse sets of registers, and transfers of more than four registers. Maybe a pass can be added to somehow take better advantage of those.

* Conditional execution: we already know improvements can probably be made here.

* Shifted operands in ALU ops: I think GCC generally does quite well at using these: maybe there are places where they can be used more effectively though. There are certainly several "special tricks" you can do with these, though GCC probably knows most of them by now.

* Conditional flag-setting: I'm not sure if there's much more to be gained from this, but it might be worth looking at also.

Hand-written assembler code is, of course, likely to use all of the above extensively...

Julian

Ramana Radhakrishnan

9:42 p.m.

Sent from my iPad

On 9 Dec 2010, at 15:19, "Julian Brown" julian@codesourcery.com wrote:

...

On Thu, 09 Dec 2010 14:42:49 +0000 Andrew Stubbs ams@codesourcery.com wrote:

...
On 26/11/10 11:11, Andrew Stubbs wrote:

...
As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities.

So, I'd like to get some idea of who would like to attend, and we'll try to find a slot we can all make. I'm on vacation next week, so I expect it'll be in two or three week's time.

I've now collected together all the ideas people sent to me here: https://wiki.linaro.org/AndrewStubbs/Sandbox/GCCoptimizations

It might be worth talking about even the oldest differences between the ARM instruction sets and "traditional RISC", e.g.:

Load/store multiple instructions: GCC only generates these from load_multiple/store_multiple (from a couple of places where hard registers are already known, e.g. function prologues & epilogues), or using peepholes. All ARM chips support much more generality for the latter: e.g. sparse sets of registers, and transfers of more than four registers. Maybe a pass can be added to somehow take better advantage of those.

Conditional execution: we already know improvements can probably be made here.

Shifted operands in ALU ops: I think GCC generally does quite well at using these: maybe there are places where they can be used more effectively though. There are certainly several "special tricks" you can do with these, though GCC probably knows most of them by now.

Conditional flag-setting: I'm not sure if there's much more to be gained from this, but it might be worth looking at also.

Hand-written assembler code is, of course, likely to use all of the above extensively...

Julian

linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Ramana Radhakrishnan

9:51 p.m.

Hit send accidentally on my last mail. Sorry

On 9 Dec 2010, at 15:19, "Julian Brown" julian@codesourcery.com wrote:

...

On Thu, 09 Dec 2010 14:42:49 +0000 Andrew Stubbs ams@codesourcery.com wrote:

...
On 26/11/10 11:11, Andrew Stubbs wrote:

...
As we discussed on Monday, I think it might be helpful to get a number of knowledgeable people together on a call to discuss GCC optimization opportunities.

So, I'd like to get some idea of who would like to attend, and we'll try to find a slot we can all make. I'm on vacation next week, so I expect it'll be in two or three week's time.

I've now collected together all the ideas people sent to me here: https://wiki.linaro.org/AndrewStubbs/Sandbox/GCCoptimizations

It might be worth talking about even the oldest differences between the ARM instruction sets and "traditional RISC", e.g.:

Load/store multiple instructions: GCC only generates these from load_multiple/store_multiple (from a couple of places where hard registers are already known, e.g. function prologues & epilogues), or using peepholes. All ARM chips support much more generality for the latter: e.g. sparse sets of registers, and transfers of more than four registers. Maybe a pass can be added to somehow take better advantage of those.

Also keep in mind the constraint of increasing register numbers getting increasing addresses this pass has to run after reg-allocation or we need to teach GCC to assign registers according to these constraints.

It might also have some effects with generated schedules where we now might change the order in which values are produced .

Ramana

...

Conditional execution: we already know improvements can probably be made here.

Shifted operands in ALU ops: I think GCC generally does quite well at using these: maybe there are places where they can be used more effectively though. There are certainly several "special tricks" you can do with these, though GCC probably knows most of them by now.

Conditional flag-setting: I'm not sure if there's much more to be gained from this, but it might be worth looking at also.

Hand-written assembler code is, of course, likely to use all of the above extensively...

Julian

linaro-toolchain mailing list linaro-toolchain@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-toolchain

Peter Maydell

11:56 p.m.

On 9 December 2010 21:51, Ramana Radhakrishnan ramrad01@arm.com wrote:

...

On 9 Dec 2010, at 15:19, "Julian Brown" julian@codesourcery.com wrote:

...
* Load/store multiple instructions: GCC only generates these from load_multiple/store_multiple (from a couple of places where hard registers are already known, e.g. function prologues & epilogues), or using peepholes. All ARM chips support much more generality for the latter: e.g. sparse sets of registers, and transfers of more than four registers. Maybe a pass can be added to somehow take better advantage of those.

Also keep in mind the constraint of increasing register numbers getting increasing addresses this pass has to run after reg-allocation or we need to teach GCC to assign registers according to these constraints.

There was an example that came up on irc the other day of a bit of inline asm that wanted to use ldm/stm and so had to explicitly assign all its inputs/outputs to fixed registers (which in turn meant it didn't work when compiled for thumb because it was using r7 which was the frame pointer). Presumably if gcc could assign registers with that sort of constraint it would make this kind of asm a bit cleaner too...

-- PMM

5332

days inactive

5345

days old

linaro-toolchain@lists.linaro.org

10 comments

participants

tags (0)

participants (9)

Andrew Stubbs
Christian Robottom Reis
David Gilbert
Julian Brown
Mark Mitchell
Michael Hope
Peter Maydell
Ramana Radhakrishnan
Ramana Radhakrishnan