Effect of SMS register move scheduling

Richard Sandiford richard.sandiford at linaro.org
Thu Aug 25 12:41:20 UTC 2011


Richard Sandiford <richard.sandiford at linaro.org> writes:
> Revital Eres <revital.eres at linaro.org> writes:
>> btw, do you also have numbers of how much SMS (hopefully) improves
>> performance on top of the vectorized code?
>
> OK, here's a comparison of:
>
>     -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
>     -fno-auto-inc-dec
>
> vs:
>
>     -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
>     -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec

Revital pointed out that I'd forgotten to list:

    -O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize

for both cases, which does make quite a big difference :-)

I looked at the mjpegenc regression, and the register pressure looks OK.
I think it maxes out at around 20 vector double registers if you just
consider the loop body.  So I think this is actually a regalloc failure
rather than an SMS one per se.

-fira-algorithm=priority removes all but one spill from the loop.
I ran another test comparing:

   -O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
   -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
   -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec

with:

   -O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
   -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
   -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
   -fira-algorithm=priority

(soon this lot won't fit in my emacs window).  I've attached the
results below.  In both cases, the compiler was current trunk with my
move-scheduling patch applied.

I haven't rerun an SMS-vs-non-SMS test, but based on previous results,
mjpegenc and aacsbr-2 become faster with SMS than without.

This doesn't hide the fact that SMS doesn't take register pressure
into account.  But if I haven't completely miscalculated (and I might
have) it seems that even if SMS did have some pressure-tracking
capability, it probably wouldn't have triggered for mjpegenc,
at least not unless it was very conservative.

Richard


a3dec
  before:  500000 runs take 4.61386s
  after:   500000 runs take 4.57584s
  speedup: x1.01
aacsbr-1
  before:  5000000 runs take 4.37384s
  after:   5000000 runs take 4.3739s
  speedup: x1
aacsbr-2
  before:  5000000 runs take 3.09015s
  after:   5000000 runs take 2.30728s
  speedup: x1.34
aacsbr-3
  before:  4000000 runs take 5.63489s
  after:   4000000 runs take 5.63391s
  speedup: x1
aes
  before:  500000 runs take 16.9729s
  after:   500000 runs take 16.9731s
  speedup: x1
avs
  before:  1000000 runs take 2.23682s
  after:   1000000 runs take 2.31372s
  speedup: x0.967
cdgraphics
  before:  1000000 runs take 2.40585s
  after:   1000000 runs take 2.39774s
  speedup: x1
dwt
  before:  2000000 runs take 9.10098s
  after:   2000000 runs take 9.10086s
  speedup: x1
dxa
  before:  2000000 runs take 4.40613s
  after:   2000000 runs take 4.40619s
  speedup: x1
mjpegenc
  before:  500000 runs take 7.31085s
  after:   500000 runs take 3.04492s
  speedup: x2.4
qtrle
  before:  1000000 runs take 4.54471s
  after:   1000000 runs take 4.51578s
  speedup: x1.01
resample
  before:  1000000 runs take 1.91022s
  after:   1000000 runs take 1.92822s
  speedup: x0.991
rgb2rgb-rgb24tobgr16
  before:  1000000 runs take 1.15643s
  after:   1000000 runs take 1.15585s
  speedup: x1
rgb2rgb-rgb24tobgr32
  before:  2000000 runs take 4.5513s
  after:   2000000 runs take 4.5513s
  speedup: x1
rgb2rgb-rgb32tobgr24
  before:  2000000 runs take 3.59665s
  after:   2000000 runs take 3.59671s
  speedup: x1
rgb2rgb-shuffle-bytes
  before:  500000 runs take 2.24115s
  after:   500000 runs take 2.23947s
  speedup: x1
rgb2rgb-yuy2toyv12
  before:  500000 runs take 4.64447s
  after:   500000 runs take 4.51465s
  speedup: x1.03
rgb2rgb-yv12touyvy
  before:  1500000 runs take 3.49857s
  after:   1500000 runs take 4.60797s
  speedup: x0.759
twinvq
  before:  500000 runs take 0.452393s
  after:   500000 runs take 0.4505s
  speedup: x1
wmavoice
  before:  500000 runs take 0.865448s
  after:   500000 runs take 0.868072s
  speedup: x0.997



More information about the linaro-toolchain mailing list