Effect of SMS register move scheduling
Richard Sandiford
richard.sandiford at linaro.org
Thu Aug 25 12:41:20 UTC 2011
Richard Sandiford <richard.sandiford at linaro.org> writes:
> Revital Eres <revital.eres at linaro.org> writes:
>> btw, do you also have numbers of how much SMS (hopefully) improves
>> performance on top of the vectorized code?
>
> OK, here's a comparison of:
>
> -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
> -fno-auto-inc-dec
>
> vs:
>
> -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
> -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
Revital pointed out that I'd forgotten to list:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
for both cases, which does make quite a big difference :-)
I looked at the mjpegenc regression, and the register pressure looks OK.
I think it maxes out at around 20 vector double registers if you just
consider the loop body. So I think this is actually a regalloc failure
rather than an SMS one per se.
-fira-algorithm=priority removes all but one spill from the loop.
I ran another test comparing:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
with:
-O2 -ffast-math -funsafe-loop-optimizations -ftree-vectorize
-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
-fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
-fira-algorithm=priority
(soon this lot won't fit in my emacs window). I've attached the
results below. In both cases, the compiler was current trunk with my
move-scheduling patch applied.
I haven't rerun an SMS-vs-non-SMS test, but based on previous results,
mjpegenc and aacsbr-2 become faster with SMS than without.
This doesn't hide the fact that SMS doesn't take register pressure
into account. But if I haven't completely miscalculated (and I might
have) it seems that even if SMS did have some pressure-tracking
capability, it probably wouldn't have triggered for mjpegenc,
at least not unless it was very conservative.
Richard
a3dec
before: 500000 runs take 4.61386s
after: 500000 runs take 4.57584s
speedup: x1.01
aacsbr-1
before: 5000000 runs take 4.37384s
after: 5000000 runs take 4.3739s
speedup: x1
aacsbr-2
before: 5000000 runs take 3.09015s
after: 5000000 runs take 2.30728s
speedup: x1.34
aacsbr-3
before: 4000000 runs take 5.63489s
after: 4000000 runs take 5.63391s
speedup: x1
aes
before: 500000 runs take 16.9729s
after: 500000 runs take 16.9731s
speedup: x1
avs
before: 1000000 runs take 2.23682s
after: 1000000 runs take 2.31372s
speedup: x0.967
cdgraphics
before: 1000000 runs take 2.40585s
after: 1000000 runs take 2.39774s
speedup: x1
dwt
before: 2000000 runs take 9.10098s
after: 2000000 runs take 9.10086s
speedup: x1
dxa
before: 2000000 runs take 4.40613s
after: 2000000 runs take 4.40619s
speedup: x1
mjpegenc
before: 500000 runs take 7.31085s
after: 500000 runs take 3.04492s
speedup: x2.4
qtrle
before: 1000000 runs take 4.54471s
after: 1000000 runs take 4.51578s
speedup: x1.01
resample
before: 1000000 runs take 1.91022s
after: 1000000 runs take 1.92822s
speedup: x0.991
rgb2rgb-rgb24tobgr16
before: 1000000 runs take 1.15643s
after: 1000000 runs take 1.15585s
speedup: x1
rgb2rgb-rgb24tobgr32
before: 2000000 runs take 4.5513s
after: 2000000 runs take 4.5513s
speedup: x1
rgb2rgb-rgb32tobgr24
before: 2000000 runs take 3.59665s
after: 2000000 runs take 3.59671s
speedup: x1
rgb2rgb-shuffle-bytes
before: 500000 runs take 2.24115s
after: 500000 runs take 2.23947s
speedup: x1
rgb2rgb-yuy2toyv12
before: 500000 runs take 4.64447s
after: 500000 runs take 4.51465s
speedup: x1.03
rgb2rgb-yv12touyvy
before: 1500000 runs take 3.49857s
after: 1500000 runs take 4.60797s
speedup: x0.759
twinvq
before: 500000 runs take 0.452393s
after: 500000 runs take 0.4505s
speedup: x1
wmavoice
before: 500000 runs take 0.865448s
after: 500000 runs take 0.868072s
speedup: x0.997
More information about the linaro-toolchain
mailing list