Effect of SMS register move scheduling
Revital Eres
revital.eres at linaro.org
Thu Aug 25 07:18:28 UTC 2011
Hi Richard,
> The effect on my flawed libav microbenchmarks was much greater
> than I imagined. I used the options:
Yeah, thats indeed looks impressive!
btw, do you also have numbers of how much SMS (hopefully) improves
performance on top of the vectorized code?
Thanks,
Revital
> -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
> -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
>
> The "before" code was from trunk, the "after" code was trunk + the
> register scheduling patch alone (not the IV patch). Only the tests
> that have different "before" and "after" code are run. The results were:
>
> a3dec
> before: 500000 runs take 4.68384s
> after: 500000 runs take 4.61395s
> speedup: x1.02
> aes
> before: 500000 runs take 20.0523s
> after: 500000 runs take 16.9722s
> speedup: x1.18
> avs
> before: 1000000 runs take 15.4698s
> after: 1000000 runs take 2.23676s
> speedup: x6.92
> dxa
> before: 2000000 runs take 18.5848s
> after: 2000000 runs take 4.40607s
> speedup: x4.22
> mjpegenc
> before: 500000 runs take 28.6987s
> after: 500000 runs take 7.31342s
> speedup: x3.92
> resample
> before: 1000000 runs take 10.418s
> after: 1000000 runs take 1.91016s
> speedup: x5.45
> rgb2rgb-rgb24tobgr16
> before: 1000000 runs take 1.60513s
> after: 1000000 runs take 1.15643s
> speedup: x1.39
> rgb2rgb-yv12touyvy
> before: 1500000 runs take 3.50122s
> after: 1500000 runs take 3.49887s
> speedup: x1
> twinvq
> before: 500000 runs take 0.452423s
> after: 500000 runs take 0.452454s
> speedup: x1
>
> Taking resample as an example: before the patch we had an ii of 27,
> stage count of 6, and 12 vector moves. Vector moves can't be dual
> issued, and there was only one free slot, so even in theory, this loop
> takes 27 + 12 - 1 = 38 cycles. Unfortunately, there were so many new
> registers that we spilled quite a few.
>
> After the patch we have an ii of 28, a stage count of 3, and no moves,
> so in theory, one iteration should take 28 cycles. We also don't spill.
> So I think the difference really is genuine. (The large difference
> in moves between ii=27 and ii=28 is because in the ii=27 schedule,
> a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled
> with time(B) == time(A) + ii + 1.)
>
> I also saw benefits in one test in a "real" benchmark, which I can't
> post here.
>
> Richard
>
> _______________________________________________
> linaro-toolchain mailing list
> linaro-toolchain at lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-toolchain
>
More information about the linaro-toolchain
mailing list