Effect of SMS register move scheduling

Revital Eres revital.eres at linaro.org
Thu Aug 25 07:18:28 UTC 2011


Hi Richard,

> The effect on my flawed libav microbenchmarks was much greater
> than I imagined.  I used the options:

Yeah, thats indeed looks impressive!

btw, do you also have numbers of how much SMS (hopefully) improves
performance on top of the vectorized code?

Thanks,
Revital

>    -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -mvectorize-with-neon-quad
>    -fmodulo-sched -fmodulo-sched-allow-regmoves -fno-auto-inc-dec
>
> The "before" code was from trunk, the "after" code was trunk + the
> register scheduling patch alone (not the IV patch).  Only the tests
> that have different "before" and "after" code are run.  The results were:
>
> a3dec
>  before:  500000 runs take 4.68384s
>  after:   500000 runs take 4.61395s
>  speedup: x1.02
> aes
>  before:  500000 runs take 20.0523s
>  after:   500000 runs take 16.9722s
>  speedup: x1.18
> avs
>  before:  1000000 runs take 15.4698s
>  after:   1000000 runs take 2.23676s
>  speedup: x6.92
> dxa
>  before:  2000000 runs take 18.5848s
>  after:   2000000 runs take 4.40607s
>  speedup: x4.22
> mjpegenc
>  before:  500000 runs take 28.6987s
>  after:   500000 runs take 7.31342s
>  speedup: x3.92
> resample
>  before:  1000000 runs take 10.418s
>  after:   1000000 runs take 1.91016s
>  speedup: x5.45
> rgb2rgb-rgb24tobgr16
>  before:  1000000 runs take 1.60513s
>  after:   1000000 runs take 1.15643s
>  speedup: x1.39
> rgb2rgb-yv12touyvy
>  before:  1500000 runs take 3.50122s
>  after:   1500000 runs take 3.49887s
>  speedup: x1
> twinvq
>  before:  500000 runs take 0.452423s
>  after:   500000 runs take 0.452454s
>  speedup: x1
>
> Taking resample as an example: before the patch we had an ii of 27,
> stage count of 6, and 12 vector moves.  Vector moves can't be dual
> issued, and there was only one free slot, so even in theory, this loop
> takes 27 + 12 - 1 = 38 cycles.  Unfortunately, there were so many new
> registers that we spilled quite a few.
>
> After the patch we have an ii of 28, a stage count of 3, and no moves,
> so in theory, one iteration should take 28 cycles.  We also don't spill.
> So I think the difference really is genuine.  (The large difference
> in moves between ii=27 and ii=28 is because in the ii=27 schedule,
> a lot of A--(T,N,0)-->B (intra-cycle true) dependencies were scheduled
> with time(B) == time(A) + ii + 1.)
>
> I also saw benefits in one test in a "real" benchmark, which I can't
> post here.
>
> Richard
>
> _______________________________________________
> linaro-toolchain mailing list
> linaro-toolchain at lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-toolchain
>



More information about the linaro-toolchain mailing list