Basic libav profiling
richard.sandiford at linaro.org
Thu Aug 18 08:21:50 UTC 2011
Michael Hope <michael.hope at linaro.org> writes:
> On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford
> <richard.sandiford at linaro.org> wrote:
>> Michael Hope <michael.hope at linaro.org> writes:
>>> I put a build harness around libav and gathered some profiling data. See:
>>> bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite
>>> It includes a Makefile that builds a C only, h.264 only decoder and
>>> two Creative Commons licensed videos to use as input.
>> Thanks for putting this together.
>>> README.rst has the basic commands for running ffmpeg and initial perf
>>> results showing the hot functions. Dave, 20 % of the time is spent in
>>> memcpy() so you might want to have a look.
>>> The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll
>>> look into extracting and harnessing the functions themselves later
>>> this week.
>> I had a look why auto-vectorisation wasn't having much effect.
>> It looks from your profile that most of the hot functions are
>> operating on 16x16 blocks of pixels with an unknown line stride.
>> So the C code looks like:
>> for (i = 0; i < 16; i++)
>> x = OP (x);
>> x = OP (x);
>> x += stride;
>> Because of the unknown stride, we're relying on SLP rather than
>> loop-based vectorisation to handle this kind of loop. The problem
>> is that SLP is being run _as_ a loop optimisation. At the moment,
>> the gimple data-ref analysis code assumes that, during a loop
>> optimisation, only simple induction variables are of interest,
>> so it treats all of the x[...] references above as unrepresentable.
>> If I move SLP outside the loop optimisations (just as a proof of concept),
>> then that problem goes away.
>> I talked about this with Ira, who said that SLP had been placed
>> where it is because ivopts (a later loop optimisation) obfuscates
>> things too much. As Ira said, we should probably look at (conditionally)
>> removing the assumption that only IVs are of interest during loop
>> Another problem is that SLP supports a much smaller range of
>> optimisations than the loop-based vectoriser. There's no support
>> for promotion, demotion, or conditional expressions. This affects
>> things like the weight_h264_pixels* functions, which contain
>> conditional moves.
> I had a poke about. GCC isn't too happy about unrolled loops either.
Right. Sorry, I should have been clearer, but this hand-unrolling was
the trigger for this loop being SLP's job, rather than the normal loop
vectoriser's. So the loop above was exactly the kind of loop you
describe (OP was the same for each x[...]).
SLP should still (in theory) be able to optimise the loop body as
straight-line code. The problem is that it doesn't yet support the
same range of operations.
More information about the linaro-toolchain