Basic libav profiling
Ira Rosen
ira.rosen at linaro.org
Thu Aug 18 05:56:18 UTC 2011
On 18 August 2011 02:43, Michael Hope <michael.hope at linaro.org> wrote:
> On Thu, Aug 18, 2011 at 11:11 AM, Michael Hope <michael.hope at linaro.org> wrote:
>> On Tue, Aug 16, 2011 at 11:32 PM, Richard Sandiford
>> <richard.sandiford at linaro.org> wrote:
>>> Michael Hope <michael.hope at linaro.org> writes:
>>>> I put a build harness around libav and gathered some profiling data. See:
>>>> bzr branch lp:~linaro-toolchain-dev/+junk/libav-suite
>>>>
>>>> It includes a Makefile that builds a C only, h.264 only decoder and
>>>> two Creative Commons licensed videos to use as input.
>>>
>>> Thanks for putting this together.
>>>
>>>> README.rst has the basic commands for running ffmpeg and initial perf
>>>> results showing the hot functions. Dave, 20 % of the time is spent in
>>>> memcpy() so you might want to have a look.
>>>>
>>>> The vectoriser has no effect. GCC 4.5 is ~17 % faster than 4.6. I'll
>>>> look into extracting and harnessing the functions themselves later
>>>> this week.
>>>
>>> I had a look why auto-vectorisation wasn't having much effect.
>>> It looks from your profile that most of the hot functions are
>>> operating on 16x16 blocks of pixels with an unknown line stride.
>>> So the C code looks like:
>>>
>>> for (i = 0; i < 16; i++)
>>> {
>>> x[0] = OP (x[0]);
>>> ...
>>> x[15] = OP (x[15]);
>>> x += stride;
>>> }
>>>
>>> Because of the unknown stride, we're relying on SLP rather than
>>> loop-based vectorisation to handle this kind of loop. The problem
>>> is that SLP is being run _as_ a loop optimisation. At the moment,
>>> the gimple data-ref analysis code assumes that, during a loop
>>> optimisation, only simple induction variables are of interest,
>>> so it treats all of the x[...] references above as unrepresentable.
>>> If I move SLP outside the loop optimisations (just as a proof of concept),
>>> then that problem goes away.
>>>
>>> I talked about this with Ira, who said that SLP had been placed
>>> where it is because ivopts (a later loop optimisation) obfuscates
>>> things too much. As Ira said, we should probably look at (conditionally)
>>> removing the assumption that only IVs are of interest during loop
>>> optimisations.
>>>
>>> Another problem is that SLP supports a much smaller range of
>>> optimisations than the loop-based vectoriser. There's no support
>>> for promotion, demotion, or conditional expressions. This affects
>>> things like the weight_h264_pixels* functions, which contain
>>> conditional moves.
>>
>> I had a poke about. GCC isn't too happy about unrolled loops either.
>> put_h264_chroma_mc8_8_c() is defined via a macro in dsputil_template.c
>> and is manually unwound by eight as:
>>
>> for(i=0; i<h; i++){\
>> OP(dst[0], (A*src[0] + B*src[1] + C*src[stride+0] +
>> D*src[stride+1]));\
>> OP(dst[1], (A*src[1] + B*src[2] + C*src[stride+1] +
>> D*src[stride+2]));\
>> OP(dst[2], (A*src[2] + B*src[3] + C*src[stride+2] +
>> D*src[stride+3]));\
>> OP(dst[3], (A*src[3] + B*src[4] + C*src[stride+3] +
>> D*src[stride+4]));\
>> OP(dst[4], (A*src[4] + B*src[5] + C*src[stride+4] +
>> D*src[stride+5]));\
>> OP(dst[5], (A*src[5] + B*src[6] + C*src[stride+5] +
>> D*src[stride+6]));\
>> OP(dst[6], (A*src[6] + B*src[7] + C*src[stride+6] +
>> D*src[stride+7]));\
>> OP(dst[7], (A*src[7] + B*src[8] + C*src[stride+7] +
>> D*src[stride+8]));\
>> dst+= stride;\
>> src+= stride;\
>> }\
>>
>> where OP is an assignment.
>>
>> Reducing this to:
>>
>> #define A 3
>> #define B 4
>>
>> void unrolled(uint8_t * __restrict dst, uint8_t * __restrict src, int h)
>> {
>> h /= 8;
>> for (int i = 0; i < h; i++) {
>> dst[0] = A*src[0] + B*src[0+1];
>> dst[1] = A*src[1] + B*src[1+1];
>> dst[2] = A*src[2] + B*src[2+1];
>> dst[3] = A*src[3] + B*src[3+1];
>> dst[4] = A*src[4] + B*src[4+1];
>> dst[5] = A*src[5] + B*src[5+1];
>> dst[6] = A*src[6] + B*src[6+1];
>> dst[7] = A*src[7] + B*src[7+1];
>> dst += 8;
>> src += 8;
>> }
>> }
>>
>> void plain(uint8_t * __restrict dst, uint8_t * __restrict src, int h)
>> {
>> for (int i = 0; i < h; i++) {
>> dst[i] = A*src[i] + B*src[i+1];
>> }
>> }
>>
>> plain() gets vectorised where unrolled() doesn't.
>
> How can I tell the vectoriser that a input is a multiple of something?
Unfortunately, I don't think you can.
> For example, this code:
>
> struct image
> {
> uint8_t d[4096];
> } __attribute__((aligned(128)));
>
> void fixed(struct image * __restrict dst, struct image * __restrict src, int h)
> {
> for (int i = 0; i < 16; i++) {
> dst->d[i] = A*src->d[i] + B*src->d[i+1];
> }
> }
>
> is lovely with no peeling or argument checking.
>
> I'd like to do a specialisation of a function where I assert that the
> height is a multiple of 16 without unrolling the loop myself.
> Something like:
>
> void multiple(struct image * __restrict dst, struct image * __restrict
> src, int h)
> {
> h &= ~15;
>
> for (int i = 0; i < h; i++) {
> dst->d[i] = A*src->d[i] + B*src->d[i+1];
> }
> }
>
> The inner loop looks good but it still includes a prologue that tests
> for h < vector size and an epilogue that handles any remaining bytes.
> The epilogue is only a code size problem as it's normally skipped.
> Still, the skipping requires a branch...
Yes, that would be a nice feature, although I think such hints are rare.
Ira
>
> -- Michael
>
> _______________________________________________
> linaro-toolchain mailing list
> linaro-toolchain at lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-toolchain
>
More information about the linaro-toolchain
mailing list