zlib NEON improvements - linaro-dev

1 May 2011


      Jan Seiffert has been looking into vectorising the adler32 routine in
zlib.  On the A9 there's a 3.0 x improvement to be had on blocks that
fit in the L1 cache and a 2.1 x improvement for larger blocks.
See:
 http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-May/002556.html
for the work in progress.
-- Michael
---------- Forwarded message ----------
From: Michael Hope michael.hope@linaro.org
Date: Mon, May 2, 2011 at 10:35 AM
Subject: Re: [Zlib-devel] [3/8][RFC V3 Patch] Add special ARM Adler32 version
To: zlib-devel@madler.net
On Mon, May 2, 2011 at 9:48 AM, Jan Seiffert
kaffeemonster@googlemail.com wrote:
...
2011/4/24 Jan Seiffert kaffeemonster@googlemail.com:
...
This adds an NEON version, a iWMMXt version for Intel (now Marvel)
StrongARM and a version for ARMv6 DSP instructions of Adler32.
Thanks again to Edwin Török the NEON and ARMv6 DSP version are now
tested and fixed.
The good news is NEON:
an i.MX515@800MHz  (arm7l) with NEON
        -------- orig ------
               a: 0x0CB4B676, 10000 * 160000 bytes     t: 4010 ms
               a: 0x25BEB273, 10000 * 159999 bytes     t: 2990 ms
               a: 0x733CB174, 10000 * 159998 bytes     t: 4060 ms
               a: 0x1144AF76, 10000 * 159996 bytes     t: 4050 ms
               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4060 ms
               a: 0x1902A382, 10000 * 159984 bytes     t: 4060 ms
        -------- vec ------
               a: 0x0CB4B676, 10000 * 160000 bytes     t: 1450 ms
               a: 0x25BEB273, 10000 * 159999 bytes     t: 1450 ms
               a: 0x733CB174, 10000 * 159998 bytes     t: 1460 ms
               a: 0x1144AF76, 10000 * 159996 bytes     t: 1450 ms
               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 1460 ms
               a: 0x1902A382, 10000 * 159984 bytes     t: 1450 ms
        speedup: 2.765517
Hi Jan.  I see similar numbers.  I wasn't sure what you were using to
benchmark this so I wrote my own little stub that did the
seed=0x0CB4B676 version over data from rand().  I ran each test five
times and picked the lowest user time as the best.  All were built
with gcc-linaro-4.5-2011.04 with -mfpu=neon -mtune=cortex-a9.  The
results were:
Cortex-A9 @ 1 GHz:
 Plain C: 4.094 s
 ARMv6: 4.578 s
 NEON: 1.985 s
Cortex-A8 @ 720 MHz:
 Plain C: 4.164 s
 NEON: 1.819 s
 NEON: 1.570 s (with -mtune=cortex-a8)
It's interesting how the slower A8 does better than the A9.  It's
probably due to the A8 having wider access to the L2 cache as running
the same test but on 16 k of data so that it fits in the L1 cache
gives:
Cortex-A8: 5.234 s
Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the
clock frequencies.
-- Michael