Jan Seiffert has been looking into vectorising the adler32 routine in zlib. On the A9 there's a 3.0 x improvement to be had on blocks that fit in the L1 cache and a 2.1 x improvement for larger blocks.
See: http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-May/002556.html
for the work in progress.
-- Michael
---------- Forwarded message ---------- From: Michael Hope michael.hope@linaro.org Date: Mon, May 2, 2011 at 10:35 AM Subject: Re: [Zlib-devel] [3/8][RFC V3 Patch] Add special ARM Adler32 version To: zlib-devel@madler.net
On Mon, May 2, 2011 at 9:48 AM, Jan Seiffert kaffeemonster@googlemail.com wrote:
2011/4/24 Jan Seiffert kaffeemonster@googlemail.com:
This adds an NEON version, a iWMMXt version for Intel (now Marvel) StrongARM and a version for ARMv6 DSP instructions of Adler32.
Thanks again to Edwin Török the NEON and ARMv6 DSP version are now tested and fixed.
The good news is NEON: an i.MX515@800MHz (arm7l) with NEON -------- orig ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 4010 ms a: 0x25BEB273, 10000 * 159999 bytes t: 2990 ms a: 0x733CB174, 10000 * 159998 bytes t: 4060 ms a: 0x1144AF76, 10000 * 159996 bytes t: 4050 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4060 ms a: 0x1902A382, 10000 * 159984 bytes t: 4060 ms -------- vec ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 1450 ms a: 0x25BEB273, 10000 * 159999 bytes t: 1450 ms a: 0x733CB174, 10000 * 159998 bytes t: 1460 ms a: 0x1144AF76, 10000 * 159996 bytes t: 1450 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 1460 ms a: 0x1902A382, 10000 * 159984 bytes t: 1450 ms speedup: 2.765517
Hi Jan. I see similar numbers. I wasn't sure what you were using to benchmark this so I wrote my own little stub that did the seed=0x0CB4B676 version over data from rand(). I ran each test five times and picked the lowest user time as the best. All were built with gcc-linaro-4.5-2011.04 with -mfpu=neon -mtune=cortex-a9. The results were:
Cortex-A9 @ 1 GHz: Plain C: 4.094 s ARMv6: 4.578 s NEON: 1.985 s
Cortex-A8 @ 720 MHz: Plain C: 4.164 s NEON: 1.819 s NEON: 1.570 s (with -mtune=cortex-a8)
It's interesting how the slower A8 does better than the A9. It's probably due to the A8 having wider access to the L2 cache as running the same test but on 16 k of data so that it fits in the L1 cache gives:
Cortex-A8: 5.234 s Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the clock frequencies.
-- Michael