2011/5/2 Michael Hope michael.hope@linaro.org: [snip]
---------- Forwarded message ---------- From: Michael Hope michael.hope@linaro.org Date: Mon, May 2, 2011 at 10:35 AM Subject: Re: [Zlib-devel] [3/8][RFC V3 Patch] Add special ARM Adler32 version To: zlib-devel@madler.net
On Mon, May 2, 2011 at 9:48 AM, Jan Seiffert kaffeemonster@googlemail.com wrote:
2011/4/24 Jan Seiffert kaffeemonster@googlemail.com:
This adds an NEON version, a iWMMXt version for Intel (now Marvel) StrongARM and a version for ARMv6 DSP instructions of Adler32.
Thanks again to Edwin Török the NEON and ARMv6 DSP version are now tested and fixed.
The good news is NEON: an i.MX515@800MHz (arm7l) with NEON -------- orig ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 4010 ms a: 0x25BEB273, 10000 * 159999 bytes t: 2990 ms a: 0x733CB174, 10000 * 159998 bytes t: 4060 ms a: 0x1144AF76, 10000 * 159996 bytes t: 4050 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4060 ms a: 0x1902A382, 10000 * 159984 bytes t: 4060 ms -------- vec ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 1450 ms a: 0x25BEB273, 10000 * 159999 bytes t: 1450 ms a: 0x733CB174, 10000 * 159998 bytes t: 1460 ms a: 0x1144AF76, 10000 * 159996 bytes t: 1450 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 1460 ms a: 0x1902A382, 10000 * 159984 bytes t: 1450 ms speedup: 2.765517
Hi Jan.
Hi Michael, Linaro Devs
I see similar numbers.
Great to hear ;) Means i'm not totally on the wrong track
I wasn't sure what you were using to benchmark this
A little program which contains the different code versions and a test loop. As it is written there: "10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer. The different lines are from tests with buffer + offset, len - offset to test for different alignments
The buffer is filled with 0xff (the worst input for adler32, because it may overflow the internal sums earlier then other input, so all internal looping must be for 0xff). The time is measured with times().
Oh, and if it's not clear, this is only the adler32 speedup, because only adler32 is run in a loop.
so I wrote my own little stub that did the seed=0x0CB4B676 version over data from rand().
Yeah, that's also a valid test. Maybe you want to srand(0) or something to get a reliable result.
I ran each test five times and picked the lowest user time as the best. All were built with gcc-linaro-4.5-2011.04 with -mfpu=neon -mtune=cortex-a9. The results were:
Cortex-A9 @ 1 GHz: Plain C: 4.094 s ARMv6: 4.578 s NEON: 1.985 s
What's the exact make and model of your cortex-a9?
Cortex-A8 @ 720 MHz: Plain C: 4.164 s NEON: 1.819 s NEON: 1.570 s (with -mtune=cortex-a8)
It's interesting how the slower A8 does better than the A9. It's probably due to the A8 having wider access to the L2 cache as running the same test but on 16 k of data so that it fits in the L1 cache gives:
Cortex-A8: 5.234 s Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the clock frequencies.
Yepp, cache connection is important. Most Vector units are, at least for this task, very fast and only constrained by internal brain damage or cache/memory. Look at the Altivec numbers: http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.htm... As long as it fits into the cache 6.6 speedup, after that 1.3 speedup.
-- Michael
Greetings Jan
PS: Michael, may i forward your numbers to the zlib mailing list?