Re: zlib NEON improvements

1 May 2011


      2011/5/2 Michael Hope michael.hope@linaro.org:
[snip]
...
---------- Forwarded message ----------
From: Michael Hope michael.hope@linaro.org
Date: Mon, May 2, 2011 at 10:35 AM
Subject: Re: [Zlib-devel] [3/8][RFC V3 Patch] Add special ARM Adler32 version
To: zlib-devel@madler.net
On Mon, May 2, 2011 at 9:48 AM, Jan Seiffert
kaffeemonster@googlemail.com wrote:
...
2011/4/24 Jan Seiffert kaffeemonster@googlemail.com:
...
This adds an NEON version, a iWMMXt version for Intel (now Marvel)
StrongARM and a version for ARMv6 DSP instructions of Adler32.
Thanks again to Edwin Török the NEON and ARMv6 DSP version are now
tested and fixed.
The good news is NEON:
an i.MX515@800MHz  (arm7l) with NEON
        -------- orig ------
               a: 0x0CB4B676, 10000 * 160000 bytes     t: 4010 ms
               a: 0x25BEB273, 10000 * 159999 bytes     t: 2990 ms
               a: 0x733CB174, 10000 * 159998 bytes     t: 4060 ms
               a: 0x1144AF76, 10000 * 159996 bytes     t: 4050 ms
               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 4060 ms
               a: 0x1902A382, 10000 * 159984 bytes     t: 4060 ms
        -------- vec ------
               a: 0x0CB4B676, 10000 * 160000 bytes     t: 1450 ms
               a: 0x25BEB273, 10000 * 159999 bytes     t: 1450 ms
               a: 0x733CB174, 10000 * 159998 bytes     t: 1460 ms
               a: 0x1144AF76, 10000 * 159996 bytes     t: 1450 ms
               a: 0x3F4ECB8A, 10000 * 159992 bytes     t: 1460 ms
               a: 0x1902A382, 10000 * 159984 bytes     t: 1450 ms
        speedup: 2.765517
Hi Jan.
Hi Michael, Linaro Devs
...
I see similar numbers.
Great to hear ;)
Means i'm not totally on the wrong track
...
I wasn't sure what you were using to benchmark this
A little program which contains the different code versions and a test loop.
As it is written there:
"10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer.
The different lines are from tests with buffer + offset, len - offset
to test for different alignments
The buffer is filled with 0xff (the worst input for adler32, because
it may overflow the internal sums earlier then other input, so all
internal looping must be for 0xff).
The time is measured with times().
Oh, and if it's not clear, this is only the adler32 speedup, because
only adler32 is run in a loop.
...
so I wrote my own little stub that did the
seed=0x0CB4B676 version over data from rand().
Yeah, that's also a valid test. Maybe you want to srand(0) or
something to get a reliable result.
...
I ran each test five
times and picked the lowest user time as the best.  All were built
with gcc-linaro-4.5-2011.04 with -mfpu=neon -mtune=cortex-a9.  The
results were:
Cortex-A9 @ 1 GHz:
 Plain C: 4.094 s
 ARMv6: 4.578 s
 NEON: 1.985 s
What's the exact make and model of your cortex-a9?
...
Cortex-A8 @ 720 MHz:
 Plain C: 4.164 s
 NEON: 1.819 s
 NEON: 1.570 s (with -mtune=cortex-a8)
It's interesting how the slower A8 does better than the A9.  It's
probably due to the A8 having wider access to the L2 cache as running
the same test but on 16 k of data so that it fits in the L1 cache
gives:
Cortex-A8: 5.234 s
Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the
clock frequencies.
Yepp, cache connection is important. Most Vector units are, at least
for this task, very fast and only constrained by internal brain damage
or cache/memory.
Look at the Altivec numbers:
http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.htm...
As long as it fits into the cache 6.6 speedup, after that 1.3 speedup.
...
-- Michael
Greetings
Jan
PS:
Michael, may i forward your numbers to the zlib mailing list?
-- 
Murphy's Law of Combat
Rule #3: "Never forget that your weapon was manufactured by the
lowest bidder"

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: zlib NEON improvements