Re: zlib NEON improvements

2 May 2011


      On Mon, May 2, 2011 at 11:13 AM, Jan Seiffert
kaffeemonster@googlemail.com wrote:
...
2011/5/2 Michael Hope michael.hope@linaro.org:
Hi Michael, Linaro Devs
...
I see similar numbers.
Great to hear ;)
Means i'm not totally on the wrong track
Note that I've sent the results to zlib-dev.  The copy to linaro-dev
was forwarded on rather than cross-posting.
...
...
I wasn't sure what you were using to benchmark this
A little program which contains the different code versions and a test loop.
As it is written there:
"10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer.
The different lines are from tests with buffer + offset, len - offset
to test for different alignments
The buffer is filled with 0xff (the worst input for adler32, because
it may overflow the internal sums earlier then other input, so all
internal looping must be for 0xff).
The time is measured with times().
Oh, and if it's not clear, this is only the adler32 speedup, because
only adler32 is run in a loop.
Any idea on how much time in a zlib decompress is spent in adler32?
...
...
so I wrote my own little stub that did the
seed=0x0CB4B676 version over data from rand().
Yeah, that's also a valid test. Maybe you want to srand(0) or
something to get a reliable result.
Yip, I had srand(1234) so the results should be repeatable.
...
...
It's interesting how the slower A8 does better than the A9.  It's
probably due to the A8 having wider access to the L2 cache as running
the same test but on 16 k of data so that it fits in the L1 cache
gives:
Cortex-A8: 5.234 s
Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the
clock frequencies.
Yepp, cache connection is important. Most Vector units are, at least
for this task, very fast and only constrained by internal brain damage
or cache/memory.
Look at the Altivec numbers:
http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.htm...
As long as it fits into the cache 6.6 speedup, after that 1.3 speedup.
Makes sense.  The A8 has a direct connection to the 256 k of L2 cache
so the 160k x 10,000 test runs as fast as the 16 k x 100,000 test.
-- Michael

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: zlib NEON improvements