On Mon, May 2, 2011 at 11:13 AM, Jan Seiffert kaffeemonster@googlemail.com wrote:
2011/5/2 Michael Hope michael.hope@linaro.org: Hi Michael, Linaro Devs
I see similar numbers.
Great to hear ;) Means i'm not totally on the wrong track
Note that I've sent the results to zlib-dev. The copy to linaro-dev was forwarded on rather than cross-posting.
I wasn't sure what you were using to benchmark this
A little program which contains the different code versions and a test loop. As it is written there: "10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer. The different lines are from tests with buffer + offset, len - offset to test for different alignments
The buffer is filled with 0xff (the worst input for adler32, because it may overflow the internal sums earlier then other input, so all internal looping must be for 0xff). The time is measured with times().
Oh, and if it's not clear, this is only the adler32 speedup, because only adler32 is run in a loop.
Any idea on how much time in a zlib decompress is spent in adler32?
so I wrote my own little stub that did the seed=0x0CB4B676 version over data from rand().
Yeah, that's also a valid test. Maybe you want to srand(0) or something to get a reliable result.
Yip, I had srand(1234) so the results should be repeatable.
It's interesting how the slower A8 does better than the A9. It's probably due to the A8 having wider access to the L2 cache as running the same test but on 16 k of data so that it fits in the L1 cache gives:
Cortex-A8: 5.234 s Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the clock frequencies.
Yepp, cache connection is important. Most Vector units are, at least for this task, very fast and only constrained by internal brain damage or cache/memory. Look at the Altivec numbers: http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.htm... As long as it fits into the cache 6.6 speedup, after that 1.3 speedup.
Makes sense. The A8 has a direct connection to the 256 k of L2 cache so the 160k x 10,000 test runs as fast as the 16 k x 100,000 test.
-- Michael