Jan Seiffert has been looking into vectorising the adler32 routine in zlib. On the A9 there's a 3.0 x improvement to be had on blocks that fit in the L1 cache and a 2.1 x improvement for larger blocks.
See: http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-May/002556.html
for the work in progress.
-- Michael
---------- Forwarded message ---------- From: Michael Hope michael.hope@linaro.org Date: Mon, May 2, 2011 at 10:35 AM Subject: Re: [Zlib-devel] [3/8][RFC V3 Patch] Add special ARM Adler32 version To: zlib-devel@madler.net
On Mon, May 2, 2011 at 9:48 AM, Jan Seiffert kaffeemonster@googlemail.com wrote:
2011/4/24 Jan Seiffert kaffeemonster@googlemail.com:
This adds an NEON version, a iWMMXt version for Intel (now Marvel) StrongARM and a version for ARMv6 DSP instructions of Adler32.
Thanks again to Edwin Török the NEON and ARMv6 DSP version are now tested and fixed.
The good news is NEON: an i.MX515@800MHz (arm7l) with NEON -------- orig ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 4010 ms a: 0x25BEB273, 10000 * 159999 bytes t: 2990 ms a: 0x733CB174, 10000 * 159998 bytes t: 4060 ms a: 0x1144AF76, 10000 * 159996 bytes t: 4050 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4060 ms a: 0x1902A382, 10000 * 159984 bytes t: 4060 ms -------- vec ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 1450 ms a: 0x25BEB273, 10000 * 159999 bytes t: 1450 ms a: 0x733CB174, 10000 * 159998 bytes t: 1460 ms a: 0x1144AF76, 10000 * 159996 bytes t: 1450 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 1460 ms a: 0x1902A382, 10000 * 159984 bytes t: 1450 ms speedup: 2.765517
Hi Jan. I see similar numbers. I wasn't sure what you were using to benchmark this so I wrote my own little stub that did the seed=0x0CB4B676 version over data from rand(). I ran each test five times and picked the lowest user time as the best. All were built with gcc-linaro-4.5-2011.04 with -mfpu=neon -mtune=cortex-a9. The results were:
Cortex-A9 @ 1 GHz: Plain C: 4.094 s ARMv6: 4.578 s NEON: 1.985 s
Cortex-A8 @ 720 MHz: Plain C: 4.164 s NEON: 1.819 s NEON: 1.570 s (with -mtune=cortex-a8)
It's interesting how the slower A8 does better than the A9. It's probably due to the A8 having wider access to the L2 cache as running the same test but on 16 k of data so that it fits in the L1 cache gives:
Cortex-A8: 5.234 s Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the clock frequencies.
-- Michael
2011/5/2 Michael Hope michael.hope@linaro.org: [snip]
---------- Forwarded message ---------- From: Michael Hope michael.hope@linaro.org Date: Mon, May 2, 2011 at 10:35 AM Subject: Re: [Zlib-devel] [3/8][RFC V3 Patch] Add special ARM Adler32 version To: zlib-devel@madler.net
On Mon, May 2, 2011 at 9:48 AM, Jan Seiffert kaffeemonster@googlemail.com wrote:
2011/4/24 Jan Seiffert kaffeemonster@googlemail.com:
This adds an NEON version, a iWMMXt version for Intel (now Marvel) StrongARM and a version for ARMv6 DSP instructions of Adler32.
Thanks again to Edwin Török the NEON and ARMv6 DSP version are now tested and fixed.
The good news is NEON: an i.MX515@800MHz (arm7l) with NEON -------- orig ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 4010 ms a: 0x25BEB273, 10000 * 159999 bytes t: 2990 ms a: 0x733CB174, 10000 * 159998 bytes t: 4060 ms a: 0x1144AF76, 10000 * 159996 bytes t: 4050 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 4060 ms a: 0x1902A382, 10000 * 159984 bytes t: 4060 ms -------- vec ------ a: 0x0CB4B676, 10000 * 160000 bytes t: 1450 ms a: 0x25BEB273, 10000 * 159999 bytes t: 1450 ms a: 0x733CB174, 10000 * 159998 bytes t: 1460 ms a: 0x1144AF76, 10000 * 159996 bytes t: 1450 ms a: 0x3F4ECB8A, 10000 * 159992 bytes t: 1460 ms a: 0x1902A382, 10000 * 159984 bytes t: 1450 ms speedup: 2.765517
Hi Jan.
Hi Michael, Linaro Devs
I see similar numbers.
Great to hear ;) Means i'm not totally on the wrong track
I wasn't sure what you were using to benchmark this
A little program which contains the different code versions and a test loop. As it is written there: "10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer. The different lines are from tests with buffer + offset, len - offset to test for different alignments
The buffer is filled with 0xff (the worst input for adler32, because it may overflow the internal sums earlier then other input, so all internal looping must be for 0xff). The time is measured with times().
Oh, and if it's not clear, this is only the adler32 speedup, because only adler32 is run in a loop.
so I wrote my own little stub that did the seed=0x0CB4B676 version over data from rand().
Yeah, that's also a valid test. Maybe you want to srand(0) or something to get a reliable result.
I ran each test five times and picked the lowest user time as the best. All were built with gcc-linaro-4.5-2011.04 with -mfpu=neon -mtune=cortex-a9. The results were:
Cortex-A9 @ 1 GHz: Plain C: 4.094 s ARMv6: 4.578 s NEON: 1.985 s
What's the exact make and model of your cortex-a9?
Cortex-A8 @ 720 MHz: Plain C: 4.164 s NEON: 1.819 s NEON: 1.570 s (with -mtune=cortex-a8)
It's interesting how the slower A8 does better than the A9. It's probably due to the A8 having wider access to the L2 cache as running the same test but on 16 k of data so that it fits in the L1 cache gives:
Cortex-A8: 5.234 s Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the clock frequencies.
Yepp, cache connection is important. Most Vector units are, at least for this task, very fast and only constrained by internal brain damage or cache/memory. Look at the Altivec numbers: http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.htm... As long as it fits into the cache 6.6 speedup, after that 1.3 speedup.
-- Michael
Greetings Jan
PS: Michael, may i forward your numbers to the zlib mailing list?
On Mon, May 2, 2011 at 11:13 AM, Jan Seiffert kaffeemonster@googlemail.com wrote:
2011/5/2 Michael Hope michael.hope@linaro.org: Hi Michael, Linaro Devs
I see similar numbers.
Great to hear ;) Means i'm not totally on the wrong track
Note that I've sent the results to zlib-dev. The copy to linaro-dev was forwarded on rather than cross-posting.
I wasn't sure what you were using to benchmark this
A little program which contains the different code versions and a test loop. As it is written there: "10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer. The different lines are from tests with buffer + offset, len - offset to test for different alignments
The buffer is filled with 0xff (the worst input for adler32, because it may overflow the internal sums earlier then other input, so all internal looping must be for 0xff). The time is measured with times().
Oh, and if it's not clear, this is only the adler32 speedup, because only adler32 is run in a loop.
Any idea on how much time in a zlib decompress is spent in adler32?
so I wrote my own little stub that did the seed=0x0CB4B676 version over data from rand().
Yeah, that's also a valid test. Maybe you want to srand(0) or something to get a reliable result.
Yip, I had srand(1234) so the results should be repeatable.
It's interesting how the slower A8 does better than the A9. It's probably due to the A8 having wider access to the L2 cache as running the same test but on 16 k of data so that it fits in the L1 cache gives:
Cortex-A8: 5.234 s Cortex-A9: 3.969 s
The ratio here is 0.760 which is very similar to the ratio between the clock frequencies.
Yepp, cache connection is important. Most Vector units are, at least for this task, very fast and only constrained by internal brain damage or cache/memory. Look at the Altivec numbers: http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.htm... As long as it fits into the cache 6.6 speedup, after that 1.3 speedup.
Makes sense. The A8 has a direct connection to the 256 k of L2 cache so the 160k x 10,000 test runs as fast as the 16 k x 100,000 test.
-- Michael
2011/5/2 Michael Hope michael.hope@linaro.org:
On Mon, May 2, 2011 at 11:13 AM, Jan Seiffert kaffeemonster@googlemail.com wrote:
2011/5/2 Michael Hope michael.hope@linaro.org: Hi Michael, Linaro Devs
I see similar numbers.
Great to hear ;) Means i'm not totally on the wrong track
Note that I've sent the results to zlib-dev. The copy to linaro-dev was forwarded on rather than cross-posting.
Yep, it came over zlib-dev a little later.
[snip]
Oh, and if it's not clear, this is only the adler32 speedup, because only adler32 is run in a loop.
Any idea on how much time in a zlib decompress is spent in adler32?
Heavily depends on the data. Or more correct the compression factor. adler32 is calculated over the uncompressed data. So if you have data that compresses well, it will be fast to decompress (mostly inflate_fast doing memsets), and adler32 becoming a greater part of the whole thing (assuming that compression is always slow). That at least was my use case. I get sparse bitfields compressed with deflate send over the network. Compressed they are 300 to 30000 byte (more to the lower numbers, because they are mostly sparsely populated). Uncompressed they are 128k. My adler32 part was >= 33% of whole zlib time. Last year in april Stefan Fuhrmann also proposed x86 vector code (http://mail.madler.net/pipermail/zlib-devel_madler.net/2010-April/001938.htm...), he said he was doing stuff with svn (so compressed text?), he saw a total adler32 time of ~15%, by making adler32 3 times as fast he could lower that to ~5%
So no, this code will not magically make your compression/decompression n times faster, but every bit helps, and i mean if it comes for free...
[snip]
-- Michael
Greetings Jan