In prep for Linaro Connect & the Ubuntu Developers Summit next week I've put together some performance measurements comparing libjpeg8c and libjpeg-turbo compiled with it's libjpeg8 compatibility setting. Quality settings of 95 and 75 are used. Image sizes used are 640x480 and 3136x2352.
Hardware used includes the imx53 QuickStart board by freescale and an intel core 2 duo in my Lenovo T400.
The results can be found here including both the raw numbers and pretty graphs.
https://wiki.linaro.org/TomGall/LibJpeg8
It is my hope that at LC/UDS we will be able to use these numbers to convince ubuntu to reconsider it's switch to libjpeg8 and instead move to libjpeg-turbo. The 2x-4x across the board performance improvement story is compelling not to mention the technical side of it as well.
On 10/26/2011 10:54 PM, Tom Gall wrote:
In prep for Linaro Connect & the Ubuntu Developers Summit next week I've put together some performance measurements comparing libjpeg8c and libjpeg-turbo compiled with it's libjpeg8 compatibility setting. Quality settings of 95 and 75 are used. Image sizes used are 640x480 and 3136x2352.
Hardware used includes the imx53 QuickStart board by freescale and an intel core 2 duo in my Lenovo T400.
The results can be found here including both the raw numbers and pretty graphs.
https://wiki.linaro.org/TomGall/LibJpeg8
It is my hope that at LC/UDS we will be able to use these numbers to convince ubuntu to reconsider it's switch to libjpeg8 and instead move to libjpeg-turbo. The 2x-4x across the board performance improvement story is compelling not to mention the technical side of it as well.
I doubt that Ubuntu will reconsider this for Precise, but I see that you did schedule a session for UDS/Connect [1]. It would be good, if you could provide relevant information for the session:
- performance data from your wiki in a precise/12.04 environment, not just for arm, but for all supported Ubuntu architectures. Performance data from a natty environment doesn't really help.
How does this compare to a libjpeg8 targeted to newer CPUs? Such a library could be used via hwcap.
- A test rebuild for packages build-depending on libjpeg*-dev. Not sure if this will catch all issues, but it's a start. That should give an estimate for sourceful and sourceless changes needed, and for which packages you'll have to maintain a delta compared to Debian.
Thanks, Matthias
[1] https://blueprints.launchpad.net/libjpeg-turbo/+spec/linaro-gfxmm-libjpeg-tu...
On 27 October 2011 12:38, Matthias Klose doko@ubuntu.com wrote:
I doubt that Ubuntu will reconsider this for Precise, but I see that you did schedule a session for UDS/Connect [1]. It would be good, if you could provide relevant information for the session:
- performance data from your wiki in a precise/12.04 environment, not just for arm, but for all supported Ubuntu architectures. Performance data from a natty environment doesn't really help.
How does this compare to a libjpeg8 targeted to newer CPUs? Such a library could be used via hwcap.
Fedora switched to libjpeg-turbo and reports in their release notes:
"The libjpeg library has been replaced by libjpeg-turbo library which has same API/ABI but is at least twice faster on all primary architectures and about 25% faster on secondary architectures."
- A test rebuild for packages build-depending on libjpeg*-dev. Not sure if this will catch all issues, but it's a start. That should give an estimate for sourceful and sourceless changes needed, and for which packages you'll have to maintain a delta compared to Debian.
Since libjpeg-turbo is API/ABI compatible, it would not even require a rebuild. This makes the transition actually smoother than the transition to libjpeg8 fork of the original libjpeg.
Riku
On Wed, Oct 26, 2011 at 03:54:52PM -0500, Tom Gall wrote:
Hardware used includes the imx53 QuickStart board by freescale and an intel core 2 duo in my Lenovo T400.
The results can be found here including both the raw numbers and pretty graphs.
Again, wow, thanks for such a thorough analysis. I think this is indeed very good material for discusing with Ubuntu. Do we have a session scheduled with them to talk about this?
I have a question: any idea why the gap between 8c and turbo8 is so much more impressive on x86(_64) than on ARM? Could there be low-hanging fruit left in the NEON codepaths?
On Thu, Oct 27, 2011 at 9:45 PM, Christian Robottom Reis kiko@linaro.org wrote:
On Wed, Oct 26, 2011 at 03:54:52PM -0500, Tom Gall wrote:
Hardware used includes the imx53 QuickStart board by freescale and an intel core 2 duo in my Lenovo T400.
The results can be found here including both the raw numbers and pretty graphs.
Again, wow, thanks for such a thorough analysis. I think this is indeed very good material for discusing with Ubuntu. Do we have a session scheduled with them to talk about this?
I have a question: any idea why the gap between 8c and turbo8 is so much more impressive on x86(_64) than on ARM?
Out of curiosity, how much is "much more impressive"? Which case in particular has caught your attention?
Could there be low-hanging fruit left in the NEON codepaths?
Yes, currently missing ARM NEON optimizations for chroma upsampling/downsampling and grayscale color conversions definitely affect the tjbench results for subsampled formats and grayscale.
Also huffman decoder optimizations (which are C code, not SIMD) in libjpeg-turbo seem to be providing only some barely measurable improvement on ARM, while huffman speedup is clearly more impressive on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a result.
On Thu, Oct 27, 2011 at 10:30:14PM +0300, Siarhei Siamashka wrote:
On Thu, Oct 27, 2011 at 9:45 PM, Christian Robottom Reis kiko@linaro.org wrote:
On Wed, Oct 26, 2011 at 03:54:52PM -0500, Tom Gall wrote:
Hardware used includes the imx53 QuickStart board by freescale and an intel core 2 duo in my Lenovo T400.
The results can be found here including both the raw numbers and pretty graphs.
Again, wow, thanks for such a thorough analysis. I think this is indeed very good material for discusing with Ubuntu. Do we have a session scheduled with them to talk about this?
I have a question: any idea why the gap between 8c and turbo8 is so much more impressive on x86(_64) than on ARM?
Out of curiosity, how much is "much more impressive"? Which case in particular has caught your attention?
Sorry; here's the graphs which show where I was surprised to see a 3-4x gap:
https://wiki.linaro.org/TomGall/LibJpeg8?action=AttachFile&do=view&t...
https://wiki.linaro.org/TomGall/LibJpeg8?action=AttachFile&do=view&t...
https://wiki.linaro.org/TomGall/LibJpeg8?action=AttachFile&do=view&t...
https://wiki.linaro.org/TomGall/LibJpeg8?action=AttachFile&do=view&t...
Could there be low-hanging fruit left in the NEON codepaths?
Yes, currently missing ARM NEON optimizations for chroma upsampling/downsampling and grayscale color conversions definitely affect the tjbench results for subsampled formats and grayscale.
Also huffman decoder optimizations (which are C code, not SIMD) in libjpeg-turbo seem to be providing only some barely measurable improvement on ARM, while huffman speedup is clearly more impressive on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a result.
We should definitely consider these for future work, then. Thanks for the reply,
On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
Also huffman decoder optimizations (which are C code, not SIMD) in libjpeg-turbo seem to be providing only some barely measurable improvement on ARM, while huffman speedup is clearly more impressive on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a result.
In general, the Huffman codec improvements produce a greater speedup on 64-bit vs. 32-bit and a greater speedup when compressing vs. decompressing. So, whereas libjpeg-turbo's Huffman codec realizes about a 25-50% improvement vs. the libjpeg Huffman codec when doing compression using 64-bit code, it only realizes a few percent speedup vs. libjpeg when doing decompression using 32-bit code. The Huffman algorithm uses a single register as a bit bucket, and the fewer times it has to shift in new bits to that register, the faster it is. That's why it's so much faster on 64-bit vs. 32-bit.
The Huffman codec is probably the single biggest piece of low-hanging fruit in the entire code base, since it represents something like 40-50% of total execution time in many cases. I've spent hundreds of hours looking at it, and the basic problem with the 32-bit code seems to be register exhaustion. After trying many different approaches, the C code, as currently written, seems to produce the best possible performance on 32-bit x86 without sacrificing any performance on 64-bit x86. However, that doesn't mean that it couldn't be improved upon-- perhaps even dramatically-- by using hand-written assembly. Other codecs, such as the Intel Performance Primitives, manage to produce similar Huffman performance on both 64-bit and 32-bit. libjpeg-turbo can mostly match their 64-bit performance but not their 32-bit performance, which leads me to believe that they're doing something fundamentally different with their Huffman codec. Perhaps they are even using SIMD instructions, although I have spent much time investigating that as well, and I couldn't manage to find a method that didn't require moving data back and forth between the SIMD registers and the regular registers (because you can't branch when using SIMD instructions, and branching is somewhat critical to the Huffman algorithm.)
If someone could manage to fix, or even improve, the way registers are used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.
I have spent much time investigating that as well, and I couldn't manage to find a method that didn't require moving data back and forth between the SIMD registers and the regular registers (because you can't branch when using SIMD instructions, and branching is somewhat critical to the Huffman algorithm.)
You've probably looked at this, but on x86, the pmovmskb instruction (_mm_movemask_epi8() intrinsic) is pretty good for branching on the result of a SIMD compare.
-Justin
On Thu, Oct 27, 2011 at 3:59 PM, DRC dcommander@users.sourceforge.net wrote:
On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
Also huffman decoder optimizations (which are C code, not SIMD) in libjpeg-turbo seem to be providing only some barely measurable improvement on ARM, while huffman speedup is clearly more impressive on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a result.
In general, the Huffman codec improvements produce a greater speedup on 64-bit vs. 32-bit and a greater speedup when compressing vs. decompressing. So, whereas libjpeg-turbo's Huffman codec realizes about a 25-50% improvement vs. the libjpeg Huffman codec when doing compression using 64-bit code, it only realizes a few percent speedup vs. libjpeg when doing decompression using 32-bit code. The Huffman algorithm uses a single register as a bit bucket, and the fewer times it has to shift in new bits to that register, the faster it is. That's why it's so much faster on 64-bit vs. 32-bit.
The Huffman codec is probably the single biggest piece of low-hanging fruit in the entire code base, since it represents something like 40-50% of total execution time in many cases. I've spent hundreds of hours looking at it, and the basic problem with the 32-bit code seems to be register exhaustion. After trying many different approaches, the C code, as currently written, seems to produce the best possible performance on 32-bit x86 without sacrificing any performance on 64-bit x86. However, that doesn't mean that it couldn't be improved upon-- perhaps even dramatically-- by using hand-written assembly. Other codecs, such as the Intel Performance Primitives, manage to produce similar Huffman performance on both 64-bit and 32-bit. libjpeg-turbo can mostly match their 64-bit performance but not their 32-bit performance, which leads me to believe that they're doing something fundamentally different with their Huffman codec. Perhaps they are even using SIMD instructions, although I have spent much time investigating that as well, and I couldn't manage to find a method that didn't require moving data back and forth between the SIMD registers and the regular registers (because you can't branch when using SIMD instructions, and branching is somewhat critical to the Huffman algorithm.)
If someone could manage to fix, or even improve, the way registers are used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.
The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Libjpeg-turbo-devel mailing list Libjpeg-turbo-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel
Here's one way that NEON could be employed to accelerate Huffman decoding. The most common 32 symbols typically account for over 99% of Huffman codes in a JPEG image, and are typically encoded with codons of length 2-10 bits. Four 128-bit registers can hold these 32 codons as left-justified 16-bit values. You can pull 64 bits of the decode buffer into a NEON register, and act on it as follows:
* replicate its leading 16 bits into all 8 slots in a uint16x8_t with a VDUP instruction; * compare against the 32 left-justified codons using four VCGE instructions; * sum up the resulting 0's and -1's with four VADDs and two VPADDs; * negate the result with a VSUB to get either a table index from 0-31 or a 32 (indicating that it wasn't one of the first 32 codons); * look this up in a table of codon sizes (32 bytes stored in two 128-bit registers) with a VTBL (this is the largest VTBL operation, and the reason why we limit to the first 32 codons); this gets you a number of bits by which to shift the decode buffer (64 bits is the largest block you can shift with a VSHL); * in addition to performing the shift, VADD the shift to a running total of bits consumed, and VTST it into a "logical accumulator". The purpose of this is to detect whether the result of the VTBL look up is ever zero, i. e., if the index was ever out of range, implying that we hit a codon outside the first 32.
You might as well repeat this sequence 8 times, slotting the table indices one by one into a vector, without pausing to test for misses. One more VTBL maps these 8 indices to symbols (from another table of 32 bytes stored in two 128-bit registers). Then move the whole batch over to the ARM side along with the "logical accumulator" and the count of consumed bits, fix up the corner cases (codon outside the first 32, too many bits consumed), and set up for the next round.
I obviously haven't tested this yet; but I would expect it to significantly reduce the time spent in Huffman decoding, largely because it should drastically reduce the number of branch operations. On typical images, 9 times out of 10 through the loop, a single test-and-branch verifies that all 8 lookups succeeded and loops back. There may be a few branches in the code that shifts data into the decode buffer, but they are normally executed once per batch of 8 codons -- and even those can probably be converted to conditional operations by a sufficiently smart compiler.
Anyone want to play at coding this up with me next week during Linaro Connect?
Cheers, - Michael
On Thu, Oct 27, 2011 at 12:59 PM, DRC dcommander@users.sourceforge.net wrote:
On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
Also huffman decoder optimizations (which are C code, not SIMD) in libjpeg-turbo seem to be providing only some barely measurable improvement on ARM, while huffman speedup is clearly more impressive on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a result.
In general, the Huffman codec improvements produce a greater speedup on 64-bit vs. 32-bit and a greater speedup when compressing vs. decompressing. So, whereas libjpeg-turbo's Huffman codec realizes about a 25-50% improvement vs. the libjpeg Huffman codec when doing compression using 64-bit code, it only realizes a few percent speedup vs. libjpeg when doing decompression using 32-bit code. The Huffman algorithm uses a single register as a bit bucket, and the fewer times it has to shift in new bits to that register, the faster it is. That's why it's so much faster on 64-bit vs. 32-bit.
The Huffman codec is probably the single biggest piece of low-hanging fruit in the entire code base, since it represents something like 40-50% of total execution time in many cases. I've spent hundreds of hours looking at it, and the basic problem with the 32-bit code seems to be register exhaustion. After trying many different approaches, the C code, as currently written, seems to produce the best possible performance on 32-bit x86 without sacrificing any performance on 64-bit x86. However, that doesn't mean that it couldn't be improved upon-- perhaps even dramatically-- by using hand-written assembly. Other codecs, such as the Intel Performance Primitives, manage to produce similar Huffman performance on both 64-bit and 32-bit. libjpeg-turbo can mostly match their 64-bit performance but not their 32-bit performance, which leads me to believe that they're doing something fundamentally different with their Huffman codec. Perhaps they are even using SIMD instructions, although I have spent much time investigating that as well, and I couldn't manage to find a method that didn't require moving data back and forth between the SIMD registers and the regular registers (because you can't branch when using SIMD instructions, and branching is somewhat critical to the Huffman algorithm.)
If someone could manage to fix, or even improve, the way registers are used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.
The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Libjpeg-turbo-devel mailing list Libjpeg-turbo-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel
On Thu, Oct 27, 2011 at 1:45 PM, Christian Robottom Reis kiko@linaro.org wrote:
On Wed, Oct 26, 2011 at 03:54:52PM -0500, Tom Gall wrote:
Hardware used includes the imx53 QuickStart board by freescale and an intel core 2 duo in my Lenovo T400.
The results can be found here including both the raw numbers and pretty graphs.
Again, wow, thanks for such a thorough analysis. I think this is indeed very good material for discusing with Ubuntu. Do we have a session scheduled with them to talk about this?
The blueprint is : https://blueprints.launchpad.net/libjpeg-turbo/+spec/linaro-gfxmm-libjpeg-tu...
The session is set for Wednesday at 10am. EST
I have a question: any idea why the gap between 8c and turbo8 is so much more impressive on x86(_64) than on ARM? Could there be low-hanging fruit left in the NEON codepaths? -- Christian Robottom Reis, Engineering VP Brazil (GMT-3) | [+55] 16 9112 6430 | [+1] 612 216 4935 Linaro.org: Open Source Software for ARM SoCs
On 27/10/11 21:45, Christian Robottom Reis wrote:
Again, wow, thanks for such a thorough analysis. I think this is indeed very good material for discusing with Ubuntu. Do we have a session scheduled with them to talk about this?
Hi folks
there is a session created: https://blueprints.launchpad.net/libjpeg-turbo/+spec/linaro-gfxmm-libjpeg-tu...
Go subscribe to it... We may need to move it to a bigger room :-)
Best regards,