I have spent much time investigating that as well, and I couldn't manage to find a method that didn't require moving data back and forth between the SIMD registers and the regular registers (because you can't branch when using SIMD instructions, and branching is somewhat critical to the Huffman algorithm.)
You've probably looked at this, but on x86, the pmovmskb instruction (_mm_movemask_epi8() intrinsic) is pretty good for branching on the result of a SIMD compare.
-Justin
On Thu, Oct 27, 2011 at 3:59 PM, DRC dcommander@users.sourceforge.net wrote:
On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
Also huffman decoder optimizations (which are C code, not SIMD) in libjpeg-turbo seem to be providing only some barely measurable improvement on ARM, while huffman speedup is clearly more impressive on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a result.
In general, the Huffman codec improvements produce a greater speedup on 64-bit vs. 32-bit and a greater speedup when compressing vs. decompressing. So, whereas libjpeg-turbo's Huffman codec realizes about a 25-50% improvement vs. the libjpeg Huffman codec when doing compression using 64-bit code, it only realizes a few percent speedup vs. libjpeg when doing decompression using 32-bit code. The Huffman algorithm uses a single register as a bit bucket, and the fewer times it has to shift in new bits to that register, the faster it is. That's why it's so much faster on 64-bit vs. 32-bit.
The Huffman codec is probably the single biggest piece of low-hanging fruit in the entire code base, since it represents something like 40-50% of total execution time in many cases. I've spent hundreds of hours looking at it, and the basic problem with the 32-bit code seems to be register exhaustion. After trying many different approaches, the C code, as currently written, seems to produce the best possible performance on 32-bit x86 without sacrificing any performance on 64-bit x86. However, that doesn't mean that it couldn't be improved upon-- perhaps even dramatically-- by using hand-written assembly. Other codecs, such as the Intel Performance Primitives, manage to produce similar Huffman performance on both 64-bit and 32-bit. libjpeg-turbo can mostly match their 64-bit performance but not their 32-bit performance, which leads me to believe that they're doing something fundamentally different with their Huffman codec. Perhaps they are even using SIMD instructions, although I have spent much time investigating that as well, and I couldn't manage to find a method that didn't require moving data back and forth between the SIMD registers and the regular registers (because you can't branch when using SIMD instructions, and branching is somewhat critical to the Huffman algorithm.)
If someone could manage to fix, or even improve, the way registers are used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.
The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Libjpeg-turbo-devel mailing list Libjpeg-turbo-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel