Here's one way that NEON could be employed to accelerate Huffman decoding. The most common 32 symbols typically account for over 99% of Huffman codes in a JPEG image, and are typically encoded with codons of length 2-10 bits. Four 128-bit registers can hold these 32 codons as left-justified 16-bit values. You can pull 64 bits of the decode buffer into a NEON register, and act on it as follows:
* replicate its leading 16 bits into all 8 slots in a uint16x8_t with a VDUP instruction; * compare against the 32 left-justified codons using four VCGE instructions; * sum up the resulting 0's and -1's with four VADDs and two VPADDs; * negate the result with a VSUB to get either a table index from 0-31 or a 32 (indicating that it wasn't one of the first 32 codons); * look this up in a table of codon sizes (32 bytes stored in two 128-bit registers) with a VTBL (this is the largest VTBL operation, and the reason why we limit to the first 32 codons); this gets you a number of bits by which to shift the decode buffer (64 bits is the largest block you can shift with a VSHL); * in addition to performing the shift, VADD the shift to a running total of bits consumed, and VTST it into a "logical accumulator". The purpose of this is to detect whether the result of the VTBL look up is ever zero, i. e., if the index was ever out of range, implying that we hit a codon outside the first 32.
You might as well repeat this sequence 8 times, slotting the table indices one by one into a vector, without pausing to test for misses. One more VTBL maps these 8 indices to symbols (from another table of 32 bytes stored in two 128-bit registers). Then move the whole batch over to the ARM side along with the "logical accumulator" and the count of consumed bits, fix up the corner cases (codon outside the first 32, too many bits consumed), and set up for the next round.
I obviously haven't tested this yet; but I would expect it to significantly reduce the time spent in Huffman decoding, largely because it should drastically reduce the number of branch operations. On typical images, 9 times out of 10 through the loop, a single test-and-branch verifies that all 8 lookups succeeded and loops back. There may be a few branches in the code that shifts data into the decode buffer, but they are normally executed once per batch of 8 codons -- and even those can probably be converted to conditional operations by a sufficiently smart compiler.
Anyone want to play at coding this up with me next week during Linaro Connect?
Cheers, - Michael
On Thu, Oct 27, 2011 at 12:59 PM, DRC dcommander@users.sourceforge.net wrote:
On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
Also huffman decoder optimizations (which are C code, not SIMD) in libjpeg-turbo seem to be providing only some barely measurable improvement on ARM, while huffman speedup is clearly more impressive on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a result.
In general, the Huffman codec improvements produce a greater speedup on 64-bit vs. 32-bit and a greater speedup when compressing vs. decompressing. So, whereas libjpeg-turbo's Huffman codec realizes about a 25-50% improvement vs. the libjpeg Huffman codec when doing compression using 64-bit code, it only realizes a few percent speedup vs. libjpeg when doing decompression using 32-bit code. The Huffman algorithm uses a single register as a bit bucket, and the fewer times it has to shift in new bits to that register, the faster it is. That's why it's so much faster on 64-bit vs. 32-bit.
The Huffman codec is probably the single biggest piece of low-hanging fruit in the entire code base, since it represents something like 40-50% of total execution time in many cases. I've spent hundreds of hours looking at it, and the basic problem with the 32-bit code seems to be register exhaustion. After trying many different approaches, the C code, as currently written, seems to produce the best possible performance on 32-bit x86 without sacrificing any performance on 64-bit x86. However, that doesn't mean that it couldn't be improved upon-- perhaps even dramatically-- by using hand-written assembly. Other codecs, such as the Intel Performance Primitives, manage to produce similar Huffman performance on both 64-bit and 32-bit. libjpeg-turbo can mostly match their 64-bit performance but not their 32-bit performance, which leads me to believe that they're doing something fundamentally different with their Huffman codec. Perhaps they are even using SIMD instructions, although I have spent much time investigating that as well, and I couldn't manage to find a method that didn't require moving data back and forth between the SIMD registers and the regular registers (because you can't branch when using SIMD instructions, and branching is somewhat critical to the Huffman algorithm.)
If someone could manage to fix, or even improve, the way registers are used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.
The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Libjpeg-turbo-devel mailing list Libjpeg-turbo-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel