[Libjpeg-turbo-devel] libjpeg8c vs libjpeg-turbo with libjpeg8 compat on
Michael K. Edwards
m.k.edwards at gmail.com
Thu Oct 27 23:23:13 UTC 2011
Here's one way that NEON could be employed to accelerate Huffman
decoding. The most common 32 symbols typically account for over 99%
of Huffman codes in a JPEG image, and are typically encoded with
codons of length 2-10 bits. Four 128-bit registers can hold these 32
codons as left-justified 16-bit values. You can pull 64 bits of the
decode buffer into a NEON register, and act on it as follows:
* replicate its leading 16 bits into all 8 slots in a uint16x8_t
with a VDUP instruction;
* compare against the 32 left-justified codons using four VCGE instructions;
* sum up the resulting 0's and -1's with four VADDs and two VPADDs;
* negate the result with a VSUB to get either a table index from
0-31 or a 32 (indicating that it wasn't one of the first 32 codons);
* look this up in a table of codon sizes (32 bytes stored in two
128-bit registers) with a VTBL (this is the largest VTBL operation,
and the reason why we limit to the first 32 codons); this gets you a
number of bits by which to shift the decode buffer (64 bits is the
largest block you can shift with a VSHL);
* in addition to performing the shift, VADD the shift to a running
total of bits consumed, and VTST it into a "logical accumulator". The
purpose of this is to detect whether the result of the VTBL look up is
ever zero, i. e., if the index was ever out of range, implying that we
hit a codon outside the first 32.
You might as well repeat this sequence 8 times, slotting the table
indices one by one into a vector, without pausing to test for misses.
One more VTBL maps these 8 indices to symbols (from another table of
32 bytes stored in two 128-bit registers). Then move the whole batch
over to the ARM side along with the "logical accumulator" and the
count of consumed bits, fix up the corner cases (codon outside the
first 32, too many bits consumed), and set up for the next round.
I obviously haven't tested this yet; but I would expect it to
significantly reduce the time spent in Huffman decoding, largely
because it should drastically reduce the number of branch operations.
On typical images, 9 times out of 10 through the loop, a single
test-and-branch verifies that all 8 lookups succeeded and loops back.
There may be a few branches in the code that shifts data into the
decode buffer, but they are normally executed once per batch of 8
codons -- and even those can probably be converted to conditional
operations by a sufficiently smart compiler.
Anyone want to play at coding this up with me next week during Linaro Connect?
On Thu, Oct 27, 2011 at 12:59 PM, DRC <dcommander at users.sourceforge.net> wrote:
> On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
>> Also huffman decoder optimizations (which are C code, not SIMD) in
>> libjpeg-turbo seem to be providing only some barely measurable
>> improvement on ARM, while huffman speedup is clearly more impressive
>> on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a
> In general, the Huffman codec improvements produce a greater speedup on
> 64-bit vs. 32-bit and a greater speedup when compressing vs.
> decompressing. So, whereas libjpeg-turbo's Huffman codec realizes about
> a 25-50% improvement vs. the libjpeg Huffman codec when doing
> compression using 64-bit code, it only realizes a few percent speedup
> vs. libjpeg when doing decompression using 32-bit code. The Huffman
> algorithm uses a single register as a bit bucket, and the fewer times it
> has to shift in new bits to that register, the faster it is. That's why
> it's so much faster on 64-bit vs. 32-bit.
> The Huffman codec is probably the single biggest piece of low-hanging
> fruit in the entire code base, since it represents something like 40-50%
> of total execution time in many cases. I've spent hundreds of hours
> looking at it, and the basic problem with the 32-bit code seems to be
> register exhaustion. After trying many different approaches, the C
> code, as currently written, seems to produce the best possible
> performance on 32-bit x86 without sacrificing any performance on 64-bit
> x86. However, that doesn't mean that it couldn't be improved upon--
> perhaps even dramatically-- by using hand-written assembly. Other
> codecs, such as the Intel Performance Primitives, manage to produce
> similar Huffman performance on both 64-bit and 32-bit. libjpeg-turbo
> can mostly match their 64-bit performance but not their 32-bit
> performance, which leads me to believe that they're doing something
> fundamentally different with their Huffman codec. Perhaps they are even
> using SIMD instructions, although I have spent much time investigating
> that as well, and I couldn't manage to find a method that didn't require
> moving data back and forth between the SIMD registers and the regular
> registers (because you can't branch when using SIMD instructions, and
> branching is somewhat critical to the Huffman algorithm.)
> If someone could manage to fix, or even improve, the way registers are
> used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.
> The demand for IT networking professionals continues to grow, and the
> demand for specialized networking skills is growing even more rapidly.
> Take a complimentary Learning at Cisco Self-Assessment and learn
> about Cisco certifications, training, and career opportunities.
> Libjpeg-turbo-devel mailing list
> Libjpeg-turbo-devel at lists.sourceforge.net
More information about the linaro-dev