Re: [Libjpeg-turbo-devel] libjpeg8c vs libjpeg-turbo with libjpeg8 compat on

27 Oct 2011

      Here's one way that NEON could be employed to accelerate Huffman
decoding.  The most common 32 symbols typically account for over 99%
of Huffman codes in a JPEG image, and are typically encoded with
codons of length 2-10 bits.  Four 128-bit registers can hold these 32
codons as left-justified 16-bit values.  You can pull 64 bits of the
decode buffer into a NEON register, and act on it as follows:
* replicate its leading 16 bits into all 8 slots in a uint16x8_t
with a VDUP instruction;
  * compare against the 32 left-justified codons using four VCGE instructions;
  * sum up the resulting 0's and -1's with four VADDs and two VPADDs;
  * negate the result with a VSUB to get either a table index from
0-31 or a 32 (indicating that it wasn't one of the first 32 codons);
  * look this up in a table of codon sizes (32 bytes stored in two
128-bit registers) with a VTBL (this is the largest VTBL operation,
and the reason why we limit to the first 32 codons); this gets you a
number of bits by which to shift the decode buffer (64 bits is the
largest block you can shift with a VSHL);
  * in addition to performing the shift, VADD the shift to a running
total of bits consumed, and VTST it into a "logical accumulator".  The
purpose of this is to detect whether the result of the VTBL look up is
ever zero, i. e., if the index was ever out of range, implying that we
hit a codon outside the first 32.
You might as well repeat this sequence 8 times, slotting the table
indices one by one into a vector, without pausing to test for misses.
One more VTBL maps these 8 indices to symbols (from another table of
32 bytes stored in two 128-bit registers).  Then move the whole batch
over to the ARM side along with the "logical accumulator" and the
count of consumed bits, fix up the corner cases (codon outside the
first 32, too many bits consumed), and set up for the next round.
I obviously haven't tested this yet; but I would expect it to
significantly reduce the time spent in Huffman decoding, largely
because it should drastically reduce the number of branch operations.
On typical images, 9 times out of 10 through the loop, a single
test-and-branch verifies that all 8 lookups succeeded and loops back.
There may be a few branches in the code that shifts data into the
decode buffer, but they are normally executed once per batch of 8
codons -- and even those can probably be converted to conditional
operations by a sufficiently smart compiler.
Anyone want to play at coding this up with me next week during Linaro Connect?
Cheers,
- Michael
On Thu, Oct 27, 2011 at 12:59 PM, DRC dcommander@users.sourceforge.net wrote:
...
On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
...
Also huffman decoder optimizations (which are C code, not SIMD) in
libjpeg-turbo seem to be providing only some barely measurable
improvement on ARM, while huffman speedup is clearly more impressive
on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a
result.
In general, the Huffman codec improvements produce a greater speedup on
64-bit vs. 32-bit and a greater speedup when compressing vs.
decompressing.  So, whereas libjpeg-turbo's Huffman codec realizes about
a 25-50% improvement vs. the libjpeg Huffman codec when doing
compression using 64-bit code, it only realizes a few percent speedup
vs. libjpeg when doing decompression using 32-bit code.  The Huffman
algorithm uses a single register as a bit bucket, and the fewer times it
has to shift in new bits to that register, the faster it is.  That's why
it's so much faster on 64-bit vs. 32-bit.
The Huffman codec is probably the single biggest piece of low-hanging
fruit in the entire code base, since it represents something like 40-50%
of total execution time in many cases.  I've spent hundreds of hours
looking at it, and the basic problem with the 32-bit code seems to be
register exhaustion.  After trying many different approaches, the C
code, as currently written, seems to produce the best possible
performance on 32-bit x86 without sacrificing any performance on 64-bit
x86.  However, that doesn't mean that it couldn't be improved upon--
perhaps even dramatically-- by using hand-written assembly.  Other
codecs, such as the Intel Performance Primitives, manage to produce
similar Huffman performance on both 64-bit and 32-bit.  libjpeg-turbo
can mostly match their 64-bit performance but not their 32-bit
performance, which leads me to believe that they're doing something
fundamentally different with their Huffman codec.  Perhaps they are even
using SIMD instructions, although I have spent much time investigating
that as well, and I couldn't manage to find a method that didn't require
moving data back and forth between the SIMD registers and the regular
registers (because you can't branch when using SIMD instructions, and
branching is somewhat critical to the Huffman algorithm.)
If someone could manage to fix, or even improve, the way registers are
used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.

The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn
about Cisco certifications, training, and career opportunities.
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Libjpeg-turbo-devel mailing list
Libjpeg-turbo-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [Libjpeg-turbo-devel] libjpeg8c vs libjpeg-turbo with libjpeg8 compat on