Re: [Libjpeg-turbo-devel] libjpeg8c vs libjpeg-turbo with libjpeg8 compat on

27 Oct 2011

      ...
I have spent much time investigating
that as well, and I couldn't manage to find a method that didn't require
moving data back and forth between the SIMD registers and the regular
registers (because you can't branch when using SIMD instructions, and
branching is somewhat critical to the Huffman algorithm.)
You've probably looked at this, but on x86, the pmovmskb instruction
(_mm_movemask_epi8() intrinsic) is pretty good for branching on the
result of a SIMD compare.
-Justin
On Thu, Oct 27, 2011 at 3:59 PM, DRC dcommander@users.sourceforge.net wrote:
...
On 10/27/11 2:30 PM, Siarhei Siamashka wrote:
...
Also huffman decoder optimizations (which are C code, not SIMD) in
libjpeg-turbo seem to be providing only some barely measurable
improvement on ARM, while huffman speedup is clearly more impressive
on x86. This gives libjpeg-turbo more points over IJG jpeg on x86 as a
result.
In general, the Huffman codec improvements produce a greater speedup on
64-bit vs. 32-bit and a greater speedup when compressing vs.
decompressing.  So, whereas libjpeg-turbo's Huffman codec realizes about
a 25-50% improvement vs. the libjpeg Huffman codec when doing
compression using 64-bit code, it only realizes a few percent speedup
vs. libjpeg when doing decompression using 32-bit code.  The Huffman
algorithm uses a single register as a bit bucket, and the fewer times it
has to shift in new bits to that register, the faster it is.  That's why
it's so much faster on 64-bit vs. 32-bit.
The Huffman codec is probably the single biggest piece of low-hanging
fruit in the entire code base, since it represents something like 40-50%
of total execution time in many cases.  I've spent hundreds of hours
looking at it, and the basic problem with the 32-bit code seems to be
register exhaustion.  After trying many different approaches, the C
code, as currently written, seems to produce the best possible
performance on 32-bit x86 without sacrificing any performance on 64-bit
x86.  However, that doesn't mean that it couldn't be improved upon--
perhaps even dramatically-- by using hand-written assembly.  Other
codecs, such as the Intel Performance Primitives, manage to produce
similar Huffman performance on both 64-bit and 32-bit.  libjpeg-turbo
can mostly match their 64-bit performance but not their 32-bit
performance, which leads me to believe that they're doing something
fundamentally different with their Huffman codec.  Perhaps they are even
using SIMD instructions, although I have spent much time investigating
that as well, and I couldn't manage to find a method that didn't require
moving data back and forth between the SIMD registers and the regular
registers (because you can't branch when using SIMD instructions, and
branching is somewhat critical to the Huffman algorithm.)
If someone could manage to fix, or even improve, the way registers are
used in the 32-bit Huffman codec, it would greatly benefit both ARM and x86.

The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn
about Cisco certifications, training, and career opportunities.
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Libjpeg-turbo-devel mailing list
Libjpeg-turbo-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/libjpeg-turbo-devel

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [Libjpeg-turbo-devel] libjpeg8c vs libjpeg-turbo with libjpeg8 compat on