I still posit that it's possible to avoid many of those inefficiencies by using a sufficiently large buffer in libjpeg-turbo and using an in-memory source/destination manager. Much of the inefficiency in the code relates to the buffering that it does to avoid reading the entire image into memory.
I also hasten to point out that not all of the compute-intensive parts of the code are NEON-accelerated. The general speedup we're seeing in NEON vs non-NEON is about 1.5-2x rather than the 3-4x we see with x86-64. Not sure whether ARM is 64-bit, but using 64-bit code will improve Huffman en/decoding performance significantly. It may also be the case that the hand-tuned code I wrote in the Huffman codec is making performance assumptions based on x86 that aren't true for ARM. It would be interesting to see what the speedup is with the unoptimized Huffman code out of libjpeg. At least on x86, Huffman can account for 40% of the compute time, so optimizing it further has a potentially big pay-off. However, I've personally spent hundreds of hours getting it where it is, and I have a gut feeling that further optimization of it would require dropping down to assembly.
On Jun 29, 2011, at 3:03 PM, Måns Rullgård mans@mansr.com wrote:
Vladimir Pantelic vladoman@gmail.com writes:
Mandeep Kumar wrote:
Hi All,
I have done some benchmarking on OMAP4 running Ubuntu for various versions of libjpegs. Benchmarks were collected with modified version of djpeg that prints out ms time taken for decoding. Sample used for benchmarking is a 12MP image downloaded from a photography website. Here are the results:
...
libjpeg-turbo trunk version that has NEON patches (5 runs). *http://libjpeg-turbo.svn.sourceforge.net/viewvc/libjpeg-turbo/*
Decoding Time for Run 2: 1065 ms Decoding Time for Run 3: 1093 ms Decoding Time for Run 4: 1066 ms Decoding Time for Run 5: 1067 msDecoding Time for Run 1: 1068 ms
*Median Decoding Time: 1067 ms*
One remark:
a 12MP image decoded in 1076ms equals ~12MP/s decoding speed.
decoding a 640x480 MJPEG file on a 1GHz OMAP4 using libavcodec gives me an average decoding time per frame of ~10ms which yields:
640x480/10ms = ~30MP/s
so roughly 2.5 times faster.
Either I am doing something wrong or this libjpeg-turbo is not so turbo.
Libjpeg (turbo or regular) is full of inefficiencies. I guess they all add up.
-- Måns Rullgård mans@mansr.com