I did some quick and dirty profiling of libpng decoding on a Beagle-xm.
This is the result with one image:
46.18% pngbench pngbench [.] inflate_fast 26.12% pngbench pngbench [.] png_read_filter_row 7.81% pngbench pngbench [.] inflate 5.65% pngbench pngbench [.] memcpy 4.26% pngbench pngbench [.] adler32 2.39% pngbench pngbench [.] crc32 1.78% pngbench [kernel.kallsyms] [k] __copy_to_user 1.76% pngbench [kernel.kallsyms] [k] __do_softirq 1.40% pngbench pngbench [.] inflate_table 1.02% pngbench [kernel.kallsyms] [k] __memzero
And another:
64.79% pngbench pngbench [.] inflate_fast 8.61% pngbench pngbench [.] memcpy 7.46% pngbench pngbench [.] adler32 5.10% pngbench pngbench [.] crc32 3.49% pngbench pngbench [.] inflate 3.16% pngbench [kernel.kallsyms] [k] __copy_to_user 1.33% pngbench [kernel.kallsyms] [k] __memzero
And a third:
47.00% pngbench pngbench [.] png_read_filter_row 28.52% pngbench pngbench [.] inflate_fast 5.12% pngbench pngbench [.] memcpy 4.23% pngbench pngbench [.] crc32 3.85% pngbench pngbench [.] adler32 1.60% pngbench [kernel.kallsyms] [k] __memzero 1.56% pngbench pngbench [.] inflate_table 1.50% pngbench [kernel.kallsyms] [k] __copy_to_user 1.38% pngbench [kernel.kallsyms] [k] __do_softirq 0.78% pngbench pngbench [.] inflate
Two of these are coded using a predictive filter resulting in png_read_filter_row() using a substantial amount of decoding time. Multiple filters are available, thus the different amounts of time seen in that function above. When no such filter is used, decoding time is dominated by zlib decompression.
Two checksum functions feature in these profiles. Adler32 is the checksum used by zlib to verify data integrity, and crc32 is used by PNG.
Optimising the png_read_filter_row function in NEON is possible in principle, the effort of hooking this up in libpng might however be non-trivial. Assuming a speedup of 4x for this function, the overall decoding performance improvement would be up to ~1.6x depending on the image. This should definitely be investigated further.
A worryingly large amount of time is also spent in memcpy(). If some of these calls could be eliminated, a further 10% speed might be gained. This is likely to be quite difficult.
Optimising zlib is of course also possible in theory, but is probably even more difficult.