Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse. That will be an interesting function to attempt to
optimize.
Right now though, we're basically beating google for all PNG decode and
encode benchmarks. There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.
While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.
When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine. The reason for this is the
particularly tricky case of dealing with unaligned buffers. The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.
Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.
Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
improves performance of inflate by up to 6% on an A-73 Hikey running at 2.36 GHz
when executing the chromium benchmark on the snappy data set. In a few cases
inflate is slower by up to 0.8%. Overall performance of inflate is better by
about 0.3%.