Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.
- Now, we allocate state and a 32K window during init, allowing the latency cost
of allocs to be done during init instead of at one or more times later.
- Total memory allocation is about the same when requesting a 32K window, but
if now window or a smaller window was requested, then it is an increase.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
- After init has succeeded, inflate will no longer possibly fail due to a failing malloc.
Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse. That will be an interesting function to attempt to
optimize.
Right now though, we're basically beating google for all PNG decode and
encode benchmarks. There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.
While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
These are very simple wrappers that do nothing clever but serve as a
shim interface for implementing versions which do cleverly track the
number of scalar sums performed so that we can minimize rebasing and
also have an efficient copy elision.
This serves as the baseline as each vectorization gets its own commit.
That way the PR will be bisectable.
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.
Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].
[1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932
This brings back a bit of the performance that may have been sacrificed
by reverting the reorganized inflate window. Doing a copy at the same
time as a CRC is basically free.
This commit significantly improves inflate performance by reorganizing the window buffer into a contiguous window and pending output buffer. The goal of this layout is to reduce branching, improve cache locality, and enable for the use of crc folding with gzip input.
The window buffer is allocated as a multiple of the user-selected window size. In this commit, a factor of 2 is utilized.
The layout of the window buffer is divided into two sections. The first section, window offset [0, wsize), is reserved for history that has already been output. The second section, window offset [wsize, 2 * wsize), is reserved for buffering pending output that hasn't been flushed to the user's output buffer yet.
The history section grows downwards, towards the window offset of 0. The pending output section grows upwards, towards the end of the buffer. As a result, all of the possible distance/length data that may need to be copied is contiguous. This removes the need to stitch together output from 2 separate buffers.
In the case of gzip input, crc folding is used to copy the pending output to the user's buffers.
Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>
inflateSync() is used to skip invalid deflate data, which means
that the check value that was being computed is no longer useful.
This commit turns off the check value computation, and furthermore
allows a successful return if the compressed data terminated in a
graceful manner. This commit also fixes a bug in the case that
inflateSync() is used before a header is ever processed. In that
case, there is no knowledge of a trailer, so the remainder is
treated as raw.
This verifies that the state has been initialized, that it is the
expected type of state, deflate or inflate, and that at least the
first several bytes of the internal state have not been clobbered.
The undocumented (except in these commit comments) function
inflateValidate(strm, check) can be called after an inflateInit(),
inflateInit2(), or inflateReset2() with check equal to zero to
turn off the check value (CRC-32 or Adler-32) computation and
comparison. Calling with check not equal to zero turns checking
back on. This should only be called immediately after the init or
reset function. inflateReset() does not change the state, so a
previous inflateValidate() setting will remain in effect.
This also turns off validation of the gzip header CRC when
present.
This should only be used when a zlib or gzip stream has already
been checked, and repeated decompressions of the same stream no
longer need to be validated.
This commit was cherry-picked and was not done, resulting in a few
problems with gcc on 64bit windows.
This reverts commit edd7a72e05.
Conflicts:
arch/x86/crc_folding.c
arch/x86/fill_window_sse.c
deflate.c
deflate.h
match.c
trees.c