Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.
- Now, we allocate state and a 32K window during init, allowing the latency cost
of allocs to be done during init instead of at one or more times later.
- Total memory allocation is about the same when requesting a 32K window, but
if now window or a smaller window was requested, then it is an increase.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
- After init has succeeded, inflate will no longer possibly fail due to a failing malloc.
Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
if inflate is invoked with Z_FINISH, and it deems a window was not
necessary, there's a corner case where we never checksum the bytes.
Detect this by checking the window size against zero and the value
of the flush parameter.
This should fix issue #1600, and possibly #1565 as well.
The inflate() functions never leave state->bits greater than 24, so
an inflatePrime() call could not cause this. The only way this
could have happened would be by using inflatePrime() to fill the
bit buffer with 32 bits, and then calling inflatePrime() a *second*
time asking to insert zero bits, for some reason. This commit
assures that a shift by 32 bits does not occur even in that case.
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
There is no hardware control for DFLTCC window size, and because of
that supporting small windows for deflate is not trivial: one has to
make sure that DFLTCC does not emit large distances, which most likely
entails somehow trimming the window and/or input in order to make sure
that whave + avail_in <= wsize.
But inflate is much easier: one only has to allocate enough space. Do
that in dfltcc_alloc_window(), and also introduce ZCOPY_WINDOW() in
order to copy everything, not just what the software implementation
cares about.
After this change, software and hardware window formats no longer
match: the software will use wbits and wsize, and the hardware will use
HB_BITS and HB_SIZE. Unlike deflate, inflate does not switch between
software and hardware implementations mid-stream, which leaves only
inflateSetDictionary() and inflateGetDictionary() interesting.
This increases the size of the `codes` array by 1920 bytes (33%), but
improves performance a little. Root table size is still limited by the
maximum code length in use, so tiny files typically see no change to
table-building time, as they don't use longer codes.
inflateGetHeader(), and if multiple calls of inflate() delivered
the extra header data, then there could be a buffer overflow of the
provided space. This commit assures that provided space is not
exceeded.
See #1323.
Negative windowBits arguments are eventually turned positive in
deflateInit2_ and inflateInit2_ (more precisely in inflateReset2).
Such values are used to indicate that raw deflate/inflate should
be performed.
If a user supplies INT32_MIN for windowBits, the code will perform
-INT32_MIN which does not fit into int32_t. In fact, this is
undefined behavior in C and should be avoided.
Clearly this is a user error, but given the careful validation of
input arguments a few lines later in deflateInit2_ I think this
might be of interest.
Proof of Concept:
- Compile zlib-ng with gcc -ftrapv or -fsanitize=undefined
- Compile and run this program:
```
#include <limits.h>
#include <stdio.h>
#include <zlib-ng.h>
int main(void) {
zng_stream de_stream = { 0 }, in_stream = { 0 };
int result;
result = zng_deflateInit2(&de_stream, 0, Z_DEFLATED, INT32_MIN,
MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY);
printf("zng_deflateInit2: %d\n", result);
result = zng_inflateInit2(&in_stream, INT32_MIN);
printf("zng_inflateInit2: %d\n", result);
return 0;
}
```
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse. That will be an interesting function to attempt to
optimize.
Right now though, we're basically beating google for all PNG decode and
encode benchmarks. There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.
While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.
When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine. The reason for this is the
particularly tricky case of dealing with unaligned buffers. The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.
Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.
Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
These are very simple wrappers that do nothing clever but serve as a
shim interface for implementing versions which do cleverly track the
number of scalar sums performed so that we can minimize rebasing and
also have an efficient copy elision.
This serves as the baseline as each vectorization gets its own commit.
That way the PR will be bisectable.
Pretty much every time updatewindow has been called, implicitly a
checksum was performed unless on s/390 or state->wrap & 4 == 0. The
inflateSetDictionary function instead separately calls this checksum
before invoking update window and checks the checksum to see if it
matches the initial checksum (a property that happens from parsing the
DICTID section of the headers).
Instead, we can make updatewindow have a "copy" parameter, which is the
state->wrap value that is being checked anyway. We instead move the 3rd
bit check to be checked by the caller rather than the callee.
Currently deflate and inflate both use a common state struct. There are
several variables in this struct that we don't need for inflate, and
more may be coming in the future. Therefore split them in two separate
structs. This in turn requires splitting ZALLOC_STATE and ZCOPY_STATE
macros.
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.
Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].
[1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932
This brings back a bit of the performance that may have been sacrificed
by reverting the reorganized inflate window. Doing a copy at the same
time as a CRC is basically free.
This was found to have a significant impact on a highly compressible PNG
for both the encode and decode. Some deltas show performance improving
as much as 60%+.
For the scenarios where the "dist" is not an even modulus of our chunk
size, we simply repeat the bytes as many times as possible into our
vector registers. We then copy the entire vector and then advance the
quotient of our chunksize divided by our dist value.
If dist happens to be 1, there's no reason to not just call memset from
libc (this is likely to be just as fast if not faster).