Commit Graph

180 Commits

Author SHA1 Message Date
Mika Lindqvist
43b2703435 Fix shift overflow in inflate and send_code. 2025-02-08 21:43:51 +01:00
Adam Stylinski
94aacd8bd6 Try to simply the inflate loop by collapsing most cases to chunksets 2024-10-23 21:20:11 +02:00
Hans Kristian Rosbach
dae668dbff Reorder variables in inflate functions to reduce padding holes
due to variable alignment requirements.
2024-10-10 13:22:50 +02:00
Hans Kristian Rosbach
a5c20ed67e Add variable 'wbufsize' to track window buffer including padding, to allow
the chunkset code to spill garbage data into the padding area if available.
2024-10-08 15:51:12 +02:00
Hans Kristian Rosbach
39e9c86ec0 Don't use 'dmax' and 'sane' variables unless their checks have been compiled in. 2024-10-08 15:51:12 +02:00
Hans Kristian Rosbach
e024068dac Revert "Split chunkcopy_safe to allow the first part to be inlined more often."
This reverts commit 6b8efe7868.

New and improved chunkcopy_safe is coming soon.
2024-09-17 14:12:24 +02:00
Hans Kristian Rosbach
6b8efe7868 Split chunkcopy_safe to allow the first part to be inlined more often. 2024-09-13 12:48:43 +02:00
Pavel P
2c801bd43a Cast result of zalloc to char * to avoid warnings
+ remove unnecessary cast when using `original_buf`
2024-08-09 13:34:43 +02:00
Hans Kristian Rosbach
037c6f84b5 Simplify inflate window management now that there is no need to
worry about failed allocs other than during init.
2024-05-30 13:59:40 +02:00
Hans Kristian Rosbach
63e1d460aa Rewrite inflate memory allocation.
Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.

- Now, we allocate state and a 32K window during init, allowing the latency cost
  of allocs to be done during init instead of at one or more times later.
- Total memory allocation is about the same when requesting a 32K window, but
  if now window or a smaller window was requested, then it is an increase.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
  with applications that incorrectly set alloc/free pointers after running init function.
- After init has succeeded, inflate will no longer possibly fail due to a failing malloc.

Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
2024-05-28 16:35:13 +02:00
Ilya Leoshkevich
05ef29eda5 IBM zSystems DFLTCC: Inline DLFTCC states into zlib states
Currently DFLTCC states are allocated using hook macros, complicating
memory management. Inline them into zlib states and remove the hooks.
2024-05-15 11:28:10 +02:00
Vladislav Shchapov
af8169a724 Replace conditional call to functable.force_init with macro FUNCTABLE_INIT
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-03-06 23:32:15 +01:00
Vladislav Shchapov
c694bcdaf6 Add option to disable runtime CPU detection
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-03-06 23:32:15 +01:00
Vladislav Shchapov
fe0a6407da Explicitly indicate functions are conditionally dispatched
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-03-06 23:32:15 +01:00
Mark Adler
a4c236c4f0 Fix bug in inflateSync() for data held in bit buffer.
madler/zlib#5af7cef45eeef86ddf6ab00b4e363c1eecaf47b6
2024-02-07 19:15:56 +01:00
Adam Stylinski
c2cd8d49d5 Removing some outdated comments
These were left on my part, the inline copy + checksum is the very
thing the function is doing.
2024-01-29 20:00:10 +01:00
Rye Mutt
a61926d3f9 Fix memory corruption introduced in 61e181c8ae 2023-12-29 15:52:14 +01:00
Vladislav Shchapov
ba4a78133e Initialize functable earlier, during inflateInit
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-12-21 16:12:00 +01:00
Adam Stylinski
90b6c36427 Fix an issue with regard to finishing out the window
if inflate is invoked with Z_FINISH, and it deems a window was not
necessary, there's a corner case where we never checksum the bytes.
Detect this by checking the window size against zero and the value
of the flush parameter.

This should fix issue #1600, and possibly #1565 as well.
2023-11-24 13:41:46 +01:00
Hans Kristian Rosbach
61e181c8ae Make sure inflateCopy() allocates window with the necessary buffer for chunked operations.
Based on Chromium bugfix https://chromium-review.googlesource.com/c/chromium/src/+/4876445
2023-09-29 13:32:52 +02:00
Mark Adler
045a278d86 Assure that inflatePrime() can't shift a 32-bit integer by 32 bits.
The inflate() functions never leave state->bits greater than 24, so
an inflatePrime() call could not cause this. The only way this
could have happened would be by using inflatePrime() to fill the
bit buffer with 32 bits, and then calling inflatePrime() a *second*
time asking to insert zero bits, for some reason. This commit
assures that a shift by 32 bits does not occur even in that case.
2023-04-26 14:01:14 +02:00
Vladislav Shchapov
20d8fa8af1 Replace global CPU feature flag variables with local variable in init_functable
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-03-06 13:26:09 +01:00
Mika T. Lindqvist
c970422caa Fix definition of z_size_t to match documentation of legacy zlib API. 2023-02-23 12:17:34 +01:00
Nathan Moinvaziri
fa9bfeddcf Use named defines instead of hard coded numbers. 2023-02-18 20:30:55 +01:00
Hans Kristian Rosbach
cf5bb01da9 Fix prefixing for internal functions calloc/cfree 2023-02-09 01:54:19 +01:00
Mika T. Lindqvist
f43f4ddb90 Fix ambiguous shift warning in inflateCopy. 2023-02-08 15:21:20 +01:00
Nathan Moinvaziri
aa1109bb2e Use arch-specific versions of inflate_fast.
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
2023-02-05 17:51:46 +01:00
Nathan Moinvaziri
b047c7247f Prefix shared functions to prevent symbol conflict when linking native api against compat api. 2023-01-09 15:10:11 +01:00
Ilya Leoshkevich
3eab3173ac IBM zSystems DFLTCC: Support inflate with small window
There is no hardware control for DFLTCC window size, and because of
that supporting small windows for deflate is not trivial: one has to
make sure that DFLTCC does not emit large distances, which most likely
entails somehow trimming the window and/or input in order to make sure
that whave + avail_in <= wsize.

But inflate is much easier: one only has to allocate enough space. Do
that in dfltcc_alloc_window(), and also introduce ZCOPY_WINDOW() in
order to copy everything, not just what the software implementation
cares about.

After this change, software and hardware window formats no longer
match: the software will use wbits and wsize, and the hardware will use
HB_BITS and HB_SIZE. Unlike deflate, inflate does not switch between
software and hardware implementations mid-stream, which leaves only
inflateSetDictionary() and inflateGetDictionary() interesting.
2022-12-11 12:03:12 +01:00
Cameron Cawley
baf0fd1234 Enable and fix -Wimplicit-fallthrough warnings 2022-10-23 15:00:46 +02:00
Dougall Johnson
2d110b17b8 Inflate: Increase max root table sizes to 10 and 9
This increases the size of the `codes` array by 1920 bytes (33%), but
improves performance a little. Root table size is still limited by the
maximum code length in use, so tiny files typically see no change to
table-building time, as they don't use longer codes.
2022-09-25 17:45:00 +02:00
Mika Lindqvist
9309904fe7 If the extra field was larger than the space the user provided with
inflateGetHeader(), and if multiple calls of inflate() delivered
the extra header data, then there could be a buffer overflow of the
provided space. This commit assures that provided space is not
exceeded.

See #1323.
2022-09-05 11:27:40 +02:00
Tobias Stoeckmann
956ff05383 Handle invalid windowBits in init functions
Negative windowBits arguments are eventually turned positive in
deflateInit2_ and inflateInit2_ (more precisely in inflateReset2).
Such values are used to indicate that raw deflate/inflate should
be performed.

If a user supplies INT32_MIN for windowBits, the code will perform
-INT32_MIN which does not fit into int32_t. In fact, this is
undefined behavior in C and should be avoided.

Clearly this is a user error, but given the careful validation of
input arguments a few lines later in deflateInit2_ I think this
might be of interest.

Proof of Concept:

- Compile zlib-ng with gcc -ftrapv or -fsanitize=undefined
- Compile and run this program:

```
 #include <limits.h>
 #include <stdio.h>
 #include <zlib-ng.h>

 int main(void) {
  zng_stream de_stream = { 0 }, in_stream = { 0 };
  int result;

  result = zng_deflateInit2(&de_stream, 0, Z_DEFLATED, INT32_MIN,
      MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY);
  printf("zng_deflateInit2: %d\n", result);

  result = zng_inflateInit2(&in_stream, INT32_MIN);
  printf("zng_inflateInit2: %d\n", result);

  return 0;
 }
```
2022-06-16 14:08:55 +02:00
Nathan Moinvaziri
d43822b9a7 zlib 1.2.12 2022-06-13 15:58:03 +02:00
Hans Kristian Rosbach
28b029c726 Simplify version and struct size checking, and ensure we do it the same way everywhere. 2022-06-03 10:21:01 +02:00
Hans Kristian Rosbach
2f4e2372a2 Simplify zlib-ng native API by removing version and struct size checks.
This should be backwards compatible with applications compiled for 2.0.x.
2022-06-03 10:21:01 +02:00
Adam Stylinski
d79984b5bc Adding avx512_vnni inline + copy elision
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse.  That will be an interesting function to attempt to
optimize.

Right now though, we're basically beating google for all PNG decode and
encode benchmarks.  There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.

While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
2022-05-23 16:13:39 +02:00
Adam Stylinski
b8269bb7d4 Added inlined AVX512 adler checksum + copy
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
2022-05-23 16:13:39 +02:00
Adam Stylinski
21f461e238 Adding an SSE42 optimized copy + adler checksum implementation
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.

When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine.  The reason for this is the
particularly tricky case of dealing with unaligned buffers.  The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.

Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.

Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
2022-05-23 16:13:39 +02:00
Adam Stylinski
b1389ac2d5 Create adler32_fold_c* functions
These are very simple wrappers that do nothing clever but serve as a
shim interface for implementing versions which do cleverly track the
number of scalar sums performed so that we can minimize rebasing and
also have an efficient copy elision.

This serves as the baseline as each vectorization gets its own commit.
That way the PR will be bisectable.
2022-05-23 16:13:39 +02:00
Adam Stylinski
84f116a3d7 Fixed regression introduced by inlining CRC + copy
Pretty much every time updatewindow has been called, implicitly a
checksum was performed unless on s/390 or state->wrap & 4 == 0. The
inflateSetDictionary function instead separately calls this checksum
before invoking update window and checks the checksum to see if it
matches the initial checksum (a property that happens from parsing the
DICTID section of the headers).

Instead, we can make updatewindow have a "copy" parameter, which is the
state->wrap value that is being checked anyway.  We instead move the 3rd
bit check to be checked by the caller rather than the callee.
2022-04-29 11:10:58 +02:00
Ilya Leoshkevich
c592b1b332 IBM Z DFLTCC: Split deflate and inflate states
Currently deflate and inflate both use a common state struct. There are
several variables in this struct that we don't need for inflate, and
more may be coming in the future. Therefore split them in two separate
structs. This in turn requires splitting ZALLOC_STATE and ZCOPY_STATE
macros.
2022-04-28 12:01:57 +02:00
Ilya Leoshkevich
9be98893aa Use PREFIX() for some of the Z_INTERNAL symbols
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.

Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].

[1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932
2022-04-27 10:37:43 +02:00
Nathan Moinvaziri
c882034d48 Use _msan_unposion to unposion end of window for when it needs to read the past < chunksize bytes in the window. See #1245.
Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>
2022-04-14 00:00:27 +02:00
Adam Stylinski
8550a90de4 Leverage inline CRC + copy
This brings back a bit of the performance that may have been sacrificed
by reverting the reorganized inflate window. Doing a copy at the same
time as a CRC is basically free.
2022-03-31 16:11:15 +02:00
Nathan Moinvaziri
6c4beb611d Revert "Reorganize inflate window layout"
This reverts commit dc3b60841d.
2022-03-23 11:30:35 +01:00
Nathan Moinvaziri
097f789fa2 Revert "DFLTCC update for window optimization from Jim & Nathan"
This reverts commit b4ca25afab.
2022-03-23 11:30:35 +01:00
Adam Stylinski
49a6bb5d41 Speed up chunkcopy and memset
This was found to have a significant impact on a highly compressible PNG
for both the encode and decode.  Some deltas show performance improving
as much as 60%+.

For the scenarios where the "dist" is not an even modulus of our chunk
size, we simply repeat the bytes as many times as possible into our
vector registers.  We then copy the entire vector and then advance the
quotient of our chunksize divided by our dist value.

If dist happens to be 1, there's no reason to not just call memset from
libc (this is likely to be just as fast if not faster).
2022-03-16 11:42:19 +01:00
Nathan Moinvaziri
a639a3d43f Use cpu_check_features in inflate and deflate. 2022-01-23 16:39:48 +01:00
Nathan Moinvaziri
a5a0b40e17 Move cpu_feature includes out of zutil.h. 2022-01-23 16:39:48 +01:00