zlib-ng

mirror of https://github.com/GerbilSoft/zlib-ng.git synced 2025-06-18 11:35:35 -04:00

Author	SHA1	Message	Date
Mika Lindqvist	43b2703435	Fix shift overflow in inflate and send_code.	2025-02-08 21:43:51 +01:00
Adam Stylinski	94aacd8bd6	Try to simply the inflate loop by collapsing most cases to chunksets	2024-10-23 21:20:11 +02:00
Hans Kristian Rosbach	dae668dbff	Reorder variables in inflate functions to reduce padding holes due to variable alignment requirements.	2024-10-10 13:22:50 +02:00
Hans Kristian Rosbach	a5c20ed67e	Add variable 'wbufsize' to track window buffer including padding, to allow the chunkset code to spill garbage data into the padding area if available.	2024-10-08 15:51:12 +02:00
Hans Kristian Rosbach	39e9c86ec0	Don't use 'dmax' and 'sane' variables unless their checks have been compiled in.	2024-10-08 15:51:12 +02:00
Hans Kristian Rosbach	e024068dac	Revert "Split chunkcopy_safe to allow the first part to be inlined more often." This reverts commit `6b8efe7868`. New and improved chunkcopy_safe is coming soon.	2024-09-17 14:12:24 +02:00
Hans Kristian Rosbach	6b8efe7868	Split chunkcopy_safe to allow the first part to be inlined more often.	2024-09-13 12:48:43 +02:00
Pavel P	2c801bd43a	Cast result of zalloc to char * to avoid warnings + remove unnecessary cast when using `original_buf`	2024-08-09 13:34:43 +02:00
Hans Kristian Rosbach	037c6f84b5	Simplify inflate window management now that there is no need to worry about failed allocs other than during init.	2024-05-30 13:59:40 +02:00
Hans Kristian Rosbach	63e1d460aa	Rewrite inflate memory allocation. Inflate used to allocate state during init, but window would be allocated when/if needed and could be resized and that required a new free/alloc round. - Now, we allocate state and a 32K window during init, allowing the latency cost of allocs to be done during init instead of at one or more times later. - Total memory allocation is about the same when requesting a 32K window, but if now window or a smaller window was requested, then it is an increase. - While doing alloc(), we now store pointer to corresponding free(), avoiding crashes with applications that incorrectly set alloc/free pointers after running init function. - After init has succeeded, inflate will no longer possibly fail due to a failing malloc. Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>	2024-05-28 16:35:13 +02:00
Ilya Leoshkevich	05ef29eda5	IBM zSystems DFLTCC: Inline DLFTCC states into zlib states Currently DFLTCC states are allocated using hook macros, complicating memory management. Inline them into zlib states and remove the hooks.	2024-05-15 11:28:10 +02:00
Vladislav Shchapov	af8169a724	Replace conditional call to functable.force_init with macro FUNCTABLE_INIT Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-03-06 23:32:15 +01:00
Vladislav Shchapov	c694bcdaf6	Add option to disable runtime CPU detection Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-03-06 23:32:15 +01:00
Vladislav Shchapov	fe0a6407da	Explicitly indicate functions are conditionally dispatched Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-03-06 23:32:15 +01:00
Mark Adler	a4c236c4f0	Fix bug in inflateSync() for data held in bit buffer. madler/zlib#5af7cef45eeef86ddf6ab00b4e363c1eecaf47b6	2024-02-07 19:15:56 +01:00
Adam Stylinski	c2cd8d49d5	Removing some outdated comments These were left on my part, the inline copy + checksum is the very thing the function is doing.	2024-01-29 20:00:10 +01:00
Rye Mutt	a61926d3f9	Fix memory corruption introduced in `61e181c8ae`	2023-12-29 15:52:14 +01:00
Vladislav Shchapov	ba4a78133e	Initialize functable earlier, during inflateInit Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-12-21 16:12:00 +01:00
Adam Stylinski	90b6c36427	Fix an issue with regard to finishing out the window if inflate is invoked with Z_FINISH, and it deems a window was not necessary, there's a corner case where we never checksum the bytes. Detect this by checking the window size against zero and the value of the flush parameter. This should fix issue #1600, and possibly #1565 as well.	2023-11-24 13:41:46 +01:00
Hans Kristian Rosbach	61e181c8ae	Make sure inflateCopy() allocates window with the necessary buffer for chunked operations. Based on Chromium bugfix https://chromium-review.googlesource.com/c/chromium/src/+/4876445	2023-09-29 13:32:52 +02:00
Mark Adler	045a278d86	Assure that inflatePrime() can't shift a 32-bit integer by 32 bits. The inflate() functions never leave state->bits greater than 24, so an inflatePrime() call could not cause this. The only way this could have happened would be by using inflatePrime() to fill the bit buffer with 32 bits, and then calling inflatePrime() a second time asking to insert zero bits, for some reason. This commit assures that a shift by 32 bits does not occur even in that case.	2023-04-26 14:01:14 +02:00
Vladislav Shchapov	20d8fa8af1	Replace global CPU feature flag variables with local variable in init_functable Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-03-06 13:26:09 +01:00
Mika T. Lindqvist	c970422caa	Fix definition of z_size_t to match documentation of legacy zlib API.	2023-02-23 12:17:34 +01:00
Nathan Moinvaziri	fa9bfeddcf	Use named defines instead of hard coded numbers.	2023-02-18 20:30:55 +01:00
Hans Kristian Rosbach	cf5bb01da9	Fix prefixing for internal functions calloc/cfree	2023-02-09 01:54:19 +01:00
Mika T. Lindqvist	f43f4ddb90	Fix ambiguous shift warning in inflateCopy.	2023-02-08 15:21:20 +01:00
Nathan Moinvaziri	aa1109bb2e	Use arch-specific versions of inflate_fast. This should reduce the cost of indirection that occurs when calling functable chunk copying functions inside inflate_fast. It should also allow the compiler to optimize the inflate fast path for the specific architecture.	2023-02-05 17:51:46 +01:00
Nathan Moinvaziri	b047c7247f	Prefix shared functions to prevent symbol conflict when linking native api against compat api.	2023-01-09 15:10:11 +01:00
Ilya Leoshkevich	3eab3173ac	IBM zSystems DFLTCC: Support inflate with small window There is no hardware control for DFLTCC window size, and because of that supporting small windows for deflate is not trivial: one has to make sure that DFLTCC does not emit large distances, which most likely entails somehow trimming the window and/or input in order to make sure that whave + avail_in <= wsize. But inflate is much easier: one only has to allocate enough space. Do that in dfltcc_alloc_window(), and also introduce ZCOPY_WINDOW() in order to copy everything, not just what the software implementation cares about. After this change, software and hardware window formats no longer match: the software will use wbits and wsize, and the hardware will use HB_BITS and HB_SIZE. Unlike deflate, inflate does not switch between software and hardware implementations mid-stream, which leaves only inflateSetDictionary() and inflateGetDictionary() interesting.	2022-12-11 12:03:12 +01:00
Cameron Cawley	baf0fd1234	Enable and fix -Wimplicit-fallthrough warnings	2022-10-23 15:00:46 +02:00
Dougall Johnson	2d110b17b8	Inflate: Increase max root table sizes to 10 and 9 This increases the size of the `codes` array by 1920 bytes (33%), but improves performance a little. Root table size is still limited by the maximum code length in use, so tiny files typically see no change to table-building time, as they don't use longer codes.	2022-09-25 17:45:00 +02:00
Mika Lindqvist	9309904fe7	If the extra field was larger than the space the user provided with inflateGetHeader(), and if multiple calls of inflate() delivered the extra header data, then there could be a buffer overflow of the provided space. This commit assures that provided space is not exceeded. See #1323.	2022-09-05 11:27:40 +02:00
Tobias Stoeckmann	956ff05383	Handle invalid windowBits in init functions Negative windowBits arguments are eventually turned positive in deflateInit2_ and inflateInit2_ (more precisely in inflateReset2). Such values are used to indicate that raw deflate/inflate should be performed. If a user supplies INT32_MIN for windowBits, the code will perform -INT32_MIN which does not fit into int32_t. In fact, this is undefined behavior in C and should be avoided. Clearly this is a user error, but given the careful validation of input arguments a few lines later in deflateInit2_ I think this might be of interest. Proof of Concept: - Compile zlib-ng with gcc -ftrapv or -fsanitize=undefined - Compile and run this program: ``` #include <limits.h> #include <stdio.h> #include <zlib-ng.h> int main(void) { zng_stream de_stream = { 0 }, in_stream = { 0 }; int result; result = zng_deflateInit2(&de_stream, 0, Z_DEFLATED, INT32_MIN, MAX_MEM_LEVEL, Z_DEFAULT_STRATEGY); printf("zng_deflateInit2: %d\n", result); result = zng_inflateInit2(&in_stream, INT32_MIN); printf("zng_inflateInit2: %d\n", result); return 0; } ```	2022-06-16 14:08:55 +02:00
Nathan Moinvaziri	d43822b9a7	zlib 1.2.12	2022-06-13 15:58:03 +02:00
Hans Kristian Rosbach	28b029c726	Simplify version and struct size checking, and ensure we do it the same way everywhere.	2022-06-03 10:21:01 +02:00
Hans Kristian Rosbach	2f4e2372a2	Simplify zlib-ng native API by removing version and struct size checks. This should be backwards compatible with applications compiled for 2.0.x.	2022-06-03 10:21:01 +02:00
Adam Stylinski	d79984b5bc	Adding avx512_vnni inline + copy elision Interesting revelation while benchmarking all of this is that our chunkmemset_avx seems to be slower in a lot of use cases than chunkmemset_sse. That will be an interesting function to attempt to optimize. Right now though, we're basically beating google for all PNG decode and encode benchmarks. There are some variations of flags that can basically have us trading blows, but we're about as much as 14% faster than chromium's zlib patches. While we're here, add a more direct benchmark of the folded copy method versus the explicit copy + checksum.	2022-05-23 16:13:39 +02:00
Adam Stylinski	b8269bb7d4	Added inlined AVX512 adler checksum + copy While we're here, also simplfy the "fold" signature, as reducing the number of rebases and horizontal sums did not prove to be meaningfully faster (slower in many circumstances).	2022-05-23 16:13:39 +02:00
Adam Stylinski	21f461e238	Adding an SSE42 optimized copy + adler checksum implementation We are protecting its usage around a lot of preprocessor macros as the other methods are not yet implemented and calling this version bypasses the faster adler implementations implicitly. When more versions are written for faster vectorizations, the functable entries will be populated and preprocessor macros removed. This round, the copy + checksum is not employing as many tricks as one would hope with a "folded" checksum routine. The reason for this is the particularly tricky case of dealing with unaligned buffers. The implementations which don't have CPUs in the mix that have a huge penalty for unaligned loads will have a much faster implementation. Fancier methods that minimized rebasing, while having the potential to be faster, ended up being slower because the compiler structured the code in a way that ended up either spilling to the stack or trampolining out of a loop and back in it instead of just jumping over the first load and store. Revisiting this for AVX512, where more registers are abundant and more advanced loads exist, may be prudent.	2022-05-23 16:13:39 +02:00
Adam Stylinski	b1389ac2d5	Create adler32_fold_c* functions These are very simple wrappers that do nothing clever but serve as a shim interface for implementing versions which do cleverly track the number of scalar sums performed so that we can minimize rebasing and also have an efficient copy elision. This serves as the baseline as each vectorization gets its own commit. That way the PR will be bisectable.	2022-05-23 16:13:39 +02:00
Adam Stylinski	84f116a3d7	Fixed regression introduced by inlining CRC + copy Pretty much every time updatewindow has been called, implicitly a checksum was performed unless on s/390 or state->wrap & 4 == 0. The inflateSetDictionary function instead separately calls this checksum before invoking update window and checks the checksum to see if it matches the initial checksum (a property that happens from parsing the DICTID section of the headers). Instead, we can make updatewindow have a "copy" parameter, which is the state->wrap value that is being checked anyway. We instead move the 3rd bit check to be checked by the caller rather than the callee.	2022-04-29 11:10:58 +02:00
Ilya Leoshkevich	c592b1b332	IBM Z DFLTCC: Split deflate and inflate states Currently deflate and inflate both use a common state struct. There are several variables in this struct that we don't need for inflate, and more may be coming in the future. Therefore split them in two separate structs. This in turn requires splitting ZALLOC_STATE and ZCOPY_STATE macros.	2022-04-28 12:01:57 +02:00
Ilya Leoshkevich	9be98893aa	Use PREFIX() for some of the Z_INTERNAL symbols https://github.com/powturbo/TurboBench links zlib and zlib-ng into the same binary, causing non-static symbol conflicts. Fix by using PREFIX() for flush_pending(), bi_reverse(), inflate_ensure_window() and all of the IBM Z symbols. Note: do not use an explicit zng_, since one of the long-term goals is to be able to link two versions of zlib-ng into the same binary for benchmarking [1]. [1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932	2022-04-27 10:37:43 +02:00
Nathan Moinvaziri	c882034d48	Use _msan_unposion to unposion end of window for when it needs to read the past < chunksize bytes in the window. See #1245 . Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>	2022-04-14 00:00:27 +02:00
Adam Stylinski	8550a90de4	Leverage inline CRC + copy This brings back a bit of the performance that may have been sacrificed by reverting the reorganized inflate window. Doing a copy at the same time as a CRC is basically free.	2022-03-31 16:11:15 +02:00
Nathan Moinvaziri	6c4beb611d	Revert "Reorganize inflate window layout" This reverts commit `dc3b60841d`.	2022-03-23 11:30:35 +01:00
Nathan Moinvaziri	097f789fa2	Revert "DFLTCC update for window optimization from Jim & Nathan" This reverts commit `b4ca25afab`.	2022-03-23 11:30:35 +01:00
Adam Stylinski	49a6bb5d41	Speed up chunkcopy and memset This was found to have a significant impact on a highly compressible PNG for both the encode and decode. Some deltas show performance improving as much as 60%+. For the scenarios where the "dist" is not an even modulus of our chunk size, we simply repeat the bytes as many times as possible into our vector registers. We then copy the entire vector and then advance the quotient of our chunksize divided by our dist value. If dist happens to be 1, there's no reason to not just call memset from libc (this is likely to be just as fast if not faster).	2022-03-16 11:42:19 +01:00
Nathan Moinvaziri	a639a3d43f	Use cpu_check_features in inflate and deflate.	2022-01-23 16:39:48 +01:00
Nathan Moinvaziri	a5a0b40e17	Move cpu_feature includes out of zutil.h.	2022-01-23 16:39:48 +01:00

1 2 3 4

180 Commits