zlib-ng

mirror of https://github.com/GerbilSoft/zlib-ng.git synced 2025-06-18 19:45:37 -04:00

Author	SHA1	Message	Date
Hans Kristian Rosbach	509f6b5818	Since we long ago make unaligned reads safe (by using memcpy or intrinsics), it is time to replace the UNALIGNED_OK checks that have since really only been used to select the optimal comparison sizes for the arch instead.	2024-12-21 00:46:48 +01:00
Hans Kristian Rosbach	037ab0fd35	Revert "Since we long ago make unaligned reads safe (by using memcpy or intrinsics)," This reverts commit `80fffd72f3`. It was mistakenly pushed to develop instead of going through a PR and the appropriate reviews.	2024-12-17 23:09:31 +01:00
Hans Kristian Rosbach	80fffd72f3	Since we long ago make unaligned reads safe (by using memcpy or intrinsics), it is time to replace the UNALIGNED_OK checks that have since really only been used to select the optimal comparison sizes for the arch instead.	2024-12-17 23:02:32 +01:00
Adam Stylinski	43d74a223b	Improve pipeling for AVX512 chunking For reasons that aren't quite so clear, using the masked writes here did not pipeline very well. Either setting up the mask stalled things or masked moves have issues overlapping regular moves. Simply putting the masked moves behind a branch that is rarely taken seemed to do the trick in improving the ILP. While here, put masked loads behind the same branch in case there were ever a hazard for overreading.	2024-12-10 22:17:14 +01:00
Adam Stylinski	0ed5ac8289	Make an AVX512 inflate fast with low cost masked writes This takes advantage of the fact that on AVX512 architectures, masked moves are incredibly cheap. There are many places where we have to fallback to the safe C implementation of chunkcopy_safe because of the assumed overwriting that occurs. We're to sidestep most of the branching needed here by simply controlling the bounds of our writes with a mask.	2024-11-20 22:14:44 +01:00
Adam Stylinski	94aacd8bd6	Try to simply the inflate loop by collapsing most cases to chunksets	2024-10-23 21:20:11 +02:00
Adam Stylinski	e874b34e1a	Make chunkset_avx2 half chunk aware This gives us appreciable gains on a number of fronts. The first being we're inlining a pretty hot function that was getting dispatched to regularly. Another is that we're able to do a safe lagged copy of a distance that is smaller, so CHUNKCOPY gets its teeth back here for smaller sizes, without having to do another dispatch to a function. We're also now doing two overlapping writes at once and letting the CPU do its store forwarding. This was an enhancement @dougallj had suggested a while back. Additionally, the "half chunk mag" here is fundamentally less complicated because it doesn't require sythensizing cross lane permutes with a blend operation, so we can optimistically do that first if the len is small enough that a full 32 byte chunk doesn't make any sense.	2024-10-12 13:21:03 +02:00
Adam Stylinski	b52e703417	Simplify avx2 chunkset a bit Put length 16 in the length checking ladder and take care of it there since it's also a simple case to handle. We kind of went out of our way to pretend 128 bit vectors didn't exist when using avx2 but this can be handled in a single instruction. Strangely the intrinsic uses vector register operands but the instruction itself assumes a memory operand for the source. This also means we don't have to handle this case in our "GET_CHUNK_MAG" function.	2024-10-12 13:21:03 +02:00
Cameron Cawley	7cca3e6fd7	Inline CHUNKCOPY and CHUNKUNROLL This slightly decreases the shared library size on x86_64 when both SSE2 and SSSE3 are enabled.	2024-02-22 20:20:42 +01:00
Vladislav Shchapov	0b856b7351	Remove always true arch conditions. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-01-25 10:21:49 +01:00
Cameron Cawley	b09215f75a	Enable use of _mm_shuffle_epi8 on machines without SSE4.1	2023-04-01 17:27:49 +02:00
Adam Stylinski	ef0cf5ca17	Improved chunkset substantially where it's heavily used For most realistic use cases, this doesn't make a ton of difference. However, for things which are highly compressible and enjoy very large run length encodes in the window, this is a huge win. We leverage a permutation table to swizzle the contents of the memory chunk into a vector register and then splat that over memory with a fast copy loop. In essence, where this helps, it helps a lot. Where it doesn't, it does no measurable damage to the runtime. This commit also simplifies a chunkcopy_safe call for determining a distance. Using labs is enough to give the same behavior as before, with the added benefit that no predication is required _and_, most importantly, static analysis by GCC's string fortification can't throw a fit because it conveys better to the compiler that the input into builtin_memcpy will always be in range.	2022-05-23 16:13:29 +02:00
Dženan Zukić	ae433e7ee1	Remove trailing whitespace in several source code files	2022-04-27 10:38:10 +02:00
Nathan Moinvaziri	80ab22f100	Fixed signed/unsigned warning in chunkmemset. chunkset_tpl.h(107,24): warning C4018: '>': signed/unsigned mismatch	2022-03-27 19:17:21 +02:00
Adam Stylinski	e81c083fda	Fix a latent issue with chunkmemset It would seem that on some platforms, namely those which are !UNALIGNED64_OK, there was a likelihood of chunkmemset_safe_c copying all the bytes before passing control flow to chunkcopy, a function which is explicitly unsafe to be called with a zero length copy. This fixes that bug for those platforms.	2022-03-18 15:56:57 +01:00
Adam Stylinski	49a6bb5d41	Speed up chunkcopy and memset This was found to have a significant impact on a highly compressible PNG for both the encode and decode. Some deltas show performance improving as much as 60%+. For the scenarios where the "dist" is not an even modulus of our chunk size, we simply repeat the bytes as many times as possible into our vector registers. We then copy the entire vector and then advance the quotient of our chunksize divided by our dist value. If dist happens to be 1, there's no reason to not just call memset from libc (this is likely to be just as fast if not faster).	2022-03-16 11:42:19 +01:00
Nathan Moinvaziri	363a95fb9b	Introduce zmemcpy to use unaligned access for architectures we know support unaligned access, otherwise use memcpy.	2022-02-10 16:10:48 +01:00
Nathan Moinvaziri	ab0a6d9fa7	Remove zutil.h includes from many files to prevent zlib.h being included.	2022-01-29 17:03:22 +01:00
Sergey Markelov	e9d0177fea	Fix hangs on macOS due to loading of misaligned addresses in chunkmemset_8.	2021-09-03 11:23:38 +02:00
Nathan Moinvaziri	b937afdc75	Remove extra division operation in chunkcopy.	2021-06-26 10:26:23 +02:00
Mika Lindqvist	21050e06c5	Cast calculation of safe length to unsigned int to avoid compiler warnings.	2021-06-21 17:45:45 +02:00
Mika Lindqvist	4521023854	[chunkcopy_safe] Don't call chunkcopy(). * chunkcopy() can read or write more than the safe length if the length is not multiple of chunk size.	2021-06-21 11:01:32 +02:00
Nathan Moinvaziri	815faea92c	Reduce number of branches in partial chunk copy based on chunk size.	2021-06-18 09:19:48 +02:00
Nathan Moinvaziri	e2705f826e	Added assert in chunkcopy to detect invalid length.	2021-06-13 20:56:54 +02:00
Nathan Moinvaziri	76b9605f80	Calculate from and out buffer advance only once in chunkcopy.	2021-06-13 20:56:54 +02:00
Nathan Moinvaziri	616ab24060	Only need to add rem if it is greater than zero in chunkmemset.	2021-06-13 20:56:54 +02:00
Mika Lindqvist	4af20eae03	[CHUNKMEMSET_SAFE] Precalculate "from". * limit len to minimum of len and left	2021-06-11 19:53:08 +02:00
Mika Lindqvist	ce4409c124	[CHUNKCOPY_SAFE] Fix off-by-one error * When chunk size was more than 8 bytes, the comparison logic failed if safe length was one less than chunk size.	2021-06-11 19:53:08 +02:00
Nathan Moinvaziri	cce302ee0b	Fixed conditional expression is constant maintainer warnings. chunkset_tpl.h(42,47): warning C4127: conditional expression is constant functable.c(381,44): warning C4127: conditional expression is constant	2020-11-02 17:01:58 +01:00
Mika Lindqvist	6575fbffea	Remove chunkmemset_3 and chunkmemset_6 on ARM/AArch64 as they need 3 chunks... * Don't unroll distances smaller than chunk size.	2020-09-19 09:52:01 +02:00
Nathan Moinvaziri	4bc5bd65e5	Added AVX support to chunkset functions.	2020-09-11 13:01:28 +02:00
Nathan Moinvaziri	7cffba4dd6	Rename ZLIB_INTERNAL to Z_INTERNAL for consistency.	2020-08-31 12:33:16 +02:00
Nathan Moinvaziri	d5d1f7e81b	Fixed extra symbols added to ABI when zlib-compat specified.	2020-08-02 18:32:25 +02:00
Nathan Moinvaziri	2026830cea	Rename from memchunk to chunkset.	2020-06-28 11:16:05 +02:00

34 Commits