Commit Graph

34 Commits

Author SHA1 Message Date
Hans Kristian Rosbach
509f6b5818 Since we long ago make unaligned reads safe (by using memcpy or intrinsics),
it is time to replace the UNALIGNED_OK checks that have since really only been
used to select the optimal comparison sizes for the arch instead.
2024-12-21 00:46:48 +01:00
Hans Kristian Rosbach
037ab0fd35 Revert "Since we long ago make unaligned reads safe (by using memcpy or intrinsics),"
This reverts commit 80fffd72f3.
It was mistakenly pushed to develop instead of going through a PR and the appropriate reviews.
2024-12-17 23:09:31 +01:00
Hans Kristian Rosbach
80fffd72f3 Since we long ago make unaligned reads safe (by using memcpy or intrinsics),
it is time to replace the UNALIGNED_OK checks that have since really only been
used to select the optimal comparison sizes for the arch instead.
2024-12-17 23:02:32 +01:00
Adam Stylinski
43d74a223b Improve pipeling for AVX512 chunking
For reasons that aren't quite so clear, using the masked writes here
did not pipeline very well. Either setting up the mask stalled things
or masked moves have issues overlapping regular moves. Simply putting
the masked moves behind a branch that is rarely taken seemed to do the
trick in improving the ILP. While here, put masked loads behind the same
branch in case there were ever a hazard for overreading.
2024-12-10 22:17:14 +01:00
Adam Stylinski
0ed5ac8289 Make an AVX512 inflate fast with low cost masked writes
This takes advantage of the fact that on AVX512 architectures, masked
moves are incredibly cheap. There are many places where we have to
fallback to the safe C implementation of chunkcopy_safe because of the
assumed overwriting that occurs. We're to sidestep most of the branching
needed here by simply controlling the bounds of our writes with a mask.
2024-11-20 22:14:44 +01:00
Adam Stylinski
94aacd8bd6 Try to simply the inflate loop by collapsing most cases to chunksets 2024-10-23 21:20:11 +02:00
Adam Stylinski
e874b34e1a Make chunkset_avx2 half chunk aware
This gives us appreciable gains on a number of fronts.  The first being
we're inlining a pretty hot function that was getting dispatched to
regularly. Another is that we're able to do a safe lagged copy of a
distance that is smaller, so CHUNKCOPY gets its teeth back here for
smaller sizes, without having to do another dispatch to a function.

We're also now doing two overlapping writes at once and letting the CPU
do its store forwarding. This was an enhancement @dougallj had suggested
a while back.

Additionally, the "half chunk mag" here is fundamentally less
complicated because it doesn't require sythensizing cross lane permutes
with a blend operation, so we can optimistically do that first if the
len is small enough that a full 32 byte chunk doesn't make any sense.
2024-10-12 13:21:03 +02:00
Adam Stylinski
b52e703417 Simplify avx2 chunkset a bit
Put length 16 in the length checking ladder and take care of it there
since it's also a simple case to handle. We kind of went out of our way
to pretend 128 bit vectors didn't exist when using avx2 but this can be
handled in a single instruction. Strangely the intrinsic uses vector
register operands but the instruction itself assumes a memory operand
for the source. This also means we don't have to handle this case in our
"GET_CHUNK_MAG" function.
2024-10-12 13:21:03 +02:00
Cameron Cawley
7cca3e6fd7 Inline CHUNKCOPY and CHUNKUNROLL
This slightly decreases the shared library size on x86_64 when both SSE2 and SSSE3 are enabled.
2024-02-22 20:20:42 +01:00
Vladislav Shchapov
0b856b7351 Remove always true arch conditions.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-01-25 10:21:49 +01:00
Cameron Cawley
b09215f75a Enable use of _mm_shuffle_epi8 on machines without SSE4.1 2023-04-01 17:27:49 +02:00
Adam Stylinski
ef0cf5ca17 Improved chunkset substantially where it's heavily used
For most realistic use cases, this doesn't make a ton of difference.
However, for things which are highly compressible and enjoy very large
run length encodes in the window, this is a huge win.

We leverage a permutation table to swizzle the contents of the memory
chunk into a vector register and then splat that over memory with a fast
copy loop.

In essence, where this helps, it helps a lot.  Where it doesn't, it does
no measurable damage to the runtime.

This commit also simplifies a chunkcopy_safe call for determining a
distance.  Using labs is enough to give the same behavior as before,
with the added benefit that no predication is required _and_, most
importantly, static analysis by GCC's string fortification can't throw a
fit because it conveys better to the compiler that the input into
builtin_memcpy will always be in range.
2022-05-23 16:13:29 +02:00
Dženan Zukić
ae433e7ee1 Remove trailing whitespace in several source code files 2022-04-27 10:38:10 +02:00
Nathan Moinvaziri
80ab22f100 Fixed signed/unsigned warning in chunkmemset.
chunkset_tpl.h(107,24): warning C4018: '>': signed/unsigned mismatch
2022-03-27 19:17:21 +02:00
Adam Stylinski
e81c083fda Fix a latent issue with chunkmemset
It would seem that on some platforms, namely those which are
!UNALIGNED64_OK, there was a likelihood of chunkmemset_safe_c copying all
the bytes before passing control flow to chunkcopy, a function which is
explicitly unsafe to be called with a zero length copy.

This fixes that bug for those platforms.
2022-03-18 15:56:57 +01:00
Adam Stylinski
49a6bb5d41 Speed up chunkcopy and memset
This was found to have a significant impact on a highly compressible PNG
for both the encode and decode.  Some deltas show performance improving
as much as 60%+.

For the scenarios where the "dist" is not an even modulus of our chunk
size, we simply repeat the bytes as many times as possible into our
vector registers.  We then copy the entire vector and then advance the
quotient of our chunksize divided by our dist value.

If dist happens to be 1, there's no reason to not just call memset from
libc (this is likely to be just as fast if not faster).
2022-03-16 11:42:19 +01:00
Nathan Moinvaziri
363a95fb9b Introduce zmemcpy to use unaligned access for architectures we know support unaligned access, otherwise use memcpy. 2022-02-10 16:10:48 +01:00
Nathan Moinvaziri
ab0a6d9fa7 Remove zutil.h includes from many files to prevent zlib.h being included. 2022-01-29 17:03:22 +01:00
Sergey Markelov
e9d0177fea Fix hangs on macOS due to loading of misaligned addresses in chunkmemset_8. 2021-09-03 11:23:38 +02:00
Nathan Moinvaziri
b937afdc75 Remove extra division operation in chunkcopy. 2021-06-26 10:26:23 +02:00
Mika Lindqvist
21050e06c5 Cast calculation of safe length to unsigned int to avoid compiler warnings. 2021-06-21 17:45:45 +02:00
Mika Lindqvist
4521023854 [chunkcopy_safe] Don't call chunkcopy().
* chunkcopy() can read or write more than the safe length if the length is not multiple of chunk size.
2021-06-21 11:01:32 +02:00
Nathan Moinvaziri
815faea92c Reduce number of branches in partial chunk copy based on chunk size. 2021-06-18 09:19:48 +02:00
Nathan Moinvaziri
e2705f826e Added assert in chunkcopy to detect invalid length. 2021-06-13 20:56:54 +02:00
Nathan Moinvaziri
76b9605f80 Calculate from and out buffer advance only once in chunkcopy. 2021-06-13 20:56:54 +02:00
Nathan Moinvaziri
616ab24060 Only need to add rem if it is greater than zero in chunkmemset. 2021-06-13 20:56:54 +02:00
Mika Lindqvist
4af20eae03 [CHUNKMEMSET_SAFE] Precalculate "from".
* limit len to minimum of len and left
2021-06-11 19:53:08 +02:00
Mika Lindqvist
ce4409c124 [CHUNKCOPY_SAFE] Fix off-by-one error
* When chunk size was more than 8 bytes, the comparison logic failed if safe length was one less than chunk size.
2021-06-11 19:53:08 +02:00
Nathan Moinvaziri
cce302ee0b Fixed conditional expression is constant maintainer warnings.
chunkset_tpl.h(42,47): warning C4127: conditional expression is constant
  functable.c(381,44): warning C4127: conditional expression is constant
2020-11-02 17:01:58 +01:00
Mika Lindqvist
6575fbffea Remove chunkmemset_3 and chunkmemset_6 on ARM/AArch64 as they need 3 chunks...
* Don't unroll distances smaller than chunk size.
2020-09-19 09:52:01 +02:00
Nathan Moinvaziri
4bc5bd65e5 Added AVX support to chunkset functions. 2020-09-11 13:01:28 +02:00
Nathan Moinvaziri
7cffba4dd6 Rename ZLIB_INTERNAL to Z_INTERNAL for consistency. 2020-08-31 12:33:16 +02:00
Nathan Moinvaziri
d5d1f7e81b Fixed extra symbols added to ABI when zlib-compat specified. 2020-08-02 18:32:25 +02:00
Nathan Moinvaziri
2026830cea Rename from memchunk to chunkset. 2020-06-28 11:16:05 +02:00