Commit Graph

146 Commits

Author SHA1 Message Date
yintong
10b51fa592 riscv: add crc32 optimization using zbc extension
Some checks failed
Configure / ${{ matrix.name }} (gcc, --warn, Ubuntu GCC, ubuntu-latest) (push) Has been cancelled
Configure / ${{ matrix.name }} (gcc-11, --sprefix=zTest_, macOS GCC Symbol Prefix, macos-13, gcc@11) (push) Has been cancelled
Configure / ${{ matrix.name }} (gcc-11, --warn, macOS GCC, macos-13, gcc@11) (push) Has been cancelled
Configure / ${{ matrix.name }} (gcc-11, --zlib-compat --sprefix=zTest_, macOS GCC Symbol Prefix & Compat, macos-13, gcc@11) (push) Has been cancelled
Configure / ${{ matrix.name }} (mips-linux-gnu, mips-linux-gnu-gcc, --warn, Ubuntu GCC MIPS, ubuntu-latest, qemu-user gcc-mips-linux-gnu libc-dev-mips-cross) (push) Has been cancelled
Configure / ${{ matrix.name }} (mips64-linux-gnuabi64, mips64-linux-gnuabi64-gcc, --warn, Ubuntu GCC MIPS64, ubuntu-latest, qemu-user gcc-mips64-linux-gnuabi64 libc-dev-mips64-cross) (push) Has been cancelled
Configure / ${{ matrix.name }} (powerpc-linux-gnu, powerpc-linux-gnu-gcc, --warn --without-power8, Ubuntu GCC PPC No Power8, ubuntu-latest, qemu-user gcc-powerpc-linux-gnu libc-dev-powerpc-cross) (push) Has been cancelled
Configure / ${{ matrix.name }} (powerpc64le-linux-gnu, powerpc64le-linux-gnu-gcc, --warn, Ubuntu GCC PPC64LE, ubuntu-latest, qemu-user gcc-powerpc64le-linux-gnu libc-dev-ppc64el-cross) (push) Has been cancelled
OSS-Fuzz / Fuzzing (push) Has been cancelled
Libpng / Ubuntu Clang (push) Has been cancelled
Link / Link zlib (push) Has been cancelled
Link / Link zlib-ng compat (push) Has been cancelled
Pigz / ${{ matrix.name }} (-DCMAKE_TOOLCHAIN_FILE=../../cmake/toolchain-aarch64.cmake, ubuntu_gcc_pigz_aarch64, Ubuntu GCC AARCH64, ubuntu-latest, qemu-user gcc-aarch64-linux-gnu libc-dev-arm64-cross) (push) Has been cancelled
Pigz / ${{ matrix.name }} (-DWITH_OPTIM=OFF, ubuntu_clang_pigz_no_optim, clang, llvm-cov-15 gcov, Ubuntu Clang No Optim, ubuntu-latest, llvm-15 llvm-15-tools) (push) Has been cancelled
Pigz / ${{ matrix.name }} (-DWITH_THREADS=OFF -DPIGZ_VERSION=v2.6, ubuntu_clang_pigz_no_threads, clang, llvm-cov-15 gcov, Ubuntu Clang No Threads, ubuntu-latest, llvm-15 llvm-15-tools) (push) Has been cancelled
Pigz / ${{ matrix.name }} (-DZLIB_SYMBOL_PREFIX=zTest_, ubuntu_gcc_pigz, gcc, Ubuntu GCC Symbol Prefix, ubuntu-latest) (push) Has been cancelled
Pigz / ${{ matrix.name }} (ubuntu_clang_pigz, clang, llvm-cov-15 gcov, Ubuntu Clang, ubuntu-latest, llvm-15 llvm-15-tools) (push) Has been cancelled
Pigz / ${{ matrix.name }} (ubuntu_gcc_pigz, gcc, Ubuntu GCC, ubuntu-latest) (push) Has been cancelled
Package Check / ${{ matrix.name }} (-DZLIB_SYMBOL_PREFIX=zTest_, clang, --sprefix=zTest_, clang++, macOS Clang Symbol Prefix, macOS-latest) (push) Has been cancelled
Package Check / ${{ matrix.name }} (-m32, -DCMAKE_C_FLAGS=-m32 -DCMAKE_CXX_FLAGS=-m32, gcc, g++, -m32, -m32, Ubuntu GCC -m32, ubuntu-latest, gcc-multilib g++-multilib) (push) Has been cancelled
Package Check / ${{ matrix.name }} (aarch64-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-aarch64.cmake, aarch64-linux-gnu-gcc, aarch64-linux-gnu-g++, Ubuntu GCC AARCH64, ubuntu-latest, qemu-user gcc-aarch64-linux-gnu g++-aarch64-linux-gnu libc6-dev-arm64-cross) (push) Has been cancelled
Package Check / ${{ matrix.name }} (arm-linux-gnueabihf, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-armhf.cmake, arm-linux-gnueabihf-gcc, arm-linux-gnueabihf-g++, Ubuntu GCC ARM HF, ubuntu-latest, qemu-user gcc-arm-linux-gnueabihf g++-arm-linux-gnueabihf libc6-dev-armhf-c… (push) Has been cancelled
Package Check / ${{ matrix.name }} (clang, clang++, macOS Clang, macOS-latest) (push) Has been cancelled
Package Check / ${{ matrix.name }} (gcc, g++, Ubuntu GCC, ubuntu-latest) (push) Has been cancelled
Package Check / ${{ matrix.name }} (mips-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-mips.cmake, mips-linux-gnu-gcc, mips-linux-gnu-g++, Ubuntu GCC MIPS, ubuntu-latest, qemu-user gcc-mips-linux-gnu g++-mips-linux-gnu libc6-dev-mips-cross) (push) Has been cancelled
Package Check / ${{ matrix.name }} (mips64-linux-gnuabi64, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-mips64.cmake, mips64-linux-gnuabi64-gcc, mips64-linux-gnuabi64-g++, Ubuntu GCC MIPS64, ubuntu-latest, qemu-user gcc-mips64-linux-gnuabi64 g++-mips64-linux-gnuabi64 libc6-… (push) Has been cancelled
Package Check / ${{ matrix.name }} (powerpc-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-powerpc.cmake, powerpc-linux-gnu-gcc, powerpc-linux-gnu-g++, Ubuntu GCC PPC, ubuntu-latest, qemu-user gcc-powerpc-linux-gnu g++-powerpc-linux-gnu libc6-dev-powerpc-cross) (push) Has been cancelled
Package Check / ${{ matrix.name }} (powerpc64le-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-powerpc64le.cmake, powerpc64le-linux-gnu-gcc, powerpc64le-linux-gnu-g++, Ubuntu GCC PPC64LE, ubuntu-latest, qemu-user gcc-powerpc64le-linux-gnu g++-powerpc64le-linux-gnu … (push) Has been cancelled
CMake / Upload Coverage Reports (push) Has been cancelled
Pigz / Upload Coverage Reports (push) Has been cancelled
2025-04-27 18:23:50 +02:00
Adam Stylinski
46fc33f39d SSE4.1 optimized chorba
This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason
for this has to do with the fact that, while incurring far fewer shifts,
an entirely separate stack buffer has to be managed that is the size of
the L1 cache on most CPUs. This was one of the main reasons the 32k
specialized function was slower for the scalar counterpart, despite auto
vectorizing. The auto vectorized loop was setting up the stack buffer at
unaligned offsets, which is detrimental to performance pre-nehalem.
Additionally, we were losing a fair bit of time to the zero
initialization, which we are now doing more selectively.

There are a ton of loads and stores happening, and for sure we are bound
on the fill buffer + store forwarding. An SSE2 version of this code is
probably possible by simply replacing the shifts with unpacks with zero
and the palignr's with shufpd's. I'm just not sure it'll be all that worth
it, though. We are gating against SSE4.1 not because we are using specifically
a 4.1 instruction but because that marks when Wolfdale came out and palignr
became a lot faster.
2025-04-15 14:11:12 +02:00
Hans Kristian Rosbach
00a3168d5d Add AVX512 version of compare256
Improve the speed of sub-16 byte matches by first using a
128-bit intrinsic, after that use only 512-bit intrinsics.
This requires us to overlap on the last run, but this is cheaper than
processing the tail using a 256-bit and then a 128-bit run.

Change benchmark steps to avoid it hitting chunk boundaries
of one or the other function as much, this gives more fair benchmarks.
2025-04-14 23:28:38 +02:00
Adam Stylinski
724dc0cfb4 Explicit SSE2 vectorization of Chorba CRC method
The version that's currently in the generic implementation for 32768
byte buffers leverages the stack. It manages to autovectorize but
unfortunately the trips to the stack hurt its performance for CPUs which
need this the most. This version is explicitly SIMD vectorized and
doesn't use trips to the stack.  In my testing it's ~10% faster than the
"small" variant, and about 42% faster than the "32768" variant.
2025-03-28 20:43:59 +01:00
Adam Stylinski
50e9ca06e2 Fold a copy into the adler32 function for UPDATEWINDOW for neon
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.

Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix #1824.
2025-03-05 22:17:55 +01:00
Hans Kristian Rosbach
f411580733 Clean up internal crc32 function handling.
Mark crc32_c and crc32_braid functions as internal, and remove prefix.
Reorder contents of generic_functions, and remove Z_INTERNAL hints from declarations.
Add test/benchmark output to indicate whether Chorba is used.
2025-02-18 23:59:16 +01:00
Sam Russell
b33ba962c2 implement chorba algorithm 2025-02-15 14:31:50 +01:00
Cameron Cawley
721c488aff Rename most ACLE references to ARMv8 2025-02-12 13:54:30 +01:00
Hans Kristian Rosbach
bf05e882b8 Continued cleanup of old UNALIGNED_OK checks
- Remove obsolete checks
- Fix checks that are inconsistent
- Stop compiling compare256/longest_match variants that never gets called
- Improve how the generic compare256 functions are handled.
- Allow overriding OPTIMAL_CMP

This simplifies the code and avoids having a lot of code in the compiled library than can never get executed.
2024-12-26 22:14:46 +01:00
Adam Stylinski
7020cb3f74 Enable AVX2 functions to be built with BMI2 instructions
While these are technically different instructions, no such CPU exists
that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to
eliminate several flag stalls by having flagless versions of shifts, and
allows us to not clobber and move around GPRs so much in scalar code.
There's usually a sizeable benefit for enabling it. Since we're building
with BMI2 for AVX2 functions, let's also just make sure the CPU claims
to support it (just to cover our bases).
2024-12-07 22:32:29 +01:00
Adam Stylinski
0ed5ac8289 Make an AVX512 inflate fast with low cost masked writes
This takes advantage of the fact that on AVX512 architectures, masked
moves are incredibly cheap. There are many places where we have to
fallback to the safe C implementation of chunkcopy_safe because of the
assumed overwriting that occurs. We're to sidestep most of the branching
needed here by simply controlling the bounds of our writes with a mask.
2024-11-20 22:14:44 +01:00
Adam Stylinski
94aacd8bd6 Try to simply the inflate loop by collapsing most cases to chunksets 2024-10-23 21:20:11 +02:00
Adeel Mujahid
e4fb3803af Address CR feedback 2024-09-01 15:38:30 +02:00
Adeel Mujahid
c5e7d0f373 Fix new Windows SDK build break
Co-authored-by: Jan Kotas <jkotas@microsoft.com>
2024-09-01 15:38:30 +02:00
Tulio Magno Quites Machado Filho
1a15c4b20e Fix illegal instruction usage in Xeon Phi x200 processors
The Xeon Phi x200 family of processors (Knights Landing) supports
AVX512 (F, CD, ER, PF) but does not support AVX512 (VL, DQ, BW).

Because of processors like this, the Intel Software Developer's Manual
suggests the bits AVX512 (DQ,BW,VL) are also tested in EBX together with
AVX512F before deciding to run AVX512 (DQ,BW,VL) instructions.

This also adds a new x86 feature called avx512_common that indicates
that AVX512 (F,DQ,BW,VL) are all available and start using this for both
adler32_avx512 and crc32_vpclmulqdq implementations because they are
both built with -mavx512dq -mavx512bw -mavx512vl.

This has been reported downstream as
https://bugzilla.redhat.com/show_bug.cgi?id=2280347 .
2024-05-19 12:25:01 +02:00
Vladislav Shchapov
c694bcdaf6 Add option to disable runtime CPU detection
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-03-06 23:32:15 +01:00
Hans Kristian Rosbach
9953f12e21 Move update_hash(), insert_string() and quick_insert_string() out of functable
and remove SSE4.2 and ACLE optimizations. The functable overhead is higher
than the benefit from using optimized functions.
2024-02-23 13:34:10 +01:00
Nathan Moinvaziri
a090529ece Remove deflate_state parameter from update_hash functions. 2024-02-23 13:34:10 +01:00
Vladislav Shchapov
ba9b3cdb61 Rename cpu_functions.h to arch_functions.h.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-02-22 20:11:46 +01:00
Vladislav Shchapov
305b268b32 Move select for generic functions into generic_functions.h.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-02-22 20:11:46 +01:00
Vladislav Shchapov
ac25a2ea6a Split CPU features checks and CPU-specific function prototypes and reduce include-dependencies.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-02-22 20:11:46 +01:00
Nathan Moinvaziri
379eda2e80 Remove type declarations for z_stream/zng_stream from cpu_features. 2024-01-30 20:50:05 +01:00
Vladislav Shchapov
1aa53f40fc Improve x86 intrinsics dependencies.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-01-25 10:21:49 +01:00
Vladislav Shchapov
0b856b7351 Remove always true arch conditions.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-01-25 10:21:49 +01:00
Vladislav Shchapov
9d486b5073 Atomic functable
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-12-25 20:47:24 +01:00
Vladislav Shchapov
0c32ad4237 Add force initialization functable, because deflate captures function pointers from functable
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-12-21 16:12:00 +01:00
Simon Hosie
f3211aba34 Add adler32_fold_copy_rvv implementation. 2023-11-28 10:25:35 +01:00
alexsifivetw
fe6aaedaf8 General optimized chunkset 2023-09-28 00:14:26 +02:00
Cameron Cawley
16fe1f885e Add ARMv6 version of slide_hash 2023-09-16 11:11:18 +02:00
alexsifivetw
6eed7416ed Optimize adler32 using rvv 2023-07-16 12:44:25 +02:00
alexsifivetw
2f4ebe2bb6 Optimize slide_hash using RVV 2023-06-23 19:44:22 +02:00
alexsifivetw
de1b640ffb Optimize compare256 with rvv 2023-06-13 12:25:48 +02:00
Cameron Cawley
38aa575129 Ensure that unaligned compare256 variants are only used on little endian systems 2023-04-25 12:07:55 +02:00
Cameron Cawley
1ae7b0545d Rename chunkset_avx to chunkset_avx2 2023-04-19 00:35:28 +02:00
Cameron Cawley
b1aafe5c67 Clean up SSE4.2 detection 2023-04-15 15:22:36 +02:00
Cameron Cawley
b09215f75a Enable use of _mm_shuffle_epi8 on machines without SSE4.1 2023-04-01 17:27:49 +02:00
Vladislav Shchapov
20d8fa8af1 Replace global CPU feature flag variables with local variable in init_functable
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-03-06 13:26:09 +01:00
Vladislav Shchapov
fdb87d63a5 Split crc32 pclmulqdq and vpclmulqdq implementations
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-02-24 13:25:54 +01:00
Hans Kristian Rosbach
9db6a98894 Sort functable alphabetically 2023-02-17 15:11:25 +01:00
Hans Kristian Rosbach
7e1d80742e Reduce the amount of different defines required for arch-specific optimizations.
Also removed a reference to a nonexistant adler32_sse41 in test/test_adler32.cc.
2023-02-17 15:11:25 +01:00
Hans Kristian Rosbach
6f714ef422 Add missing compare256_neon activation to functable 2023-02-13 00:25:57 +01:00
Hans Kristian Rosbach
5d9ddac4cd Combine some of the checks that were not identical.
Made longest_match and compare256 use the X86_NOCHECK_SSE2 override,
thus now those are also automatically enabled on x86_64.
2023-02-13 00:25:57 +01:00
Hans Kristian Rosbach
c8a6b3ed6b Simplify functable.c 2023-02-13 00:25:57 +01:00
Pavel P
3e75a5c981 Correct inflate_fast function signature 2023-02-08 15:22:22 +01:00
Nathan Moinvaziri
c72cd309ca Remove unused chunk memory functions from functable. 2023-02-05 17:51:46 +01:00
Nathan Moinvaziri
aa1109bb2e Use arch-specific versions of inflate_fast.
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
2023-02-05 17:51:46 +01:00
Pavel P
d144fc06bf Rename local functable variable to ft 2023-02-03 15:50:07 +01:00
Pavel P
709a710f6f Use local functable variable instead of standalone function pointers 2023-02-03 15:50:07 +01:00
Pavel P
df60007e8e Move initialization of functable to init_functable function 2023-02-03 15:50:07 +01:00
Pavel P
fecb03a1a1 Avoid functable redefinition in functable.c
`functable` is already declared by functable.h which is included by functable.c
2023-02-03 15:50:07 +01:00