zlib-ng

mirror of https://github.com/GerbilSoft/zlib-ng.git synced 2025-06-18 11:35:35 -04:00

Author	SHA1	Message	Date
yintong	10b51fa592	riscv: add crc32 optimization using zbc extension Some checks failed Configure / ${{ matrix.name }} (gcc, --warn, Ubuntu GCC, ubuntu-latest) (push) Has been cancelled Details Configure / ${{ matrix.name }} (gcc-11, --sprefix=zTest_, macOS GCC Symbol Prefix, macos-13, gcc@11) (push) Has been cancelled Details Configure / ${{ matrix.name }} (gcc-11, --warn, macOS GCC, macos-13, gcc@11) (push) Has been cancelled Details Configure / ${{ matrix.name }} (gcc-11, --zlib-compat --sprefix=zTest_, macOS GCC Symbol Prefix & Compat, macos-13, gcc@11) (push) Has been cancelled Details Configure / ${{ matrix.name }} (mips-linux-gnu, mips-linux-gnu-gcc, --warn, Ubuntu GCC MIPS, ubuntu-latest, qemu-user gcc-mips-linux-gnu libc-dev-mips-cross) (push) Has been cancelled Details Configure / ${{ matrix.name }} (mips64-linux-gnuabi64, mips64-linux-gnuabi64-gcc, --warn, Ubuntu GCC MIPS64, ubuntu-latest, qemu-user gcc-mips64-linux-gnuabi64 libc-dev-mips64-cross) (push) Has been cancelled Details Configure / ${{ matrix.name }} (powerpc-linux-gnu, powerpc-linux-gnu-gcc, --warn --without-power8, Ubuntu GCC PPC No Power8, ubuntu-latest, qemu-user gcc-powerpc-linux-gnu libc-dev-powerpc-cross) (push) Has been cancelled Details Configure / ${{ matrix.name }} (powerpc64le-linux-gnu, powerpc64le-linux-gnu-gcc, --warn, Ubuntu GCC PPC64LE, ubuntu-latest, qemu-user gcc-powerpc64le-linux-gnu libc-dev-ppc64el-cross) (push) Has been cancelled Details OSS-Fuzz / Fuzzing (push) Has been cancelled Details Libpng / Ubuntu Clang (push) Has been cancelled Details Link / Link zlib (push) Has been cancelled Details Link / Link zlib-ng compat (push) Has been cancelled Details Pigz / ${{ matrix.name }} (-DCMAKE_TOOLCHAIN_FILE=../../cmake/toolchain-aarch64.cmake, ubuntu_gcc_pigz_aarch64, Ubuntu GCC AARCH64, ubuntu-latest, qemu-user gcc-aarch64-linux-gnu libc-dev-arm64-cross) (push) Has been cancelled Details Pigz / ${{ matrix.name }} (-DWITH_OPTIM=OFF, ubuntu_clang_pigz_no_optim, clang, llvm-cov-15 gcov, Ubuntu Clang No Optim, ubuntu-latest, llvm-15 llvm-15-tools) (push) Has been cancelled Details Pigz / ${{ matrix.name }} (-DWITH_THREADS=OFF -DPIGZ_VERSION=v2.6, ubuntu_clang_pigz_no_threads, clang, llvm-cov-15 gcov, Ubuntu Clang No Threads, ubuntu-latest, llvm-15 llvm-15-tools) (push) Has been cancelled Details Pigz / ${{ matrix.name }} (-DZLIB_SYMBOL_PREFIX=zTest_, ubuntu_gcc_pigz, gcc, Ubuntu GCC Symbol Prefix, ubuntu-latest) (push) Has been cancelled Details Pigz / ${{ matrix.name }} (ubuntu_clang_pigz, clang, llvm-cov-15 gcov, Ubuntu Clang, ubuntu-latest, llvm-15 llvm-15-tools) (push) Has been cancelled Details Pigz / ${{ matrix.name }} (ubuntu_gcc_pigz, gcc, Ubuntu GCC, ubuntu-latest) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (-DZLIB_SYMBOL_PREFIX=zTest_, clang, --sprefix=zTest_, clang++, macOS Clang Symbol Prefix, macOS-latest) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (-m32, -DCMAKE_C_FLAGS=-m32 -DCMAKE_CXX_FLAGS=-m32, gcc, g++, -m32, -m32, Ubuntu GCC -m32, ubuntu-latest, gcc-multilib g++-multilib) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (aarch64-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-aarch64.cmake, aarch64-linux-gnu-gcc, aarch64-linux-gnu-g++, Ubuntu GCC AARCH64, ubuntu-latest, qemu-user gcc-aarch64-linux-gnu g++-aarch64-linux-gnu libc6-dev-arm64-cross) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (arm-linux-gnueabihf, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-armhf.cmake, arm-linux-gnueabihf-gcc, arm-linux-gnueabihf-g++, Ubuntu GCC ARM HF, ubuntu-latest, qemu-user gcc-arm-linux-gnueabihf g++-arm-linux-gnueabihf libc6-dev-armhf-c… (push) Has been cancelled Details Package Check / ${{ matrix.name }} (clang, clang++, macOS Clang, macOS-latest) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (gcc, g++, Ubuntu GCC, ubuntu-latest) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (mips-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-mips.cmake, mips-linux-gnu-gcc, mips-linux-gnu-g++, Ubuntu GCC MIPS, ubuntu-latest, qemu-user gcc-mips-linux-gnu g++-mips-linux-gnu libc6-dev-mips-cross) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (mips64-linux-gnuabi64, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-mips64.cmake, mips64-linux-gnuabi64-gcc, mips64-linux-gnuabi64-g++, Ubuntu GCC MIPS64, ubuntu-latest, qemu-user gcc-mips64-linux-gnuabi64 g++-mips64-linux-gnuabi64 libc6-… (push) Has been cancelled Details Package Check / ${{ matrix.name }} (powerpc-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-powerpc.cmake, powerpc-linux-gnu-gcc, powerpc-linux-gnu-g++, Ubuntu GCC PPC, ubuntu-latest, qemu-user gcc-powerpc-linux-gnu g++-powerpc-linux-gnu libc6-dev-powerpc-cross) (push) Has been cancelled Details Package Check / ${{ matrix.name }} (powerpc64le-linux-gnu, -DCMAKE_TOOLCHAIN_FILE=cmake/toolchain-powerpc64le.cmake, powerpc64le-linux-gnu-gcc, powerpc64le-linux-gnu-g++, Ubuntu GCC PPC64LE, ubuntu-latest, qemu-user gcc-powerpc64le-linux-gnu g++-powerpc64le-linux-gnu … (push) Has been cancelled Details CMake / Upload Coverage Reports (push) Has been cancelled Details Pigz / Upload Coverage Reports (push) Has been cancelled Details	2025-04-27 18:23:50 +02:00
Adam Stylinski	46fc33f39d	SSE4.1 optimized chorba This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason for this has to do with the fact that, while incurring far fewer shifts, an entirely separate stack buffer has to be managed that is the size of the L1 cache on most CPUs. This was one of the main reasons the 32k specialized function was slower for the scalar counterpart, despite auto vectorizing. The auto vectorized loop was setting up the stack buffer at unaligned offsets, which is detrimental to performance pre-nehalem. Additionally, we were losing a fair bit of time to the zero initialization, which we are now doing more selectively. There are a ton of loads and stores happening, and for sure we are bound on the fill buffer + store forwarding. An SSE2 version of this code is probably possible by simply replacing the shifts with unpacks with zero and the palignr's with shufpd's. I'm just not sure it'll be all that worth it, though. We are gating against SSE4.1 not because we are using specifically a 4.1 instruction but because that marks when Wolfdale came out and palignr became a lot faster.	2025-04-15 14:11:12 +02:00
Hans Kristian Rosbach	00a3168d5d	Add AVX512 version of compare256 Improve the speed of sub-16 byte matches by first using a 128-bit intrinsic, after that use only 512-bit intrinsics. This requires us to overlap on the last run, but this is cheaper than processing the tail using a 256-bit and then a 128-bit run. Change benchmark steps to avoid it hitting chunk boundaries of one or the other function as much, this gives more fair benchmarks.	2025-04-14 23:28:38 +02:00
Adam Stylinski	724dc0cfb4	Explicit SSE2 vectorization of Chorba CRC method The version that's currently in the generic implementation for 32768 byte buffers leverages the stack. It manages to autovectorize but unfortunately the trips to the stack hurt its performance for CPUs which need this the most. This version is explicitly SIMD vectorized and doesn't use trips to the stack. In my testing it's ~10% faster than the "small" variant, and about 42% faster than the "32768" variant.	2025-03-28 20:43:59 +01:00
Adam Stylinski	50e9ca06e2	Fold a copy into the adler32 function for UPDATEWINDOW for neon So a lot of alterations had to be done to make this not worse and so far, it's not really better, either. I had to force inlining for the adler routine, I had to remove the x4 load instruction otherwise pipelining stalled, and I had to use restrict pointers with a copy idiom for GCC to inline a copy routine for the tail. Still, we see a small benefit in benchmarks, particularly when done with size of our window or larger. There's also an added benefit that this will fix #1824.	2025-03-05 22:17:55 +01:00
Hans Kristian Rosbach	f411580733	Clean up internal crc32 function handling. Mark crc32_c and crc32_braid functions as internal, and remove prefix. Reorder contents of generic_functions, and remove Z_INTERNAL hints from declarations. Add test/benchmark output to indicate whether Chorba is used.	2025-02-18 23:59:16 +01:00
Sam Russell	b33ba962c2	implement chorba algorithm	2025-02-15 14:31:50 +01:00
Cameron Cawley	721c488aff	Rename most ACLE references to ARMv8	2025-02-12 13:54:30 +01:00
Hans Kristian Rosbach	bf05e882b8	Continued cleanup of old UNALIGNED_OK checks - Remove obsolete checks - Fix checks that are inconsistent - Stop compiling compare256/longest_match variants that never gets called - Improve how the generic compare256 functions are handled. - Allow overriding OPTIMAL_CMP This simplifies the code and avoids having a lot of code in the compiled library than can never get executed.	2024-12-26 22:14:46 +01:00
Adam Stylinski	7020cb3f74	Enable AVX2 functions to be built with BMI2 instructions While these are technically different instructions, no such CPU exists that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to eliminate several flag stalls by having flagless versions of shifts, and allows us to not clobber and move around GPRs so much in scalar code. There's usually a sizeable benefit for enabling it. Since we're building with BMI2 for AVX2 functions, let's also just make sure the CPU claims to support it (just to cover our bases).	2024-12-07 22:32:29 +01:00
Adam Stylinski	0ed5ac8289	Make an AVX512 inflate fast with low cost masked writes This takes advantage of the fact that on AVX512 architectures, masked moves are incredibly cheap. There are many places where we have to fallback to the safe C implementation of chunkcopy_safe because of the assumed overwriting that occurs. We're to sidestep most of the branching needed here by simply controlling the bounds of our writes with a mask.	2024-11-20 22:14:44 +01:00
Adam Stylinski	94aacd8bd6	Try to simply the inflate loop by collapsing most cases to chunksets	2024-10-23 21:20:11 +02:00
Adeel Mujahid	e4fb3803af	Address CR feedback	2024-09-01 15:38:30 +02:00
Adeel Mujahid	c5e7d0f373	Fix new Windows SDK build break Co-authored-by: Jan Kotas <jkotas@microsoft.com>	2024-09-01 15:38:30 +02:00
Tulio Magno Quites Machado Filho	1a15c4b20e	Fix illegal instruction usage in Xeon Phi x200 processors The Xeon Phi x200 family of processors (Knights Landing) supports AVX512 (F, CD, ER, PF) but does not support AVX512 (VL, DQ, BW). Because of processors like this, the Intel Software Developer's Manual suggests the bits AVX512 (DQ,BW,VL) are also tested in EBX together with AVX512F before deciding to run AVX512 (DQ,BW,VL) instructions. This also adds a new x86 feature called avx512_common that indicates that AVX512 (F,DQ,BW,VL) are all available and start using this for both adler32_avx512 and crc32_vpclmulqdq implementations because they are both built with -mavx512dq -mavx512bw -mavx512vl. This has been reported downstream as https://bugzilla.redhat.com/show_bug.cgi?id=2280347 .	2024-05-19 12:25:01 +02:00
Vladislav Shchapov	c694bcdaf6	Add option to disable runtime CPU detection Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-03-06 23:32:15 +01:00
Hans Kristian Rosbach	9953f12e21	Move update_hash(), insert_string() and quick_insert_string() out of functable and remove SSE4.2 and ACLE optimizations. The functable overhead is higher than the benefit from using optimized functions.	2024-02-23 13:34:10 +01:00
Nathan Moinvaziri	a090529ece	Remove deflate_state parameter from update_hash functions.	2024-02-23 13:34:10 +01:00
Vladislav Shchapov	ba9b3cdb61	Rename cpu_functions.h to arch_functions.h. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-02-22 20:11:46 +01:00
Vladislav Shchapov	305b268b32	Move select for generic functions into generic_functions.h. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-02-22 20:11:46 +01:00
Vladislav Shchapov	ac25a2ea6a	Split CPU features checks and CPU-specific function prototypes and reduce include-dependencies. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-02-22 20:11:46 +01:00
Nathan Moinvaziri	379eda2e80	Remove type declarations for z_stream/zng_stream from cpu_features.	2024-01-30 20:50:05 +01:00
Vladislav Shchapov	1aa53f40fc	Improve x86 intrinsics dependencies. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-01-25 10:21:49 +01:00
Vladislav Shchapov	0b856b7351	Remove always true arch conditions. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-01-25 10:21:49 +01:00
Vladislav Shchapov	9d486b5073	Atomic functable Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-12-25 20:47:24 +01:00
Vladislav Shchapov	0c32ad4237	Add force initialization functable, because deflate captures function pointers from functable Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-12-21 16:12:00 +01:00
Simon Hosie	f3211aba34	Add adler32_fold_copy_rvv implementation.	2023-11-28 10:25:35 +01:00
alexsifivetw	fe6aaedaf8	General optimized chunkset	2023-09-28 00:14:26 +02:00
Cameron Cawley	16fe1f885e	Add ARMv6 version of slide_hash	2023-09-16 11:11:18 +02:00
alexsifivetw	6eed7416ed	Optimize adler32 using rvv	2023-07-16 12:44:25 +02:00
alexsifivetw	2f4ebe2bb6	Optimize slide_hash using RVV	2023-06-23 19:44:22 +02:00
alexsifivetw	de1b640ffb	Optimize compare256 with rvv	2023-06-13 12:25:48 +02:00
Cameron Cawley	38aa575129	Ensure that unaligned compare256 variants are only used on little endian systems	2023-04-25 12:07:55 +02:00
Cameron Cawley	1ae7b0545d	Rename chunkset_avx to chunkset_avx2	2023-04-19 00:35:28 +02:00
Cameron Cawley	b1aafe5c67	Clean up SSE4.2 detection	2023-04-15 15:22:36 +02:00
Cameron Cawley	b09215f75a	Enable use of _mm_shuffle_epi8 on machines without SSE4.1	2023-04-01 17:27:49 +02:00
Vladislav Shchapov	20d8fa8af1	Replace global CPU feature flag variables with local variable in init_functable Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-03-06 13:26:09 +01:00
Vladislav Shchapov	fdb87d63a5	Split crc32 pclmulqdq and vpclmulqdq implementations Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-02-24 13:25:54 +01:00
Hans Kristian Rosbach	9db6a98894	Sort functable alphabetically	2023-02-17 15:11:25 +01:00
Hans Kristian Rosbach	7e1d80742e	Reduce the amount of different defines required for arch-specific optimizations. Also removed a reference to a nonexistant adler32_sse41 in test/test_adler32.cc.	2023-02-17 15:11:25 +01:00
Hans Kristian Rosbach	6f714ef422	Add missing compare256_neon activation to functable	2023-02-13 00:25:57 +01:00
Hans Kristian Rosbach	5d9ddac4cd	Combine some of the checks that were not identical. Made longest_match and compare256 use the X86_NOCHECK_SSE2 override, thus now those are also automatically enabled on x86_64.	2023-02-13 00:25:57 +01:00
Hans Kristian Rosbach	c8a6b3ed6b	Simplify functable.c	2023-02-13 00:25:57 +01:00
Pavel P	3e75a5c981	Correct inflate_fast function signature	2023-02-08 15:22:22 +01:00
Nathan Moinvaziri	c72cd309ca	Remove unused chunk memory functions from functable.	2023-02-05 17:51:46 +01:00
Nathan Moinvaziri	aa1109bb2e	Use arch-specific versions of inflate_fast. This should reduce the cost of indirection that occurs when calling functable chunk copying functions inside inflate_fast. It should also allow the compiler to optimize the inflate fast path for the specific architecture.	2023-02-05 17:51:46 +01:00
Pavel P	d144fc06bf	Rename local `functable` variable to `ft`	2023-02-03 15:50:07 +01:00
Pavel P	709a710f6f	Use local functable variable instead of standalone function pointers	2023-02-03 15:50:07 +01:00
Pavel P	df60007e8e	Move initialization of functable to `init_functable` function	2023-02-03 15:50:07 +01:00
Pavel P	fecb03a1a1	Avoid `functable` redefinition in functable.c `functable` is already declared by functable.h which is included by functable.c	2023-02-03 15:50:07 +01:00

1 2 3

146 Commits