Commit Graph

179 Commits

Author SHA1 Message Date
Nathan Moinvaziri
dee0ff75f8 Remove NMake build projects
Reduce development burden by getting rid of NMake files that are manually
kept up to date. For continued NMake support please generate NMake project
files using CMake.
2025-04-14 23:18:18 +02:00
Adam Stylinski
724dc0cfb4 Explicit SSE2 vectorization of Chorba CRC method
The version that's currently in the generic implementation for 32768
byte buffers leverages the stack. It manages to autovectorize but
unfortunately the trips to the stack hurt its performance for CPUs which
need this the most. This version is explicitly SIMD vectorized and
doesn't use trips to the stack.  In my testing it's ~10% faster than the
"small" variant, and about 42% faster than the "32768" variant.
2025-03-28 20:43:59 +01:00
Sam Russell
b33ba962c2 implement chorba algorithm 2025-02-15 14:31:50 +01:00
Cameron Cawley
721c488aff Rename most ACLE references to ARMv8 2025-02-12 13:54:30 +01:00
Cameron Cawley
d7e121e56b Use GCC's may_alias attribute for unaligned memory access 2024-12-24 12:55:44 +01:00
Vladislav Shchapov
775053110c Rename cpu_functions.h to arch_functions.h in depcheck.cpp
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-03-04 11:59:09 +01:00
Mika Lindqvist
1d08728c52 Cleanup and update NMake Makefiles.
* Add depcheck.exe to validate NMake Makefiles
2024-02-24 14:38:49 +01:00
Hans Kristian Rosbach
9953f12e21 Move update_hash(), insert_string() and quick_insert_string() out of functable
and remove SSE4.2 and ACLE optimizations. The functable overhead is higher
than the benefit from using optimized functions.
2024-02-23 13:34:10 +01:00
Vladislav Shchapov
ac25a2ea6a Split CPU features checks and CPU-specific function prototypes and reduce include-dependencies.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-02-22 20:11:46 +01:00
Nathan Moinvaziri
fc63426372 Update copyright years in other source files. 2024-02-07 19:15:56 +01:00
Mark Adler
6c7b9a4c9b Update copyright year in win32 resource files.
madler/zlib#8988e03256e9c80766ac6899e86c3bc57c347efc
2024-02-07 19:15:56 +01:00
Hans Kristian Rosbach
0fddd5f125 Rename crc32_braid.c to crc32.c 2024-02-05 08:17:33 +01:00
Hans Kristian Rosbach
86250d40fa Move compare256 and longest_match C fallbacks to arch/generic 2024-01-19 16:58:53 +01:00
Hans Kristian Rosbach
3416e44ba1 Move slide_hash C fallback to arch/generic 2024-01-19 16:58:53 +01:00
Hans Kristian Rosbach
9a1722a22f Move insert_string and update_hash C fallbacks to arch/generic.
Also add missing insert_string dependencies to windows makefiles.
2024-01-19 16:58:53 +01:00
Hans Kristian Rosbach
30856c33bf Move chunkset and inffast C fallbacks to arch/generic 2024-01-19 16:58:53 +01:00
Hans Kristian Rosbach
06895bc1b3 Move crc32 C fallbacks to arch/generic 2024-01-19 15:22:34 +01:00
Hans Kristian Rosbach
4e132cc0ec Move adler32 C fallbacks to arch/generic 2024-01-19 15:22:34 +01:00
Hans Kristian Rosbach
6f38b4c5fc Simplify includes 2024-01-19 15:22:34 +01:00
Cameron Cawley
16fe1f885e Add ARMv6 version of slide_hash 2023-09-16 11:11:18 +02:00
Hans Kristian Rosbach
2167377c46 Clean up SSE4.2 support, and no longer use asm fallback or gcc builtin.
Defines changing meaning:
X86_SSE42 used to mean the compiler supports crc asm fallback.
X86_SSE42_CRC_INTRIN used to mean compiler supports SSE4.2 intrinsics.

X86_SSE42 now means compiler supports SSE4.2 intrinsics.

This therefore also fixes the adler32_sse42 checks, since those were depending
on SSE4.2 intrinsics but was mistakenly checking the X86_SSE42 define.
Now the X86_SSE42 define actually means what it appears to.
2023-08-06 10:17:24 +02:00
Cameron Cawley
1ae7b0545d Rename chunkset_avx to chunkset_avx2 2023-04-19 00:35:28 +02:00
Cameron Cawley
b09215f75a Enable use of _mm_shuffle_epi8 on machines without SSE4.1 2023-04-01 17:27:49 +02:00
Vladislav Shchapov
fdb87d63a5 Split crc32 pclmulqdq and vpclmulqdq implementations
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-02-24 13:25:54 +01:00
Hans Kristian Rosbach
7e1d80742e Reduce the amount of different defines required for arch-specific optimizations.
Also removed a reference to a nonexistant adler32_sse41 in test/test_adler32.cc.
2023-02-17 15:11:25 +01:00
Nathan Moinvaziri
aa1109bb2e Use arch-specific versions of inflate_fast.
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
2023-02-05 17:51:46 +01:00
Mika T. Lindqvist
d5db5aa985 Sync with zlib 1.2.13 and declare compatibility. 2023-02-03 15:49:02 +01:00
Viktor Szakats
89763032d5 cmake: respect custom RC flags and delete GCC_WINDRES
Before this patch, `zlib.rc` was compiled using a manual command [1] when
using the MinGW (and MSYS/Cygwin) toolchains. This method ignores
`CMAKE_RC_FLAGS` and offers no other way to pass a custom flag, breaking
the build in cases where a custom `windres` option is required. E.g.
`--target=` or `-I` on some platforms and configuration, in particular
with `llvm-windres`.

This patch deletes the special case for these toolchains and lets CMake
compile the `.rc` file the default way used for all Windows targets.

I'm not entirely sure why this special case was added back in 2011. The
need to pass `-DGCC_WINDRES` is my suspect. We can resolve this much
simpler by adding this line for the targets that require it:
   set(CMAKE_RC_FLAGS "${CMAKE_RC_FLAGS} -DGCC_WINDRES")

But, the `.rc` line protected by `GCC_WINDRES`, these days work just fine
with `windres`. Moreover, that protected line are oboslete flags from the
16-bit era, which for a long time have no effect, as documented here:
<https://docs.microsoft.com/windows/win32/menurc/common-resource-attributes>

So, this patch deletes `GCC_WINDRES` from the project entirely.

[1] dc5a43e
2022-08-17 14:42:34 +02:00
Hans Kristian Rosbach
2f4e2372a2 Simplify zlib-ng native API by removing version and struct size checks.
This should be backwards compatible with applications compiled for 2.0.x.
2022-06-03 10:21:01 +02:00
Nathan Moinvaziri
a6155234a2 Speed up software CRC-32 computation by a factor of 1.5 to 3.
Use the interleaved method of Kadatch and Jenkins in order to make
use of pipelined instructions through multiple ALUs in a single
core. This also speeds up and simplifies the combination of CRCs,
and updates the functions to pre-calculate and use an operator for
CRC combination.

Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>
2022-05-25 12:04:35 +02:00
Adam Stylinski
d79984b5bc Adding avx512_vnni inline + copy elision
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse.  That will be an interesting function to attempt to
optimize.

Right now though, we're basically beating google for all PNG decode and
encode benchmarks.  There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.

While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
2022-05-23 16:13:39 +02:00
Adam Stylinski
b8269bb7d4 Added inlined AVX512 adler checksum + copy
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
2022-05-23 16:13:39 +02:00
Adam Stylinski
21f461e238 Adding an SSE42 optimized copy + adler checksum implementation
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.

When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine.  The reason for this is the
particularly tricky case of dealing with unaligned buffers.  The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.

Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.

Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
2022-05-23 16:13:39 +02:00
Adam Stylinski
b1389ac2d5 Create adler32_fold_c* functions
These are very simple wrappers that do nothing clever but serve as a
shim interface for implementing versions which do cleverly track the
number of scalar sums performed so that we can minimize rebasing and
also have an efficient copy elision.

This serves as the baseline as each vectorization gets its own commit.
That way the PR will be bisectable.
2022-05-23 16:13:39 +02:00
Nathan Moinvaziri
48f346e806 Implement neon version of compare256.
Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>
2022-05-06 12:19:35 +02:00
Nathan Moinvaziri
445284c570 Fixed missing crc32_combine exports for zlib 1.2.12. 2022-04-05 13:43:29 +02:00
Nathan Moinvaziri
e38c493337 Move UNALIGNED_OK detection to compile time instead of configure time. 2022-03-17 11:03:26 +01:00
Adam Stylinski
b3260fd0c8 Axe the SSE4 compare256 functions 2022-02-11 09:56:19 +01:00
Nathan Moinvaziri
cc361feaad Rename CPU feature header and source files for consistency. 2022-02-06 16:52:10 +01:00
Adam Stylinski
9146bd472c Marginal improvement by pipelining loads on NEON
The ld1{4 reg} variant saves us instructions
and only adds 3 cycles of latency to load 3
more neon/asimd registers worth of data.
2022-02-01 13:31:00 +01:00
Nathan Moinvaziri
6f179fd301 Added adler32, compare256, crc32, and slide_hash benchmarks using Google Benchmark.
Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>
2022-01-17 09:10:02 +01:00
Nathan Moinvaziri
66506ace8d Convert compare258 to compare256 and moved 2 byte check into deflate_quick. Prevents having multiple compare258 functions with 2 byte checks. 2022-01-16 17:30:15 +01:00
Nathan Moinvaziri
76c2ddf201 Remove unmaintained and out-dated DLL FAQ. 2022-01-14 20:48:15 +01:00
Nathan Moinvaziri
2af7ead293 Rename x86 source files with instruction set version. 2022-01-14 20:43:03 +01:00
Nathan Moinvaziri
ff52d46714 Remove old win32 readme. 2022-01-06 22:13:44 +01:00
Dženan Zukić
714f624d79 Add support for name mangling
This is useful when zlib-ng is embedded into another library,
such as ITK: https://itk.org/

Closes #1025.

Co-authored-by: Mika Lindqvist <postmaster@raasu.org>
2021-10-09 09:19:55 +02:00
Nathan Moinvaziri
d802e8900f Move crc32 folding functions into functable. 2021-08-13 15:05:34 +02:00
Nathan Moinvaziri
12a975ac9f Rename slide source files to slide_hash to match function name. 2021-07-08 09:33:41 +02:00
Nathan Moinvaziri
e52d08ea92 Separate slide_hash_c in the same way that insert_string_c is separated from deflate.c. 2021-07-08 09:33:41 +02:00
Nathan Moinvaziri
6948789969 Added rolling hash functions for hash table. 2021-06-25 20:09:14 +02:00