Reduce development burden by getting rid of NMake files that are manually
kept up to date. For continued NMake support please generate NMake project
files using CMake.
The version that's currently in the generic implementation for 32768
byte buffers leverages the stack. It manages to autovectorize but
unfortunately the trips to the stack hurt its performance for CPUs which
need this the most. This version is explicitly SIMD vectorized and
doesn't use trips to the stack. In my testing it's ~10% faster than the
"small" variant, and about 42% faster than the "32768" variant.
Defines changing meaning:
X86_SSE42 used to mean the compiler supports crc asm fallback.
X86_SSE42_CRC_INTRIN used to mean compiler supports SSE4.2 intrinsics.
X86_SSE42 now means compiler supports SSE4.2 intrinsics.
This therefore also fixes the adler32_sse42 checks, since those were depending
on SSE4.2 intrinsics but was mistakenly checking the X86_SSE42 define.
Now the X86_SSE42 define actually means what it appears to.
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
Before this patch, `zlib.rc` was compiled using a manual command [1] when
using the MinGW (and MSYS/Cygwin) toolchains. This method ignores
`CMAKE_RC_FLAGS` and offers no other way to pass a custom flag, breaking
the build in cases where a custom `windres` option is required. E.g.
`--target=` or `-I` on some platforms and configuration, in particular
with `llvm-windres`.
This patch deletes the special case for these toolchains and lets CMake
compile the `.rc` file the default way used for all Windows targets.
I'm not entirely sure why this special case was added back in 2011. The
need to pass `-DGCC_WINDRES` is my suspect. We can resolve this much
simpler by adding this line for the targets that require it:
set(CMAKE_RC_FLAGS "${CMAKE_RC_FLAGS} -DGCC_WINDRES")
But, the `.rc` line protected by `GCC_WINDRES`, these days work just fine
with `windres`. Moreover, that protected line are oboslete flags from the
16-bit era, which for a long time have no effect, as documented here:
<https://docs.microsoft.com/windows/win32/menurc/common-resource-attributes>
So, this patch deletes `GCC_WINDRES` from the project entirely.
[1] dc5a43e
Use the interleaved method of Kadatch and Jenkins in order to make
use of pipelined instructions through multiple ALUs in a single
core. This also speeds up and simplifies the combination of CRCs,
and updates the functions to pre-calculate and use an operator for
CRC combination.
Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse. That will be an interesting function to attempt to
optimize.
Right now though, we're basically beating google for all PNG decode and
encode benchmarks. There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.
While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.
When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine. The reason for this is the
particularly tricky case of dealing with unaligned buffers. The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.
Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.
Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
These are very simple wrappers that do nothing clever but serve as a
shim interface for implementing versions which do cleverly track the
number of scalar sums performed so that we can minimize rebasing and
also have an efficient copy elision.
This serves as the baseline as each vectorization gets its own commit.
That way the PR will be bisectable.
This is useful when zlib-ng is embedded into another library,
such as ITK: https://itk.org/Closes#1025.
Co-authored-by: Mika Lindqvist <postmaster@raasu.org>