This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason
for this has to do with the fact that, while incurring far fewer shifts,
an entirely separate stack buffer has to be managed that is the size of
the L1 cache on most CPUs. This was one of the main reasons the 32k
specialized function was slower for the scalar counterpart, despite auto
vectorizing. The auto vectorized loop was setting up the stack buffer at
unaligned offsets, which is detrimental to performance pre-nehalem.
Additionally, we were losing a fair bit of time to the zero
initialization, which we are now doing more selectively.
There are a ton of loads and stores happening, and for sure we are bound
on the fill buffer + store forwarding. An SSE2 version of this code is
probably possible by simply replacing the shifts with unpacks with zero
and the palignr's with shufpd's. I'm just not sure it'll be all that worth
it, though. We are gating against SSE4.1 not because we are using specifically
a 4.1 instruction but because that marks when Wolfdale came out and palignr
became a lot faster.
While these are technically different instructions, no such CPU exists
that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to
eliminate several flag stalls by having flagless versions of shifts, and
allows us to not clobber and move around GPRs so much in scalar code.
There's usually a sizeable benefit for enabling it. Since we're building
with BMI2 for AVX2 functions, let's also just make sure the CPU claims
to support it (just to cover our bases).
This takes advantage of the fact that on AVX512 architectures, masked
moves are incredibly cheap. There are many places where we have to
fallback to the safe C implementation of chunkcopy_safe because of the
assumed overwriting that occurs. We're to sidestep most of the branching
needed here by simply controlling the bounds of our writes with a mask.
Compiling zlib-ng with glibc 2.17 (minimum version still supported by
crosstool-ng) fails due to the lack of HWCAP_S390_VX - it was
introduced in glibc 2.23.
Strictly speaking, this is a problem with the feature detection logic
in cmake. However, it's not worth disabling the s390x vectorized CRC32
if the hwcap constant is missing and the compiler intrinsics are
available.
So fix by hardcoding the constant. It's a part of the kernel ABI,
which does not change.
zlib-ng requires some patches to make it compilable on LLVM-mingw.
1. Add -Wno-pedantic-ms-format only if a toolchain is MinGW GCC.
- llvm-mingw does not support it, causing build to break.
2. Include arm_neon.h instead of arm64_neon.h (aarch64 only).
- arm64_neon.h is MSVC only.
- GCC, Clang does not have arm64_neon.h but arm_neon.h on aarch64.
- Also applied to configure and detect-instrinsics.cmake
Native flag should already determine what code will run on the architecture.
This appears to have just been an extra run check with limited benefits. Any
compiler that compiles code not available on the native platform is buggy and
not our problem.
The Visual Studio CMake generator allows to select different toolsets.
One of these toolsets is Clang-Cl.
However, the generator does accept the toolset name case-agnostic, so it
could be "ClangCl", but also "Clangcl" or "clangcl" or ...
This value will be stored verbatim in variable CMAKE_GENERATOR_TOOLSET
by CMake. Therefore, this variable must be matched case-insensitive,
which is what this commit does.
fixes: #1576
Signed-off-by: Deniz Bahadir <deniz@code.bahadir.email>
Defines changing meaning:
X86_SSE42 used to mean the compiler supports crc asm fallback.
X86_SSE42_CRC_INTRIN used to mean compiler supports SSE4.2 intrinsics.
X86_SSE42 now means compiler supports SSE4.2 intrinsics.
This therefore also fixes the adler32_sse42 checks, since those were depending
on SSE4.2 intrinsics but was mistakenly checking the X86_SSE42 define.
Now the X86_SSE42 define actually means what it appears to.
ARM64EC is a new ARM64 variant introduced in Windows 11 that uses an
ABI similar to AMD64, which allows for better interoperability with
emulated AMD64 applications. When enabled in MSVC, it defines _M_AMD64
and _M_ARM64EC, but not _M_ARM64, so we need to check for _M_ARM64EC.
* Add support for cross-compiling using clang 13 and later for PowerPC64 little-endian and big-endian
* Fix detection for availability of Power9 intrinsics
Replace missing '_mm512_set_epi8' with
'_mm512_set_epi32' in test code for configuring;
Add fallback for '-mtune=cascadelake' flag used
when AVX512 is enabled.
glibc defines HWCAP_S390_VX and, since v2.33, its alias
HWCAP_S390_VXRS; musl has only HWCAP_S390_VXRS.
Use the common HWCAP_S390_VXRS, define it as HWCAP_S390_VX if
necessary.