zlib-ng

mirror of https://github.com/GerbilSoft/zlib-ng.git synced 2025-06-18 11:35:35 -04:00

Author	SHA1	Message	Date
Nathan Moinvaziri	8e0e24cd18	Split cpu_features.h by architecture.	2024-01-30 20:50:05 +01:00
Cameron Cawley	a339d85c80	Move the AVX compatibility functions into a separate file	2023-07-20 08:03:17 +02:00
David Korth	8976caa3f0	Handle ARM64EC as ARM64. ARM64EC is a new ARM64 variant introduced in Windows 11 that uses an ABI similar to AMD64, which allows for better interoperability with emulated AMD64 applications. When enabled in MSVC, it defines _M_AMD64 and _M_ARM64EC, but not _M_ARM64, so we need to check for _M_ARM64EC.	2023-07-16 12:42:38 +02:00
Georgiy Manuilov	6670678012	Add fallback function for '_mm512_set_epi8' intrinsic '_mm512_set_epi8' intrinsic is missing in GCC <9. However, its usage can be easily eliminated in favor of '_mm512_set_epi32' with no loss in performance enabling older GCC to benefit from AVX512-optimized codepaths.	2023-03-28 20:36:19 +02:00
Hans Kristian Rosbach	17331431e0	Replace __builtin_ctz[ll] fallback functions with branchless implementations. Added debug assert check for value = 0. Added more details to the comment to avoid future confusion. Added fallback logic for older MSVC versions, just in case.	2023-02-07 16:25:46 +01:00
Pavel P	147dd9f0e9	Match __builtin_ctzl/__builtin_ctzll signatures make sure input/output args match original functions from clang/gcc	2023-01-22 21:34:45 +01:00
Vladislav Shchapov	78284a439f	Fix missing intrinsics (MSVS 2015, 2017) Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2022-10-11 21:25:02 +02:00
Cameron Cawley	9c839540ed	Move the NEON compatibility defines into a separate file	2022-10-11 12:22:33 +02:00
Shawn Hoffman	8098fde200	fix ACLE detection on msvc/arm64	2022-09-05 11:26:37 +02:00
Sean McBride	16d2762366	Fixed issue #1264 : Use fallback for _mm256_zextsi128_si256 on Xcode < 9.3	2022-05-12 12:05:55 +02:00
Mika Lindqvist	db3feb4cf2	Allow bypassing runtime feature check of TZCNT instructions. * This avoids conditional branch when it's known at build time that TZCNT instructions are always supported	2022-03-16 11:43:09 +01:00
Adam Stylinski	43dbfd6709	Improved adler32 NEON performance by 30-47% We unlocked some ILP by allowing for independent sums in the loop and reducing these sums outside of the loop. Additionally, the multiplication by 32 (now 64) is moved outside of this loop. Similar to the chromium implementation, this code does straight 8 bit -> 16 bit additions and defers the fused multiply accumulate outside of the loop. However, by unrolling by another factor of 2, the code is measurably faster. The code does fused multiply accmulates back to as many scratch registers we have room for in order to maximize ILP for the 16 integer FMAs that need to occur. The compiler seems to order them such that the destination register is the same register as the previous instruction, so perhaps it's not actually able to overlap or maybe the -A73's pipeline is reordering these instructions, anyway. On the Odroid-N2, the Cortex-A73 cores are ~30-44% faster on the adler32 benchmark, and the Cortex-A53 cores are anywhere from 34-47% faster.	2022-02-24 16:00:51 +01:00
Nathan Moinvaziri	cc361feaad	Rename CPU feature header and source files for consistency.	2022-02-06 16:52:10 +01:00
Adam Stylinski	9146bd472c	Marginal improvement by pipelining loads on NEON The ld1{4 reg} variant saves us instructions and only adds 3 cycles of latency to load 3 more neon/asimd registers worth of data.	2022-02-01 13:31:00 +01:00
Adam Stylinski	429bc4f5d5	Remove the "avx512_well_suited" cpu flag Now that we have confirmation that the AVX512 variants so far have been universally better on every capable CPU we've tested them on, there's no sense in trying to maintain a whitelist.	2022-01-22 20:39:43 +01:00
Adam Stylinski	8437a02b93	Improvements to avx512 adler32 implementations Now that better benchmarks are in place, it became apparent that masked broadcast was _not_ faster and it's actually faster to use vmovd, as suspected. Additionally, for the VNNI variant, we've unlocked some additional ILP by doing a second dot product in the loop to a different running sum that gets recombined later. This broke a data dependency chain and allowed the IPC be ~2.75. The result is about a 40-50% improvement in runtime. Additionally, we've called the lesser SIMD sized variants if the input is too small and they happen to be compiled in. This helps for the impossibly small input that still is large enough to be a vector length. For size 16 and 32 inputs I was seeing something like sub 10 ns instead of 50 ns.	2022-01-22 20:39:43 +01:00
Nathan Moinvaziri	d8aeacbfa2	Harmonize the CPU architecture preprocessor definitions.	2020-08-20 12:05:11 +02:00
Nathan Moinvaziri	cf3f8bd671	Removed fallback for __builtin_ctzl since it is no longer used.	2020-05-24 13:53:25 +02:00
Nathan Moinvaziri	9bd28d9381	Abstracted out architecture specific implementations of 258 byte comparison to compare258.	2020-05-24 13:53:25 +02:00
Nathan Moinvaziri	e09d131b5a	Standardize fill_window implementations and abstract out slide_hash_neon for ARM.	2020-05-01 00:21:18 +02:00
Nathan Moinvaziri	e0a711cdde	Fixed formatting, 4 spaces for code intent, 2 spaces for preprocessor indent, initial function brace on the same line as definition, removed extraneous spaces and new lines.	2020-02-07 10:44:20 +01:00
Nathan Moinvaziri	305b2d75e2	Fixed formatting converting tabs to spaces.	2019-10-24 09:18:29 +02:00
Nathan Moinvaziri	ce0076688a	Changes to support compilation with MSVC ARM & ARM64 (#386 ) * Merge aarch64 and arm cmake sections. * Updated MSVC compiler support for ARM and ARM64. * Moved detection for -mfpu=neon to where the flag is set to simplify add_intrinsics_option. * Only add ${ACLEFLAG} on aarch64 if not WITH_NEON. * Rename arch/x86/ctzl.h to fallback_builtins.h.	2019-09-04 08:35:23 +02:00

23 Commits