Commit Graph

23 Commits

Author SHA1 Message Date
Nathan Moinvaziri
8e0e24cd18 Split cpu_features.h by architecture. 2024-01-30 20:50:05 +01:00
Cameron Cawley
a339d85c80 Move the AVX compatibility functions into a separate file 2023-07-20 08:03:17 +02:00
David Korth
8976caa3f0 Handle ARM64EC as ARM64.
ARM64EC is a new ARM64 variant introduced in Windows 11 that uses an
ABI similar to AMD64, which allows for better interoperability with
emulated AMD64 applications. When enabled in MSVC, it defines _M_AMD64
and _M_ARM64EC, but not _M_ARM64, so we need to check for _M_ARM64EC.
2023-07-16 12:42:38 +02:00
Georgiy Manuilov
6670678012 Add fallback function for '_mm512_set_epi8' intrinsic
'_mm512_set_epi8' intrinsic is missing in GCC <9.
However, its usage can be easily eliminated in
favor of '_mm512_set_epi32' with no loss in
performance enabling older GCC to benefit from
AVX512-optimized codepaths.
2023-03-28 20:36:19 +02:00
Hans Kristian Rosbach
17331431e0 Replace __builtin_ctz[ll] fallback functions with branchless implementations.
Added debug assert check for value = 0.
Added more details to the comment to avoid future confusion.
Added fallback logic for older MSVC versions, just in case.
2023-02-07 16:25:46 +01:00
Pavel P
147dd9f0e9 Match __builtin_ctzl/__builtin_ctzll signatures
make sure input/output args match original functions from clang/gcc
2023-01-22 21:34:45 +01:00
Vladislav Shchapov
78284a439f Fix missing intrinsics (MSVS 2015, 2017)
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2022-10-11 21:25:02 +02:00
Cameron Cawley
9c839540ed Move the NEON compatibility defines into a separate file 2022-10-11 12:22:33 +02:00
Shawn Hoffman
8098fde200 fix ACLE detection on msvc/arm64 2022-09-05 11:26:37 +02:00
Sean McBride
16d2762366 Fixed issue #1264: Use fallback for _mm256_zextsi128_si256 on Xcode < 9.3 2022-05-12 12:05:55 +02:00
Mika Lindqvist
db3feb4cf2 Allow bypassing runtime feature check of TZCNT instructions.
* This avoids conditional branch when it's known at build time that TZCNT instructions are always supported
2022-03-16 11:43:09 +01:00
Adam Stylinski
43dbfd6709 Improved adler32 NEON performance by 30-47%
We unlocked some ILP by allowing for independent sums in the loop and
reducing these sums outside of the loop. Additionally, the multiplication
by 32 (now 64) is moved outside of this loop. Similar to the chromium
implementation, this code does straight 8 bit -> 16 bit additions and defers
the fused multiply accumulate outside of the loop.  However, by unrolling by
another factor of 2, the code is measurably faster. The code does fused multiply
accmulates back to as many scratch registers we have room for in order to maximize
ILP for the 16 integer FMAs that need to occur.  The compiler seems to order them
such that the destination register is the same register as the previous instruction,
so perhaps it's not actually able to overlap or maybe the -A73's pipeline is reordering
these instructions, anyway.

On the Odroid-N2, the Cortex-A73 cores are ~30-44% faster on the adler32 benchmark,
and the Cortex-A53 cores are anywhere from 34-47% faster.
2022-02-24 16:00:51 +01:00
Nathan Moinvaziri
cc361feaad Rename CPU feature header and source files for consistency. 2022-02-06 16:52:10 +01:00
Adam Stylinski
9146bd472c Marginal improvement by pipelining loads on NEON
The ld1{4 reg} variant saves us instructions
and only adds 3 cycles of latency to load 3
more neon/asimd registers worth of data.
2022-02-01 13:31:00 +01:00
Adam Stylinski
429bc4f5d5 Remove the "avx512_well_suited" cpu flag
Now that we have confirmation that the AVX512 variants so far have been
universally better on every capable CPU we've tested them on, there's no
sense in trying to maintain a whitelist.
2022-01-22 20:39:43 +01:00
Adam Stylinski
8437a02b93 Improvements to avx512 adler32 implementations
Now that better benchmarks are in place, it became apparent that masked
broadcast was _not_ faster and it's actually faster to use vmovd, as
suspected.  Additionally, for the VNNI variant, we've unlocked some
additional ILP by doing a second dot product in the loop to a different
running sum that gets recombined later.  This broke a data dependency
chain and allowed the IPC be ~2.75. The result is about a 40-50%
improvement in runtime.

Additionally, we've called the lesser SIMD sized variants if the input
is too small and they happen to be compiled in.  This helps for the
impossibly small input that still is large enough to be a vector length.
For size 16 and 32 inputs I was seeing something like sub 10 ns instead
of 50 ns.
2022-01-22 20:39:43 +01:00
Nathan Moinvaziri
d8aeacbfa2 Harmonize the CPU architecture preprocessor definitions. 2020-08-20 12:05:11 +02:00
Nathan Moinvaziri
cf3f8bd671 Removed fallback for __builtin_ctzl since it is no longer used. 2020-05-24 13:53:25 +02:00
Nathan Moinvaziri
9bd28d9381 Abstracted out architecture specific implementations of 258 byte comparison to compare258. 2020-05-24 13:53:25 +02:00
Nathan Moinvaziri
e09d131b5a Standardize fill_window implementations and abstract out slide_hash_neon for ARM. 2020-05-01 00:21:18 +02:00
Nathan Moinvaziri
e0a711cdde Fixed formatting, 4 spaces for code intent, 2 spaces for preprocessor indent, initial function brace on the same line as definition, removed extraneous spaces and new lines. 2020-02-07 10:44:20 +01:00
Nathan Moinvaziri
305b2d75e2 Fixed formatting converting tabs to spaces. 2019-10-24 09:18:29 +02:00
Nathan Moinvaziri
ce0076688a Changes to support compilation with MSVC ARM & ARM64 (#386)
* Merge aarch64 and arm cmake sections.
* Updated MSVC compiler support for ARM and ARM64.
* Moved detection for -mfpu=neon to where the flag is set to simplify add_intrinsics_option.
* Only add ${ACLEFLAG} on aarch64 if not WITH_NEON.
* Rename arch/x86/ctzl.h to fallback_builtins.h.
2019-09-04 08:35:23 +02:00