ARM64EC is a new ARM64 variant introduced in Windows 11 that uses an
ABI similar to AMD64, which allows for better interoperability with
emulated AMD64 applications. When enabled in MSVC, it defines _M_AMD64
and _M_ARM64EC, but not _M_ARM64, so we need to check for _M_ARM64EC.
'_mm512_set_epi8' intrinsic is missing in GCC <9.
However, its usage can be easily eliminated in
favor of '_mm512_set_epi32' with no loss in
performance enabling older GCC to benefit from
AVX512-optimized codepaths.
Added debug assert check for value = 0.
Added more details to the comment to avoid future confusion.
Added fallback logic for older MSVC versions, just in case.
We unlocked some ILP by allowing for independent sums in the loop and
reducing these sums outside of the loop. Additionally, the multiplication
by 32 (now 64) is moved outside of this loop. Similar to the chromium
implementation, this code does straight 8 bit -> 16 bit additions and defers
the fused multiply accumulate outside of the loop. However, by unrolling by
another factor of 2, the code is measurably faster. The code does fused multiply
accmulates back to as many scratch registers we have room for in order to maximize
ILP for the 16 integer FMAs that need to occur. The compiler seems to order them
such that the destination register is the same register as the previous instruction,
so perhaps it's not actually able to overlap or maybe the -A73's pipeline is reordering
these instructions, anyway.
On the Odroid-N2, the Cortex-A73 cores are ~30-44% faster on the adler32 benchmark,
and the Cortex-A53 cores are anywhere from 34-47% faster.
Now that we have confirmation that the AVX512 variants so far have been
universally better on every capable CPU we've tested them on, there's no
sense in trying to maintain a whitelist.
Now that better benchmarks are in place, it became apparent that masked
broadcast was _not_ faster and it's actually faster to use vmovd, as
suspected. Additionally, for the VNNI variant, we've unlocked some
additional ILP by doing a second dot product in the loop to a different
running sum that gets recombined later. This broke a data dependency
chain and allowed the IPC be ~2.75. The result is about a 40-50%
improvement in runtime.
Additionally, we've called the lesser SIMD sized variants if the input
is too small and they happen to be compiled in. This helps for the
impossibly small input that still is large enough to be a vector length.
For size 16 and 32 inputs I was seeing something like sub 10 ns instead
of 50 ns.
* Merge aarch64 and arm cmake sections.
* Updated MSVC compiler support for ARM and ARM64.
* Moved detection for -mfpu=neon to where the flag is set to simplify add_intrinsics_option.
* Only add ${ACLEFLAG} on aarch64 if not WITH_NEON.
* Rename arch/x86/ctzl.h to fallback_builtins.h.