zlib-ng/arch/x86/x86_features.h
Adam Stylinski 46fc33f39d SSE4.1 optimized chorba
This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason
for this has to do with the fact that, while incurring far fewer shifts,
an entirely separate stack buffer has to be managed that is the size of
the L1 cache on most CPUs. This was one of the main reasons the 32k
specialized function was slower for the scalar counterpart, despite auto
vectorizing. The auto vectorized loop was setting up the stack buffer at
unaligned offsets, which is detrimental to performance pre-nehalem.
Additionally, we were losing a fair bit of time to the zero
initialization, which we are now doing more selectively.

There are a ton of loads and stores happening, and for sure we are bound
on the fill buffer + store forwarding. An SSE2 version of this code is
probably possible by simply replacing the shifts with unpacks with zero
and the palignr's with shufpd's. I'm just not sure it'll be all that worth
it, though. We are gating against SSE4.1 not because we are using specifically
a 4.1 instruction but because that marks when Wolfdale came out and palignr
became a lot faster.
2025-04-15 14:11:12 +02:00

31 lines
753 B
C

/* x86_features.h -- check for CPU features
* Copyright (C) 2013 Intel Corporation Jim Kukunas
* For conditions of distribution and use, see copyright notice in zlib.h
*/
#ifndef X86_FEATURES_H_
#define X86_FEATURES_H_
struct x86_cpu_features {
int has_avx2;
int has_avx512f;
int has_avx512dq;
int has_avx512bw;
int has_avx512vl;
int has_avx512_common; // Enabled when AVX512(F,DQ,BW,VL) are all enabled.
int has_avx512vnni;
int has_bmi2;
int has_sse2;
int has_ssse3;
int has_sse41;
int has_sse42;
int has_pclmulqdq;
int has_vpclmulqdq;
int has_os_save_ymm;
int has_os_save_zmm;
};
void Z_INTERNAL x86_check_features(struct x86_cpu_features *features);
#endif /* X86_FEATURES_H_ */