This is ~25-30% faster than the SSE2 variant on a core2 quad. The main reason
for this has to do with the fact that, while incurring far fewer shifts,
an entirely separate stack buffer has to be managed that is the size of
the L1 cache on most CPUs. This was one of the main reasons the 32k
specialized function was slower for the scalar counterpart, despite auto
vectorizing. The auto vectorized loop was setting up the stack buffer at
unaligned offsets, which is detrimental to performance pre-nehalem.
Additionally, we were losing a fair bit of time to the zero
initialization, which we are now doing more selectively.
There are a ton of loads and stores happening, and for sure we are bound
on the fill buffer + store forwarding. An SSE2 version of this code is
probably possible by simply replacing the shifts with unpacks with zero
and the palignr's with shufpd's. I'm just not sure it'll be all that worth
it, though. We are gating against SSE4.1 not because we are using specifically
a 4.1 instruction but because that marks when Wolfdale came out and palignr
became a lot faster.
Improve the speed of sub-16 byte matches by first using a
128-bit intrinsic, after that use only 512-bit intrinsics.
This requires us to overlap on the last run, but this is cheaper than
processing the tail using a 256-bit and then a 128-bit run.
Change benchmark steps to avoid it hitting chunk boundaries
of one or the other function as much, this gives more fair benchmarks.
The version that's currently in the generic implementation for 32768
byte buffers leverages the stack. It manages to autovectorize but
unfortunately the trips to the stack hurt its performance for CPUs which
need this the most. This version is explicitly SIMD vectorized and
doesn't use trips to the stack. In my testing it's ~10% faster than the
"small" variant, and about 42% faster than the "32768" variant.
The chunkset_tpl comment allows negative dist (out - from) as long as
the length is smaller than the absolute value of dist (i.e. memory does
not overlap). However this case is currently broken in the RVV override
of CHUNKCOPY -- it compares dist (which is a ptrdiff_t, a value that
should be of the same size with size_t but signed) with the result of
sizeof (which is a size_t), and this triggers the implicit conversion
from signed to unsigned (thus losing negative values).
As it's promised to be not overlapping when dist is negative, just use a
gaint memcpy() call to copy everything.
Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
In testing a SIMD vectorization for this, I wrote a gtest which stumbled
onto the fact that this had a bug on big endian. Before the initial CRC
had been mixed in it needed to be byte swapped.
MSVC compiler (VS 17.11.x) incorrectly optimizes the GET_CHUNK_MAG code on
older versions. Appears to be resolved in VS 17.13.2. The compiler would
optimize the code in such a way that it would cause a decompression failure.
It only happens when /Os flag is set.
So a lot of alterations had to be done to make this not worse and
so far, it's not really better, either. I had to force inlining for
the adler routine, I had to remove the x4 load instruction otherwise
pipelining stalled, and I had to use restrict pointers with a copy
idiom for GCC to inline a copy routine for the tail.
Still, we see a small benefit in benchmarks, particularly when done
with size of our window or larger. There's also an added benefit that
this will fix#1824.
Mark crc32_c and crc32_braid functions as internal, and remove prefix.
Reorder contents of generic_functions, and remove Z_INTERNAL hints from declarations.
Add test/benchmark output to indicate whether Chorba is used.
- Rename N and W to BRAID_N and BRAID_W
- Remove override capabilities for BRAID_N and BRAID_W
- Fix formatting in crc32_braid_tbl.h
- Make makecrct not rely on crc32_braid_p.h
Evidently this instruction, despite the intrinsic having a register operand,
is a memory-register instruction. There seems to be no alignment requirement
for the source operand. Because of this, compilers when not optimized are doing
the unaligned load and then dumping back to the stack to do the broadcasting load.
In doing this, MSVC seems to be dumping to the stack with an aligned move at an
unaligned address, causing a segfault. GCC does not seem to make this mistake, as
it stashes to an aligned address.
If we're on Visual Studio 2015, let's just do the longer 9 cycle sequence of a 128
bit load followed by a vinserti128. This _should_ fix this (issue #1861).
We have to disable the CRC32-VX implementation for some Clang versions
(18 <= version < 19.1.2) that generate bad code for the IBM S390 VGFMA intrinsics.
Warning as an error with GCC from Uubuntu 24.04:
```
/home/runner/work/dotnet_riscv/dotnet_riscv/runtime/src/native/external/zlib-ng/arch/riscv/riscv_features.c(25,33): error G6E97C40B: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses] [/home/runner/work/dotnet_riscv/dotnet_riscv/runtime/src/native/libs/build-native.proj]
```
- Remove obsolete checks
- Fix checks that are inconsistent
- Stop compiling compare256/longest_match variants that never gets called
- Improve how the generic compare256 functions are handled.
- Allow overriding OPTIMAL_CMP
This simplifies the code and avoids having a lot of code in the compiled library than can never get executed.
This fixes a rightful complaint from the alignment sanitizer that we
alias memory in an unaligned fashion. A nice added bonus is that this
improves performance a tiny bit on the larger buffers, perhaps due to
loops that idiomatically decrement a count and increment a single buffer
pointer rather than the maze of conditional pointer reassignments.
While here, let's write a unit test just for this. Since this is the only
variant that accesses memory in a potentially unaligned fashion that doesn't
explicitly go byte by byte or use intrinsics that don't require alignment,
we'll enable it only for this function for now. Adding more tests later if
need be should be possible. For everything else not crc, we're relying on
ubsan to hopefully catch things by chance.
No longer do the big iron on yore which lack SIMD optimized loads need
to search strings a byte at a time like primitive machines of the vax
era. This guard here was mostly due to the fact that the string
comparison was searched with "count trailing zero", which assumes an
endianness. We can just conditionally use leading zeros when on big
endian and stop using the extremely naive C implementation. This makes
things a tad bit faster.
There are currently some overflow problems in adler32_rvv
implementation, which can lead to wrong results for some input, and
these problems could be easily exhibited when running `git fsck` with
zlib-ng suitituting the system zlib on a big git repository.
These problems and the solutions are the following:
- When the input data is long enough, the v_buf32_accu can overflow too.
Add it to the modulo code that happens per ~NMAX bytes.
- When the vector data is reduced to scalar ones, the resulting scalar
value (and the proceeded length) may lead to the calculation of sum2
to overflow. Add mod BASE to all these reductions and initial
calculation of sum2.
- When the remaining data less than vl bytes, the code falls back to a
scalar implementation; however the sum2 and alder2 values are just
reduced from vectors and could be very big that makes sum2 overflows
in the scalar code. Modulo them before the scalar code to prevent such
overflow (because vl is surely quite smaller than NMAX).
Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
For reasons that aren't quite so clear, using the masked writes here
did not pipeline very well. Either setting up the mask stalled things
or masked moves have issues overlapping regular moves. Simply putting
the masked moves behind a branch that is rarely taken seemed to do the
trick in improving the ILP. While here, put masked loads behind the same
branch in case there were ever a hazard for overreading.
While these are technically different instructions, no such CPU exists
that has AVX2 that doesn't have BMI2. Enabling BMI2 allows us to
eliminate several flag stalls by having flagless versions of shifts, and
allows us to not clobber and move around GPRs so much in scalar code.
There's usually a sizeable benefit for enabling it. Since we're building
with BMI2 for AVX2 functions, let's also just make sure the CPU claims
to support it (just to cover our bases).
It's unclear if raspberry pi OS's shipped GCC doesn't properly detect
ACLE or not (/proc/cpuinfo claims to support AES), but in any case, the
preprocessor macro for that flag is not defined with -march=native on a
raspberry pi 5. Unfortunately that means when built "WITH_NATIVE", we do
not get a fast CRC function. The CRC32 preprocessor macro _IS_ defined,
and the auto detection when built without NATIVE support does properly
get dispatched to. Since we only need the scalar CRC32 and not the polynomial
stuff anyhow, let's make it be an || condition and not a && one.
This takes advantage of the fact that on AVX512 architectures, masked
moves are incredibly cheap. There are many places where we have to
fallback to the safe C implementation of chunkcopy_safe because of the
assumed overwriting that occurs. We're to sidestep most of the branching
needed here by simply controlling the bounds of our writes with a mask.
This gives us appreciable gains on a number of fronts. The first being
we're inlining a pretty hot function that was getting dispatched to
regularly. Another is that we're able to do a safe lagged copy of a
distance that is smaller, so CHUNKCOPY gets its teeth back here for
smaller sizes, without having to do another dispatch to a function.
We're also now doing two overlapping writes at once and letting the CPU
do its store forwarding. This was an enhancement @dougallj had suggested
a while back.
Additionally, the "half chunk mag" here is fundamentally less
complicated because it doesn't require sythensizing cross lane permutes
with a blend operation, so we can optimistically do that first if the
len is small enough that a full 32 byte chunk doesn't make any sense.
Put length 16 in the length checking ladder and take care of it there
since it's also a simple case to handle. We kind of went out of our way
to pretend 128 bit vectors didn't exist when using avx2 but this can be
handled in a single instruction. Strangely the intrinsic uses vector
register operands but the instruction itself assumes a memory operand
for the source. This also means we don't have to handle this case in our
"GET_CHUNK_MAG" function.