For reasons that aren't quite so clear, using the masked writes here
did not pipeline very well. Either setting up the mask stalled things
or masked moves have issues overlapping regular moves. Simply putting
the masked moves behind a branch that is rarely taken seemed to do the
trick in improving the ILP. While here, put masked loads behind the same
branch in case there were ever a hazard for overreading.
This takes advantage of the fact that on AVX512 architectures, masked
moves are incredibly cheap. There are many places where we have to
fallback to the safe C implementation of chunkcopy_safe because of the
assumed overwriting that occurs. We're to sidestep most of the branching
needed here by simply controlling the bounds of our writes with a mask.
This gives us appreciable gains on a number of fronts. The first being
we're inlining a pretty hot function that was getting dispatched to
regularly. Another is that we're able to do a safe lagged copy of a
distance that is smaller, so CHUNKCOPY gets its teeth back here for
smaller sizes, without having to do another dispatch to a function.
We're also now doing two overlapping writes at once and letting the CPU
do its store forwarding. This was an enhancement @dougallj had suggested
a while back.
Additionally, the "half chunk mag" here is fundamentally less
complicated because it doesn't require sythensizing cross lane permutes
with a blend operation, so we can optimistically do that first if the
len is small enough that a full 32 byte chunk doesn't make any sense.
Put length 16 in the length checking ladder and take care of it there
since it's also a simple case to handle. We kind of went out of our way
to pretend 128 bit vectors didn't exist when using avx2 but this can be
handled in a single instruction. Strangely the intrinsic uses vector
register operands but the instruction itself assumes a memory operand
for the source. This also means we don't have to handle this case in our
"GET_CHUNK_MAG" function.
For most realistic use cases, this doesn't make a ton of difference.
However, for things which are highly compressible and enjoy very large
run length encodes in the window, this is a huge win.
We leverage a permutation table to swizzle the contents of the memory
chunk into a vector register and then splat that over memory with a fast
copy loop.
In essence, where this helps, it helps a lot. Where it doesn't, it does
no measurable damage to the runtime.
This commit also simplifies a chunkcopy_safe call for determining a
distance. Using labs is enough to give the same behavior as before,
with the added benefit that no predication is required _and_, most
importantly, static analysis by GCC's string fortification can't throw a
fit because it conveys better to the compiler that the input into
builtin_memcpy will always be in range.
It would seem that on some platforms, namely those which are
!UNALIGNED64_OK, there was a likelihood of chunkmemset_safe_c copying all
the bytes before passing control flow to chunkcopy, a function which is
explicitly unsafe to be called with a zero length copy.
This fixes that bug for those platforms.
This was found to have a significant impact on a highly compressible PNG
for both the encode and decode. Some deltas show performance improving
as much as 60%+.
For the scenarios where the "dist" is not an even modulus of our chunk
size, we simply repeat the bytes as many times as possible into our
vector registers. We then copy the entire vector and then advance the
quotient of our chunksize divided by our dist value.
If dist happens to be 1, there's no reason to not just call memset from
libc (this is likely to be just as fast if not faster).