This takes advantage of the fact that on AVX512 architectures, masked
moves are incredibly cheap. There are many places where we have to
fallback to the safe C implementation of chunkcopy_safe because of the
assumed overwriting that occurs. We're to sidestep most of the branching
needed here by simply controlling the bounds of our writes with a mask.
The safe pointer that is computed is an exclusive, not inclusive bounds.
While we were probably rarely ever bit this, if ever, it still makes
sense to apply the limit, properly.
If the output buffer and the window buffer are the same
memory allocation, we cannot make the assumptions that chunkunroll
does, that it is okay to overwrite the output buffer.
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.