As it turns out, trying to peel off the remainder with so many branches
caused the code size to inflate a bit too much that this function
wouldn't inline without some fairly aggressive optimization flags. Only
catching vector sized chunks here makes the loop body small enough and
having the byte by byte copy idiom at the bottom gives the compiler some
flexibility that it is likely to do something there.
Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.
- Now, we allocate state and a 32K window during init, allowing the latency cost
of allocs to be done during init instead of at one or more times later.
- Total memory allocation is about the same when requesting a 32K window, but
if now window or a smaller window was requested, then it is an increase.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
- After init has succeeded, inflate will no longer possibly fail due to a failing malloc.
Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
There is no hardware control for DFLTCC window size, and because of
that supporting small windows for deflate is not trivial: one has to
make sure that DFLTCC does not emit large distances, which most likely
entails somehow trimming the window and/or input in order to make sure
that whave + avail_in <= wsize.
But inflate is much easier: one only has to allocate enough space. Do
that in dfltcc_alloc_window(), and also introduce ZCOPY_WINDOW() in
order to copy everything, not just what the software implementation
cares about.
After this change, software and hardware window formats no longer
match: the software will use wbits and wsize, and the hardware will use
HB_BITS and HB_SIZE. Unlike deflate, inflate does not switch between
software and hardware implementations mid-stream, which leaves only
inflateSetDictionary() and inflateGetDictionary() interesting.
For most realistic use cases, this doesn't make a ton of difference.
However, for things which are highly compressible and enjoy very large
run length encodes in the window, this is a huge win.
We leverage a permutation table to swizzle the contents of the memory
chunk into a vector register and then splat that over memory with a fast
copy loop.
In essence, where this helps, it helps a lot. Where it doesn't, it does
no measurable damage to the runtime.
This commit also simplifies a chunkcopy_safe call for determining a
distance. Using labs is enough to give the same behavior as before,
with the added benefit that no predication is required _and_, most
importantly, static analysis by GCC's string fortification can't throw a
fit because it conveys better to the compiler that the input into
builtin_memcpy will always be in range.
Currently deflate and inflate both use a common state struct. There are
several variables in this struct that we don't need for inflate, and
more may be coming in the future. Therefore split them in two separate
structs. This in turn requires splitting ZALLOC_STATE and ZCOPY_STATE
macros.
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inffast.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
inflate.c
inflate_p.h(244,18): warning C4018: '>': signed/unsigned mismatch
inflate_p.h(234,38): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data
This was found to have a significant impact on a highly compressible PNG
for both the encode and decode. Some deltas show performance improving
as much as 60%+.
For the scenarios where the "dist" is not an even modulus of our chunk
size, we simply repeat the bytes as many times as possible into our
vector registers. We then copy the entire vector and then advance the
quotient of our chunksize divided by our dist value.
If dist happens to be 1, there's no reason to not just call memset from
libc (this is likely to be just as fast if not faster).
Stop relying on software and hardware inflate window formats being the
same and act the way we already do for deflate: provide and implement
window-related hooks.
Another possibility would be to use an in-line history buffer (by not
setting HBT_CIRCULAR), but this would require an extra memmove().
Also fix a couple corner cases in the software implementation of
inflateGetDictionary() and inflateSetDictionary().
This commit significantly improves inflate performance by reorganizing the window buffer into a contiguous window and pending output buffer. The goal of this layout is to reduce branching, improve cache locality, and enable for the use of crc folding with gzip input.
The window buffer is allocated as a multiple of the user-selected window size. In this commit, a factor of 2 is utilized.
The layout of the window buffer is divided into two sections. The first section, window offset [0, wsize), is reserved for history that has already been output. The second section, window offset [wsize, 2 * wsize), is reserved for buffering pending output that hasn't been flushed to the user's output buffer yet.
The history section grows downwards, towards the window offset of 0. The pending output section grows upwards, towards the end of the buffer. As a result, all of the possible distance/length data that may need to be copied is contiguous. This removes the need to stitch together output from 2 separate buffers.
In the case of gzip input, crc folding is used to copy the pending output to the user's buffers.
Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>