Inflate used to allocate state during init, but window would be allocated
when/if needed and could be resized and that required a new free/alloc round.
- Now, we allocate state and a 32K window during init, allowing the latency cost
of allocs to be done during init instead of at one or more times later.
- Total memory allocation is about the same when requesting a 32K window, but
if now window or a smaller window was requested, then it is an increase.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
- After init has succeeded, inflate will no longer possibly fail due to a failing malloc.
Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
This increases the size of the `codes` array by 1920 bytes (33%), but
improves performance a little. Root table size is still limited by the
maximum code length in use, so tiny files typically see no change to
table-building time, as they don't use longer codes.
Currently deflate and inflate both use a common state struct. There are
several variables in this struct that we don't need for inflate, and
more may be coming in the future. Therefore split them in two separate
structs. This in turn requires splitting ZALLOC_STATE and ZCOPY_STATE
macros.
This commit significantly improves inflate performance by reorganizing the window buffer into a contiguous window and pending output buffer. The goal of this layout is to reduce branching, improve cache locality, and enable for the use of crc folding with gzip input.
The window buffer is allocated as a multiple of the user-selected window size. In this commit, a factor of 2 is utilized.
The layout of the window buffer is divided into two sections. The first section, window offset [0, wsize), is reserved for history that has already been output. The second section, window offset [wsize, 2 * wsize), is reserved for buffering pending output that hasn't been flushed to the user's output buffer yet.
The history section grows downwards, towards the window offset of 0. The pending output section grows upwards, towards the end of the buffer. As a result, all of the possible distance/length data that may need to be copied is contiguous. This removes the need to stitch together output from 2 separate buffers.
In the case of gzip input, crc folding is used to copy the pending output to the user's buffers.
Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>
infback.c:200:13: runtime error: left shift of 255 by 24 places cannot be represented in type 'int'
624: SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /Users/runner/work/zlib-ng/zlib-ng/infback.c:200:13 in
Based on a patch by Nigel Tao:
e0ff1f330c
This patch makes unzipping of files up to 1.2x faster on x86_64. The other part
(1.3x speedup) of the patch by Nigel Tao is unsafe as discussed in the review of
that pull request. zlib-ng already has a different way to optimize the memcpy
for that missing part.
The original patch was enabled only on little-endian machines. This patch adapts
the loading of 64 bits at a time to big endian machines.
Benchmarking notes from Hans Kristian Rosbach:
https://github.com/zlib-ng/zlib-ng/pull/224#issuecomment-444837182
Benchmark runs: 7, tested levels: 0-7, testfile 100M
develop at 796ad10 with -O3:
Level Comp Comptime min/avg/max Decomptime min/avg/max
0 100.02% 0.01/0.01/0.02 0.08/0.09/0.11
1 47.08% 0.49/0.50/0.51 0.37/0.39/0.40
2 36.02% 1.10/1.12/1.13 0.39/0.39/0.40
3 34.77% 1.32/1.34/1.37 0.38/0.38/0.38
4 33.41% 1.50/1.53/1.56 0.37/0.37/0.38
5 33.07% 1.85/1.87/1.90 0.36/0.37/0.38
6 32.83% 2.54/2.57/2.61 0.36/0.37/0.38
avg 45.31% 1.28 0.34
tot 62.60 16.58
PR224 with -O3:
Level Comp Comptime min/avg/max Decomptime min/avg/max
0 100.02% 0.01/0.01/0.02 0.09/0.09/0.10
1 47.08% 0.49/0.50/0.51 0.37/0.37/0.38
2 36.02% 1.09/1.11/1.13 0.38/0.38/0.39
3 34.77% 1.32/1.34/1.38 0.35/0.36/0.38
4 33.41% 1.49/1.52/1.54 0.36/0.36/0.37
5 33.07% 1.85/1.88/1.93 0.35/0.36/0.37
6 32.83% 2.55/2.58/2.65 0.35/0.35/0.36
avg 45.31% 1.28 0.33
tot 62.48 16.02
So I see about a 5.4% speedup on my x86_64 machine, not quite the 1.2x speedup
but a nice speedup nevertheless. This benchmark measures the total execution
time of minigzip, so that might have caused some inefficiencies.
At -O2, I only see a 2.7% speedup.
to co-exist in an application that has been linked to something that
depends on stock zlib. Previously, that would cause random problems
since there is no way to guarantee what zlib version is being used
for each dynamically linked function.
Add the corresponding zlib-ng.h.
Tests, example and minigzip will not compile before they have been
adapted to use the correct functions as well.
Either duplicate them, so we have minigzip-ng.c for example, or add
compile-time detection in the source code.
* local -> static
* Normalize and cleanup line-endings
* Fix warnings under Visual Studio.
* Whitespace cleanup
***
This patch has been edited to merge cleanly and to exclude type changes.
Based on 8d7a7c3b82c6e38734bd504dac800b148ab410d0 "Type Cleanup"
This commit was cherry-picked and was not done, resulting in a few
problems with gcc on 64bit windows.
This reverts commit edd7a72e05.
Conflicts:
arch/x86/crc_folding.c
arch/x86/fill_window_sse.c
deflate.c
deflate.h
match.c
trees.c