A bug fix in zlib 1.2.12 resulted in a slight slowdown (1-2%) of
deflate. This commit provides the option to #define LIT_MEM, which
uses more memory to reverse most of that slowdown. The memory for
the pending buffer and symbol buffers is increased by 25%, which
increases the total memory usage with the default parameters by
about 6%.
madler/zlib#ac8f12c97d1afd9bafa9c710f827d40a407d3266
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.
Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].
[1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932
deflate.c(1575,67): warning C4244: 'function': conversion from 'uint32_t' to 'unsigned char', possible loss of data
deflate_fast.c(60,94): warning C4244: 'function': conversion from 'uint32_t' to 'unsigned char', possible loss of data
deflate_medium.c(39,102): warning C4244: 'function': conversion from 'int' to 'unsigned char', possible loss of data
deflate_slow.c(75,101): warning C4244: 'function': conversion from 'unsigned int' to 'unsigned char', possible loss of data
Some gcc versions complain that parameter c is always less than
MAX_MATCH-MIN_MATCH, and therefore the assertion that checks for this
is useless, but in reality some day MIN_MATCH and MAX_MATCH can change.
So disable the warning around the assertion.
deflate_p.h(42,37): warning C4244: '=': conversion from 'unsigned int' to 'unsigned char', possible loss of data
deflate_p.h(43,42): warning C4244: '=': conversion from 'unsigned int' to 'unsigned char', possible loss of data
This makes the checks for arm cpu features as inexpensive as on the x86 side
by calling the runtime feature detection once in deflate/inflate init and then
storing the result in a global variable.
The current version of insert_string_c and variations for sse2, arm, and aarch64
in zlib-ng has changed semantics from the original code of INSERT_STRING macro
in zlib:
#define INSERT_STRING(s, str, match_head) \
(UPDATE_HASH(s, s->ins_h, s->window[(str) + (MIN_MATCH-1)]), \
match_head = s->prev[(str) & s->w_mask] = s->head[s->ins_h], \
s->head[s->ins_h] = (Pos)(str))
The code of INSERT_STRING assigns match_head with the content of s->head[s->ins_h].
In zlib-ng, the assignment to match_head happens in the caller of insert_string().
zlib-ng's insert_string_*() functions return 0 instead of str+idx in case of
collision, i.e., when if (s->head[s->ins_h] == str+idx).
The effect of returning 0 instead of the content of s->head[s->ins_h] is that
the search for a longest_match through s->prev[] chains will be cut short when
arriving at 0. This leads to a shorter compression time at the expense of a
worse compression rate: returning 0 cuts out the search space.
With this patch:
Performance counter stats for './minigzip -9 llvm.tar':
13422.379017 task-clock (msec) # 1.000 CPUs utilized
20 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
130 page-faults # 0.010 K/sec
58,926,104,511 cycles # 4.390 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
77,543,740,646 instructions # 1.32 insns per cycle
17,158,892,214 branches # 1278.379 M/sec
198,433,680 branch-misses # 1.16% of all branches
13.423365095 seconds time elapsed
45408 -rw-rw-r-- 1 spop spop 46493896 Dec 11 11:47 llvm.tar.gz
Without this patch the compressed file is larger:
Performance counter stats for './minigzip -9 llvm.tar':
13459.342312 task-clock (msec) # 1.000 CPUs utilized
25 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
129 page-faults # 0.010 K/sec
59,088,391,808 cycles # 4.390 GHz
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
77,600,766,958 instructions # 1.31 insns per cycle
17,486,130,785 branches # 1299.182 M/sec
196,281,761 branch-misses # 1.12% of all branches
13.463512830 seconds time elapsed
45408 -rw-rw-r-- 1 spop spop 46493896 Dec 11 11:48 llvm.tar.gz
The struct contains pointers to select functions to be used by the
rest of zlib, and the init function selects what functions will be
used depending on what optimizations has been compiled in and what
instruction-sets are available at runtime.
Tests done on a haswell cpu running minigzip -6 compression of a
40M file shows a 2.5% decrease in branches, and a 25-30% reduction
in iTLB-loads. The reduction i iTLB-loads is likely mostly due to
the inability to inline functions. This also causes a slight
performance regression of around 1%, this might still be worth it
to make it much easier to implement new optimized functions for
various architectures and instruction sets.
The performance penalty will get smaller for functions that get more
alternative implementations to choose from, since there is no need
to add more branches to every call of the function.
Today insert_string has 1 branch to choose insert_string_sse
or insert_string_c, but if we also add for example insert_string_sse4
then that would have needed another branch, and it would probably
at some point hinder effective inlining too.
The previous code slid the window and the hash table and copied
every input byte three times in order to just write the data as
stored blocks with no compression. This commit minimizes sliding
and copying, especially for large input and output buffers.
Level 0 compression is now more than 20 times faster than before
the commit.
Most of the speedup is due to deferring hash table slides until
deflateParams() is called to change the compression level away
from 0. More speedup is due to copying directly from next_in to
next_out when the amounts of available input data and output space
permit it, avoiding the intermediate pending buffer. Additionally,
only the last 32K of the used input data is copied back to the
sliding window when large input buffers are provided.
** Partial merge of this commit, based on a8c94e9f5a3b9d3c62182bcf84e72304a3c1a6e5
Excludes changes to fill_window_sse.c, changes to fill_window_c() in deflate.c
and several unrelated changes in the commit.
* local -> static
* Normalize and cleanup line-endings
* Fix warnings under Visual Studio.
* Whitespace cleanup
***
This patch has been edited to merge cleanly and to exclude type changes.
Based on 8d7a7c3b82c6e38734bd504dac800b148ab410d0 "Type Cleanup"
* Separate common inlines and macros to deflate_p.h
* Separate deflate_fast related code to deflate_fast.c
* Separate deflate_medium related code to deflate_medium.c
* Separate deflate_slow related code to deflate_slow.c