Deflate used to call allocate 5 times during init.
- 5 calls to external alloc function now becomes 1
- Handling alignment of allocated buffers is simplified
- Efforts to align the allocated buffer now needs to happen only once.
- Individual buffers are ordered so that they have natural sequential alignment.
- Due to reduced losses to alignment, we allocate less memory in total.
- While doing alloc(), we now store pointer to corresponding free(), avoiding crashes
with applications that incorrectly set alloc/free pointers after running init function.
- Removed need for extra padding after window, chunked reads can now go beyond the window
buffer without causing a segfault.
Co-authored-by: Ilya Leoshkevich <iii@linux.ibm.com>
Fixing align attribute, makes ms compiler warn: 'internal_state': Alignment specifier is less than actual alignment (16), and will be ignored.
Increasing alignemnt fixes the warning
When building with clang-cl, compiler produces the following warning:
zlib-ng/deflate.h(287,3): warning : attribute 'align' is ignored, place it after "struct" to apply attribute to type declaration [-Wignored-attributes]
zlib-ng/zbuild.h(196,34): note: expanded from macro 'ALIGNED_'
Repositioning align attribute after "struct" fixes the warning and aligns `deflate_state` correctly.
A bug fix in zlib 1.2.12 resulted in a slight slowdown (1-2%) of
deflate. This commit provides the option to #define LIT_MEM, which
uses more memory to reverse most of that slowdown. The memory for
the pending buffer and symbol buffers is increased by 25%, which
increases the total memory usage with the default parameters by
about 6%.
madler/zlib#ac8f12c97d1afd9bafa9c710f827d40a407d3266
If gzip support has been disabled during compilation then also
consider gzip relevant states as invalid in deflateStateCheck.
Also the gzip state definitions can be removed.
This change leads to failure in test/example, and I am not sure
what the GZIP conditional is trying to achieve. All gzip related
functions are still defined in zlib.h
Alternative approach is to remove the GZIP define.
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse. That will be an interesting function to attempt to
optimize.
Right now though, we're basically beating google for all PNG decode and
encode benchmarks. There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.
While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.
When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine. The reason for this is the
particularly tricky case of dealing with unaligned buffers. The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.
Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.
Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.
Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].
[1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932
Also make internal_state struct have a static size regardless of what features have been activated.
Internal_state is now always 6040 bytes on Linux/x86-64, and 5952 bytes on Linux/x86-32.
- Change window_size from unsigned long to unsigned int
- Change block_start from long to int
- Change high_water from unsigned long to unsigned int
- Reorder to promote cache locality in hot code and decrease holes.
On x86_64 this means the struct goes from:
/* size: 6008, cachelines: 94, members: 57 */
/* sum members: 5984, holes: 6, sum holes: 24 */
/* last cacheline: 56 bytes */
To:
/* size: 5984, cachelines: 94, members: 57 */
/* sum members: 5972, holes: 3, sum holes: 8 */
/* padding: 4 */
/* last cacheline: 32 bytes */
Move TRIGGER_LEVEL to match_tpl.h since it is only used in longest match.
Use early return inside match loops instead of cont variable.
Added back two variable check for platforms that don't supported unaligned access.