Commit Graph

59 Commits

Author SHA1 Message Date
Vladislav Shchapov
5401b24a16 Allow disabling runtime CPU features detection in tests and benchmarks
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-04-04 13:47:02 +02:00
Vladislav Shchapov
ac25a2ea6a Split CPU features checks and CPU-specific function prototypes and reduce include-dependencies.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-02-22 20:11:46 +01:00
Nathan Moinvaziri
8c8cca7638 Remove extern keyword from cpu_feature function declarations. 2024-01-30 20:50:05 +01:00
Nathan Moinvaziri
379eda2e80 Remove type declarations for z_stream/zng_stream from cpu_features. 2024-01-30 20:50:05 +01:00
Nathan Moinvaziri
8e0e24cd18 Split cpu_features.h by architecture. 2024-01-30 20:50:05 +01:00
Vladislav Shchapov
1aa53f40fc Improve x86 intrinsics dependencies.
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2024-01-25 10:21:49 +01:00
Hans Kristian Rosbach
06895bc1b3 Move crc32 C fallbacks to arch/generic 2024-01-19 15:22:34 +01:00
Hans Kristian Rosbach
4e132cc0ec Move adler32 C fallbacks to arch/generic 2024-01-19 15:22:34 +01:00
Simon Hosie
f3211aba34 Add adler32_fold_copy_rvv implementation. 2023-11-28 10:25:35 +01:00
alexsifivetw
fe6aaedaf8 General optimized chunkset 2023-09-28 00:14:26 +02:00
Cameron Cawley
16fe1f885e Add ARMv6 version of slide_hash 2023-09-16 11:11:18 +02:00
alexsifivetw
6eed7416ed Optimize adler32 using rvv 2023-07-16 12:44:25 +02:00
alexsifivetw
2f4ebe2bb6 Optimize slide_hash using RVV 2023-06-23 19:44:22 +02:00
alexsifivetw
de1b640ffb Optimize compare256 with rvv 2023-06-13 12:25:48 +02:00
Alex Chiang
c3cdf434f3 Add supporting RISC-V cross compilation workflows
Add RISC-V cross-compilation test
Enable RVV support at compile time
2023-05-12 16:57:32 +02:00
Cameron Cawley
38aa575129 Ensure that unaligned compare256 variants are only used on little endian systems 2023-04-25 12:07:55 +02:00
Cameron Cawley
1ae7b0545d Rename chunkset_avx to chunkset_avx2 2023-04-19 00:35:28 +02:00
Cameron Cawley
b1aafe5c67 Clean up SSE4.2 detection 2023-04-15 15:22:36 +02:00
Cameron Cawley
b09215f75a Enable use of _mm_shuffle_epi8 on machines without SSE4.1 2023-04-01 17:27:49 +02:00
lawadr
39008be53b Add member to cpu_features struct if empty
When WITH_OPTIM is off, the cpu_features struct is empty. This is not
allowed in standard C and causes a build failure with various compilers,
including MSVC.

This adds a dummy char member to the struct if it would otherwise be
empty.
2023-03-27 20:06:07 +02:00
Vladislav Shchapov
20d8fa8af1 Replace global CPU feature flag variables with local variable in init_functable
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-03-06 13:26:09 +01:00
Vladislav Shchapov
fdb87d63a5 Split crc32 pclmulqdq and vpclmulqdq implementations
Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>
2023-02-24 13:25:54 +01:00
Hans Kristian Rosbach
7e1d80742e Reduce the amount of different defines required for arch-specific optimizations.
Also removed a reference to a nonexistant adler32_sse41 in test/test_adler32.cc.
2023-02-17 15:11:25 +01:00
Pavel P
3e75a5c981 Correct inflate_fast function signature 2023-02-08 15:22:22 +01:00
Nathan Moinvaziri
c72cd309ca Remove unused chunk memory functions from functable. 2023-02-05 17:51:46 +01:00
Nathan Moinvaziri
aa1109bb2e Use arch-specific versions of inflate_fast.
This should reduce the cost of indirection that occurs when calling functable
chunk copying functions inside inflate_fast. It should also allow the compiler
to optimize the inflate fast path for the specific architecture.
2023-02-05 17:51:46 +01:00
Cameron Cawley
1ab443812a Use size_t instead of uint64_t for len in all adler32 functions 2023-01-22 00:58:12 +01:00
Cameron Cawley
23e4305932 Use size_t instead of uint64_t for len in all crc32 functions 2023-01-22 00:58:12 +01:00
Nathan Moinvaziri
b047c7247f Prefix shared functions to prevent symbol conflict when linking native api against compat api. 2023-01-09 15:10:11 +01:00
Nathan Moinvaziri
2ca4a77761 Used fixed width uint8_t for crc32 and adler32 function declarations. 2022-06-24 15:12:00 +02:00
Nathan Moinvaziri
5f370cd887 Use uint64_t instead of size_t for len in adler32 to be consistent with crc32. 2022-06-24 15:12:00 +02:00
Nathan Moinvaziri
7e243e4436 Fix MSVC possible loss of data warning in crc32_pclmulqdq by converting len types to use uint64_t.
arch\x86\crc32_fold_pclmulqdq.c(604,43): warning C4244: 'function':
  conversion from 'uint64_t' to 'size_t', possible loss of data
2022-06-24 15:12:00 +02:00
Nathan Moinvaziri
201188691a Remove unused chunkcopy_safe function prototypes. 2022-06-07 12:47:00 +02:00
Nathan Moinvaziri
843c16c87a Move crc32 fold functions into templates. Don't store xmm_crc_part between runs because it is automatically folded into the checksum in partial_fold.
Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>
2022-06-04 11:29:34 +02:00
Nathan Moinvaziri
a6155234a2 Speed up software CRC-32 computation by a factor of 1.5 to 3.
Use the interleaved method of Kadatch and Jenkins in order to make
use of pipelined instructions through multiple ALUs in a single
core. This also speeds up and simplifies the combination of CRCs,
and updates the functions to pre-calculate and use an operator for
CRC combination.

Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>
2022-05-25 12:04:35 +02:00
Adam Stylinski
d79984b5bc Adding avx512_vnni inline + copy elision
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse.  That will be an interesting function to attempt to
optimize.

Right now though, we're basically beating google for all PNG decode and
encode benchmarks.  There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.

While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
2022-05-23 16:13:39 +02:00
Adam Stylinski
b8269bb7d4 Added inlined AVX512 adler checksum + copy
While we're here, also simplfy the "fold" signature, as reducing the
number of rebases and horizontal sums did not prove to be meaningfully
faster (slower in many circumstances).
2022-05-23 16:13:39 +02:00
Adam Stylinski
2f2b7e9d69 Add AVX2 inline copy + adler implementation
This was pretty much across the board wins for performance, but the wins
are very data dependent and it sort of depends on what copy runs look
like.  On our less than realistic data in benchmark_zlib_apps, the
decode test saw some of the bigger gains, ranging anywhere from 6 to 11%
when compiled with AVX2 on a Cascade Lake CPU (and with only AVX2
enabled).  The decode on realistic imagery enjoyed smaller gains,
somewhere between 2 and 4%.

Interestingly, there was one outlier on encode, at level 5.  The best
theory for this is that the copy runs for that particular compression
level were such that glibc's ERMS aware memmove implementation managed
to marginally outpace the copy during the checksum with the move rep str
sequence thanks to clever microcoding on Intel's part. It's hard to say
for sure but the most standout difference between the two perf profiles
was more time spent in memmove (which is expected, as it's calling
memcpy instead of copying the bytes during the checksum).

There's the distinct possibility that the AVX2 checksums could be
marginally improved by one level of unrolling (like what's done in the
SSE3 implementation).  The AVX512 implementations are certainly getting
gains from this but it's not appropriate to append this optimization in
this series of commits.
2022-05-23 16:13:39 +02:00
Adam Stylinski
21f461e238 Adding an SSE42 optimized copy + adler checksum implementation
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.

When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine.  The reason for this is the
particularly tricky case of dealing with unaligned buffers.  The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.

Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.

Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
2022-05-23 16:13:39 +02:00
Adam Stylinski
ef0cf5ca17 Improved chunkset substantially where it's heavily used
For most realistic use cases, this doesn't make a ton of difference.
However, for things which are highly compressible and enjoy very large
run length encodes in the window, this is a huge win.

We leverage a permutation table to swizzle the contents of the memory
chunk into a vector register and then splat that over memory with a fast
copy loop.

In essence, where this helps, it helps a lot.  Where it doesn't, it does
no measurable damage to the runtime.

This commit also simplifies a chunkcopy_safe call for determining a
distance.  Using labs is enough to give the same behavior as before,
with the added benefit that no predication is required _and_, most
importantly, static analysis by GCC's string fortification can't throw a
fit because it conveys better to the compiler that the input into
builtin_memcpy will always be in range.
2022-05-23 16:13:29 +02:00
Matheus Castanho
02d10b252c Implement power9 version of compare256.
Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>
2022-05-07 14:06:42 +02:00
Nathan Moinvaziri
48f346e806 Implement neon version of compare256.
Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>
2022-05-06 12:19:35 +02:00
Nathan Moinvaziri
f8a7f264cf Fixed warning about strict prototypes for cpu_check_features. 2022-05-05 14:55:13 +02:00
Ilya Leoshkevich
9be98893aa Use PREFIX() for some of the Z_INTERNAL symbols
https://github.com/powturbo/TurboBench links zlib and zlib-ng into the
same binary, causing non-static symbol conflicts. Fix by using PREFIX()
for flush_pending(), bi_reverse(), inflate_ensure_window() and all of
the IBM Z symbols.

Note: do not use an explicit zng_, since one of the long-term goals is
to be able to link two versions of zlib-ng into the same binary for
benchmarking [1].

[1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932
2022-04-27 10:37:43 +02:00
Nathan Moinvaziri
8163322781 Allow SSE2 and AVX2 functions with -DWITH_UNALIGNED=OFF. Even though they use unaligned loads, they don't result in undefined behavior. 2022-03-31 16:11:25 +02:00
Adam Stylinski
7db13e652a Rename adler32_sse41 to adler32_ssse3
As it turns out, the sum of absolute differences instruction _did_ exist
in SSSE3 all along. SSE41 introduced a stranger, less commonly used
variation of the sum of absolute difference instruction.  Knowing this,
the old SSSE3 method can be axed entirely and the SSE41 method can now
be used on CPUs only having SSSE3.

Removing this extra functable entry shrinks the code and allows for a
simpler planned refactor later for the adler checksum and copy elision.
2022-03-23 11:31:27 +01:00
Nathan Moinvaziri
7069f42c9c Fixed missing checks around compare256 and longest_match definitions. 2022-03-23 11:30:54 +01:00
Adam Stylinski
2a19125a7d Use pclmulqdq accelerated CRC for exported function
We were already using this internally for our CRC calculations, however
the exported function to CRC checksum any arbitrary stream of bytes was
still using a generic C based version that leveraged tables. This
function is now called when len is at least 64 bytes.
2022-03-08 11:09:20 +01:00
Adam Stylinski
b3260fd0c8 Axe the SSE4 compare256 functions 2022-02-11 09:56:19 +01:00
Adam Stylinski
eaa00cd791 Write an SSE2 optimized compare256
The SSE4 variant uses the unfortunate string comparison instructions from
SSE4.2 which not only don't work on as many CPUs but, are often slower
than the SSE2 counterparts except in very specific circumstances.

This version should be ~2x faster than unaligned_64 for larger strings
and about half the performance of AVX2 comparisons on identical
hardware.

This version is meant to supplement pre AVX hardware. Because of this,
we're performing 1 extra load + compare at the beginning. In the event
that we're doing a full 256 byte comparison (completely equal strings),
this will result in 2 extra SIMD comparisons if the inputs are unaligned.
Given that the loads will be absorbed by L1, this isn't super likely to
be a giant penalty but for something like a core-i first or second gen,
where unaligned loads aren't nearly as expensive, this going to be
_marginally_ slower in the worst case.  This allows us to have half the
loads be aligned, so that the compiler can elide the load and compare by
using a register relative pcmpeqb.
2022-02-11 09:56:19 +01:00