zlib-ng

mirror of https://github.com/GerbilSoft/zlib-ng.git synced 2025-06-18 11:35:35 -04:00

Author	SHA1	Message	Date
Vladislav Shchapov	5401b24a16	Allow disabling runtime CPU features detection in tests and benchmarks Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-04-04 13:47:02 +02:00
Vladislav Shchapov	ac25a2ea6a	Split CPU features checks and CPU-specific function prototypes and reduce include-dependencies. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-02-22 20:11:46 +01:00
Nathan Moinvaziri	8c8cca7638	Remove extern keyword from cpu_feature function declarations.	2024-01-30 20:50:05 +01:00
Nathan Moinvaziri	379eda2e80	Remove type declarations for z_stream/zng_stream from cpu_features.	2024-01-30 20:50:05 +01:00
Nathan Moinvaziri	8e0e24cd18	Split cpu_features.h by architecture.	2024-01-30 20:50:05 +01:00
Vladislav Shchapov	1aa53f40fc	Improve x86 intrinsics dependencies. Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2024-01-25 10:21:49 +01:00
Hans Kristian Rosbach	06895bc1b3	Move crc32 C fallbacks to arch/generic	2024-01-19 15:22:34 +01:00
Hans Kristian Rosbach	4e132cc0ec	Move adler32 C fallbacks to arch/generic	2024-01-19 15:22:34 +01:00
Simon Hosie	f3211aba34	Add adler32_fold_copy_rvv implementation.	2023-11-28 10:25:35 +01:00
alexsifivetw	fe6aaedaf8	General optimized chunkset	2023-09-28 00:14:26 +02:00
Cameron Cawley	16fe1f885e	Add ARMv6 version of slide_hash	2023-09-16 11:11:18 +02:00
alexsifivetw	6eed7416ed	Optimize adler32 using rvv	2023-07-16 12:44:25 +02:00
alexsifivetw	2f4ebe2bb6	Optimize slide_hash using RVV	2023-06-23 19:44:22 +02:00
alexsifivetw	de1b640ffb	Optimize compare256 with rvv	2023-06-13 12:25:48 +02:00
Alex Chiang	c3cdf434f3	Add supporting RISC-V cross compilation workflows Add RISC-V cross-compilation test Enable RVV support at compile time	2023-05-12 16:57:32 +02:00
Cameron Cawley	38aa575129	Ensure that unaligned compare256 variants are only used on little endian systems	2023-04-25 12:07:55 +02:00
Cameron Cawley	1ae7b0545d	Rename chunkset_avx to chunkset_avx2	2023-04-19 00:35:28 +02:00
Cameron Cawley	b1aafe5c67	Clean up SSE4.2 detection	2023-04-15 15:22:36 +02:00
Cameron Cawley	b09215f75a	Enable use of _mm_shuffle_epi8 on machines without SSE4.1	2023-04-01 17:27:49 +02:00
lawadr	39008be53b	Add member to cpu_features struct if empty When WITH_OPTIM is off, the cpu_features struct is empty. This is not allowed in standard C and causes a build failure with various compilers, including MSVC. This adds a dummy char member to the struct if it would otherwise be empty.	2023-03-27 20:06:07 +02:00
Vladislav Shchapov	20d8fa8af1	Replace global CPU feature flag variables with local variable in init_functable Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-03-06 13:26:09 +01:00
Vladislav Shchapov	fdb87d63a5	Split crc32 pclmulqdq and vpclmulqdq implementations Signed-off-by: Vladislav Shchapov <vladislav@shchapov.ru>	2023-02-24 13:25:54 +01:00
Hans Kristian Rosbach	7e1d80742e	Reduce the amount of different defines required for arch-specific optimizations. Also removed a reference to a nonexistant adler32_sse41 in test/test_adler32.cc.	2023-02-17 15:11:25 +01:00
Pavel P	3e75a5c981	Correct inflate_fast function signature	2023-02-08 15:22:22 +01:00
Nathan Moinvaziri	c72cd309ca	Remove unused chunk memory functions from functable.	2023-02-05 17:51:46 +01:00
Nathan Moinvaziri	aa1109bb2e	Use arch-specific versions of inflate_fast. This should reduce the cost of indirection that occurs when calling functable chunk copying functions inside inflate_fast. It should also allow the compiler to optimize the inflate fast path for the specific architecture.	2023-02-05 17:51:46 +01:00
Cameron Cawley	1ab443812a	Use size_t instead of uint64_t for len in all adler32 functions	2023-01-22 00:58:12 +01:00
Cameron Cawley	23e4305932	Use size_t instead of uint64_t for len in all crc32 functions	2023-01-22 00:58:12 +01:00
Nathan Moinvaziri	b047c7247f	Prefix shared functions to prevent symbol conflict when linking native api against compat api.	2023-01-09 15:10:11 +01:00
Nathan Moinvaziri	2ca4a77761	Used fixed width uint8_t for crc32 and adler32 function declarations.	2022-06-24 15:12:00 +02:00
Nathan Moinvaziri	5f370cd887	Use uint64_t instead of size_t for len in adler32 to be consistent with crc32.	2022-06-24 15:12:00 +02:00
Nathan Moinvaziri	7e243e4436	Fix MSVC possible loss of data warning in crc32_pclmulqdq by converting len types to use uint64_t. arch\x86\crc32_fold_pclmulqdq.c(604,43): warning C4244: 'function': conversion from 'uint64_t' to 'size_t', possible loss of data	2022-06-24 15:12:00 +02:00
Nathan Moinvaziri	201188691a	Remove unused chunkcopy_safe function prototypes.	2022-06-07 12:47:00 +02:00
Nathan Moinvaziri	843c16c87a	Move crc32 fold functions into templates. Don't store xmm_crc_part between runs because it is automatically folded into the checksum in partial_fold. Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>	2022-06-04 11:29:34 +02:00
Nathan Moinvaziri	a6155234a2	Speed up software CRC-32 computation by a factor of 1.5 to 3. Use the interleaved method of Kadatch and Jenkins in order to make use of pipelined instructions through multiple ALUs in a single core. This also speeds up and simplifies the combination of CRCs, and updates the functions to pre-calculate and use an operator for CRC combination. Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>	2022-05-25 12:04:35 +02:00
Adam Stylinski	d79984b5bc	Adding avx512_vnni inline + copy elision Interesting revelation while benchmarking all of this is that our chunkmemset_avx seems to be slower in a lot of use cases than chunkmemset_sse. That will be an interesting function to attempt to optimize. Right now though, we're basically beating google for all PNG decode and encode benchmarks. There are some variations of flags that can basically have us trading blows, but we're about as much as 14% faster than chromium's zlib patches. While we're here, add a more direct benchmark of the folded copy method versus the explicit copy + checksum.	2022-05-23 16:13:39 +02:00
Adam Stylinski	b8269bb7d4	Added inlined AVX512 adler checksum + copy While we're here, also simplfy the "fold" signature, as reducing the number of rebases and horizontal sums did not prove to be meaningfully faster (slower in many circumstances).	2022-05-23 16:13:39 +02:00
Adam Stylinski	2f2b7e9d69	Add AVX2 inline copy + adler implementation This was pretty much across the board wins for performance, but the wins are very data dependent and it sort of depends on what copy runs look like. On our less than realistic data in benchmark_zlib_apps, the decode test saw some of the bigger gains, ranging anywhere from 6 to 11% when compiled with AVX2 on a Cascade Lake CPU (and with only AVX2 enabled). The decode on realistic imagery enjoyed smaller gains, somewhere between 2 and 4%. Interestingly, there was one outlier on encode, at level 5. The best theory for this is that the copy runs for that particular compression level were such that glibc's ERMS aware memmove implementation managed to marginally outpace the copy during the checksum with the move rep str sequence thanks to clever microcoding on Intel's part. It's hard to say for sure but the most standout difference between the two perf profiles was more time spent in memmove (which is expected, as it's calling memcpy instead of copying the bytes during the checksum). There's the distinct possibility that the AVX2 checksums could be marginally improved by one level of unrolling (like what's done in the SSE3 implementation). The AVX512 implementations are certainly getting gains from this but it's not appropriate to append this optimization in this series of commits.	2022-05-23 16:13:39 +02:00
Adam Stylinski	21f461e238	Adding an SSE42 optimized copy + adler checksum implementation We are protecting its usage around a lot of preprocessor macros as the other methods are not yet implemented and calling this version bypasses the faster adler implementations implicitly. When more versions are written for faster vectorizations, the functable entries will be populated and preprocessor macros removed. This round, the copy + checksum is not employing as many tricks as one would hope with a "folded" checksum routine. The reason for this is the particularly tricky case of dealing with unaligned buffers. The implementations which don't have CPUs in the mix that have a huge penalty for unaligned loads will have a much faster implementation. Fancier methods that minimized rebasing, while having the potential to be faster, ended up being slower because the compiler structured the code in a way that ended up either spilling to the stack or trampolining out of a loop and back in it instead of just jumping over the first load and store. Revisiting this for AVX512, where more registers are abundant and more advanced loads exist, may be prudent.	2022-05-23 16:13:39 +02:00
Adam Stylinski	ef0cf5ca17	Improved chunkset substantially where it's heavily used For most realistic use cases, this doesn't make a ton of difference. However, for things which are highly compressible and enjoy very large run length encodes in the window, this is a huge win. We leverage a permutation table to swizzle the contents of the memory chunk into a vector register and then splat that over memory with a fast copy loop. In essence, where this helps, it helps a lot. Where it doesn't, it does no measurable damage to the runtime. This commit also simplifies a chunkcopy_safe call for determining a distance. Using labs is enough to give the same behavior as before, with the added benefit that no predication is required _and_, most importantly, static analysis by GCC's string fortification can't throw a fit because it conveys better to the compiler that the input into builtin_memcpy will always be in range.	2022-05-23 16:13:29 +02:00
Matheus Castanho	02d10b252c	Implement power9 version of compare256. Co-authored-by: Nathan Moinvaziri <nathan@nathanm.com>	2022-05-07 14:06:42 +02:00
Nathan Moinvaziri	48f346e806	Implement neon version of compare256. Co-authored-by: Adam Stylinski <kungfujesus06@gmail.com>	2022-05-06 12:19:35 +02:00
Nathan Moinvaziri	f8a7f264cf	Fixed warning about strict prototypes for cpu_check_features.	2022-05-05 14:55:13 +02:00
Ilya Leoshkevich	9be98893aa	Use PREFIX() for some of the Z_INTERNAL symbols https://github.com/powturbo/TurboBench links zlib and zlib-ng into the same binary, causing non-static symbol conflicts. Fix by using PREFIX() for flush_pending(), bi_reverse(), inflate_ensure_window() and all of the IBM Z symbols. Note: do not use an explicit zng_, since one of the long-term goals is to be able to link two versions of zlib-ng into the same binary for benchmarking [1]. [1] https://github.com/zlib-ng/zlib-ng/pull/1248#issuecomment-1096648932	2022-04-27 10:37:43 +02:00
Nathan Moinvaziri	8163322781	Allow SSE2 and AVX2 functions with -DWITH_UNALIGNED=OFF. Even though they use unaligned loads, they don't result in undefined behavior.	2022-03-31 16:11:25 +02:00
Adam Stylinski	7db13e652a	Rename adler32_sse41 to adler32_ssse3 As it turns out, the sum of absolute differences instruction _did_ exist in SSSE3 all along. SSE41 introduced a stranger, less commonly used variation of the sum of absolute difference instruction. Knowing this, the old SSSE3 method can be axed entirely and the SSE41 method can now be used on CPUs only having SSSE3. Removing this extra functable entry shrinks the code and allows for a simpler planned refactor later for the adler checksum and copy elision.	2022-03-23 11:31:27 +01:00
Nathan Moinvaziri	7069f42c9c	Fixed missing checks around compare256 and longest_match definitions.	2022-03-23 11:30:54 +01:00
Adam Stylinski	2a19125a7d	Use pclmulqdq accelerated CRC for exported function We were already using this internally for our CRC calculations, however the exported function to CRC checksum any arbitrary stream of bytes was still using a generic C based version that leveraged tables. This function is now called when len is at least 64 bytes.	2022-03-08 11:09:20 +01:00
Adam Stylinski	b3260fd0c8	Axe the SSE4 compare256 functions	2022-02-11 09:56:19 +01:00
Adam Stylinski	eaa00cd791	Write an SSE2 optimized compare256 The SSE4 variant uses the unfortunate string comparison instructions from SSE4.2 which not only don't work on as many CPUs but, are often slower than the SSE2 counterparts except in very specific circumstances. This version should be ~2x faster than unaligned_64 for larger strings and about half the performance of AVX2 comparisons on identical hardware. This version is meant to supplement pre AVX hardware. Because of this, we're performing 1 extra load + compare at the beginning. In the event that we're doing a full 256 byte comparison (completely equal strings), this will result in 2 extra SIMD comparisons if the inputs are unaligned. Given that the loads will be absorbed by L1, this isn't super likely to be a giant penalty but for something like a core-i first or second gen, where unaligned loads aren't nearly as expensive, this going to be _marginally_ slower in the worst case. This allows us to have half the loads be aligned, so that the compiler can elide the load and compare by using a register relative pcmpeqb.	2022-02-11 09:56:19 +01:00

1 2

59 Commits