zlib-ng

mirror of https://github.com/GerbilSoft/zlib-ng.git synced 2025-06-18 11:35:35 -04:00

Author	SHA1	Message	Date
Hans Kristian Rosbach	ed30965e29	Replace DO1/DO8 macros	2025-02-18 23:59:16 +01:00
Cameron Cawley	1ab443812a	Use size_t instead of uint64_t for len in all adler32 functions	2023-01-22 00:58:12 +01:00
Nathan Moinvaziri	2ca4a77761	Used fixed width uint8_t for crc32 and adler32 function declarations.	2022-06-24 15:12:00 +02:00
Nathan Moinvaziri	5f370cd887	Use uint64_t instead of size_t for len in adler32 to be consistent with crc32.	2022-06-24 15:12:00 +02:00
Adam Stylinski	d79984b5bc	Adding avx512_vnni inline + copy elision Interesting revelation while benchmarking all of this is that our chunkmemset_avx seems to be slower in a lot of use cases than chunkmemset_sse. That will be an interesting function to attempt to optimize. Right now though, we're basically beating google for all PNG decode and encode benchmarks. There are some variations of flags that can basically have us trading blows, but we're about as much as 14% faster than chromium's zlib patches. While we're here, add a more direct benchmark of the folded copy method versus the explicit copy + checksum.	2022-05-23 16:13:39 +02:00
Adam Stylinski	21f461e238	Adding an SSE42 optimized copy + adler checksum implementation We are protecting its usage around a lot of preprocessor macros as the other methods are not yet implemented and calling this version bypasses the faster adler implementations implicitly. When more versions are written for faster vectorizations, the functable entries will be populated and preprocessor macros removed. This round, the copy + checksum is not employing as many tricks as one would hope with a "folded" checksum routine. The reason for this is the particularly tricky case of dealing with unaligned buffers. The implementations which don't have CPUs in the mix that have a huge penalty for unaligned loads will have a much faster implementation. Fancier methods that minimized rebasing, while having the potential to be faster, ended up being slower because the compiler structured the code in a way that ended up either spilling to the stack or trampolining out of a loop and back in it instead of just jumping over the first load and store. Revisiting this for AVX512, where more registers are abundant and more advanced loads exist, may be prudent.	2022-05-23 16:13:39 +02:00
Nathan Moinvaziri	330445c51b	Reuse adler32_len_64 in adler32_c.	2021-12-02 09:27:21 +01:00
Nathan Moinvaziri	d5419d68ea	Mod adler and sum2 when calculating adler32 for short lengths.	2021-06-12 18:45:54 +02:00
Nathan Moinvaziri	193d8fd7df	Remove NO_DIVIDE from adler32.	2020-08-16 17:37:04 +02:00
Matheus Castanho	09d3134a6e	Add adler32_len_64 for length < 64 Add adler32_len_64 to adler32_p.h to allow reuse by other adler32 implementations that may need it.	2020-06-25 15:29:54 +02:00
Matheus Castanho	7e63f5237f	Move DO* macro definitions to adler32_p.h Add new generic definitions of DO* macros used by adler32 algorithms to adler32_p.h to allow reuse by other adler32 implementations.	2020-06-25 15:29:54 +02:00
Sebastian Pop	098f73a45e	cleanup arm/adler32_neon.c code	2019-04-04 10:13:26 +02:00
Sebastian Pop	3ac4f5de06	only call NEON adler32 for more than 16 bytes improves performance of inflate by up to 6% on an A-73 Hikey running at 2.36 GHz when executing the chromium benchmark on the snappy data set. In a few cases inflate is slower by up to 0.8%. Overall performance of inflate is better by about 0.3%.	2019-04-04 10:13:26 +02:00

13 Commits