Commit Graph

13 Commits

Author SHA1 Message Date
Hans Kristian Rosbach
ed30965e29 Replace DO1/DO8 macros 2025-02-18 23:59:16 +01:00
Cameron Cawley
1ab443812a Use size_t instead of uint64_t for len in all adler32 functions 2023-01-22 00:58:12 +01:00
Nathan Moinvaziri
2ca4a77761 Used fixed width uint8_t for crc32 and adler32 function declarations. 2022-06-24 15:12:00 +02:00
Nathan Moinvaziri
5f370cd887 Use uint64_t instead of size_t for len in adler32 to be consistent with crc32. 2022-06-24 15:12:00 +02:00
Adam Stylinski
d79984b5bc Adding avx512_vnni inline + copy elision
Interesting revelation while benchmarking all of this is that our
chunkmemset_avx seems to be slower in a lot of use cases than
chunkmemset_sse.  That will be an interesting function to attempt to
optimize.

Right now though, we're basically beating google for all PNG decode and
encode benchmarks.  There are some variations of flags that can
basically have us trading blows, but we're about as much as 14% faster
than chromium's zlib patches.

While we're here, add a more direct benchmark of the folded copy method
versus the explicit copy + checksum.
2022-05-23 16:13:39 +02:00
Adam Stylinski
21f461e238 Adding an SSE42 optimized copy + adler checksum implementation
We are protecting its usage around a lot of preprocessor macros as the
other methods are not yet implemented and calling this version bypasses
the faster adler implementations implicitly.

When more versions are written for faster vectorizations, the functable
entries will be populated and preprocessor macros removed. This round,
the copy + checksum is not employing as many tricks as one would hope
with a "folded" checksum routine.  The reason for this is the
particularly tricky case of dealing with unaligned buffers.  The
implementations which don't have CPUs in the mix that have a huge
penalty for unaligned loads will have a much faster implementation.

Fancier methods that minimized rebasing, while having the potential to
be faster, ended up being slower because the compiler structured the
code in a way that ended up either spilling to the stack or trampolining
out of a loop and back in it instead of just jumping over the first load
and store.

Revisiting this for AVX512, where more registers are abundant and more
advanced loads exist, may be prudent.
2022-05-23 16:13:39 +02:00
Nathan Moinvaziri
330445c51b Reuse adler32_len_64 in adler32_c. 2021-12-02 09:27:21 +01:00
Nathan Moinvaziri
d5419d68ea Mod adler and sum2 when calculating adler32 for short lengths. 2021-06-12 18:45:54 +02:00
Nathan Moinvaziri
193d8fd7df Remove NO_DIVIDE from adler32. 2020-08-16 17:37:04 +02:00
Matheus Castanho
09d3134a6e Add adler32_len_64 for length < 64
Add adler32_len_64 to adler32_p.h to allow reuse by other adler32
implementations that may need it.
2020-06-25 15:29:54 +02:00
Matheus Castanho
7e63f5237f Move DO* macro definitions to adler32_p.h
Add new generic definitions of DO* macros used by adler32 algorithms to
adler32_p.h to allow reuse by other adler32 implementations.
2020-06-25 15:29:54 +02:00
Sebastian Pop
098f73a45e cleanup arm/adler32_neon.c code 2019-04-04 10:13:26 +02:00
Sebastian Pop
3ac4f5de06 only call NEON adler32 for more than 16 bytes
improves performance of inflate by up to 6% on an A-73 Hikey running at 2.36 GHz
when executing the chromium benchmark on the snappy data set.  In a few cases
inflate is slower by up to 0.8%.  Overall performance of inflate is better by
about 0.3%.
2019-04-04 10:13:26 +02:00