The version that's currently in the generic implementation for 32768
byte buffers leverages the stack. It manages to autovectorize but
unfortunately the trips to the stack hurt its performance for CPUs which
need this the most. This version is explicitly SIMD vectorized and
doesn't use trips to the stack. In my testing it's ~10% faster than the
"small" variant, and about 42% faster than the "32768" variant.
crc_table is made using a four-byte integer (when that can be
determined). However get_crc_table() returned a pointer to an
unsigned long, which could be eight bytes. This fixes that by
creating a new z_crc_t type for the crc_table.
This type is also used for the BYFOUR crc calculations that depend
on a four-byte type. The four-byte type can now be determined by
./configure, which also solves a problem where ./configure --solo
would never use BYFOUR. No the Z_U4 #define indicates that four-
byte integer was found either by ./configure or by zconf.h.