-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
zstd is extremely impressive, both in its speed and compression ratio. However, it is surprising that it does not attempt to detect whether simple byte-swizzling can improve performance. Tools like blosc exploit this. In the example below I have example data that stores 32-bit uints and 32-bit floats in a 3:1 ratio. I compare zstd with the raw data and then when the data is bytes are swizzled so that the order goes from 12341234... to 1111...222...333...444... using zstd-1.3.8. The same data gets dramatically better compression if the bytes are re-ordered.
$ zstd RAW.mz3
RAW.mz3 : 56.80% (2578036 => 1464255 bytes, RAW.mz3.zst)
$ zstd Swizzle.mz3
Swizzle.mz3 : 38.53% (2578036 => 993357 bytes, Swizzle.mz3.zst)
I realize there as other tools like Blosc that do this, but having an option to have zstandard detect if each block would benefit from swizzling would be great for data scientists as zstd is quickly becoming the new standard compression algorithm. Niavely, it seems easy to detect if swizzling will improve a block: as the compressor compresses the block it can calculate the variance using Welford's Online algorithm. This would measure the accuracy of using a 1-back (default compression) with 2-back (uint16), 3-back (RGB), 4-back (uint32, single), and 8-back (double) predictor. Obviously, if one of these has a dramatically lower variance it suggests that the data should be swizzled and compressed. This would seem to have a penalty (e.g. half the speed for compression) and the swizzling would need to be stored in the header. However, it seems like as an option it would be great to have this built into the standard. Most scientists (and I think most developers) want to save data in the simplest method possible, so they do not consider byte-swizzling data. Having an option for zstd to automatically take advantage of these patterns would be tremendous. In many cases it might even accelerate decompression/transmission/reading as the size of the compressed data stream is reduced.