Skip to content

Compression ratio regression in dictionary + streaming API mode (src size unknown) #2442

@phiresky

Description

@phiresky

When using the streaming compression API using a dictionary, there were two regressions between 1.4.5 and 1.4.7 that make dictionary compression unusable at least for my use case:

A 120kB file compressed with a 20kB dictionary gives the following sizes (all at level 19):

  • 1.4.5 without a dictionary: 29517 bytes
  • 1.4.5 with a dictionary: 24177 bytes
  • 1.4.8 with a dictionary when using ZSTD_compress_usingCDict api: 23866 bytes
  • 1.4.8 with a dictionary when using streaming api: 76455 bytes (!!)

In total, in my use case of a sqlite extension for transparent row-level compression https://github.com/phiresky/sqlite-zstd , this causes a dataset of 2GB of rows of around 30kB each compressed individually to only compress to 1GB instead of down to ~100MB as it did before.

I did a bisect on the code base and then investigated for a while, and the reason for the regression are these two commits:

After 48ef15f, the result goes up from 23000 bytes to 28904 bytes, then after d5c688e (both by @terrelln ), the size goes up to 76455.

The reason is that when the streaming API is used, the srcSize is set to ZSTD_CONTENTSIZE_UNKNOWN. Then, in ZSTD_adjustCParams_internal the srcSize is set to 513 bytes, and the dictSize is set to 0 (because the mode is ZSTD_cpm_attachDict:

if (dictSize && srcSize == ZSTD_CONTENTSIZE_UNKNOWN)
srcSize = minSrcSize;
switch (mode) {
case ZSTD_cpm_noAttachDict:
case ZSTD_cpm_unknown:
case ZSTD_cpm_createCDict:
break;
case ZSTD_cpm_attachDict:
dictSize = 0;
break;

Setting these values then causes the windowSize to go down to 1024, which means that the 120kB file is compressed in individual 1024 byte segments.

Removing the dictSize = 0 assignment above in the current dev branch in causes the windowSize to be 20kB (exactly to fit the dictionary) which reduces the compressed size to 29kB again, but to get it down to 23819 bytes something like this is needed:

diff --git a/lib/compress/zstd_compress.c b/lib/compress/zstd_compress.c
index 386b051d..c84c0b25 100644
--- a/lib/compress/zstd_compress.c
+++ b/lib/compress/zstd_compress.c
@@ -1189,7 +1189,7 @@ ZSTD_adjustCParams_internal(ZSTD_compressionParameters cPar,
     assert(ZSTD_checkCParams(cPar)==0);
 
     if (dictSize && srcSize == ZSTD_CONTENTSIZE_UNKNOWN)
-        srcSize = minSrcSize;
+        srcSize = 100 * dictSize;
 
     switch (mode) {
     case ZSTD_cpm_noAttachDict:
@@ -1197,7 +1197,7 @@ ZSTD_adjustCParams_internal(ZSTD_compressionParameters cPar,
     case ZSTD_cpm_createCDict:
         break;
     case ZSTD_cpm_attachDict:
-        dictSize = 0;
+        // dictSize = 0;
         break;
     default:
         assert(0);

Though really I don't see why the size of the source dictionary should influence the window size at all, and I also don't see why when a dictionary is there the source size is assumed to be 513 bytes.

I originally reported this here: gyscos/zstd-rs#100 But seems like it's unrelated to the Rust wrapper.

The CLI does not have this problem since it uses known source sizes (I guess).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions