For ZSTD_compressStream2(), let cdict take compression level priority#2303
Conversation
…hen using compressStream2
lib/compress/zstd_compress.c
Outdated
| { | ||
| ZSTD_compressionParameters cParams = ZSTD_getCParams_internal(compressionLevel, ZSTD_CONTENTSIZE_UNKNOWN, dictSize); | ||
| return ZSTD_createCDict_advanced(dict, dictSize, | ||
| ZSTD_CDict* cdict = ZSTD_createCDict_advanced(dict, dictSize, |
There was a problem hiding this comment.
minor : coding style :
prefer ZSTD_CDict* const cdict =
which indicates that, once created, the cdict pointer will never change.
lib/compress/zstd_compress.c
Outdated
| ZSTD_memset(&cctx->prefixDict, 0, sizeof(cctx->prefixDict)); /* single usage */ | ||
| assert(prefixDict.dict==NULL || cctx->cdict==NULL); /* only one can be set */ | ||
| if (cctx->cdict) | ||
| cctx->requestedParams.compressionLevel = cctx->cdict->compressionLevel; /* let cdict take priority in terms of compression level */ |
There was a problem hiding this comment.
This operation overwrites a parameter set by the user.
We generally try to avoid that,
it can have surprising impacts in later usages of cctx since these parameters are sticky.
That's also the reason for the existence of appliedParams,
which is where dynamic adaptations are performed,
there they do not pollute user's explicit requests, nor survive the current session.
There was a problem hiding this comment.
I've looked into this issue some more - it seems like this behavior also isn't consistent between ZSTD_compress_usingCDict() and ZSTD_compressStream2(). ZSTD_compress_usingCDict() doesn't seem to have this issue, and always will use the CDict compression level.
The reason is: in ZSTD_compress_usingCDict(), we essentially have:
ZSTD_CCtx_params params = cctx->requestedParams
[...]
params.cParams = ZSTD_getCParamsFromCDict() OR params.cParams = ZSTD_getCParams()
// where both these functions take cdict compression level into account
Whereas, in ZSTD_compressStream2(), we have
ZSTD_CCtx_params params = cctx->requestedParams
params.cParams = ZSTD_getCParamsFromCCtxParams(&cctx->requestedParams, cctx->pledgedSrcSizePlusOne-1, 0 /*dictSize*/));
// followed immediately (within ZSTD_resetCStream_internal) by
params.cParams = ZSTD_getCParamsFromCCtxParams(¶ms, pledgedSrcSize, dictSize);
// in neither case is cdict compression level taken into account
I guess I'm a little perplexed here - what's the point of calling ZSTD_getCParamsFromCCtxParams() twice back-to-back, with what are essentially the same arguments (with the exception of dictSize)? Doesn't the second call just overwrite?
There was a problem hiding this comment.
It's likely redundant in this case.
But params is also used for other purposes when ZSTD_MULTITHREAD is defined :
- It makes it possible to adjust
params.nbWorkerswithout modifyingcctx->requestedParams - It's a parameter of
ZSTDMT_initCStream_internal(), where it's presumed correctly adjusted.
Also, consider that calling ZSTD_getCParamsFromCCtxParams() twice is not very significant in term of performance, making this issue minor.
Nevertheless, if you believe you can improve the situation, you are welcomed.
|
2 minor comments regarding the PR's presentation :
|
| FORWARD_IF_ERROR( ZSTD_initLocalDict(cctx) , ""); /* Init the local dict if present. */ | ||
| ZSTD_memset(&cctx->prefixDict, 0, sizeof(cctx->prefixDict)); /* single usage */ | ||
| assert(prefixDict.dict==NULL || cctx->cdict==NULL); /* only one can be set */ | ||
| if (cctx->cdict) |
There was a problem hiding this comment.
After some more digging, the reason for the cdict passing on the correct clevel for filesizes + dictsizes that do not qualify for reloading is that downstream in ZSTD_compressBegin_internal(), the CCtx is readjusted using ZSTD_resetCCtx_usingCDict() with the compression level of the cdict.
However, in the case of reloading, the function called to adjust the cctx is ZSTD_resetCCtx_internal() which doesn't take into account the cdict.
So I think the mitigation is to adjust the params upstream (which is a copy of requestedParams), and later on if we need to reload the dict and call ZSTD_resetCCtx_internal(), this very params which may have the cdict compression level, will be copied into the cctx->appliedParams during ZSTD_resetCCtx_internal(), and in this case, we don't modify the cctx->requestedParams if the cctx needs to be used in the future for some reason.
I'm not sure if this is exactly what you meant by using the appliedParams, but it does seem like the appliedParams are mainly touched in the ZSTD_resetCCtx_internal(), and in this case we are essentially modifying appliedParams by modifying params.
There was a problem hiding this comment.
I think that's fine.
appliedParams is the place where the final result of parameter adjustments will be stored.
Before reaching that point, appliedParams could also be used as a form of "temporary storage", where intermediate adjustment results are saved.
However, this could prove messy and confusing, hence separating the 2 topics, by creating and using a dedicated params structure for the purpose of temporary storage and intermediate adjustment is fine, probably clearer.
|
Are there any further follow-up/changes required on this PR to merge? |
This PR does resolves two issues brought up in #2300, discovered by @animalize:
cdictwas used during compression, that cdict's compression level was not respected. Now thecdictwill take priority in terms of compression level, but will not changecctx->requestedParams.ZSTD_createCDict_byReference()to use the overloadedcompressionLevel = 0behavior.No unit tests included in this PR, but tested locally (and by @animalize in #2300) to confirm that using
ZSTD_compressStream2()does in fact take on thecdictcompression level when compressing files that would cause a dictionary reload, and that everything else works as usual.