Conversation
terrelln
approved these changes
Jan 3, 2023
fix #3328 In situations where the alphabet size is very small, the evaluation of literal costs from the Optimal Parser is initially incorrect. It takes some time to converge, during which compression is less efficient. This is especially important for small files, because there will not be enough data to converge, so most of the parsing is selected based on incorrect metrics. After this patch, the scenario ##3328 gets fixed, delivering the expected 29 bytes compressed size (smallest known compressed size).
comparing level 19 to level 22 and expecting a stricter better result from level 22 is not that guaranteed, because level 19 and 22 are very close to each other, especially for small files, so any noise in the final compression result result in failing this test. Level 22 could be compared to something much lower, like level 15, But level 19 is required anyway, because there is a clamping test which depends on it. Removed level 22, kept level 19
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix #3228
In situations where the source's alphabet size is very small, the evaluation of literal costs from the Optimal Parser is initially incorrect. It takes some time to converge, during which compression is less efficient.
This is especially important for small files, since most of the parsing decisions would be based on incorrect metrics.
After this patch, the scenario ##3228 is fixed,
delivering the expected 29 bytes compressed size (smallest known compressed size, down from 54).
On other "regular" data, this patch tends to be generally positive, though this is an average, and differences remain pretty small.
The patch seems to impact text data more, likely because it prunes non present alphabet symbols much faster.
On binary data with full alphabet, it's more balanced, and results typically vary by merely a few bytes (compared to
dev), making it essentially a non-event.Since this modification is only for high compression modes, the speed impact is insignificant.