fix 22294 & 22262 by Cyan4973 · Pull Request #2151 · facebook/zstd

Cyan4973 · 2020-05-19T02:48:51Z

Issue 22294, firing assert(offset_1 <= current +1) within compress_fast_extDict(),
and issue 22262, firing assert(offset_1 <= dictAndPrefixLength) within compress_fast_dictMatchState(),
are actually a consequence of a new behavior enabled within ldm, that has become possible since the new --patch-from capability (and therefore not present in earlier versions) :

Previously, ldm would not load any dictionary when in multithreading mode
- This is still the case when using the "one-ingestion" strategy
This has been updated as part of the --patch-from capability
- Loading the dictionary into the ldm is necessary in order to catch long-distance correlations
- Note that only the "streaming" mode has been updated, as it's the only mode relevant for --patch-from
However, there is a subtle twist : this patch makes the ldm ingest the full dictionary as a content, irrespective of being a "raw" dictionary (only content) or a "full" dictionary (header and entropy tables)
- As a consequence, when the dictionary is "full", the ldm loads entropy tables as "content", and may be able to find matches into them (just as a matter of random luck)
- Once the ldm has found such a match, it passes the following literals section to the regular match finder (compress_fast() in this case), where the previous offset is passed as repeat code
- the previous offset now leads beyond the dictionary content, which was properly loaded into the regular match finder. As a consequence, the offset underflows the index.
- this is caught by the assert(). Without the assert(), the resulting pointer is invalid, and result in a segfault.

This PR makes the minimum to control the damage :
Now ldm only loads the dictionary if it is labelled a "raw" dictionary, as it is the only case which matters for --patch-from. If the dictionary is labelled "auto" or "full", it is not loaded at all.
This was the behavior of ldm before the --patch-from mode.
It also avoids expanding the scope and creating new scenarios, that would have to be fuzzed.

This PR doesn't have a test case attached.
It's a bit tricky to generate, as it requires a "full" dictionary, with a matching pattern directly in the encoded entropy section of the header (which looks like random bytes), so it's unclear how to generate this case intentionally.

But we can add the known cases as "golden files" to our regression corpus.

terrelln · 2020-05-19T03:58:29Z

Just to verify, this fixes both offset_1 bugs, right?

Cyan4973 · 2020-05-19T03:59:07Z

Right

fix 22294

43e3607

facebook-github-bot added the CLA Signed label May 19, 2020

Cyan4973 changed the title ~~fix 22294~~ fix 22294 & 22262 May 19, 2020

terrelln approved these changes May 19, 2020

View reviewed changes

Cyan4973 merged commit fdc56ba into dev May 19, 2020

Cyan4973 deleted the fix22294 branch November 19, 2020 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix 22294 & 22262#2151

fix 22294 & 22262#2151
Cyan4973 merged 1 commit intodevfrom
fix22294

Cyan4973 commented May 19, 2020 •

edited

Loading

Uh oh!

terrelln commented May 19, 2020

Uh oh!

Cyan4973 commented May 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Cyan4973 commented May 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrelln commented May 19, 2020

Uh oh!

Cyan4973 commented May 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Cyan4973 commented May 19, 2020 •

edited

Loading