Skip to content

[contrib] Edit distance match finder#2036

Merged
bimbashrestha merged 1 commit intofacebook:devfrom
bimbashrestha:edist
Mar 23, 2020
Merged

[contrib] Edit distance match finder#2036
bimbashrestha merged 1 commit intofacebook:devfrom
bimbashrestha:edist

Conversation

@bimbashrestha
Copy link
Contributor

Turned out to not provide enough utility to make up for the additional CPU usage. Including this in contrib with some light docs in case this becomes interesting again.

The one place where this outperforms zstd-19 for compression ratio consistently is when compressing a small file (<10KB) with a small and similar in content dictionary (a use case which doesn't really exist as far as I can tell). There are very few overlapping matches in those cases so edit distance basically does as good of a job as can be done.

For large files, this is 5-10X slower than zstd-19 and 2-3X worse in terms of compression ratio. I Think most of the loss is in being unable to find overlapping matches.

References:

An O(ND) Difference Algorithm and Its Variations:
http://www.xmailserver.org/diff2.pdf

An Algorithm for Differential File Comparison:
https://www.cs.dartmouth.edu/~doug/diff.pdf

A File Comparison Program:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.189.70&rep=rep1&type=pdf

More information and references on diff algorithms on this page:
https://wiki.c2.com/?DiffAlgorithm

@bimbashrestha
Copy link
Contributor Author

Ping. Should we merge this or leave it out? Both seems reasonable to me. Might be useful if someone ever wants to experiment with edit distance based match finding in the future.

@Cyan4973
Copy link
Contributor

I think it's fine to merge it.
We could still remove it in some future if it feels superfluous, but for the time being, the learning seems interesting to archive considering our activity on this topic.

@bimbashrestha bimbashrestha merged commit 40414f9 into facebook:dev Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants