[libzstd] Split out zstd_fast dict match state function by terrelln · Pull Request #1563 · facebook/zstd

terrelln · 2019-03-29T16:45:50Z

I'm working on optimizing the fast search algorithm, but it is hard to work with the two functions combined. Each function wants a different optimization.

This patch has no behavior changes, it just splits the two functions out exactly as-is. I will have a second PR that optimizes these functions.

I benchmarked with clang and gcc on my Intel i9 and clang on my Mac on silesia and the GitHub dataset with a dictionary. This change has a neutral to small positive (up to 5 MB/s) performance difference.

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.6% | +7.5% | | i9 5 GHz gcc-8 | +4.7% | +6.8% | | i9 5 GHz clang-7 | +5.5% | +7.9% | | Skylake 2.4 GHz gcc-4.8 | +4.8% | +5.6% | | Skylake 2.4 GHz clang-7 | +6.0% | +8.5% | We lose 0.4% of compression ratio on Silesia. We gain 0.2% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.6% | +7.5% | | i9 5 GHz gcc-8 | +4.7% | +6.8% | | i9 5 GHz clang-7 | +5.5% | +7.9% | | Skylake 2.4 GHz gcc-4.8 | +4.8% | +5.6% | | Skylake 2.4 GHz clang-7 | +6.0% | +8.5% | We lose 0.4% of compression ratio on Silesia. We gain 0.2% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

Cyan4973 · 2019-03-29T19:40:33Z

Looks good,
it is more readable.
it might even help the compiler a bit, as I measured a ~1% speed improvement in simple benchmarks.

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.6% | +7.5% | | i9 5 GHz gcc-8 | +4.7% | +6.8% | | i9 5 GHz clang-7 | +5.5% | +7.9% | | Skylake 2.4 GHz gcc-4.8 | +4.8% | +5.6% | | Skylake 2.4 GHz clang-7 | +6.0% | +8.5% | We lose 0.4% of compression ratio on Silesia. We gain 0.2% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.6% | +7.5% | | i9 5 GHz gcc-8 | +4.7% | +6.8% | | i9 5 GHz clang-7 | +5.5% | +7.9% | | Skylake 2.4 GHz gcc-4.8 | +4.8% | +5.6% | | Skylake 2.4 GHz clang-7 | +6.0% | +8.5% | We lose 0.4% of compression ratio on Silesia. We gain 0.2% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.3% | +5.4% | | i9 5 GHz gcc-8 | +6.6% | +6.6% | | i9 5 GHz clang-7 | +8.0% | +8.0% | | Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% | | Skylake 2.4 GHz clang-7 | +6.2% | +7.5% | We gain 0.1% of compression ratio on Silesia. We gain 0.3% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.3% | +5.4% | | i9 5 GHz gcc-8 | +6.6% | +6.6% | | i9 5 GHz clang-7 | +8.0% | +8.0% | | Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% | | Skylake 2.4 GHz clang-7 | +6.2% | +7.5% | Testing on all Silesia files on my Intel i9-9900k with gcc-8 | Silesia File | Ratio Change | Speed Change | |--------------|--------------|--------------| | silesia.tar | +0.17% | +6.6% | | dickens | +0.25% | +7.0% | | mozilla | +0.02% | +6.8% | | mr | -0.30% | +10.9% | | nci | +1.28% | +4.5% | | ooffice | -0.35% | +10.7% | | osdb | +0.75% | +9.8% | | reymont | +0.65% | +4.6% | | samba | +0.70% | +5.9% | | sao | -0.01% | +14.0% | | webster | +0.30% | +5.5% | | xml | +0.92% | +5.3% | | x-ray | -0.00% | +1.4% | Same tests on Calgary. For brevity, I've only included files where compression ratio regressed or was much better. | Calgary File | Ratio Change | Speed Change | |--------------|--------------|--------------| | calgary.tar | +0.30% | +7.1% | | geo | -0.14% | +25.0% | | obj1 | -0.46% | +15.2% | | obj2 | -0.18% | +6.0% | | pic | +1.80% | +9.3% | | trans | -0.35% | +5.5% | We gain 0.1% of compression ratio on Silesia. We gain 0.3% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.3% | +5.4% | | i9 5 GHz gcc-8 | +6.6% | +6.6% | | i9 5 GHz clang-7 | +8.0% | +8.0% | | Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% | | Skylake 2.4 GHz clang-7 | +6.2% | +7.5% | Testing on all Silesia files on my Intel i9-9900k with gcc-8 | Silesia File | Ratio Change | Speed Change | |--------------|--------------|--------------| | silesia.tar | +0.17% | +6.6% | | dickens | +0.25% | +7.0% | | mozilla | +0.02% | +6.8% | | mr | -0.30% | +10.9% | | nci | +1.28% | +4.5% | | ooffice | -0.35% | +10.7% | | osdb | +0.75% | +9.8% | | reymont | +0.65% | +4.6% | | samba | +0.70% | +5.9% | | sao | -0.01% | +14.0% | | webster | +0.30% | +5.5% | | xml | +0.92% | +5.3% | | x-ray | -0.00% | +1.4% | Same tests on Calgary. For brevity, I've only included files where compression ratio regressed or was much better. | Calgary File | Ratio Change | Speed Change | |--------------|--------------|--------------| | calgary.tar | +0.30% | +7.1% | | geo | -0.14% | +25.0% | | obj1 | -0.46% | +15.2% | | obj2 | -0.18% | +6.0% | | pic | +1.80% | +9.3% | | trans | -0.35% | +5.5% | We gain 0.1% of compression ratio on Silesia. We gain 0.3% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

@gbtucker

This PR is based on top of PR facebook#1563. The optimization is to process two input pointers per loop. It is based on ideas from [igzip] level 1, and talking to @gbtucker. | Platform | Silesia | Enwik8 | |-------------------------|-------------|--------| | OSX clang-10 | +5.3% | +5.4% | | i9 5 GHz gcc-8 | +6.6% | +6.6% | | i9 5 GHz clang-7 | +8.0% | +8.0% | | Skylake 2.4 GHz gcc-4.8 | +6.3% | +7.9% | | Skylake 2.4 GHz clang-7 | +6.2% | +7.5% | Testing on all Silesia files on my Intel i9-9900k with gcc-8 | Silesia File | Ratio Change | Speed Change | |--------------|--------------|--------------| | silesia.tar | +0.17% | +6.6% | | dickens | +0.25% | +7.0% | | mozilla | +0.02% | +6.8% | | mr | -0.30% | +10.9% | | nci | +1.28% | +4.5% | | ooffice | -0.35% | +10.7% | | osdb | +0.75% | +9.8% | | reymont | +0.65% | +4.6% | | samba | +0.70% | +5.9% | | sao | -0.01% | +14.0% | | webster | +0.30% | +5.5% | | xml | +0.92% | +5.3% | | x-ray | -0.00% | +1.4% | Same tests on Calgary. For brevity, I've only included files where compression ratio regressed or was much better. | Calgary File | Ratio Change | Speed Change | |--------------|--------------|--------------| | calgary.tar | +0.30% | +7.1% | | geo | -0.14% | +25.0% | | obj1 | -0.46% | +15.2% | | obj2 | -0.18% | +6.0% | | pic | +1.80% | +9.3% | | trans | -0.35% | +5.5% | We gain 0.1% of compression ratio on Silesia. We gain 0.3% of compression ratio on enwik8. I also tested on the GitHub and hg-commands datasets without a dictionary, and we gain a small amount of compression ratio on each, as well as speed. I tested the negative compression levels on Silesia on my Intel i9-9900k with gcc-8: | Level | Ratio Change | Speed Change | |-------|--------------|--------------| | -1 | +0.13% | +6.4% | | -2 | +4.6% | -1.5% | | -3 | +7.5% | -4.8% | | -4 | +8.5% | -6.9% | | -5 | +9.1% | -9.1% | Roughly, the negative levels now scale half as quickly. E.g. the new level 16 is roughly equivalent to the old level 8, but a bit quicker and smaller. If you don't think this is the right trade off, we can change it to multiply the step size by 2, instead of adding 1. I think this makes sense, because it gives a bit slower ratio decay. [igzip]: https://github.com/01org/isa-l/tree/master/igzip

Split out zstd_fast dict match state function

f00407b

facebook-github-bot added the CLA Signed label Mar 29, 2019

terrelln mentioned this pull request Mar 29, 2019

[libzstd] Speed up single segment zstd_fast by 5% #1562

Merged

Cyan4973 approved these changes Mar 29, 2019

View reviewed changes

terrelln merged commit 425ce55 into facebook:dev Mar 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libzstd] Split out zstd_fast dict match state function#1563

[libzstd] Split out zstd_fast dict match state function#1563
terrelln merged 1 commit intofacebook:devfrom
terrelln:dms-sep

terrelln commented Mar 29, 2019 •

edited

Loading

Uh oh!

Cyan4973 commented Mar 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

terrelln commented Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyan4973 commented Mar 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

terrelln commented Mar 29, 2019 •

edited

Loading