zstd: x86 assembler implementation of sequenceDecs.executeSimple by WojciechMula · Pull Request #531 · klauspost/compress

WojciechMula · 2022-03-11T17:43:10Z

This is plain x86 and x86 with BMI2 implementation of sequenceDecs.executeSimple. Part of #515.

I extracted function executeSimple to handle cases when no history nor dictionary is used. My quick check showed that for go test such cases is 83% of all calls, while for go test -bench . it's 99%. Thus, it's the vast majority of cases. Of course, we may consider handling all cases in another PR (but after completing #529).

~~As always, I'm marking it as a draft, as some tests fail. I will figure out what's wrong, likely as usual I missed something silly.~~ [fixed (I was right, it was silly)]

Below are preliminary benchmark results from IceLake machine: it's noasm vs GOARM64=v3. Currently, the branch is built on top of #528, thus we see the combined performance boost from x86 BMI use in both decode and execute.

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            260.20       348.57       1.34x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         211.67       251.78       1.19x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           250.34       298.27       1.19x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         233.78       373.57       1.60x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          209.63       322.12       1.54x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       6952.38      7619.23      1.10x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             363.22       431.34       1.19x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        366.19       458.76       1.25x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               261.76       386.54       1.48x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            221.62       350.06       1.58x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              262.00       389.86       1.49x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            231.21       361.89       1.57x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             209.42       335.43       1.60x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          9356.96      10893.21     1.16x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                388.64       539.45       1.39x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           357.89       444.40       1.24x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      226.70       360.78       1.59x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      224.89       354.12       1.57x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       239.08       369.24       1.54x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         219.11       329.76       1.50x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9348.18      10886.74     1.16x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1052.83      1292.48      1.23x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 432.47       419.43       0.97x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 264.89       324.29       1.22x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 184.00       207.14       1.13x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  182.83       205.27       1.12x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    165.99       192.78       1.16x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       401.53       562.99       1.40x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       382.26       558.41       1.46x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        413.76       587.31       1.42x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          389.76       540.86       1.39x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9334.02      10878.93     1.17x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9349.18      10882.05     1.16x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9357.66      10897.99     1.16x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9356.77      10893.79     1.16x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9332.25      10890.70     1.17x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1502.94      2665.02      1.77x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1431.25      2625.59      1.83x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1443.33      2770.91      1.92x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1518.36      2649.41      1.74x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         67904.31     97302.23     1.43x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             5855.79      5776.46      0.99x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                4075.41      4028.22      0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1468.79      1620.82      1.10x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                996.97       1099.55      1.10x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1014.07      1097.88      1.08x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   823.45       943.41       1.15x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      1845.55      2886.39      1.56x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1687.15      2817.55      1.67x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       1911.29      3005.57      1.57x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1730.33      2881.95      1.67x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        67953.72     94562.80     1.39x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    67918.38     97067.32     1.43x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    67942.13     95589.74     1.41x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     67922.47     96824.04     1.43x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       67918.38     95890.17     1.41x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1517.85      2819.81      1.86x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1277.90      2531.84      1.98x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1523.41      2878.54      1.89x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1306.50      2575.44      1.97x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1145.33      2332.57      2.04x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  68109.62     95841.69     1.41x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        2821.39      4603.53      1.63x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   2019.39      2534.19      1.25x

klauspost

Just a few initial thoughts.

WojciechMula · 2022-03-17T08:40:30Z

I fixed the code, so the tests pass. Now going to prepare an avo generator and, as mentioned earlier, will prepare a version that supports a history buffer.

WojciechMula · 2022-03-17T15:12:49Z

I'll fix the CI error once pushing to github will be working.

The fresh results from IceLake:

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            262.80       389.16       1.48x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        961.16       1371.73      1.43x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         214.94       253.26       1.18x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           255.51       308.32       1.21x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         241.40       430.23       1.78x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          210.31       369.20       1.76x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             2103.10      2669.35      1.27x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3489.03      3905.64      1.12x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       7743.20      7735.32      1.00x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             391.60       430.36       1.10x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 739.86       1128.08      1.52x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        393.55       464.23       1.18x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               262.92       453.93       1.73x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           978.87       1411.59      1.44x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            223.07       407.63       1.83x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              264.86       480.07       1.81x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            232.85       414.14       1.78x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             209.49       396.72       1.89x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1592.88      1507.85      0.95x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          4067.58      4665.96      1.15x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          10899.76     10878.23     1.00x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                421.35       620.59       1.47x
BenchmarkDecoder_DecodeAll/html.zst-16                                    726.37       1102.25      1.52x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           389.13       456.14       1.17x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      241.73       409.74       1.70x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      225.08       414.40       1.84x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       244.65       444.58       1.82x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         224.23       422.26       1.88x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          10883.95     10887.70     1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          266.96       462.63       1.73x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           383.27       540.68       1.41x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             642.99       660.01       1.03x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1194.00      1114.33      0.93x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1286.46      1240.22      0.96x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1118.93      1049.82      0.94x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 438.99       398.25       0.91x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 304.22       322.32       1.06x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 189.27       206.34       1.09x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  188.57       205.80       1.09x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    175.16       197.99       1.13x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       470.55       747.39       1.59x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       437.31       725.62       1.66x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        485.12       777.39       1.60x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          419.17       723.73       1.73x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         10891.35     10878.36     1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         262.69       462.56       1.76x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          391.25       547.51       1.40x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            649.28       661.32       1.02x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1796.02      1810.06      1.01x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1623.16      1523.89      0.94x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1999.14      1993.01      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1687.37      1490.37      0.88x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     10859.45     10839.70     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     10884.23     10887.07     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      10895.83     10893.26     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        10903.41     10887.14     1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1720.41      2872.19      1.67x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1534.87      2859.31      1.86x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1693.59      3145.73      1.86x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1609.07      3026.55      1.88x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         94790.63     94180.16     0.99x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         1604.08      2807.05      1.75x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          2479.13      3560.26      1.44x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            5001.69      5334.60      1.07x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6439.60      5671.60      0.88x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5781.42      5403.87      0.93x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              5598.84      5830.59      1.04x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3958.76      3927.56      0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1485.24      1501.19      1.01x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                981.87       1027.91      1.05x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1001.71      1049.43      1.05x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   840.46       880.14       1.05x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      2084.80      3294.23      1.58x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1952.01      3190.45      1.63x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       2104.22      3386.36      1.61x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1977.31      3172.92      1.60x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        94630.31     97016.26     1.03x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        1602.96      2873.16      1.79x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         2544.39      3624.77      1.42x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           4971.90      5320.62      1.07x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   9113.58      9584.72      1.05x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   9319.28      9187.18      0.99x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    10207.10     10810.94     1.06x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      12153.69     10909.15     0.90x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    94506.95     96586.11     1.02x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    96830.36     95727.43     0.99x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     94506.00     96802.95     1.02x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       96394.59     94526.33     0.98x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1738.60      3116.46      1.79x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   5850.84      8302.28      1.42x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1523.50      2811.58      1.85x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1789.78      3285.39      1.84x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1530.76      2815.69      1.84x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1319.04      2564.55      1.94x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        12006.13     11221.63     0.93x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  23686.04     28255.57     1.19x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  97496.84     96280.42     0.99x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        3382.53      5038.57      1.49x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            4594.61      6895.99      1.50x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   2222.76      2518.68      1.13x

I also tried copying via AVX registers, but there is no big difference. IIRC I observed better speedup in my early decodeSync implementation. So, maybe for that function, we'll see a nicer AVX impact.

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            389.16       387.73       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        1371.73      1509.79      1.10x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         253.26       251.65       0.99x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           308.32       309.41       1.00x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         430.23       425.06       0.99x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          369.20       368.70       1.00x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             2669.35      2849.05      1.07x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3905.64      3942.06      1.01x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       7735.32      7317.12      0.95x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             430.36       435.71       1.01x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 1128.08      1269.40      1.13x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        464.23       464.20       1.00x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               453.93       455.29       1.00x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           1411.59      1549.50      1.10x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            407.63       400.44       0.98x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              480.07       480.59       1.00x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            414.14       409.31       0.99x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             396.72       393.11       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1507.85      1550.74      1.03x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          4665.96      4661.04      1.00x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          10878.23     10889.61     1.00x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                620.59       662.65       1.07x
BenchmarkDecoder_DecodeAll/html.zst-16                                    1102.25      1206.46      1.09x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           456.14       454.81       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      409.74       406.33       0.99x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      414.40       409.36       0.99x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       444.58       436.08       0.98x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         422.26       400.60       0.95x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          10887.70     10880.90     1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          462.63       457.38       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           540.68       577.30       1.07x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             660.01       665.51       1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1114.33      1111.60      1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1240.22      1226.62      0.99x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1049.82      1050.40      1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 398.25       421.64       1.06x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 322.32       322.19       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 206.34       206.22       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  205.80       205.78       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    197.99       197.75       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       747.39       757.97       1.01x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       725.62       738.49       1.02x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        777.39       780.45       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          723.73       732.83       1.01x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         10878.36     10882.18     1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         462.56       457.38       0.99x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          547.51       586.03       1.07x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            661.32       671.31       1.02x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1810.06      1796.73      0.99x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1523.89      1520.33      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1993.01      1994.58      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1490.37      1510.76      1.01x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     10839.70     10878.24     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     10887.07     10881.96     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      10893.26     10876.39     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        10887.14     10869.43     1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     2872.19      2869.56      1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     2859.31      2859.63      1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      3145.73      3138.93      1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        3026.55      2979.04      0.98x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         94180.16     96195.02     1.02x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         2807.05      2852.21      1.02x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          3560.26      3654.30      1.03x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            5334.60      5360.62      1.00x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             5671.60      5976.83      1.05x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5403.87      5442.14      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              5830.59      5567.97      0.95x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3927.56      3890.56      0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1501.19      1496.00      1.00x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1027.91      1032.50      1.00x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1049.43      1045.79      1.00x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   880.14       890.81       1.01x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      3294.23      3256.52      0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      3190.45      3236.51      1.01x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       3386.36      3402.11      1.00x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         3172.92      3294.31      1.04x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        97016.26     96479.74     0.99x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        2873.16      2820.33      0.98x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         3624.77      3714.61      1.02x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           5320.62      5362.39      1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   9584.72      9561.04      1.00x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   9187.18      9273.60      1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    10810.94     10689.81     0.99x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      10909.15     11269.85     1.03x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    96586.11     93617.83     0.97x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    95727.43     95778.64     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     96802.95     94356.05     0.97x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       94526.33     95557.69     1.01x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       3116.46      3170.17      1.02x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   8302.28      8659.01      1.04x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    2811.58      2803.59      1.00x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      3285.39      3302.83      1.01x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    2815.69      2801.65      1.00x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     2564.55      2572.79      1.00x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        11221.63     11359.84     1.01x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  28255.57     28395.81     1.00x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  96280.42     97816.82     1.02x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        5038.57      5257.19      1.04x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            6895.99      7177.71      1.04x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   2518.68      2526.22      1.00x

WojciechMula · 2022-03-18T09:06:24Z

Re: performing all memcpy inside asm -- overall, it's better, few minimal regressions. Comparison with the version with a threshold.

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            3845758       3705646       -3.64%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        717232        707392        -1.37%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         15624308      14384005      -7.94%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           11343992      10469989      -7.70%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         2401338       2226784       -7.27%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          3382540       3136416       -7.28%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1248554       1187123       -4.92%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       210902        200434        -4.96%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       126255        129317        +2.43%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             13099079      12317570      -5.97%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 752231        694420        -7.69%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        71887         68355         -4.91%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               416562        398134        -4.42%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           93967         86363         -8.09%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            1222869       1149552       -6.00%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              904607        857917        -5.16%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            298161        292852        -1.78%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             377384        375006        -0.63%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                268596        270104        +0.56%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          21586         22154         +2.63%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11289         11289         +0.00%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1111436       1101481       -0.90%
BenchmarkDecoder_DecodeAll/html.zst-16                                    92141         92000         -0.15%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           8793          8839          +0.52%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      933695        921954        -1.26%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      917850        906344        -1.25%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       873901        854356        -2.24%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         938215        924998        -1.41%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9179          9185          +0.07%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          212470        214724        +1.06%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           182574        173646        -4.89%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             151173        142107        -6.00%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3585          3376          -5.83%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              3231          3004          -7.03%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               3811          3518          -7.69%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 9320          10107         +8.44%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 4710          4753          +0.91%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7313          7343          +0.41%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7365          7387          +0.30%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7798          7865          +0.86%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       59135         59060         -0.13%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       60437         59676         -1.26%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        57637         56678         -1.66%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          59459         60085         +1.05%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9184          9187          +0.03%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         214467        216055        +0.74%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          181248        171735        -5.25%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            150610        144955        -3.75%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    27940         27779         -0.58%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    33186         32154         -3.11%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     25435         25301         -0.53%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       34400         32698         -4.95%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9178          9181          +0.03%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9171          9244          +0.80%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9168          9169          +0.01%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9168          9167          -0.01%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     133506        136480        +2.23%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     135875        136753        +0.65%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      122832        123723        +0.73%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        129654        130045        +0.30%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1030          1061          +3.01%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         34976         35498         +1.49%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          27700         27392         -1.11%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            18733         18889         +0.83%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             709           692           -2.40%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             710           731           +2.94%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              704           701           -0.51%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                1082          1045          -3.42%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                984           990           +0.57%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1413          1438          +1.77%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1402          1431          +2.07%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1629          1590          -2.39%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      13464         12853         -4.54%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      13848         13384         -3.35%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       12912         12593         -2.47%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         13633         12993         -4.69%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1045          1041          -0.38%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        34715         35369         +1.88%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         27240         27145         -0.35%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           18629         18428         -1.08%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   5246          4946          -5.72%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   5455          5116          -6.21%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    4669          4433          -5.05%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      4538          4209          -7.25%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1048          1082          +3.24%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1035          1037          +0.19%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1065          1044          -1.97%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1031          1046          +1.45%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       58481         58428         -0.09%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   14104         14052         -0.37%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    169695        169866        +0.10%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      127942        128813        +0.68%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    44027         43909         -0.27%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     58501         57549         -1.63%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        35924         36489         +1.57%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  3568          3695          +3.56%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1247          1268          +1.68%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        137809        139703        +1.37%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            14733         14636         -0.66%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1587          1585          -0.13%

klauspost · 2022-03-18T09:07:29Z

@WojciechMula Looks good 👍🏼

WojciechMula · 2022-03-18T09:23:56Z

@WojciechMula Looks good 👍🏼

@klauspost sorry, I copied a wrong file - but the actual results are still quite good IMHO

WojciechMula · 2022-03-18T09:31:39Z

Re: copying in 32-byte chunks - 2xSSE reg. There are some nice speed-ups, but I feel like there are more regressions.

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            3705646       3660047       -1.23%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        707392        597417        -15.55%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         14384005      14440584      +0.39%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           10469989      10535106      +0.62%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         2226784       2225881       -0.04%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          3136416       3129899       -0.21%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1187123       1101201       -7.24%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       200434        201599        +0.58%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       129317        124227        -3.94%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             12317570      12448499      +1.06%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 694420        607620        -12.50%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        68355         68594         +0.35%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               398134        391978        -1.55%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           86363         73203         -15.24%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            1149552       1152439       +0.25%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              857917        853468        -0.52%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            292852        291907        -0.32%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             375006        370662        -1.16%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                270104        257457        -4.68%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          22154         21781         -1.68%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11289         11297         +0.07%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1101481       1024586       -6.98%
BenchmarkDecoder_DecodeAll/html.zst-16                                    92000         79330         -13.77%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           8839          8848          +0.10%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      921954        923764        +0.20%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      906344        907516        +0.13%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       854356        859642        +0.62%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         924998        944844        +2.15%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9185          9228          +0.47%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          214724        216704        +0.92%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           173646        171949        -0.98%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             142107        144852        +1.93%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3376          3363          -0.39%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              3004          3005          +0.03%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               3518          3548          +0.85%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 10107         10219         +1.11%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 4753          4801          +1.01%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7343          7374          +0.42%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7387          7410          +0.31%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7865          7753          -1.42%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       59060         57314         -2.96%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       59676         59008         -1.12%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        56678         56282         -0.70%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          60085         58173         -3.18%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9187          9195          +0.09%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         216055        216945        +0.41%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          171735        169016        -1.58%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            144955        144113        -0.58%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    27779         27478         -1.08%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    32154         32151         -0.01%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     25301         25266         -0.14%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       32698         32649         -0.15%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9181          9177          -0.04%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9244          9173          -0.77%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9169          9173          +0.04%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9167          9210          +0.47%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     136480        136638        +0.12%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     136753        136618        -0.10%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      123723        124277        +0.45%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        130045        131254        +0.93%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1061          1056          -0.47%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         35498         36316         +2.30%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          27392         27312         -0.29%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            18889         18301         -3.11%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             692           635           -8.21%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             731           677           -7.44%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              701           747           +6.55%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                1045          1045          +0.00%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                990           1003          +1.33%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1438          1459          +1.46%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1431          1438          +0.49%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1590          1653          +3.96%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      12853         12687         -1.29%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      13384         13105         -2.08%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       12593         12362         -1.83%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         12993         12778         -1.65%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1041          1049          +0.77%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        35369         36168         +2.26%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         27145         27216         +0.26%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           18428         18328         -0.54%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   4946          4934          -0.24%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   5116          5046          -1.37%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    4433          4383          -1.13%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      4209          4194          -0.36%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1082          1042          -3.70%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1037          1052          +1.45%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1044          1050          +0.57%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1046          1054          +0.76%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       58428         57842         -1.00%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   14052         13157         -6.37%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    169866        168361        -0.89%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      128813        126611        -1.71%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    43909         44199         +0.66%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     57549         58327         +1.35%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        36489         35778         -1.95%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  3695          3605          -2.44%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1268          1277          +0.71%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        139703        134718        -3.57%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            14636         13840         -5.44%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1585          1587          +0.13%

klauspost · 2022-03-18T09:36:51Z

copying in 32-byte chunks - 2xSSE reg.

Yeah, it will over-copy quite a bit, since most matches/lits will be <16 bytes.

We could add some logic that uses one or the other, based on seqSize/nSegs/2, which gives the average copied size per sequence element. That is a some finetuning we can look at when we have it running.

WojciechMula · 2022-03-18T09:40:56Z

copying in 32-byte chunks - 2xSSE reg.

Yeah, it will over-copy quite a bit, since most matches/lits will be <16 bytes.

We could add some logic that uses one or the other, based on seqSize/nSegs/2, which gives the average copied size per sequence element. That is a some finetuning we can look at when we have it running.

I fully agree. So, I'm going to revert this change, squash the commits and it'll be ready to merge. I'd like to add the history support in another PR, as I see there's some more work and likely we'd need another specialisation. Does it sound fine?

WojciechMula · 2022-03-18T09:45:51Z

Yeah, it will over-copy quite a bit, since most matches/lits will be <16 bytes.

BTW this is on my TODO list: check if having a separate path for ml <= 16 && ll <= 16 would bring any speedup.

Method sequenceDecs.executeSimple is simplified sequenceDecs.execute for cases when both the history and dictionary are empty. These cases are 83% of all calls to `execute` when run `go test`. In benchmarks it is 99%.

- allocate padded out buffer - remove not needed Go fallback code - add missing checks

WojciechMula · 2022-03-18T14:56:21Z

BTW this is on my TODO list: check if having a separate path for ml <= 16 && ll <= 16 would bring any speedup.

I checked this (see: https://github.com/WojciechMula/compress/tree/asm-seqdec-execute-small-ll-ml), but the results are not too promising.

benchmark                                                                 old ns/op     new ns/op     delta
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            4910541       4928996       +0.38%
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        892585        895982        +0.38%
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         16483076      16496218      +0.08%
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           12319991      12291384      -0.23%
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         3796248       3787498       -0.23%
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          5244475       5235340       -0.17%
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1408008       1410128       +0.15%
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       218693        217900        -0.36%
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       126754        129718        +2.34%
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             13654478      13626124      -0.21%
BenchmarkDecoder_DecoderSmall/html.zst-16                                 1002381       997243        -0.51%
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        77231         77215         -0.02%
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               591680        592955        +0.22%
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           108071        108376        +0.28%
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            1890810       1893191       +0.13%
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              1414366       1416131       +0.12%
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            467514        468130        +0.13%
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             633571        633612        +0.01%
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                233570        230213        -1.44%
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          23447         23390         -0.24%
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          11285         11282         -0.03%
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                1574397       1571079       -0.21%
BenchmarkDecoder_DecodeAll/html.zst-16                                    122176        122674        +0.41%
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           9677          9683          +0.06%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      1438379       1440529       +0.15%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      1489496       1491006       +0.10%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       1404763       1407061       +0.16%
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         1456927       1456647       -0.02%
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9176          9180          +0.04%
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          340675        342640        +0.58%
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           245581        243651        -0.79%
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             149857        149481        -0.25%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              3136          3135          -0.03%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              2951          2933          -0.61%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               3433          3430          -0.09%
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 9135          9065          -0.77%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 4874          4865          -0.18%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 7717          7736          +0.25%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  7712          7725          +0.17%
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    7934          7930          -0.05%
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       84743         85289         +0.64%
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       91615         92230         +0.67%
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        84535         84413         -0.14%
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          100836        100875        +0.04%
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9183          9179          -0.04%
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         344552        344648        +0.03%
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          240341        240617        +0.11%
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            149246        148939        -0.21%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    27111         27197         +0.32%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    30150         30184         +0.11%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     24562         24555         -0.03%
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       29256         29230         -0.09%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9221          9176          -0.49%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9166          9170          +0.04%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9165          9175          +0.11%
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9169          9183          +0.15%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     152267        189930        +24.73%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     160717        167896        +4.47%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      147805        147261        -0.37%
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        156921        157078        +0.10%
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         1028          1039          +1.07%
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         41962         38990         -7.08%
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          28733         28836         +0.36%
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            18227         18202         -0.14%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             494           482           -2.49%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             493           514           +4.34%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              513           530           +3.35%
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                872           883           +1.30%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                714           709           -0.76%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                920           874           -4.99%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 878           862           -1.89%
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   971           1006          +3.60%
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      11973         13636         +13.89%
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      13805         14058         +1.83%
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       11533         11204         -2.85%
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         12706         13511         +6.34%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        1034          1025          -0.87%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        40561         39360         -2.96%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         28381         28373         -0.03%
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           18197         18181         -0.09%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   4236          4221          -0.35%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   4331          4321          -0.23%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    3518          3474          -1.25%
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      4080          4084          +0.10%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    1043          1042          -0.10%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    1031          1031          +0.00%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     1026          1027          +0.10%
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       1026          1026          +0.00%
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       80514         80448         -0.08%
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   15532         15755         +1.44%
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    271086        275307        +1.56%
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      199163        203038        +1.95%
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    64857         66247         +2.14%
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     88381         89690         +1.48%
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        36946         37122         +0.48%
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  3533          3584          +1.44%
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  1248          1242          -0.48%
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        200022        200791        +0.38%
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            18201         18425         +1.23%
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1364          1362          -0.15%

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            300.28       299.16       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        1062.87      1058.84      1.00x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         233.87       233.68       1.00x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           277.11       277.76       1.00x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         263.80       264.40       1.00x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          232.00       232.40       1.00x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             2327.26      2323.76      1.00x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3745.89      3759.52      1.00x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       7768.95      7591.43      0.98x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             411.34       412.20       1.00x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 817.25       821.47       1.01x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        422.22       422.30       1.00x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               311.52       310.85       1.00x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           1097.31      1094.23      1.00x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            254.84       254.52       1.00x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              301.73       301.35       1.00x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            267.75       267.40       1.00x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             240.05       240.03       1.00x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1753.65      1779.22      1.01x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          4367.32      4377.95      1.00x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          10907.81     10910.55     1.00x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                445.94       446.88       1.00x
BenchmarkDecoder_DecodeAll/html.zst-16                                    838.13       834.73       1.00x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           421.21       420.96       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      269.72       269.32       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      260.47       260.20       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       276.18       275.73       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         266.29       266.34       1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          10898.34     10893.26     1.00x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          293.54       291.86       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           407.21       410.43       1.01x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             667.32       669.00       1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1312.69      1313.05      1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1394.82      1403.49      1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1198.89      1200.15      1.00x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 450.60       454.04       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 317.63       318.20       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 200.59       200.09       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  200.72       200.40       1.00x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    195.11       195.22       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       524.84       521.48       0.99x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       485.48       482.24       0.99x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        526.14       526.90       1.00x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          441.08       440.91       1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         10890.18     10894.51     1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         290.24       290.16       1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          416.09       415.61       1.00x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            670.05       671.44       1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1888.55      1882.53      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1698.17      1696.27      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     2084.54      2085.12      1.00x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1750.06      1751.60      1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     10844.95     10898.87     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     10910.65     10905.45     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      10911.41     10899.94     1.00x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        10907.20     10890.57     1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     2547.93      2042.67      0.80x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     2413.95      2310.74      0.96x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      2624.83      2634.54      1.00x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        2472.35      2469.88      1.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         97232.19     96248.03     0.99x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         2383.18      2564.87      1.08x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          3480.45      3468.02      1.00x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            5486.50      5494.05      1.00x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             8335.65      8548.33      1.03x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             8349.16      8001.89      0.96x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              8028.43      7768.11      0.97x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                4721.19      4660.98      0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                2166.81      2183.27      1.01x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                1682.17      1770.51      1.05x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 1762.65      1796.65      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   1594.28      1539.18      0.97x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      3714.64      3261.66      0.88x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      3221.78      3163.86      0.98x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       3856.44      3969.58      1.03x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         3500.36      3291.81      0.94x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        96752.02     97597.02     1.01x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        2465.52      2540.70      1.03x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         3523.59      3524.62      1.00x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           5495.63      5500.48      1.00x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   12086.40     12129.97     1.00x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   11820.68     11848.48     1.00x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    14553.53     14736.61     1.01x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      12547.69     12537.70     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    95859.32     96007.18     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    97022.49     96960.91     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     97432.91     97351.12     1.00x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       97465.45     97440.05     1.00x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       2289.30      2291.17      1.00x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   7635.15      7527.23      0.99x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1777.52      1750.27      0.98x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      2142.74      2101.85      0.98x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1930.07      1889.58      0.98x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1720.83      1695.72      0.99x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        11086.54     11034.03     1.00x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  28986.75     28569.63     0.99x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  98648.47     99085.95     1.00x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        3510.05      3496.61      1.00x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            5626.21      5557.62      0.99x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   2988.52      2993.51      1.00x

klauspost · 2022-03-18T15:51:22Z

Method sequenceDecs.executeSimple is simplified sequenceDecs.execute
for cases when both the history and dictionary are empty. These
cases are 83% of all calls to execute when run go test.
In benchmarks it is 99%.

Took this out, since it isn't true, since non-stream benchmarks hits decodeSync, not decode+execute.

The code is largely unused for now, but a foundation we can build on.

klauspost reviewed Mar 14, 2022

View reviewed changes

Comment thread zstd/seqdec_amd64.go

Comment thread zstd/seqdec_amd64.s Outdated

Comment thread zstd/seqdec_amd64.s Outdated

WojciechMula force-pushed the asm-seqdec-execute branch from 9d30996 to 7643f8e Compare March 17, 2022 14:15

klauspost reviewed Mar 17, 2022

View reviewed changes

Comment thread internal/cpuinfo/cpuinfo_amd64.s Outdated

Comment thread zstd/_generate/gen.go

Comment thread zstd/_generate/gen.go

Comment thread zstd/_generate/gen.go

Comment thread zstd/_generate/gen.go Outdated

Comment thread zstd/_generate/gen.go Outdated

zstd: x86 assembler implementation of sequenceDecs.executeSimple

3295600

Method sequenceDecs.executeSimple is simplified sequenceDecs.execute for cases when both the history and dictionary are empty. These cases are 83% of all calls to `execute` when run `go test`. In benchmarks it is 99%.

klauspost reviewed Mar 18, 2022

View reviewed changes

Comment thread zstd/seqdec_amd64.go Outdated

Comment thread zstd/seqdec_amd64.go Outdated

WojciechMula force-pushed the asm-seqdec-execute branch from 2cb26c5 to 3295600 Compare March 18, 2022 10:08

WojciechMula marked this pull request as ready for review March 18, 2022 10:08

WojciechMula added 2 commits March 18, 2022 13:59

Review fixes:

25b7f32

- allocate padded out buffer - remove not needed Go fallback code - add missing checks

Simplify litPosition calculation

c35f4ae

klauspost approved these changes Mar 18, 2022

View reviewed changes

Comment thread zstd/seqdec.go

klauspost merged commit 8dc799d into klauspost:master Mar 18, 2022

WojciechMula deleted the asm-seqdec-execute branch March 18, 2022 15:55

Conversation

WojciechMula commented Mar 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klauspost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WojciechMula commented Mar 17, 2022

Uh oh!

WojciechMula commented Mar 17, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WojciechMula commented Mar 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klauspost commented Mar 18, 2022

Uh oh!

WojciechMula commented Mar 18, 2022

Uh oh!

WojciechMula commented Mar 18, 2022

Uh oh!

klauspost commented Mar 18, 2022

Uh oh!

WojciechMula commented Mar 18, 2022

Uh oh!

WojciechMula commented Mar 18, 2022

Uh oh!

Uh oh!

Uh oh!

WojciechMula commented Mar 18, 2022

Uh oh!

Uh oh!

klauspost commented Mar 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WojciechMula commented Mar 11, 2022 •

edited

Loading

WojciechMula commented Mar 18, 2022 •

edited

Loading