zstd: x86 assembler implementation of sequenceDecs.executeSimple#531
zstd: x86 assembler implementation of sequenceDecs.executeSimple#531klauspost merged 3 commits intoklauspost:masterfrom
Conversation
klauspost
left a comment
There was a problem hiding this comment.
Just a few initial thoughts.
|
I fixed the code, so the tests pass. Now going to prepare an |
9d30996 to
7643f8e
Compare
|
I'll fix the CI error once pushing to github will be working. The fresh results from IceLake: I also tried copying via AVX registers, but there is no big difference. IIRC I observed better speedup in my early decodeSync implementation. So, maybe for that function, we'll see a nicer AVX impact. |
|
Re: performing all memcpy inside asm -- overall, it's better, few minimal regressions. Comparison with the version with a threshold. |
|
@WojciechMula Looks good 👍🏼 |
@klauspost sorry, I copied a wrong file - but the actual results are still quite good IMHO |
|
Re: copying in 32-byte chunks - 2xSSE reg. There are some nice speed-ups, but I feel like there are more regressions. |
Yeah, it will over-copy quite a bit, since most matches/lits will be <16 bytes. We could add some logic that uses one or the other, based on |
I fully agree. So, I'm going to revert this change, squash the commits and it'll be ready to merge. I'd like to add the history support in another PR, as I see there's some more work and likely we'd need another specialisation. Does it sound fine? |
BTW this is on my TODO list: check if having a separate path for |
Method sequenceDecs.executeSimple is simplified sequenceDecs.execute for cases when both the history and dictionary are empty. These cases are 83% of all calls to `execute` when run `go test`. In benchmarks it is 99%.
2cb26c5 to
3295600
Compare
- allocate padded out buffer - remove not needed Go fallback code - add missing checks
I checked this (see: https://github.com/WojciechMula/compress/tree/asm-seqdec-execute-small-ll-ml), but the results are not too promising. |
Took this out, since it isn't true, since non-stream benchmarks hits decodeSync, not decode+execute. The code is largely unused for now, but a foundation we can build on. |
This is plain x86 and x86 with BMI2 implementation of
sequenceDecs.executeSimple. Part of #515.I extracted function
executeSimpleto handle cases when no history nor dictionary is used. My quick check showed that forgo testsuch cases is 83% of all calls, while forgo test -bench .it's 99%. Thus, it's the vast majority of cases. Of course, we may consider handling all cases in another PR (but after completing #529).As always, I'm marking it as a draft, as some tests fail. I will figure out what's wrong, likely as usual I missed something silly.[fixed (I was right, it was silly)]Below are preliminary benchmark results from IceLake machine: it's noasm vs GOARM64=v3. Currently, the branch is built on top of #528, thus we see the combined performance boost from x86 BMI use in both
decodeandexecute.