Skip to content

Performance improvements for AMD64 and Generic#205

Merged
pjbgf merged 3 commits into
mainfrom
performance
Apr 27, 2026
Merged

Performance improvements for AMD64 and Generic#205
pjbgf merged 3 commits into
mainfrom
performance

Conversation

@pjbgf

@pjbgf pjbgf commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Performance optimisations:

  • Replace four calls to VPEXTRD with VPSHUFD + VMOVDQU.
  • Share intermediate values via reference, as opposed to by value.

Benchstat

                                    │         before       │                   after                   │
                                    │        sec/op        │        sec/op         vs base             │
CalculateDvMask/go-24                 0.0000004000n ±  25%   0.0000003500n ± 186%        ~ (p=1.000 n=6)
CalculateDvMask/cgo-24                0.0000008000n ± 313%   0.0000008500n ± 288%        ~ (p=0.617 n=6)
Hash8Bytes/sha1-24                           55.99n ±   4%          56.21n ±   5%        ~ (p=0.699 n=6)
Hash8Bytes/sha1cd_native-24                 102.20n ±   4%          92.41n ±   2%   -9.57% (p=0.002 n=6)
Hash8Bytes/sha1cd_generic-24                 158.8n ±   1%          155.4n ±   3%   -2.17% (p=0.026 n=6)
Hash8Bytes/sha1cd_cgo-24                     587.3n ±  10%          600.8n ±  19%        ~ (p=0.485 n=6)
Hash320Bytes/sha1-24                         170.8n ±   6%          173.0n ±   5%        ~ (p=0.937 n=6)
Hash320Bytes/sha1cd_native-24                484.6n ±  10%          421.0n ±   2%  -13.12% (p=0.002 n=6)
Hash320Bytes/sha1cd_generic-24               886.5n ±  12%          805.8n ±   3%   -9.10% (p=0.002 n=6)
Hash320Bytes/sha1cd_cgo-24                  1087.5n ±   4%          943.7n ±  26%        ~ (p=0.394 n=6)
Hash1K/sha1-24                               433.2n ±   8%          439.1n ±   1%        ~ (p=0.255 n=6)
Hash1K/sha1cd_native-24                      1.282µ ±   2%          1.147µ ±   2%  -10.60% (p=0.002 n=6)
Hash1K/sha1cd_generic-24                     2.365µ ±   3%          2.284µ ±   2%   -3.42% (p=0.004 n=6)
Hash1K/sha1cd_cgo-24                         1.895µ ±  15%          1.648µ ±   9%        ~ (p=0.061 n=6)
Hash8K/sha1-24                               3.148µ ±   1%          3.169µ ±   1%        ~ (p=0.513 n=6)
Hash8K/sha1cd_native-24                      9.497µ ±   2%          8.466µ ±   1%  -10.86% (p=0.002 n=6)
Hash8K/sha1cd_generic-24                     17.38µ ±   0%          16.67µ ±   1%   -4.12% (p=0.002 n=6)
Hash8K/sha1cd_cgo-24                         9.606µ ±   4%          9.718µ ±  17%        ~ (p=0.240 n=6)
HashWithCollision/sha1cd_native-24           1.915µ ±   2%          1.676µ ±   8%  -12.48% (p=0.002 n=6)
HashWithCollision/sha1cd_generic-24          2.625µ ±   2%          2.403µ ±   2%   -8.46% (p=0.002 n=6)
HashWithCollision/sha1cd_cgo-24              1.961µ ±   5%          2.012µ ±  21%        ~ (p=0.258 n=6)
geomean                                      142.4n                 134.8n          -5.29%

                                    │     before      │                after               │
                                    │      B/op       │     B/op      vs base              │
CalculateDvMask/go-24                    0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
CalculateDvMask/cgo-24                   0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1-24                       0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1cd_native-24              0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1cd_generic-24             0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1cd_cgo-24               2.625Ki ± 0%     2.625Ki ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1-24                     0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1cd_native-24            0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1cd_generic-24           0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1cd_cgo-24             2.625Ki ± 0%     2.625Ki ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1-24                           0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1cd_native-24                  0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1cd_generic-24                 0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1cd_cgo-24                   2.625Ki ± 0%     2.625Ki ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1-24                           0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1cd_native-24                  0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1cd_generic-24                 0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1cd_cgo-24                   2.625Ki ± 0%     2.625Ki ± 0%       ~ (p=1.000 n=6) ¹
HashWithCollision/sha1cd_native-24       0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
HashWithCollision/sha1cd_generic-24      0.000 ± 0%       0.000 ± 0%       ~ (p=1.000 n=6) ¹
HashWithCollision/sha1cd_cgo-24        2.625Ki ± 0%     2.625Ki ± 0%       ~ (p=1.000 n=6) ¹
geomean                                             ²                 +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                    │     before      │                after              │
                                    │    allocs/op    │ allocs/op   vs base               │
CalculateDvMask/go-24                    0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
CalculateDvMask/cgo-24                   0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1-24                       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1cd_native-24              0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1cd_generic-24             0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8Bytes/sha1cd_cgo-24                 1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1-24                     0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1cd_native-24            0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1cd_generic-24           0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash320Bytes/sha1cd_cgo-24               1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1-24                           0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1cd_native-24                  0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1cd_generic-24                 0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash1K/sha1cd_cgo-24                     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1-24                           0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1cd_native-24                  0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1cd_generic-24                 0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
Hash8K/sha1cd_cgo-24                     1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=6) ¹
HashWithCollision/sha1cd_native-24       0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HashWithCollision/sha1cd_generic-24      0.000 ± 0%     0.000 ± 0%       ~ (p=1.000 n=6) ¹
HashWithCollision/sha1cd_cgo-24          1.000 ± 0%     1.000 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                             ²               +0.00%               ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                    │     before    │                after               │
                                    │       B/s       │      B/s       vs base               │
Hash8Bytes/sha1-24                      136.3Mi ±  4%   135.7Mi ±  5%        ~ (p=0.615 n=6)
Hash8Bytes/sha1cd_native-24             74.65Mi ±  3%   82.55Mi ±  2%  +10.59% (p=0.002 n=6)
Hash8Bytes/sha1cd_generic-24            48.04Mi ±  1%   49.11Mi ±  2%   +2.24% (p=0.026 n=6)
Hash8Bytes/sha1cd_cgo-24                12.99Mi ±  9%   12.70Mi ± 16%        ~ (p=0.485 n=6)
Hash320Bytes/sha1-24                    1.745Gi ±  6%   1.722Gi ±  4%        ~ (p=0.937 n=6)
Hash320Bytes/sha1cd_native-24           629.7Mi ±  9%   724.8Mi ±  2%  +15.10% (p=0.002 n=6)
Hash320Bytes/sha1cd_generic-24          344.3Mi ± 11%   378.7Mi ±  3%   +9.98% (p=0.002 n=6)
Hash320Bytes/sha1cd_cgo-24              280.5Mi ±  4%   326.2Mi ± 21%        ~ (p=0.394 n=6)
Hash1K/sha1-24                          2.201Gi ±  8%   2.172Gi ±  2%        ~ (p=0.240 n=6)
Hash1K/sha1cd_native-24                 761.4Mi ±  1%   851.9Mi ±  2%  +11.87% (p=0.002 n=6)
Hash1K/sha1cd_generic-24                412.9Mi ±  3%   427.6Mi ±  2%   +3.56% (p=0.004 n=6)
Hash1K/sha1cd_cgo-24                    516.2Mi ± 14%   593.2Mi ±  8%        ~ (p=0.065 n=6)
Hash8K/sha1-24                          2.424Gi ±  1%   2.408Gi ±  1%        ~ (p=0.589 n=6)
Hash8K/sha1cd_native-24                 822.7Mi ±  2%   922.8Mi ±  1%  +12.17% (p=0.002 n=6)
Hash8K/sha1cd_generic-24                449.4Mi ±  0%   468.7Mi ±  2%   +4.30% (p=0.002 n=6)
Hash8K/sha1cd_cgo-24                    813.3Mi ±  4%   803.9Mi ± 15%        ~ (p=0.240 n=6)
HashWithCollision/sha1cd_native-24      318.8Mi ±  2%   364.3Mi ±  7%  +14.27% (p=0.002 n=6)
HashWithCollision/sha1cd_generic-24     232.6Mi ±  2%   254.0Mi ±  2%   +9.22% (p=0.002 n=6)
HashWithCollision/sha1cd_cgo-24         311.3Mi ±  5%   303.3Mi ± 17%        ~ (p=0.310 n=6)
geomean                                 363.0Mi         384.2Mi         +5.83%

pjbgf added 3 commits April 27, 2026 17:11
Avoids per-block copies of [80]uint32 m1 (320B), DvInfo.Dm (320B),
[3][5]uint32 cs (60B) and [5]uint32 h (20B) through checkCollision,
hasCollided, rectifyCompressionState, and ubc.CalculateDvMask.

Hash8K (SHA-NI) -12.6%, Hash1K -14.5%, Hash320Bytes -11.6%.

Assisted-by: Claude Opus 4.7 <[email protected]>
Signed-off-by: Paulo Gomes <[email protected]>
Four VPEXTRD's serial on the source register are replaced with one
VPSHUFD (reverse dword order) + one VMOVDQU. Same memory layout, fewer
store-port µops.

Hash320Bytes (SHA-NI) -4.0%, Hash1K -2.3%, Hash8K -1.6%.

Entire-Checkpoint: d009daa58401

Assisted-by: Claude Opus 4.7 <[email protected]>
Signed-off-by: Paulo Gomes <[email protected]>
Signed-off-by: Paulo Gomes <[email protected]>
@pjbgf pjbgf changed the title Performance Performance improvements for AMD64 and Generic Apr 27, 2026
@pjbgf pjbgf merged commit ba31b91 into main Apr 27, 2026
13 checks passed
@pjbgf pjbgf deleted the performance branch April 27, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant