adding a length computation benchmark by lemire · Pull Request #901 · simdutf/simdutf

lemire · 2026-01-06T18:15:16Z

This is a PR ON TOP of @anonrig's PR #887

The purpose here is to anchor the discussion with respect to performance. So that everyone understands what is happening.

To test this out, do the following:

cmake -B build -D SIMDUTF_BENCHMARKS=ON -D CMAKE_BUILD_TYPE=Release
cmake --build build --target benchmark_base64

Use a test file. Any would not, but you can create one like so:

 base64 -i ./README.md -b 76 > test.base64

Next run the benchmarks, first the decoding benchmark:

./build/benchmarks/base64/benchmark_base64 -d test.base64 -f simdutf

Here -f simdutf applies a filter so we only run the simdutf functions.

Then benchmark the length functions:

 ./build/benchmarks/base64/benchmark_base64 -L test.base64

Here is what I get on my macbook...

./build/benchmarks/base64/benchmark_base64 -d test.base64 -f simdutf

# current system detected as arm64.
# loading files: .
# volume: 182408 bytes
# max length: 182408 bytes
# number of inputs: 1
# decode
# the base64 data contains spaces, so we cannot use straight libbase64::base64_decode directly
simdutf::arm64                                :  15.82 GB/s  7.27 % 
simdutf::arm64 (accept garbage)               :  13.77 GB/s  6.65 % 

 ./build/benchmarks/base64/benchmark_base64 -L test.base64         

# current system detected as arm64.
# loading files: .
# volume: 182409 bytes
# max length: 182409 bytes
# number of inputs: 1
# lengths
# Benchmark only simdutf length functions (maximal and exact)
simdutf::arm64_maximal_binary_length_from_base64 :  10838.88 GB/s  inf % 
simdutf::arm64_binary_length_from_base64      :   8.31 GB/s  3.06 %

So you see here that maximal_binary_length_from_base64 is effectively free, while binary_length_from_base64 is not.

Suppose you combine the decoding function with the maximal function... then we get 15.82 GB/s (unchanged).

But if you combine it with the new function you get 1/(1/8.31 + 1/15.82) or 5.5 GB/s. That is, we reduce by a factor of three the speed. It is not a small effect.

lemire · 2026-01-06T18:16:18Z

@anonrig I leave this up to you to merge this.

anonrig

perfect - amazing work

anonrig · 2026-01-06T18:20:34Z

(now we now, my code is slow!)

erikcorry · 2026-01-06T19:55:27Z

The AVX2 and AVX-512 versions are much faster though. On a 135090 byte input with my sanitizer-incompatible version:

simdutf::icelake_binary_length_from_base64    :  130.62 GB/s  13.52 %   4.49 GHz   0.03 c/b   0.17 i/b   5.07 i/c 
simdutf::haswell_binary_length_from_base64    :  67.02 GB/s  13.05 %   4.37 GHz   0.07 c/b   0.31 i/b   4.83 i/c 
simdutf::fallback_binary_length_from_base64   :   6.62 GB/s  16.63 %   3.83 GHz   0.58 c/b   3.32 i/b   5.73 i/c

So on Icelake

On a tiny 17 byte input the SIMD versions are the same speed as the scalar versions, which suggests that a version that is sanitizer-friendly would be just as fast.

Getting the length is thus not a big part of a getting-the-length-then-decoding task.

# volume: 135090 bytes
# max length: 135090 bytes
# number of inputs: 1
# decode
# the base64 data contains spaces, so we cannot use straight libbase64::base64_decode directly
simdutf::icelake                              :  14.59 GB/s  14.69 %   3.93 GHz   0.27 c/b   0.51 i/b   1.88 i/c 
simdutf::icelake (accept garbage)             :  13.05 GB/s  10.99 %   3.84 GHz   0.29 c/b   0.47 i/b   1.59 i/c 
simdutf::haswell                              :  12.36 GB/s  27.33 %   3.90 GHz   0.32 c/b   1.21 i/b   3.83 i/c 
simdutf::haswell (accept garbage)             :  12.05 GB/s  15.39 %   4.14 GHz   0.34 c/b   1.22 i/b   3.55 i/c 
simdutf::westmere                             :   9.93 GB/s  21.90 %   4.00 GHz   0.40 c/b   2.34 i/b   5.81 i/c 
simdutf::westmere (accept garbage)            :   9.24 GB/s  15.21 %   4.19 GHz   0.45 c/b   2.43 i/b   5.36 i/c 
simdutf::fallback                             :   3.82 GB/s  22.04 %   4.02 GHz   1.05 c/b   6.83 i/b   6.49 i/c 
simdutf::fallback (accept garbage)            :   2.41 GB/s  20.23 %   4.05 GHz   1.68 c/b   9.59 i/b   5.70 i/c

adding a length computation benchmark

e9db1f7

lemire marked this pull request as ready for review January 6, 2026 18:15

lemire requested review from anonrig and erikcorry January 6, 2026 18:15

anonrig approved these changes Jan 6, 2026

View reviewed changes

anonrig merged commit 96a192e into yagiz/add-binary-length-base64 Jan 6, 2026
47 checks passed

anonrig pushed a commit that referenced this pull request Jan 30, 2026

adding a length computation benchmark (#901)

13b3582

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding a length computation benchmark#901

adding a length computation benchmark#901
anonrig merged 1 commit intoyagiz/add-binary-length-base64from
lemire/add-binary-length-base64-benchmark

lemire commented Jan 6, 2026

Uh oh!

lemire commented Jan 6, 2026

Uh oh!

anonrig left a comment

Uh oh!

anonrig commented Jan 6, 2026

Uh oh!

Uh oh!

erikcorry commented Jan 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lemire commented Jan 6, 2026

Uh oh!

lemire commented Jan 6, 2026

Uh oh!

anonrig left a comment

Choose a reason for hiding this comment

Uh oh!

anonrig commented Jan 6, 2026

Uh oh!

Uh oh!

erikcorry commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erikcorry commented Jan 6, 2026 •

edited

Loading