add simdutf::binary_length_from_base64 by anonrig · Pull Request #887 · simdutf/simdutf

anonrig · 2025-12-30T16:37:47Z

Fixes #867

lemire · 2025-12-31T22:08:10Z

@anonrig I like @erikcorry's comment here tremendously. But we should then not take an option parameter because, among the possible options, is that we ignore everythings (as Node.js's atob does). Further, it is clear form what @erikcorry is describing that we do no validation of any kind...

So we would just say something like 'we return the exact output size assuming that the base64 input is valid [obviously] and ignoring ASCII white-space characters.'

With Erik's simplified approach, we can even implement accelerated functions in the various kernels as needed, and it becomes a SIMD programming 101 problem (beat the autovectorizer).

(Note that the exact function can be done sanely within our kernels without a PhD in SIMD programming... but it is going to involve non trivial arithmetic... we already have the code... but it is still going to be slower than what Erik is proposing we implement.)

lemire · 2025-12-31T22:10:19Z

This PR does not have to include my comment above about how this is often going to be a suboptimal idea. But I will want to throw it in later because I think we need have people be aware of the tradeoff. (This function here is fine as long as the engineer is aware of the tradeoff.)

lemire · 2026-01-03T21:24:46Z

@anonrig Warning: Paul (@pauldreik) turned everything constexpr so this may interact badly with this PR.

anonrig · 2026-01-05T15:55:39Z

@lemire No worries. I'll rebase and address the reviews and ready the PR for the next round of reviews.

erikcorry · 2026-01-05T16:35:19Z

+```cpp
+std::vector<char> buffer(simdutf::binary_length_from_base64(base64.data(), base64.size()));
+simdutf::result r = simdutf::base64_to_binary(base64.data(), base64.size(), buffer.data());
+if(r.error) {


Check explicitly against SUCCESS, not with implicit bool conversion. Also missing space after 'if'.

erikcorry · 2026-01-05T16:53:57Z

+simdutf_warn_unused size_t binary_length_from_base64(const char * input, size_t length) noexcept;
+
+/**
+ * Compute the binary length from a base64 input with ASCII spaces.


Same changes as above.

erikcorry · 2026-01-05T17:01:11Z

+  #endif // SIMDUTF_SPAN
+
+/**
+ * Compute the binary length from a base64 input with ASCII spaces.


erikcorry · 2026-01-05T17:01:38Z

      const char16_t *input, size_t length) const noexcept;

+  /**
+   * Compute the binary length from a base64 input with ASCII spaces.


erikcorry · 2026-01-05T17:01:45Z

+  binary_length_from_base64(const char *input, size_t length) const noexcept;
+
+  /**
+   * Compute the binary length from a base64 input with ASCII spaces.


erikcorry · 2026-01-05T17:29:09Z


 #endif

+// Tests for binary_length_from_base64


Some of the tests should take invalid input if only to verify that there is no crash or buffer overflow. Eg. Something that only consists of '=' and spaces.

I agree but I think that can be added later, if and when vectorization is implemented.

erikcorry · 2026-01-06T16:55:11Z

+      simdutf::binary_length_from_base64(input16.data(), input16.size()), 3);
+
+  std::u16string input16_spaces = u"  Y  Q  =  =  ";
+  ASSERT_EQUAL(simdutf::binary_length_from_base64(input16_spaces.data(),


These inputs are so short I don't think they are testing an actual SIMD implementation, since they drop right through to the scalar cleanup. At least I wrote a broken version of the 16 bit version in another PR and the tests passed.

lemire · 2026-01-06T19:02:30Z

I'll mark this PR as approved. We could merge, but before turning it into a release, I think we want to do more performance work. I don't have time right now, but later...

Performance numbers reproduced from my other PR.

To test this out, do the following:

cmake -B build -D SIMDUTF_BENCHMARKS=ON -D CMAKE_BUILD_TYPE=Release
cmake --build build --target benchmark_base64

Use a test file. Any would not, but you can create one like so:

 base64 -i ./README.md -b 76 > test.base64

Next run the benchmarks, first the decoding benchmark:

./build/benchmarks/base64/benchmark_base64 -d test.base64 -f simdutf

Here -f simdutf applies a filter so we only run the simdutf functions.

Then benchmark the length functions:

 ./build/benchmarks/base64/benchmark_base64 -L test.base64

Here is what I get on my macbook...

./build/benchmarks/base64/benchmark_base64 -d test.base64 -f simdutf

# current system detected as arm64.
# loading files: .
# volume: 182408 bytes
# max length: 182408 bytes
# number of inputs: 1
# decode
# the base64 data contains spaces, so we cannot use straight libbase64::base64_decode directly
simdutf::arm64                                :  15.82 GB/s  7.27 % 
simdutf::arm64 (accept garbage)               :  13.77 GB/s  6.65 % 

 ./build/benchmarks/base64/benchmark_base64 -L test.base64         

# current system detected as arm64.
# loading files: .
# volume: 182409 bytes
# max length: 182409 bytes
# number of inputs: 1
# lengths
# Benchmark only simdutf length functions (maximal and exact)
simdutf::arm64_maximal_binary_length_from_base64 :  10838.88 GB/s  inf % 
simdutf::arm64_binary_length_from_base64      :   8.31 GB/s  3.06 %

So you see here that maximal_binary_length_from_base64 is effectively free, while binary_length_from_base64 is not.

Suppose you combine the decoding function with the maximal function... then we get 15.82 GB/s (unchanged).

But if you combine it with the new function you get 1/(1/8.31 + 1/15.82) or 5.5 GB/s. That is, we reduce by a factor of three the speed. It is not a small effect.

pauldreik · 2026-01-06T19:38:02Z

@anonrig before we merge this, I think there should be a span overload and it should be constexpr. ping me if you want help or me to do it. it should be easy.

anonrig · 2026-01-06T20:42:47Z

@anonrig before we merge this, I think there should be a span overload and it should be constexpr. ping me if you want help or me to do it. it should be easy.

@pauldreik Thanks! I'll add it now.

@erikcorry ~~Would you mind if I follow up on your remaining recommendations after landing this PR?~~ Actually, I'll address them now.

pauldreik

I suggest the following test:

commit b504e11f606889bff895d21a85afedc4a9751689 (HEAD -> add-binary-length-base64)
Author: Paul Dreik <[email protected]>
Date:   Tue Jan 6 22:47:38 2026 +0100

    add test for binary_length_from_base64

diff --git a/tests/constexpr_base64_tests.cpp b/tests/constexpr_base64_tests.cpp
index 75324a6e..ed2c90f1 100644
--- a/tests/constexpr_base64_tests.cpp
+++ b/tests/constexpr_base64_tests.cpp
@@ -43,6 +43,35 @@ TEST(compile_time_maximal_binary_length16) {
                 binary.size());
 }
 
+TEST(compile_time_binary_length_from_base64) {
+  using namespace std::string_view_literals;
+
+  // empty input
+  static_assert(simdutf::binary_length_from_base64(""sv) == 0);
+  static_assert(simdutf::binary_length_from_base64(" "sv) == 0);
+
+  // increasing length of "a" repeated
+  static_assert(simdutf::binary_length_from_base64("YQ=="sv) == 1);
+  static_assert(simdutf::binary_length_from_base64("YWE="sv) == 2);
+  static_assert(simdutf::binary_length_from_base64("YWFh"sv) == 3);
+  static_assert(simdutf::binary_length_from_base64("YWFhYQ=="sv) == 4);
+  static_assert(simdutf::binary_length_from_base64("YWFhYWE="sv) == 5);
+
+  // all these are base64 of 'a', mixed with whitespace in different ways
+  constexpr std::array mixedwithspaces{
+      " YQ=="sv,     //
+      "Y Q=="sv,     //
+      "YQ =="sv,     //
+      "YQ= ="sv,     //
+      "YQ== "sv,     //
+      " Y Q = = "sv, //
+      " YQ = ="sv,   //
+  };
+  static_assert(std::ranges::all_of(mixedwithspaces, [](auto s) {
+    return simdutf::binary_length_from_base64(s) == 1;
+  }));
+}
+
 namespace {
 
 template <auto input>

pauldreik · 2026-01-07T06:33:59Z

good, please also add constexpr tests for the utf16 variant!

Co-authored-by: Erik Corry <[email protected]>

anonrig · 2026-01-30T20:17:21Z

Sorry for the late response. I've added the last remaining tests, rebased and force pushed. I think this pull-request is good to go. @lemire would you mind landing this pull-request if you think it's good to go (and when tests succeed)?

lemire · 2026-01-30T21:25:43Z

@anonrig I am happy to approve this. Obviously, it will require more work to get the optimized performance, but it is fine to do it as follow up PRs. I see that @pauldreik has a 'requested changes' status and I think that @erikcorry had comments, so I will approve but give time to these fine folks to react before we merge for good.

Merging early next week if nobody has anything to add ?

lemire

Approving, but let us make sure @pauldreik is happy before we merge.

pauldreik · 2026-01-31T10:56:21Z


 #endif

+// Tests for binary_length_from_base64


I agree but I think that can be added later, if and when vectorization is implemented.

pauldreik · 2026-01-31T10:58:35Z

+ * @param length        the length of the base64 input in 16-bit units
+ * @return number of binary bytes
+ */
+simdutf_warn_unused size_t binary_length_from_base64(const char16_t * input, size_t length) noexcept;


I agree.

But I also think this is a general problem not belonging to this PR - the documentation is duplicated. Do we need the function signatures in the README ? Or can they be auto generated/validated.

anonrig · 2026-02-27T18:42:27Z

Can we merge this? Sorry I haven't been super active with this pull-request.

pauldreik · 2026-02-28T06:51:48Z

Can we merge this? Sorry I haven't been super active with this pull-request.

yes!

anonrig requested a review from lemire December 30, 2025 16:37

lemire reviewed Dec 30, 2025

View reviewed changes

Comment thread README.md Outdated

erikcorry reviewed Dec 31, 2025

View reviewed changes

Comment thread src/scalar/base64.h Outdated

Comment thread README.md Outdated

anonrig force-pushed the yagiz/add-binary-length-base64 branch from 24aacbe to 1951bf2 Compare January 5, 2026 16:07

anonrig requested review from erikcorry and lemire January 5, 2026 16:07

anonrig force-pushed the yagiz/add-binary-length-base64 branch from 1951bf2 to e4afd9a Compare January 5, 2026 16:17

erikcorry approved these changes Jan 5, 2026

View reviewed changes

erikcorry reviewed Jan 5, 2026

View reviewed changes

erikcorry reviewed Jan 6, 2026

View reviewed changes

Comment thread include/simdutf/scalar/base64.h Outdated

erikcorry reviewed Jan 6, 2026

View reviewed changes

lemire mentioned this pull request Jan 6, 2026

adding a length computation benchmark #901

Merged

anonrig requested a review from pauldreik January 6, 2026 20:45

pauldreik requested changes Jan 6, 2026

View reviewed changes

Comment thread include/simdutf/implementation.h Outdated

anonrig requested a review from pauldreik January 6, 2026 22:00

anonrig and others added 7 commits January 30, 2026 15:16

add simdutf::binary_length_from_base64

d2acd0d

adding a length computation benchmark (#901)

13b3582

Apply suggestions from code review

23fd958

Co-authored-by: Erik Corry <[email protected]>

address constexpr/consteval

558d684

address doc recommendations

c81b549

address pr reviews

bf80c35

add utf16 tests

32c0869

anonrig force-pushed the yagiz/add-binary-length-base64 branch from 6211533 to 32c0869 Compare January 30, 2026 20:16

lemire approved these changes Jan 30, 2026

View reviewed changes

pauldreik requested changes Jan 31, 2026

View reviewed changes

address pr reviews

6d9c9ed

anonrig requested a review from pauldreik February 9, 2026 12:34

fixup! address pr reviews

0de2f4e

anonrig merged commit 6f5cefb into master Feb 28, 2026
108 checks passed

Conversation

anonrig commented Dec 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lemire commented Dec 31, 2025

Uh oh!

lemire commented Dec 31, 2025

Uh oh!

lemire commented Jan 3, 2026

Uh oh!

anonrig commented Jan 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lemire commented Jan 6, 2026

Uh oh!

pauldreik commented Jan 6, 2026

Uh oh!

anonrig commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pauldreik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pauldreik commented Jan 7, 2026

Uh oh!

anonrig commented Jan 30, 2026

Uh oh!

lemire commented Jan 30, 2026

Uh oh!

lemire left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anonrig commented Feb 27, 2026

Uh oh!

pauldreik commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anonrig commented Jan 6, 2026 •

edited

Loading