Skip to content

add simdutf::binary_length_from_base64#887

Merged
anonrig merged 9 commits intomasterfrom
yagiz/add-binary-length-base64
Feb 28, 2026
Merged

add simdutf::binary_length_from_base64#887
anonrig merged 9 commits intomasterfrom
yagiz/add-binary-length-base64

Conversation

@anonrig
Copy link
Copy Markdown
Member

@anonrig anonrig commented Dec 30, 2025

Fixes #867

@anonrig anonrig requested a review from lemire December 30, 2025 16:37
Comment thread README.md Outdated
Comment thread src/scalar/base64.h Outdated
Comment thread README.md Outdated
@lemire
Copy link
Copy Markdown
Member

lemire commented Dec 31, 2025

@anonrig I like @erikcorry's comment here tremendously. But we should then not take an option parameter because, among the possible options, is that we ignore everythings (as Node.js's atob does). Further, it is clear form what @erikcorry is describing that we do no validation of any kind...

So we would just say something like 'we return the exact output size assuming that the base64 input is valid [obviously] and ignoring ASCII white-space characters.'

With Erik's simplified approach, we can even implement accelerated functions in the various kernels as needed, and it becomes a SIMD programming 101 problem (beat the autovectorizer).

(Note that the exact function can be done sanely within our kernels without a PhD in SIMD programming... but it is going to involve non trivial arithmetic... we already have the code... but it is still going to be slower than what Erik is proposing we implement.)

@lemire
Copy link
Copy Markdown
Member

lemire commented Dec 31, 2025

This PR does not have to include my comment above about how this is often going to be a suboptimal idea. But I will want to throw it in later because I think we need have people be aware of the tradeoff. (This function here is fine as long as the engineer is aware of the tradeoff.)

@lemire
Copy link
Copy Markdown
Member

lemire commented Jan 3, 2026

@anonrig Warning: Paul (@pauldreik) turned everything constexpr so this may interact badly with this PR.

@anonrig
Copy link
Copy Markdown
Member Author

anonrig commented Jan 5, 2026

@lemire No worries. I'll rebase and address the reviews and ready the PR for the next round of reviews.

@anonrig anonrig force-pushed the yagiz/add-binary-length-base64 branch from 24aacbe to 1951bf2 Compare January 5, 2026 16:07
@anonrig anonrig requested review from erikcorry and lemire January 5, 2026 16:07
@anonrig anonrig force-pushed the yagiz/add-binary-length-base64 branch from 1951bf2 to e4afd9a Compare January 5, 2026 16:17
Comment thread README.md Outdated
```cpp
std::vector<char> buffer(simdutf::binary_length_from_base64(base64.data(), base64.size()));
simdutf::result r = simdutf::base64_to_binary(base64.data(), base64.size(), buffer.data());
if(r.error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check explicitly against SUCCESS, not with implicit bool conversion. Also missing space after 'if'.

Comment thread README.md Outdated
Comment thread README.md Outdated
Comment thread README.md Outdated
Comment thread README.md Outdated
simdutf_warn_unused size_t binary_length_from_base64(const char * input, size_t length) noexcept;

/**
* Compute the binary length from a base64 input with ASCII spaces.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same changes as above.

Comment thread include/simdutf/implementation.h Outdated
#endif // SIMDUTF_SPAN

/**
* Compute the binary length from a base64 input with ASCII spaces.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

const char16_t *input, size_t length) const noexcept;

/**
* Compute the binary length from a base64 input with ASCII spaces.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

binary_length_from_base64(const char *input, size_t length) const noexcept;

/**
* Compute the binary length from a base64 input with ASCII spaces.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above

Comment thread include/simdutf/scalar/base64.h Outdated
Comment thread include/simdutf/scalar/base64.h Outdated
Comment thread tests/base64_tests.cpp

#endif

// Tests for binary_length_from_base64
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the tests should take invalid input if only to verify that there is no crash or buffer overflow. Eg. Something that only consists of '=' and spaces.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I think that can be added later, if and when vectorization is implemented.

Comment thread include/simdutf/scalar/base64.h Outdated
Comment thread tests/base64_tests.cpp
simdutf::binary_length_from_base64(input16.data(), input16.size()), 3);

std::u16string input16_spaces = u" Y Q = = ";
ASSERT_EQUAL(simdutf::binary_length_from_base64(input16_spaces.data(),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These inputs are so short I don't think they are testing an actual SIMD implementation, since they drop right through to the scalar cleanup. At least I wrote a broken version of the 16 bit version in another PR and the tests passed.

@lemire
Copy link
Copy Markdown
Member

lemire commented Jan 6, 2026

I'll mark this PR as approved. We could merge, but before turning it into a release, I think we want to do more performance work. I don't have time right now, but later...


Performance numbers reproduced from my other PR.


To test this out, do the following:

cmake -B build -D SIMDUTF_BENCHMARKS=ON -D CMAKE_BUILD_TYPE=Release
cmake --build build --target benchmark_base64

Use a test file. Any would not, but you can create one like so:

 base64 -i ./README.md -b 76 > test.base64

Next run the benchmarks, first the decoding benchmark:

./build/benchmarks/base64/benchmark_base64 -d test.base64 -f simdutf

Here -f simdutf applies a filter so we only run the simdutf functions.

Then benchmark the length functions:

 ./build/benchmarks/base64/benchmark_base64 -L test.base64

Here is what I get on my macbook...

./build/benchmarks/base64/benchmark_base64 -d test.base64 -f simdutf

# current system detected as arm64.
# loading files: .
# volume: 182408 bytes
# max length: 182408 bytes
# number of inputs: 1
# decode
# the base64 data contains spaces, so we cannot use straight libbase64::base64_decode directly
simdutf::arm64                                :  15.82 GB/s  7.27 % 
simdutf::arm64 (accept garbage)               :  13.77 GB/s  6.65 % 

 ./build/benchmarks/base64/benchmark_base64 -L test.base64         

# current system detected as arm64.
# loading files: .
# volume: 182409 bytes
# max length: 182409 bytes
# number of inputs: 1
# lengths
# Benchmark only simdutf length functions (maximal and exact)
simdutf::arm64_maximal_binary_length_from_base64 :  10838.88 GB/s  inf % 
simdutf::arm64_binary_length_from_base64      :   8.31 GB/s  3.06 % 

So you see here that maximal_binary_length_from_base64 is effectively free, while binary_length_from_base64 is not.

Suppose you combine the decoding function with the maximal function... then we get 15.82 GB/s (unchanged).

But if you combine it with the new function you get 1/(1/8.31 + 1/15.82) or 5.5 GB/s. That is, we reduce by a factor of three the speed. It is not a small effect.

@pauldreik
Copy link
Copy Markdown
Collaborator

@anonrig before we merge this, I think there should be a span overload and it should be constexpr. ping me if you want help or me to do it. it should be easy.

@anonrig
Copy link
Copy Markdown
Member Author

anonrig commented Jan 6, 2026

@anonrig before we merge this, I think there should be a span overload and it should be constexpr. ping me if you want help or me to do it. it should be easy.

@pauldreik Thanks! I'll add it now.

@erikcorry Would you mind if I follow up on your remaining recommendations after landing this PR? Actually, I'll address them now.

@anonrig anonrig requested a review from pauldreik January 6, 2026 20:45
Copy link
Copy Markdown
Collaborator

@pauldreik pauldreik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest the following test:

commit b504e11f606889bff895d21a85afedc4a9751689 (HEAD -> add-binary-length-base64)
Author: Paul Dreik <[email protected]>
Date:   Tue Jan 6 22:47:38 2026 +0100

    add test for binary_length_from_base64

diff --git a/tests/constexpr_base64_tests.cpp b/tests/constexpr_base64_tests.cpp
index 75324a6e..ed2c90f1 100644
--- a/tests/constexpr_base64_tests.cpp
+++ b/tests/constexpr_base64_tests.cpp
@@ -43,6 +43,35 @@ TEST(compile_time_maximal_binary_length16) {
                 binary.size());
 }
 
+TEST(compile_time_binary_length_from_base64) {
+  using namespace std::string_view_literals;
+
+  // empty input
+  static_assert(simdutf::binary_length_from_base64(""sv) == 0);
+  static_assert(simdutf::binary_length_from_base64(" "sv) == 0);
+
+  // increasing length of "a" repeated
+  static_assert(simdutf::binary_length_from_base64("YQ=="sv) == 1);
+  static_assert(simdutf::binary_length_from_base64("YWE="sv) == 2);
+  static_assert(simdutf::binary_length_from_base64("YWFh"sv) == 3);
+  static_assert(simdutf::binary_length_from_base64("YWFhYQ=="sv) == 4);
+  static_assert(simdutf::binary_length_from_base64("YWFhYWE="sv) == 5);
+
+  // all these are base64 of 'a', mixed with whitespace in different ways
+  constexpr std::array mixedwithspaces{
+      " YQ=="sv,     //
+      "Y Q=="sv,     //
+      "YQ =="sv,     //
+      "YQ= ="sv,     //
+      "YQ== "sv,     //
+      " Y Q = = "sv, //
+      " YQ = ="sv,   //
+  };
+  static_assert(std::ranges::all_of(mixedwithspaces, [](auto s) {
+    return simdutf::binary_length_from_base64(s) == 1;
+  }));
+}
+
 namespace {
 
 template <auto input>

Comment thread include/simdutf/implementation.h Outdated
@anonrig anonrig requested a review from pauldreik January 6, 2026 22:00
@pauldreik
Copy link
Copy Markdown
Collaborator

good, please also add constexpr tests for the utf16 variant!

@anonrig anonrig force-pushed the yagiz/add-binary-length-base64 branch from 6211533 to 32c0869 Compare January 30, 2026 20:16
@anonrig
Copy link
Copy Markdown
Member Author

anonrig commented Jan 30, 2026

Sorry for the late response. I've added the last remaining tests, rebased and force pushed. I think this pull-request is good to go. @lemire would you mind landing this pull-request if you think it's good to go (and when tests succeed)?

@lemire
Copy link
Copy Markdown
Member

lemire commented Jan 30, 2026

@anonrig I am happy to approve this. Obviously, it will require more work to get the optimized performance, but it is fine to do it as follow up PRs. I see that @pauldreik has a 'requested changes' status and I think that @erikcorry had comments, so I will approve but give time to these fine folks to react before we merge for good.

Merging early next week if nobody has anything to add ?

Copy link
Copy Markdown
Member

@lemire lemire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but let us make sure @pauldreik is happy before we merge.

Comment thread include/simdutf/implementation.h Outdated
Comment thread include/simdutf/scalar/base64.h Outdated
Comment thread tests/base64_tests.cpp

#endif

// Tests for binary_length_from_base64
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I think that can be added later, if and when vectorization is implemented.

Comment thread README.md
* @param length the length of the base64 input in 16-bit units
* @return number of binary bytes
*/
simdutf_warn_unused size_t binary_length_from_base64(const char16_t * input, size_t length) noexcept;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

But I also think this is a general problem not belonging to this PR - the documentation is duplicated. Do we need the function signatures in the README ? Or can they be auto generated/validated.

@anonrig anonrig requested a review from pauldreik February 9, 2026 12:34
@anonrig
Copy link
Copy Markdown
Member Author

anonrig commented Feb 27, 2026

Can we merge this? Sorry I haven't been super active with this pull-request.

@pauldreik
Copy link
Copy Markdown
Collaborator

Can we merge this? Sorry I haven't been super active with this pull-request.

yes!

@anonrig anonrig merged commit 6f5cefb into master Feb 28, 2026
108 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement binary_length_from_base64

4 participants