add simdutf::binary_length_from_base64#887
Conversation
|
@anonrig I like @erikcorry's comment here tremendously. But we should then not take an option parameter because, among the possible options, is that we ignore everythings (as Node.js's atob does). Further, it is clear form what @erikcorry is describing that we do no validation of any kind... So we would just say something like 'we return the exact output size assuming that the base64 input is valid [obviously] and ignoring ASCII white-space characters.' With Erik's simplified approach, we can even implement accelerated functions in the various kernels as needed, and it becomes a SIMD programming 101 problem (beat the autovectorizer). (Note that the exact function can be done sanely within our kernels without a PhD in SIMD programming... but it is going to involve non trivial arithmetic... we already have the code... but it is still going to be slower than what Erik is proposing we implement.) |
|
This PR does not have to include my comment above about how this is often going to be a suboptimal idea. But I will want to throw it in later because I think we need have people be aware of the tradeoff. (This function here is fine as long as the engineer is aware of the tradeoff.) |
|
@anonrig Warning: Paul (@pauldreik) turned everything constexpr so this may interact badly with this PR. |
|
@lemire No worries. I'll rebase and address the reviews and ready the PR for the next round of reviews. |
24aacbe to
1951bf2
Compare
1951bf2 to
e4afd9a
Compare
| ```cpp | ||
| std::vector<char> buffer(simdutf::binary_length_from_base64(base64.data(), base64.size())); | ||
| simdutf::result r = simdutf::base64_to_binary(base64.data(), base64.size(), buffer.data()); | ||
| if(r.error) { |
There was a problem hiding this comment.
Check explicitly against SUCCESS, not with implicit bool conversion. Also missing space after 'if'.
| simdutf_warn_unused size_t binary_length_from_base64(const char * input, size_t length) noexcept; | ||
|
|
||
| /** | ||
| * Compute the binary length from a base64 input with ASCII spaces. |
| #endif // SIMDUTF_SPAN | ||
|
|
||
| /** | ||
| * Compute the binary length from a base64 input with ASCII spaces. |
| const char16_t *input, size_t length) const noexcept; | ||
|
|
||
| /** | ||
| * Compute the binary length from a base64 input with ASCII spaces. |
| binary_length_from_base64(const char *input, size_t length) const noexcept; | ||
|
|
||
| /** | ||
| * Compute the binary length from a base64 input with ASCII spaces. |
|
|
||
| #endif | ||
|
|
||
| // Tests for binary_length_from_base64 |
There was a problem hiding this comment.
Some of the tests should take invalid input if only to verify that there is no crash or buffer overflow. Eg. Something that only consists of '=' and spaces.
There was a problem hiding this comment.
I agree but I think that can be added later, if and when vectorization is implemented.
| simdutf::binary_length_from_base64(input16.data(), input16.size()), 3); | ||
|
|
||
| std::u16string input16_spaces = u" Y Q = = "; | ||
| ASSERT_EQUAL(simdutf::binary_length_from_base64(input16_spaces.data(), |
There was a problem hiding this comment.
These inputs are so short I don't think they are testing an actual SIMD implementation, since they drop right through to the scalar cleanup. At least I wrote a broken version of the 16 bit version in another PR and the tests passed.
|
I'll mark this PR as approved. We could merge, but before turning it into a release, I think we want to do more performance work. I don't have time right now, but later... Performance numbers reproduced from my other PR. To test this out, do the following: Use a test file. Any would not, but you can create one like so: Next run the benchmarks, first the decoding benchmark: Here Then benchmark the length functions: Here is what I get on my macbook... So you see here that Suppose you combine the decoding function with the maximal function... then we get But if you combine it with the new function you get 1/(1/8.31 + 1/15.82) or |
|
@anonrig before we merge this, I think there should be a span overload and it should be constexpr. ping me if you want help or me to do it. it should be easy. |
@pauldreik Thanks! I'll add it now. @erikcorry |
pauldreik
left a comment
There was a problem hiding this comment.
I suggest the following test:
commit b504e11f606889bff895d21a85afedc4a9751689 (HEAD -> add-binary-length-base64)
Author: Paul Dreik <[email protected]>
Date: Tue Jan 6 22:47:38 2026 +0100
add test for binary_length_from_base64
diff --git a/tests/constexpr_base64_tests.cpp b/tests/constexpr_base64_tests.cpp
index 75324a6e..ed2c90f1 100644
--- a/tests/constexpr_base64_tests.cpp
+++ b/tests/constexpr_base64_tests.cpp
@@ -43,6 +43,35 @@ TEST(compile_time_maximal_binary_length16) {
binary.size());
}
+TEST(compile_time_binary_length_from_base64) {
+ using namespace std::string_view_literals;
+
+ // empty input
+ static_assert(simdutf::binary_length_from_base64(""sv) == 0);
+ static_assert(simdutf::binary_length_from_base64(" "sv) == 0);
+
+ // increasing length of "a" repeated
+ static_assert(simdutf::binary_length_from_base64("YQ=="sv) == 1);
+ static_assert(simdutf::binary_length_from_base64("YWE="sv) == 2);
+ static_assert(simdutf::binary_length_from_base64("YWFh"sv) == 3);
+ static_assert(simdutf::binary_length_from_base64("YWFhYQ=="sv) == 4);
+ static_assert(simdutf::binary_length_from_base64("YWFhYWE="sv) == 5);
+
+ // all these are base64 of 'a', mixed with whitespace in different ways
+ constexpr std::array mixedwithspaces{
+ " YQ=="sv, //
+ "Y Q=="sv, //
+ "YQ =="sv, //
+ "YQ= ="sv, //
+ "YQ== "sv, //
+ " Y Q = = "sv, //
+ " YQ = ="sv, //
+ };
+ static_assert(std::ranges::all_of(mixedwithspaces, [](auto s) {
+ return simdutf::binary_length_from_base64(s) == 1;
+ }));
+}
+
namespace {
template <auto input>|
good, please also add constexpr tests for the utf16 variant! |
Co-authored-by: Erik Corry <[email protected]>
6211533 to
32c0869
Compare
|
Sorry for the late response. I've added the last remaining tests, rebased and force pushed. I think this pull-request is good to go. @lemire would you mind landing this pull-request if you think it's good to go (and when tests succeed)? |
|
@anonrig I am happy to approve this. Obviously, it will require more work to get the optimized performance, but it is fine to do it as follow up PRs. I see that @pauldreik has a 'requested changes' status and I think that @erikcorry had comments, so I will approve but give time to these fine folks to react before we merge for good. Merging early next week if nobody has anything to add ? |
lemire
left a comment
There was a problem hiding this comment.
Approving, but let us make sure @pauldreik is happy before we merge.
|
|
||
| #endif | ||
|
|
||
| // Tests for binary_length_from_base64 |
There was a problem hiding this comment.
I agree but I think that can be added later, if and when vectorization is implemented.
| * @param length the length of the base64 input in 16-bit units | ||
| * @return number of binary bytes | ||
| */ | ||
| simdutf_warn_unused size_t binary_length_from_base64(const char16_t * input, size_t length) noexcept; |
There was a problem hiding this comment.
I agree.
But I also think this is a general problem not belonging to this PR - the documentation is duplicated. Do we need the function signatures in the README ? Or can they be auto generated/validated.
|
Can we merge this? Sorry I haven't been super active with this pull-request. |
yes! |
Fixes #867