Skip to content

add convert_utf16_to_utf8_with_replacement#936

Merged
pauldreik merged 7 commits intosimdutf:masterfrom
mertcanaltin:mert/convert-utf16-to-utf8-with-replacement
Feb 11, 2026
Merged

add convert_utf16_to_utf8_with_replacement#936
pauldreik merged 7 commits intosimdutf:masterfrom
mertcanaltin:mert/convert-utf16-to-utf8-with-replacement

Conversation

@mertcanaltin
Copy link
Copy Markdown
Member

utf8_length_from_utf16_with_replacement was already added in v7.7.0, but the actual conversion function was missing. This PR adds it — converting directly to UTF-8 in a single call, without an intermediate buffer, by replacing broken surrogates with U+FFFD.

In Node.js, in encoding_binding.cc, in places like TextEncoder.encodeInto and similar locations, the string received from the user arrives in UTF-16 format, but it may contain corrupted surrogates You need to convert this to UTF-8, but convert_utf16_to_utf8 throws an error when it encounters a corrupted surrogate.

- node/src/encoding_binding.cc line 144
- node/src/encoding_binding.cc line 283

@anonrig @lemire

@mertcanaltin
Copy link
Copy Markdown
Member Author

I will inspect ubuntu s390x problem

@lemire
Copy link
Copy Markdown
Member

lemire commented Feb 7, 2026

I think that's a great PR. We'll just have to continue the work with later PRs by providng optimized functions (per kernel). We should also add it to the fuzzer. This can be done later.

@mertcanaltin
Copy link
Copy Markdown
Member Author

The s390x failure was a big-endian issue in the tests. The hardcoded char16_t values (like {0xD83D, 0xDE00}) are stored in native byte order, but convert_utf16le_to_utf8_with_replacement expects little-endian data in memory. On s390x the bytes are swapped, so the function sees regular BMP characters instead of a surrogate pair. Fixed by wrapping test inputs with to_utf16le(), same pattern used by the other UTF-16LE tests.

Comment thread tests/convert_utf16_to_utf8_with_replacement_tests.cpp Outdated
Copy link
Copy Markdown
Collaborator

@pauldreik pauldreik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see at least one constexpr test, so we know it is callable in a constexpr context.

@mertcanaltin
Copy link
Copy Markdown
Member Author

I would like to see at least one constexpr test, so we know it is callable in a constexpr context.

Sure, I will send today

@lemire
Copy link
Copy Markdown
Member

lemire commented Feb 10, 2026

running tests.

@mertcanaltin
Copy link
Copy Markdown
Member Author

running tests.

Sorry, I fixed a minor linter error. I think We'll need to restart the tests.

@lemire
Copy link
Copy Markdown
Member

lemire commented Feb 10, 2026

Tests restarted.

@mertcanaltin
Copy link
Copy Markdown
Member Author

They all succeeded.

@lemire
Copy link
Copy Markdown
Member

lemire commented Feb 10, 2026

Let us give @pauldreik some time to get back to this issue.

Copy link
Copy Markdown
Collaborator

@pauldreik pauldreik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good! just a minor thing, but fine to merge anyway.

Comment thread tests/convert_utf16_to_utf8_with_replacement_tests.cpp Outdated
@mertcanaltin
Copy link
Copy Markdown
Member Author

Solved, thanks for review

@pauldreik pauldreik merged commit 600a5c9 into simdutf:master Feb 11, 2026
84 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants