add convert_utf16_to_utf8_with_replacement#936
Conversation
|
I will inspect ubuntu s390x problem |
|
I think that's a great PR. We'll just have to continue the work with later PRs by providng optimized functions (per kernel). We should also add it to the fuzzer. This can be done later. |
|
The s390x failure was a big-endian issue in the tests. The hardcoded char16_t values (like {0xD83D, 0xDE00}) are stored in native byte order, but convert_utf16le_to_utf8_with_replacement expects little-endian data in memory. On s390x the bytes are swapped, so the function sees regular BMP characters instead of a surrogate pair. Fixed by wrapping test inputs with to_utf16le(), same pattern used by the other UTF-16LE tests. |
pauldreik
left a comment
There was a problem hiding this comment.
I would like to see at least one constexpr test, so we know it is callable in a constexpr context.
Sure, I will send today |
|
running tests. |
Sorry, I fixed a minor linter error. I think We'll need to restart the tests. |
|
Tests restarted. |
|
They all succeeded. |
|
Let us give @pauldreik some time to get back to this issue. |
pauldreik
left a comment
There was a problem hiding this comment.
this looks good! just a minor thing, but fine to merge anyway.
|
Solved, thanks for review |
utf8_length_from_utf16_with_replacement was already added in v7.7.0, but the actual conversion function was missing. This PR adds it — converting directly to UTF-8 in a single call, without an intermediate buffer, by replacing broken surrogates with U+FFFD.
In Node.js, in encoding_binding.cc, in places like TextEncoder.encodeInto and similar locations, the string received from the user arrives in UTF-16 format, but it may contain corrupted surrogates You need to convert this to UTF-8, but convert_utf16_to_utf8 throws an error when it encounters a corrupted surrogate.
- node/src/encoding_binding.cc line 144
- node/src/encoding_binding.cc line 283
@anonrig @lemire