add convert_utf16_to_utf8_with_replacement by mertcanaltin · Pull Request #936 · simdutf/simdutf

mertcanaltin · 2026-02-07T12:06:21Z

utf8_length_from_utf16_with_replacement was already added in v7.7.0, but the actual conversion function was missing. This PR adds it — converting directly to UTF-8 in a single call, without an intermediate buffer, by replacing broken surrogates with U+FFFD.

In Node.js, in encoding_binding.cc, in places like TextEncoder.encodeInto and similar locations, the string received from the user arrives in UTF-16 format, but it may contain corrupted surrogates You need to convert this to UTF-8, but convert_utf16_to_utf8 throws an error when it encounters a corrupted surrogate.

- node/src/encoding_binding.cc line 144
- node/src/encoding_binding.cc line 283

@anonrig @lemire

mertcanaltin · 2026-02-07T15:53:12Z

I will inspect ubuntu s390x problem

lemire · 2026-02-07T18:16:38Z

I think that's a great PR. We'll just have to continue the work with later PRs by providng optimized functions (per kernel). We should also add it to the fuzzer. This can be done later.

mertcanaltin · 2026-02-08T16:32:56Z

The s390x failure was a big-endian issue in the tests. The hardcoded char16_t values (like {0xD83D, 0xDE00}) are stored in native byte order, but convert_utf16le_to_utf8_with_replacement expects little-endian data in memory. On s390x the bytes are swapped, so the function sees regular BMP characters instead of a surrogate pair. Fixed by wrapping test inputs with to_utf16le(), same pattern used by the other UTF-16LE tests.

pauldreik

I would like to see at least one constexpr test, so we know it is callable in a constexpr context.

mertcanaltin · 2026-02-10T07:20:39Z

I would like to see at least one constexpr test, so we know it is callable in a constexpr context.

Sure, I will send today

lemire · 2026-02-10T15:45:23Z

running tests.

mertcanaltin · 2026-02-10T15:50:12Z

running tests.

Sorry, I fixed a minor linter error. I think We'll need to restart the tests.

lemire · 2026-02-10T19:46:47Z

Tests restarted.

mertcanaltin · 2026-02-10T20:23:48Z

They all succeeded.

lemire · 2026-02-10T20:26:58Z

Let us give @pauldreik some time to get back to this issue.

pauldreik

this looks good! just a minor thing, but fine to merge anyway.

mertcanaltin · 2026-02-11T05:44:11Z

Solved, thanks for review

mertcanaltin added 2 commits February 7, 2026 14:58

add convert_utf16_to_utf8_with_replacement

4056585

lint

88e23a9

use to_utf16le for input in surrogate pair tests

15dfeca

mertcanaltin commented Feb 9, 2026

View reviewed changes

Comment thread tests/convert_utf16_to_utf8_with_replacement_tests.cpp Outdated

Update tests/convert_utf16_to_utf8_with_replacement_tests.cpp

a35036e

pauldreik requested changes Feb 10, 2026

View reviewed changes

add constexpr test

0eefc61

mertcanaltin requested a review from pauldreik February 10, 2026 13:26

lint

6954ecb

lemire approved these changes Feb 10, 2026

View reviewed changes

pauldreik approved these changes Feb 11, 2026

View reviewed changes

Comment thread tests/convert_utf16_to_utf8_with_replacement_tests.cpp Outdated

use output size

72e77a3

pauldreik merged commit 600a5c9 into simdutf:master Feb 11, 2026
84 checks passed

BrewTestBot mentioned this pull request Mar 7, 2026

simdutf 8.1.0 Homebrew/homebrew-core#271054

Merged

Conversation

mertcanaltin commented Feb 7, 2026

Uh oh!

mertcanaltin commented Feb 7, 2026

Uh oh!

lemire commented Feb 7, 2026

Uh oh!

mertcanaltin commented Feb 8, 2026

Uh oh!

Uh oh!

pauldreik left a comment

Choose a reason for hiding this comment

Uh oh!

mertcanaltin commented Feb 10, 2026

Uh oh!

lemire commented Feb 10, 2026

Uh oh!

mertcanaltin commented Feb 10, 2026

Uh oh!

lemire commented Feb 10, 2026

Uh oh!

mertcanaltin commented Feb 10, 2026

Uh oh!

lemire commented Feb 10, 2026

Uh oh!

pauldreik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mertcanaltin commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants