Reconcile UTF-8 behavior in `utf8ToWideCharParser.cpp`

Outcomes from #3320... (this is all related to `utf8ToWideCharParser.cpp`)

1. When we encounter obviously invalid UTF-8 (wrong number of continuation bytes for a lead, lead without continuations, etc.), we straight up discard them from the stream in `_InvolvedParse`.
- We need to consider replacing them with U+FFFD or having the option to do so.
2. When we get not-obviously invalid UTF-8 (non-minimal forms like 0xC0 0x80 for null), we didn't detect and remove those.
- We now (as of #3320) replace them with U+FFFD, though that's inconsistent with eating them above. This was chosen as a quick fix to be easy to port into the Windows 20H1 cycle during our bug fixing phase.
- We should reconcile this difference with part 1 above (being able to toggle both of them on or off)
3. We seem to need to walk though this entire sequence multiple to many times inside our code and then again inside the kernel as the `MultiByteToWideChar` call eventually thunks to `RtlUTF8ToUnicodeN`.
- The reason we have to walk through it ourselves is to attempt to preserve partial sequences that just happen to fall across a call boundary. If the client application emits one UTF-8 character at a time, we still want to aggregate them and turn it into the correct result if they properly gave us "3-sequence-lead, continuation, continuation" in 3 separate calls. `MultiByteToWideChar` will not do this. It will either error (MB_ERR_INVALID_CHARS) or replace (U+FFFD).
- The call to `RtlUTF8ToUnicodeN` is a kernel syscall on a hot path in our codebase that is quite probably slowing us down. 

Given that we're already doing almost everything required to understand the UTF-8 sequence such that we can store partials across calls, perhaps we just add the final two steps of:
1. Detecting non-minimal forms of UTF-8 as invalid
2. Just doing the conversion to UTF-8 ourselves in user mode as it's just an algorithmic translation and we're already most of the way there bit twiddling to identify sequence length and continuations

This issue represents investigating:
1. Is there a serious performance detriment to the kernel syscall for UTF-8 conversion that will start biting us especially hard as more and more things need UTF-8 <--> UTF-16?
2. How hard is it for us to detect the non-minimal forms?
3. Can we just finish the algorithm on our side in user mode relatively simply?
4. Can we decide on a replacement strategy and make it consistent between clearly invalid sequences and non-minimal forms?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconcile UTF-8 behavior in `utf8ToWideCharParser.cpp` #3378

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Reconcile UTF-8 behavior in `utf8ToWideCharParser.cpp` #3378