Skip to content

Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378

@miniksa

Description

@miniksa

Outcomes from #3320... (this is all related to utf8ToWideCharParser.cpp)

  1. When we encounter obviously invalid UTF-8 (wrong number of continuation bytes for a lead, lead without continuations, etc.), we straight up discard them from the stream in _InvolvedParse.
  • We need to consider replacing them with U+FFFD or having the option to do so.
  1. When we get not-obviously invalid UTF-8 (non-minimal forms like 0xC0 0x80 for null), we didn't detect and remove those.
  1. We seem to need to walk though this entire sequence multiple to many times inside our code and then again inside the kernel as the MultiByteToWideChar call eventually thunks to RtlUTF8ToUnicodeN.
  • The reason we have to walk through it ourselves is to attempt to preserve partial sequences that just happen to fall across a call boundary. If the client application emits one UTF-8 character at a time, we still want to aggregate them and turn it into the correct result if they properly gave us "3-sequence-lead, continuation, continuation" in 3 separate calls. MultiByteToWideChar will not do this. It will either error (MB_ERR_INVALID_CHARS) or replace (U+FFFD).
  • The call to RtlUTF8ToUnicodeN is a kernel syscall on a hot path in our codebase that is quite probably slowing us down.

Given that we're already doing almost everything required to understand the UTF-8 sequence such that we can store partials across calls, perhaps we just add the final two steps of:

  1. Detecting non-minimal forms of UTF-8 as invalid
  2. Just doing the conversion to UTF-8 ourselves in user mode as it's just an algorithmic translation and we're already most of the way there bit twiddling to identify sequence length and continuations

This issue represents investigating:

  1. Is there a serious performance detriment to the kernel syscall for UTF-8 conversion that will start biting us especially hard as more and more things need UTF-8 <--> UTF-16?
  2. How hard is it for us to detect the non-minimal forms?
  3. Can we just finish the algorithm on our side in user mode relatively simply?
  4. Can we decide on a replacement strategy and make it consistent between clearly invalid sequences and non-minimal forms?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Area-OutputRelated to output processing (inserting text into buffer, retrieving buffer text, etc.)Issue-TaskIt's a feature request, but it doesn't really need a major design.Needs-Tag-FixDoesn't match tag requirementsProduct-ConhostFor issues in the Console codebaseResolution-Fix-CommittedFix is checked in, but it might be 3-4 weeks until a release.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions