-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Closed
Labels
Area-OutputRelated to output processing (inserting text into buffer, retrieving buffer text, etc.)Related to output processing (inserting text into buffer, retrieving buffer text, etc.)Issue-TaskIt's a feature request, but it doesn't really need a major design.It's a feature request, but it doesn't really need a major design.Needs-Tag-FixDoesn't match tag requirementsDoesn't match tag requirementsProduct-ConhostFor issues in the Console codebaseFor issues in the Console codebaseResolution-Fix-CommittedFix is checked in, but it might be 3-4 weeks until a release.Fix is checked in, but it might be 3-4 weeks until a release.
Milestone
Description
Outcomes from #3320... (this is all related to utf8ToWideCharParser.cpp)
- When we encounter obviously invalid UTF-8 (wrong number of continuation bytes for a lead, lead without continuations, etc.), we straight up discard them from the stream in
_InvolvedParse.
- We need to consider replacing them with U+FFFD or having the option to do so.
- When we get not-obviously invalid UTF-8 (non-minimal forms like 0xC0 0x80 for null), we didn't detect and remove those.
- We now (as of Writing random output to console output handle fails with no last error set #3320) replace them with U+FFFD, though that's inconsistent with eating them above. This was chosen as a quick fix to be easy to port into the Windows 20H1 cycle during our bug fixing phase.
- We should reconcile this difference with part 1 above (being able to toggle both of them on or off)
- We seem to need to walk though this entire sequence multiple to many times inside our code and then again inside the kernel as the
MultiByteToWideCharcall eventually thunks toRtlUTF8ToUnicodeN.
- The reason we have to walk through it ourselves is to attempt to preserve partial sequences that just happen to fall across a call boundary. If the client application emits one UTF-8 character at a time, we still want to aggregate them and turn it into the correct result if they properly gave us "3-sequence-lead, continuation, continuation" in 3 separate calls.
MultiByteToWideCharwill not do this. It will either error (MB_ERR_INVALID_CHARS) or replace (U+FFFD). - The call to
RtlUTF8ToUnicodeNis a kernel syscall on a hot path in our codebase that is quite probably slowing us down.
Given that we're already doing almost everything required to understand the UTF-8 sequence such that we can store partials across calls, perhaps we just add the final two steps of:
- Detecting non-minimal forms of UTF-8 as invalid
- Just doing the conversion to UTF-8 ourselves in user mode as it's just an algorithmic translation and we're already most of the way there bit twiddling to identify sequence length and continuations
This issue represents investigating:
- Is there a serious performance detriment to the kernel syscall for UTF-8 conversion that will start biting us especially hard as more and more things need UTF-8 <--> UTF-16?
- How hard is it for us to detect the non-minimal forms?
- Can we just finish the algorithm on our side in user mode relatively simply?
- Can we decide on a replacement strategy and make it consistent between clearly invalid sequences and non-minimal forms?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Area-OutputRelated to output processing (inserting text into buffer, retrieving buffer text, etc.)Related to output processing (inserting text into buffer, retrieving buffer text, etc.)Issue-TaskIt's a feature request, but it doesn't really need a major design.It's a feature request, but it doesn't really need a major design.Needs-Tag-FixDoesn't match tag requirementsDoesn't match tag requirementsProduct-ConhostFor issues in the Console codebaseFor issues in the Console codebaseResolution-Fix-CommittedFix is checked in, but it might be 3-4 weeks until a release.Fix is checked in, but it might be 3-4 weeks until a release.