Skip to content

ReadConsoleW fails with non-BMP characters #4628

@eryksun

Description

@eryksun

Environment

Microsoft Windows [Version 10.0.18363.657]
conhost.exe builtin console, V2
wt.exe terminal, V0.9.433.0

Steps to reproduce

readsp.zip

Extract, compile and run the attached readsp.c program under the V2 console. This programs exercises directly writing a non-BMP character to the input buffer via WriteConsoleInputW and reading it back via ReadConsoleW, first with echo enabled and then with it disabled. Run the program with -v (e.g. readsp -v) to show the input key-event records that each step tries to read. It tries a normal key down/up event pair as well as the Alt+Numpad sequence that the console uses for pasted text. The latter uses 6 key events per wide-character and thus 12 key events for a surrogate pair. I included the paste sequence to try to clarify a related issue in which manually pasting a non-BMP character produces a different incorrect result, but it didn't help. I'll discuss that related issue in a comment, in case it's all due to the same underlying issue.

Expected behavior

ReadConsoleW should be able to correctly read supplementary-plane (i.e. non-BMP) characters such as "😞" (U+1F61E), regardless of whether they are typed or pasted into the terminal window, or written directly to the input buffer, or whether echo is enabled. Since the wide-character API uses 16-bit characters, the non-BMP character should be read as a UTF-16 surrogate pair, e.g. U+1F61E should be encoded as {0xD83D, 0xDE1E}.

ReadConsoleW works as expected with the legacy (V1) console. For example:

Test normal with ECHO ON
😞
stream (4): L"\ud83d\ude1e\u000d\u000a"
screen: L"\ud83d\ude1e        "

Test paste with ECHO ON
😞
stream (4): L"\ud83d\ude1e\u000d\u000a"
screen: L"\ud83d\ude1e        "

Test normal with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

It almost works correctly with Windows Terminal version 0.9.433.0:

Test normal with ECHO ON
��
stream (4) = L"\ud83d\ude1e\u000d\u000a"
screen = L"\ufffd\ufffd        "

Test paste with ECHO ON
��
stream (4) = L"\ud83d\ude1e\u000d\u000a"
screen = L"\ufffd\ufffd        "

Test normal with ECHO OFF
stream (4) = L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4) = L"\ud83d\ude1e\u000d\u000a"

Apparently a cooked read under Windows Terminal has a bug in which a non-BMP character gets echoed as two replacement characters, U+FFFD. But at least the ReadConsoleW result is correct.

Actual behavior

In the output below, not only does the cooked read fail with ERROR_INVALID_PARAMETER (87) when echo is enabled, but the echoed text contains only the first surrogate code of the surrogate pair, 0xD83D.

Test normal with ECHO ON
�
ReadConsoleW failed (87)
screen: L"\ud83d         "

Test paste with ECHO ON
�
ReadConsoleW failed (87)
screen: L"\ud83d         "

Test normal with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Test paste with ECHO OFF
stream (4): L"\ud83d\ude1e\u000d\u000a"

Since it's not a valid Unicode character, I've replaced this lone surrogate code in the pasted text with the Unicode replacement character, U+FFFD, but the "screen" text, which gets read directly from the screen buffer, shows that the code displayed on the console is 0xD83D.

Metadata

Metadata

Assignees

Labels

Area-ServerDown in the muck of API call servicing, interprocess communication, eventing, etc.In-PRThis issue has a related PRIssue-BugIt either shouldn't be doing this or needs an investigation.Needs-Tag-FixDoesn't match tag requirementsPriority-2A description (P2)Product-ConhostFor issues in the Console codebase

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions