Fix a major stdin wakeup race condition by lhecker · Pull Request #18816 · microsoft/terminal

lhecker · 2025-04-19T00:41:04Z

The conhost v2 rewrite from a decade ago introduced a race condition:
Previously, we would acquire and hold the global console lock while servicing
a console API call. If the call cannot be completed a wait task is enqueued,
while the lock is held. The v2 rewrite then split the project up into a
"server" and "host" component (which remain to this day). The "host" would
hold the console lock, while the "server" was responsible for enqueueing wait
tasks outside of the console lock. Without any form of synchronization,
any operations on the waiter list would then of course introduce a race
condition. In conhost this primarily meant keyboard/mouse input, because that
runs on the separate Win32 window thread. For Windows Terminal it primarily
meant the VT input thread.

I do not know why this issue is so extremely noticeable specifically when we
respond to DSC CPR requests, but I'm also not surprised: I suspect that the
overall performance issues that conhost had for a long time, meant that most
things it did were slower than allocating the wait task.
Now that both conhost and Windows Terminal became orders of magnitudes faster
over the last few years, it probably just so happens that the DSC CPR request
takes almost exactly as many cycles to complete as allocating the wait task
does, hence perfectly reproducing the race condition.

There's also a slight chance that this is actually a regression from my ConPTY
rewrite #17510, but I fail to see what that would be. Regardless of that,
I'm 100% certain though, that this is a bug that has existed in v0.1.

Closes #18117
Closes #18800

Validation Steps Performed

See repro in regression post 1.21: TUI app cursor position blocked until a key is hit - was working fine before #18800. In other words:
- Continuously emit DSC CPR sequences
- ...read the response from stdin
- ...and print the response to stdout
- Doesn't deadlock randomly anymore ✅
Feature & Unit tests ✅

DHowett

okay so if I understand it properly, this moves the construction and registration of the waiter out of DoWriteConsole/WriteConsoleWImplHelper and into the three places that would be impacted by waiters instead.

I can't tell where it gets made thread-safe or handles the console lock, but I haven't read all of it yet

DHowett · 2025-04-21T22:46:02Z

src/server/ApiDispatchers.cpp

-            *pbReplyPending = TRUE;
-            hr = CONSOLE_STATUS_WAIT;
-        }
+        hr = S_OK;


wait, who handles this hr? It doesn't get up to the calling application, does it?

okay nevermind! i finally see it - all of the returns for CONSOLE_STATUS_WAIT inside host get transformed into S_OK right here. you can ignore my prior comments and write a poem about whatever the heck you want

Sorry, yes, I was interrupted as I was writing comments. I was very confused about reading this particular bit of code, because the other 2 places return S_OK in this case. I later found that the returned hr is actually completely ignored pbReplyPending is true. Still, better be consistent.

DHowett · 2025-04-21T22:48:42Z

src/host/_stream.cpp

+        auto& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
+        if (WI_IsAnyFlagSet(gci.Flags, (CONSOLE_SUSPENDED | CONSOLE_SELECTING | CONSOLE_SCROLLBAR_TRACKING)))
+        {
+            std::ignore = ConsoleWaitQueue::s_CreateWait(pWaitReplyMessage, new WriteData(context, std::wstring{ buffer }, gci.OutputCP));


raw use of new: this and the others are only provably safe if s_CreateWait is noexcept; otherwise we could construct a waiter and it would be leaked if it throws, right?

This is correct, but also already the case, as the previous code used release() to hand off a raw unowned pointer to s_CreateWait. Effectively, the safety of the function has not changed: It's still incorrect. This is also why I'm ignoring the returned HRESULT, because it's similarly unuseful.

I would prefer refactoring this as part of "Server v2" aka In-Process-ConpTY.

DHowett · 2025-04-21T22:50:35Z

src/host/server.h

-#define CONSOLE_STATUS_WAIT 0xC0030001
-#define CONSOLE_STATUS_READ_COMPLETE 0xC0030002
-#define CONSOLE_STATUS_WAIT_NO_BLOCK 0xC0030003
+#define CONSOLE_STATUS_WAIT ((HRESULT)0xC0030001)


i vaguely recall that these are NTSTATUSes

I did this at some point for some reason to get it to compile, but I forgot why. I reverted it.

I do believe that they're NTSTATUSes as well, but ironically they're mostly used as HRESULTs in most places. I think it would be worth standardizing our code on HRESULT, given that conhost runs in user-mode... right?

Seeing the CI failure, I remembered why I did this. 😄
The test fails to build, because an untyped hex number is a signed literal but HRESULTs are unsigned.

DHowett

my only concern is about the unguarded news :)

DHowett · 2025-04-22T20:23:10Z

src/server/ApiDispatchers.cpp

-            *pbReplyPending = TRUE;
-            hr = CONSOLE_STATUS_WAIT;
-        }
+        hr = S_OK;


okay nevermind! i finally see it - all of the returns for CONSOLE_STATUS_WAIT inside host get transformed into S_OK right here. you can ignore my prior comments and write a poem about whatever the heck you want

lhecker · 2025-04-22T20:56:48Z

I force-"unpushed" the last commit.

The conhost v2 rewrite from a decade ago introduced a race condition: Previously, we would acquire and hold the global console lock while servicing a console API call. If the call cannot be completed a wait task is enqueued, while the lock is held. The v2 rewrite then split the project up into a "server" and "host" component (which remain to this day). The "host" would hold the console lock, while the "server" was responsible for enqueueing wait tasks _outside of the console lock_. Without any form of synchronization, any operations on the waiter list would then of course introduce a race condition. In conhost this primarily meant keyboard/mouse input, because that runs on the separate Win32 window thread. For Windows Terminal it primarily meant the VT input thread. I do not know why this issue is so extremely noticeable specifically when we respond to DSC CPR requests, but I'm also not surprised: I suspect that the overall performance issues that conhost had for a long time, meant that most things it did were slower than allocating the wait task. Now that both conhost and Windows Terminal became orders of magnitudes faster over the last few years, it probably just so happens that the DSC CPR request takes almost exactly as many cycles to complete as allocating the wait task does, hence perfectly reproducing the race condition. There's also a slight chance that this is actually a regression from my ConPTY rewrite #17510, but I fail to see what that would be. Regardless of that, I'm 100% certain though, that this is a bug that has existed in v0.1. Closes #18117 Closes #18800 ## Validation Steps Performed * See repro in #18800. In other words: * Continuously emit DSC CPR sequences * ...read the response from stdin * ...and print the response to stdout * Doesn't deadlock randomly anymore ✅ * Feature & Unit tests ✅ (cherry picked from commit 2992421) Service-Card-Id: PVTI_lADOAF3p4s4AxadtzgZiUlo Service-Version: 1.23

The conhost v2 rewrite from a decade ago introduced a race condition: Previously, we would acquire and hold the global console lock while servicing a console API call. If the call cannot be completed a wait task is enqueued, while the lock is held. The v2 rewrite then split the project up into a "server" and "host" component (which remain to this day). The "host" would hold the console lock, while the "server" was responsible for enqueueing wait tasks _outside of the console lock_. Without any form of synchronization, any operations on the waiter list would then of course introduce a race condition. In conhost this primarily meant keyboard/mouse input, because that runs on the separate Win32 window thread. For Windows Terminal it primarily meant the VT input thread. I do not know why this issue is so extremely noticeable specifically when we respond to DSC CPR requests, but I'm also not surprised: I suspect that the overall performance issues that conhost had for a long time, meant that most things it did were slower than allocating the wait task. Now that both conhost and Windows Terminal became orders of magnitudes faster over the last few years, it probably just so happens that the DSC CPR request takes almost exactly as many cycles to complete as allocating the wait task does, hence perfectly reproducing the race condition. There's also a slight chance that this is actually a regression from my ConPTY rewrite #17510, but I fail to see what that would be. Regardless of that, I'm 100% certain though, that this is a bug that has existed in v0.1. Closes #18117 Closes #18800 ## Validation Steps Performed * See repro in #18800. In other words: * Continuously emit DSC CPR sequences * ...read the response from stdin * ...and print the response to stdout * Doesn't deadlock randomly anymore ✅ * Feature & Unit tests ✅ (cherry picked from commit 2992421) Service-Card-Id: PVTI_lADOAF3p4s4AmhmQzgUXwj0 PVTI_lADOAF3p4s4AmhmQzgZiUlw Service-Version: 1.22

Fix a major stdin wakeup race condition

a8c8a9b

lhecker added this to 1.23 Servicing Pipeline, Inbox Servicing Pipeline and 1.22 Servicing Pipeline Apr 19, 2025

github-project-automation bot moved this to To Cherry Pick in Inbox Servicing Pipeline Apr 19, 2025

github-project-automation bot moved this to To Cherry Pick in 1.23 Servicing Pipeline Apr 19, 2025

github-project-automation bot moved this to To Cherry Pick in 1.22 Servicing Pipeline Apr 19, 2025

DHowett reviewed Apr 21, 2025

View reviewed changes

DHowett approved these changes Apr 22, 2025

View reviewed changes

lhecker force-pushed the dev/lhecker/18800-input-wake-race branch from 58a1e29 to a8c8a9b Compare April 22, 2025 20:56

DHowett moved this from To Cherry Pick to To Consider in 1.23 Servicing Pipeline Apr 22, 2025

DHowett moved this from To Cherry Pick to To Consider in 1.22 Servicing Pipeline Apr 22, 2025

DHowett enabled auto-merge (squash) April 22, 2025 23:39

lhecker mentioned this pull request Apr 23, 2025

frequently experience unresposive situation #18803

Closed

DHowett approved these changes Apr 23, 2025

View reviewed changes

DHowett merged commit 2992421 into main Apr 23, 2025
27 checks passed

DHowett deleted the dev/lhecker/18800-input-wake-race branch April 23, 2025 20:27

DHowett moved this from To Consider to To Cherry Pick in 1.23 Servicing Pipeline Apr 24, 2025

DHowett moved this from To Cherry Pick to Cherry Picked in 1.23 Servicing Pipeline Apr 24, 2025

DHowett moved this from To Consider to To Cherry Pick in 1.22 Servicing Pipeline May 7, 2025

DHowett moved this from To Cherry Pick to Cherry Picked in 1.22 Servicing Pipeline May 7, 2025

cracyc mentioned this pull request Jun 11, 2025

Problems with cursor position escape #18117

Closed

ldemailly mentioned this pull request Aug 5, 2025

windows: ansipixels event loop broken in windows terminal 1.22 and newer (was fine before) fortio/terminal#100

Closed

lhecker mentioned this pull request Aug 13, 2025

Freeze when pasting large text into vim (WSL) #16921

Open

lhecker mentioned this pull request Aug 27, 2025

Escape sequences like DA responses can end up fragmented across multiple TCP/IP packets #19282

Closed

DHowett moved this from Cherry Picked to Shipped in 1.23 Servicing Pipeline Sep 5, 2025

diffuty mentioned this pull request Sep 12, 2025

Receiving duplicated response for a single device attribute request #19345

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a major stdin wakeup race condition#18816

Fix a major stdin wakeup race condition#18816
DHowett merged 1 commit intomainfrom
dev/lhecker/18800-input-wake-race

lhecker commented Apr 19, 2025 •

edited

Loading

Uh oh!

DHowett left a comment

Uh oh!

DHowett Apr 21, 2025

Uh oh!

DHowett Apr 22, 2025

Uh oh!

lhecker Apr 22, 2025

Uh oh!

DHowett Apr 21, 2025

Uh oh!

lhecker Apr 22, 2025

Uh oh!

DHowett Apr 21, 2025

Uh oh!

lhecker Apr 22, 2025

Uh oh!

lhecker Apr 22, 2025

Uh oh!

DHowett left a comment

Uh oh!

DHowett Apr 22, 2025

Uh oh!

lhecker commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lhecker commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation Steps Performed

Uh oh!

DHowett left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DHowett left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhecker commented Apr 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhecker commented Apr 19, 2025 •

edited

Loading