Skip to content

Fix a major stdin wakeup race condition#18816

Merged
DHowett merged 1 commit intomainfrom
dev/lhecker/18800-input-wake-race
Apr 23, 2025
Merged

Fix a major stdin wakeup race condition#18816
DHowett merged 1 commit intomainfrom
dev/lhecker/18800-input-wake-race

Conversation

@lhecker
Copy link
Member

@lhecker lhecker commented Apr 19, 2025

The conhost v2 rewrite from a decade ago introduced a race condition:
Previously, we would acquire and hold the global console lock while servicing
a console API call. If the call cannot be completed a wait task is enqueued,
while the lock is held. The v2 rewrite then split the project up into a
"server" and "host" component (which remain to this day). The "host" would
hold the console lock, while the "server" was responsible for enqueueing wait
tasks outside of the console lock. Without any form of synchronization,
any operations on the waiter list would then of course introduce a race
condition. In conhost this primarily meant keyboard/mouse input, because that
runs on the separate Win32 window thread. For Windows Terminal it primarily
meant the VT input thread.

I do not know why this issue is so extremely noticeable specifically when we
respond to DSC CPR requests, but I'm also not surprised: I suspect that the
overall performance issues that conhost had for a long time, meant that most
things it did were slower than allocating the wait task.
Now that both conhost and Windows Terminal became orders of magnitudes faster
over the last few years, it probably just so happens that the DSC CPR request
takes almost exactly as many cycles to complete as allocating the wait task
does, hence perfectly reproducing the race condition.

There's also a slight chance that this is actually a regression from my ConPTY
rewrite #17510, but I fail to see what that would be. Regardless of that,
I'm 100% certain though, that this is a bug that has existed in v0.1.

Closes #18117
Closes #18800

Validation Steps Performed

@lhecker lhecker added Product-Conhost For issues in the Console codebase Product-Conpty For console issues specifically related to conpty Product-Terminal The new Windows Terminal. Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. labels Apr 19, 2025
@github-project-automation github-project-automation bot moved this to To Cherry Pick in 1.23 Servicing Pipeline Apr 19, 2025
@github-project-automation github-project-automation bot moved this to To Cherry Pick in 1.22 Servicing Pipeline Apr 19, 2025
Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay so if I understand it properly, this moves the construction and registration of the waiter out of DoWriteConsole/WriteConsoleWImplHelper and into the three places that would be impacted by waiters instead.

I can't tell where it gets made thread-safe or handles the console lock, but I haven't read all of it yet

*pbReplyPending = TRUE;
hr = CONSOLE_STATUS_WAIT;
}
hr = S_OK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, who handles this hr? It doesn't get up to the calling application, does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay nevermind! i finally see it - all of the returns for CONSOLE_STATUS_WAIT inside host get transformed into S_OK right here. you can ignore my prior comments and write a poem about whatever the heck you want

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes, I was interrupted as I was writing comments. I was very confused about reading this particular bit of code, because the other 2 places return S_OK in this case. I later found that the returned hr is actually completely ignored pbReplyPending is true. Still, better be consistent.

auto& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
if (WI_IsAnyFlagSet(gci.Flags, (CONSOLE_SUSPENDED | CONSOLE_SELECTING | CONSOLE_SCROLLBAR_TRACKING)))
{
std::ignore = ConsoleWaitQueue::s_CreateWait(pWaitReplyMessage, new WriteData(context, std::wstring{ buffer }, gci.OutputCP));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raw use of new: this and the others are only provably safe if s_CreateWait is noexcept; otherwise we could construct a waiter and it would be leaked if it throws, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but also already the case, as the previous code used release() to hand off a raw unowned pointer to s_CreateWait. Effectively, the safety of the function has not changed: It's still incorrect. This is also why I'm ignoring the returned HRESULT, because it's similarly unuseful.

I would prefer refactoring this as part of "Server v2" aka In-Process-ConpTY.

#define CONSOLE_STATUS_WAIT 0xC0030001
#define CONSOLE_STATUS_READ_COMPLETE 0xC0030002
#define CONSOLE_STATUS_WAIT_NO_BLOCK 0xC0030003
#define CONSOLE_STATUS_WAIT ((HRESULT)0xC0030001)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i vaguely recall that these are NTSTATUSes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this at some point for some reason to get it to compile, but I forgot why. I reverted it.

I do believe that they're NTSTATUSes as well, but ironically they're mostly used as HRESULTs in most places. I think it would be worth standardizing our code on HRESULT, given that conhost runs in user-mode... right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the CI failure, I remembered why I did this. 😄
The test fails to build, because an untyped hex number is a signed literal but HRESULTs are unsigned.

Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only concern is about the unguarded news :)

*pbReplyPending = TRUE;
hr = CONSOLE_STATUS_WAIT;
}
hr = S_OK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay nevermind! i finally see it - all of the returns for CONSOLE_STATUS_WAIT inside host get transformed into S_OK right here. you can ignore my prior comments and write a poem about whatever the heck you want

@lhecker lhecker force-pushed the dev/lhecker/18800-input-wake-race branch from 58a1e29 to a8c8a9b Compare April 22, 2025 20:56
@lhecker
Copy link
Member Author

lhecker commented Apr 22, 2025

I force-"unpushed" the last commit.

@DHowett DHowett moved this from To Cherry Pick to To Consider in 1.23 Servicing Pipeline Apr 22, 2025
@DHowett DHowett moved this from To Cherry Pick to To Consider in 1.22 Servicing Pipeline Apr 22, 2025
@DHowett DHowett enabled auto-merge (squash) April 22, 2025 23:39
@DHowett DHowett merged commit 2992421 into main Apr 23, 2025
27 checks passed
@DHowett DHowett deleted the dev/lhecker/18800-input-wake-race branch April 23, 2025 20:27
@DHowett DHowett moved this from To Consider to To Cherry Pick in 1.23 Servicing Pipeline Apr 24, 2025
@DHowett DHowett moved this from To Cherry Pick to Cherry Picked in 1.23 Servicing Pipeline Apr 24, 2025
DHowett pushed a commit that referenced this pull request Apr 24, 2025
The conhost v2 rewrite from a decade ago introduced a race condition:
Previously, we would acquire and hold the global console lock while
servicing
a console API call. If the call cannot be completed a wait task is
enqueued,
while the lock is held. The v2 rewrite then split the project up into a
"server" and "host" component (which remain to this day). The "host"
would
hold the console lock, while the "server" was responsible for enqueueing
wait
tasks _outside of the console lock_. Without any form of
synchronization,
any operations on the waiter list would then of course introduce a race
condition. In conhost this primarily meant keyboard/mouse input, because
that
runs on the separate Win32 window thread. For Windows Terminal it
primarily
meant the VT input thread.

I do not know why this issue is so extremely noticeable specifically
when we
respond to DSC CPR requests, but I'm also not surprised: I suspect that
the
overall performance issues that conhost had for a long time, meant that
most
things it did were slower than allocating the wait task.
Now that both conhost and Windows Terminal became orders of magnitudes
faster
over the last few years, it probably just so happens that the DSC CPR
request
takes almost exactly as many cycles to complete as allocating the wait
task
does, hence perfectly reproducing the race condition.

There's also a slight chance that this is actually a regression from my
ConPTY
rewrite #17510, but I fail to see what that would be. Regardless of
that,
I'm 100% certain though, that this is a bug that has existed in v0.1.

Closes #18117
Closes #18800

## Validation Steps Performed
* See repro in #18800. In other words:
  * Continuously emit DSC CPR sequences
  * ...read the response from stdin
  * ...and print the response to stdout
  * Doesn't deadlock randomly anymore ✅
* Feature & Unit tests ✅

(cherry picked from commit 2992421)
Service-Card-Id: PVTI_lADOAF3p4s4AxadtzgZiUlo
Service-Version: 1.23
@DHowett DHowett moved this from To Consider to To Cherry Pick in 1.22 Servicing Pipeline May 7, 2025
@DHowett DHowett moved this from To Cherry Pick to Cherry Picked in 1.22 Servicing Pipeline May 7, 2025
DHowett pushed a commit that referenced this pull request May 7, 2025
The conhost v2 rewrite from a decade ago introduced a race condition:
Previously, we would acquire and hold the global console lock while
servicing
a console API call. If the call cannot be completed a wait task is
enqueued,
while the lock is held. The v2 rewrite then split the project up into a
"server" and "host" component (which remain to this day). The "host"
would
hold the console lock, while the "server" was responsible for enqueueing
wait
tasks _outside of the console lock_. Without any form of
synchronization,
any operations on the waiter list would then of course introduce a race
condition. In conhost this primarily meant keyboard/mouse input, because
that
runs on the separate Win32 window thread. For Windows Terminal it
primarily
meant the VT input thread.

I do not know why this issue is so extremely noticeable specifically
when we
respond to DSC CPR requests, but I'm also not surprised: I suspect that
the
overall performance issues that conhost had for a long time, meant that
most
things it did were slower than allocating the wait task.
Now that both conhost and Windows Terminal became orders of magnitudes
faster
over the last few years, it probably just so happens that the DSC CPR
request
takes almost exactly as many cycles to complete as allocating the wait
task
does, hence perfectly reproducing the race condition.

There's also a slight chance that this is actually a regression from my
ConPTY
rewrite #17510, but I fail to see what that would be. Regardless of
that,
I'm 100% certain though, that this is a bug that has existed in v0.1.

Closes #18117
Closes #18800

## Validation Steps Performed
* See repro in #18800. In other words:
  * Continuously emit DSC CPR sequences
  * ...read the response from stdin
  * ...and print the response to stdout
  * Doesn't deadlock randomly anymore ✅
* Feature & Unit tests ✅

(cherry picked from commit 2992421)
Service-Card-Id: PVTI_lADOAF3p4s4AmhmQzgUXwj0 PVTI_lADOAF3p4s4AmhmQzgZiUlw
Service-Version: 1.22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-1 A description (P1) Product-Conhost For issues in the Console codebase Product-Conpty For console issues specifically related to conpty Product-Terminal The new Windows Terminal.

Projects

Status: Cherry Picked
Status: To Cherry Pick

Development

Successfully merging this pull request may close these issues.

regression post 1.21: TUI app cursor position blocked until a key is hit - was working fine before Problems with cursor position escape

2 participants