Fix a major stdin wakeup race condition#18816
Conversation
DHowett
left a comment
There was a problem hiding this comment.
okay so if I understand it properly, this moves the construction and registration of the waiter out of DoWriteConsole/WriteConsoleWImplHelper and into the three places that would be impacted by waiters instead.
I can't tell where it gets made thread-safe or handles the console lock, but I haven't read all of it yet
| *pbReplyPending = TRUE; | ||
| hr = CONSOLE_STATUS_WAIT; | ||
| } | ||
| hr = S_OK; |
There was a problem hiding this comment.
wait, who handles this hr? It doesn't get up to the calling application, does it?
There was a problem hiding this comment.
okay nevermind! i finally see it - all of the returns for CONSOLE_STATUS_WAIT inside host get transformed into S_OK right here. you can ignore my prior comments and write a poem about whatever the heck you want
There was a problem hiding this comment.
Sorry, yes, I was interrupted as I was writing comments. I was very confused about reading this particular bit of code, because the other 2 places return S_OK in this case. I later found that the returned hr is actually completely ignored pbReplyPending is true. Still, better be consistent.
| auto& gci = ServiceLocator::LocateGlobals().getConsoleInformation(); | ||
| if (WI_IsAnyFlagSet(gci.Flags, (CONSOLE_SUSPENDED | CONSOLE_SELECTING | CONSOLE_SCROLLBAR_TRACKING))) | ||
| { | ||
| std::ignore = ConsoleWaitQueue::s_CreateWait(pWaitReplyMessage, new WriteData(context, std::wstring{ buffer }, gci.OutputCP)); |
There was a problem hiding this comment.
raw use of new: this and the others are only provably safe if s_CreateWait is noexcept; otherwise we could construct a waiter and it would be leaked if it throws, right?
There was a problem hiding this comment.
This is correct, but also already the case, as the previous code used release() to hand off a raw unowned pointer to s_CreateWait. Effectively, the safety of the function has not changed: It's still incorrect. This is also why I'm ignoring the returned HRESULT, because it's similarly unuseful.
I would prefer refactoring this as part of "Server v2" aka In-Process-ConpTY.
| #define CONSOLE_STATUS_WAIT 0xC0030001 | ||
| #define CONSOLE_STATUS_READ_COMPLETE 0xC0030002 | ||
| #define CONSOLE_STATUS_WAIT_NO_BLOCK 0xC0030003 | ||
| #define CONSOLE_STATUS_WAIT ((HRESULT)0xC0030001) |
There was a problem hiding this comment.
i vaguely recall that these are NTSTATUSes
There was a problem hiding this comment.
I did this at some point for some reason to get it to compile, but I forgot why. I reverted it.
I do believe that they're NTSTATUSes as well, but ironically they're mostly used as HRESULTs in most places. I think it would be worth standardizing our code on HRESULT, given that conhost runs in user-mode... right?
There was a problem hiding this comment.
Seeing the CI failure, I remembered why I did this. 😄
The test fails to build, because an untyped hex number is a signed literal but HRESULTs are unsigned.
DHowett
left a comment
There was a problem hiding this comment.
my only concern is about the unguarded news :)
| *pbReplyPending = TRUE; | ||
| hr = CONSOLE_STATUS_WAIT; | ||
| } | ||
| hr = S_OK; |
There was a problem hiding this comment.
okay nevermind! i finally see it - all of the returns for CONSOLE_STATUS_WAIT inside host get transformed into S_OK right here. you can ignore my prior comments and write a poem about whatever the heck you want
58a1e29 to
a8c8a9b
Compare
|
I force-"unpushed" the last commit. |
The conhost v2 rewrite from a decade ago introduced a race condition: Previously, we would acquire and hold the global console lock while servicing a console API call. If the call cannot be completed a wait task is enqueued, while the lock is held. The v2 rewrite then split the project up into a "server" and "host" component (which remain to this day). The "host" would hold the console lock, while the "server" was responsible for enqueueing wait tasks _outside of the console lock_. Without any form of synchronization, any operations on the waiter list would then of course introduce a race condition. In conhost this primarily meant keyboard/mouse input, because that runs on the separate Win32 window thread. For Windows Terminal it primarily meant the VT input thread. I do not know why this issue is so extremely noticeable specifically when we respond to DSC CPR requests, but I'm also not surprised: I suspect that the overall performance issues that conhost had for a long time, meant that most things it did were slower than allocating the wait task. Now that both conhost and Windows Terminal became orders of magnitudes faster over the last few years, it probably just so happens that the DSC CPR request takes almost exactly as many cycles to complete as allocating the wait task does, hence perfectly reproducing the race condition. There's also a slight chance that this is actually a regression from my ConPTY rewrite #17510, but I fail to see what that would be. Regardless of that, I'm 100% certain though, that this is a bug that has existed in v0.1. Closes #18117 Closes #18800 ## Validation Steps Performed * See repro in #18800. In other words: * Continuously emit DSC CPR sequences * ...read the response from stdin * ...and print the response to stdout * Doesn't deadlock randomly anymore ✅ * Feature & Unit tests ✅ (cherry picked from commit 2992421) Service-Card-Id: PVTI_lADOAF3p4s4AxadtzgZiUlo Service-Version: 1.23
The conhost v2 rewrite from a decade ago introduced a race condition: Previously, we would acquire and hold the global console lock while servicing a console API call. If the call cannot be completed a wait task is enqueued, while the lock is held. The v2 rewrite then split the project up into a "server" and "host" component (which remain to this day). The "host" would hold the console lock, while the "server" was responsible for enqueueing wait tasks _outside of the console lock_. Without any form of synchronization, any operations on the waiter list would then of course introduce a race condition. In conhost this primarily meant keyboard/mouse input, because that runs on the separate Win32 window thread. For Windows Terminal it primarily meant the VT input thread. I do not know why this issue is so extremely noticeable specifically when we respond to DSC CPR requests, but I'm also not surprised: I suspect that the overall performance issues that conhost had for a long time, meant that most things it did were slower than allocating the wait task. Now that both conhost and Windows Terminal became orders of magnitudes faster over the last few years, it probably just so happens that the DSC CPR request takes almost exactly as many cycles to complete as allocating the wait task does, hence perfectly reproducing the race condition. There's also a slight chance that this is actually a regression from my ConPTY rewrite #17510, but I fail to see what that would be. Regardless of that, I'm 100% certain though, that this is a bug that has existed in v0.1. Closes #18117 Closes #18800 ## Validation Steps Performed * See repro in #18800. In other words: * Continuously emit DSC CPR sequences * ...read the response from stdin * ...and print the response to stdout * Doesn't deadlock randomly anymore ✅ * Feature & Unit tests ✅ (cherry picked from commit 2992421) Service-Card-Id: PVTI_lADOAF3p4s4AmhmQzgUXwj0 PVTI_lADOAF3p4s4AmhmQzgZiUlw Service-Version: 1.22
The conhost v2 rewrite from a decade ago introduced a race condition:
Previously, we would acquire and hold the global console lock while servicing
a console API call. If the call cannot be completed a wait task is enqueued,
while the lock is held. The v2 rewrite then split the project up into a
"server" and "host" component (which remain to this day). The "host" would
hold the console lock, while the "server" was responsible for enqueueing wait
tasks outside of the console lock. Without any form of synchronization,
any operations on the waiter list would then of course introduce a race
condition. In conhost this primarily meant keyboard/mouse input, because that
runs on the separate Win32 window thread. For Windows Terminal it primarily
meant the VT input thread.
I do not know why this issue is so extremely noticeable specifically when we
respond to DSC CPR requests, but I'm also not surprised: I suspect that the
overall performance issues that conhost had for a long time, meant that most
things it did were slower than allocating the wait task.
Now that both conhost and Windows Terminal became orders of magnitudes faster
over the last few years, it probably just so happens that the DSC CPR request
takes almost exactly as many cycles to complete as allocating the wait task
does, hence perfectly reproducing the race condition.
There's also a slight chance that this is actually a regression from my ConPTY
rewrite #17510, but I fail to see what that would be. Regardless of that,
I'm 100% certain though, that this is a bug that has existed in v0.1.
Closes #18117
Closes #18800
Validation Steps Performed