server: avoid deadlock when initing additional stores#107124
Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom Jul 26, 2023
Merged
server: avoid deadlock when initing additional stores#107124craig[bot] merged 2 commits intocockroachdb:masterfrom
craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
Member
c66125f to
0b32a28
Compare
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
0b32a28 to
9e5df2e
Compare
erikgrinaker
approved these changes
Jul 24, 2023
Contributor
Nah, I might be wrong about that. I thought we moved the startup there, but looks like we didn't. |
Otherwise, we can end up in a situation where each node is sitting on the channel and nobody has started their liveness yet. The sender to the channel will first have to get an Increment through KV, but if nobody acquires the lease (since nobody's heartbeat loop is running), this will never happen. In practice, *most of the time*, there is no deadlock because the lease acquisition path performs a synchronous heartbeat to the own entry in most cases (ignoring the fact that liveness hasn't been started yet). But there is also another path where someone else's epoch needs to be incremented, and this path also checks if the node itself is live - which it won't necessarily be (liveness loop is not running yet). Fixes cockroachdb#106706 Epic: None Release note (bug fix): a rare (!) situation in which nodes would get stuck during start-up was addressed. This is unlikely to have been encountered by production users This is unlikely to have been encountered by users. If so, it would manifest itself through a stack frame sitting on a select in `waitForAdditionalStoreInit` for extended periods of time (i.e. minutes).
If we rely on sync heartbeats, there's an issue. They were very effective in hiding the problem in cockroachdb#106706 so at least in our testing, allow sync heartbeats only once there are also async heartbeats. Epic: none Release note: None
9e5df2e to
ec326ca
Compare
Member
Author
|
TFTR! bors r=erikgrinaker |
Contributor
|
Build succeeded: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We need to start node liveness before waiting for additional store init.
Otherwise, we can end up in a situation where each node is sitting on the
channel and nobody has started their liveness yet. The sender to the channel
will first have to get an Increment through KV, but if nobody acquires the
lease (since nobody's heartbeat loop is running), this will never happen.
In practice, most of the time, there is no deadlock because the lease
acquisition path performs a synchronous heartbeat to the own entry in most
cases (ignoring the fact that liveness hasn't been started yet). But there is
also another path where someone else's epoch needs to be incremented, and this
path also checks if the node itself is live - which it won't necessarily be
(liveness loop is not running yet).
Fixes #106706
Epic: None
Release note (bug fix): a rare (!) situation in which nodes would get stuck
during start-up was addressed. This is unlikely to have been encountered by
production users This is unlikely to have been encountered by users. If so, it
would manifest itself through a stack frame sitting on a select in
waitForAdditionalStoreInitfor extended periods of time (i.e. minutes).