Skip to content

server: avoid deadlock when initing additional stores#107124

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
tbg:TestAddNewStoresToExistingNodes-again
Jul 26, 2023
Merged

server: avoid deadlock when initing additional stores#107124
craig[bot] merged 2 commits intocockroachdb:masterfrom
tbg:TestAddNewStoresToExistingNodes-again

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Jul 18, 2023

We need to start node liveness before waiting for additional store init.

Otherwise, we can end up in a situation where each node is sitting on the
channel and nobody has started their liveness yet. The sender to the channel
will first have to get an Increment through KV, but if nobody acquires the
lease (since nobody's heartbeat loop is running), this will never happen.

In practice, most of the time, there is no deadlock because the lease
acquisition path performs a synchronous heartbeat to the own entry in most
cases (ignoring the fact that liveness hasn't been started yet). But there is
also another path where someone else's epoch needs to be incremented, and this
path also checks if the node itself is live - which it won't necessarily be
(liveness loop is not running yet).

Fixes #106706

Epic: None
Release note (bug fix): a rare (!) situation in which nodes would get stuck
during start-up was addressed. This is unlikely to have been encountered by
production users This is unlikely to have been encountered by users. If so, it
would manifest itself through a stack frame sitting on a select in
waitForAdditionalStoreInit for extended periods of time (i.e. minutes).

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the TestAddNewStoresToExistingNodes-again branch from c66125f to 0b32a28 Compare July 20, 2023 14:13
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jul 20, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg force-pushed the TestAddNewStoresToExistingNodes-again branch from 0b32a28 to 9e5df2e Compare July 21, 2023 09:01
@tbg tbg added the db-cy-23 label Jul 21, 2023
@tbg tbg marked this pull request as ready for review July 21, 2023 12:00
@tbg tbg requested a review from a team as a code owner July 21, 2023 12:00
@tbg tbg requested a review from erikgrinaker July 21, 2023 12:00
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a release note, this was broken recently in #103601.

LGTM once comments on #107265 are addressed. Thanks!

@erikgrinaker
Copy link
Copy Markdown
Contributor

We don't need a release note, this was broken recently in #103601.

Nah, I might be wrong about that. I thought we moved the startup there, but looks like we didn't.

tbg added 2 commits July 26, 2023 11:46
Otherwise, we can end up in a situation where each node is sitting on the
channel and nobody has started their liveness yet. The sender to the channel
will first have to get an Increment through KV, but if nobody acquires the
lease (since nobody's heartbeat loop is running), this will never happen.

In practice, *most of the time*, there is no deadlock because the lease
acquisition path performs a synchronous heartbeat to the own entry in most
cases (ignoring the fact that liveness hasn't been started yet). But there is
also another path where someone else's epoch needs to be incremented, and this
path also checks if the node itself is live - which it won't necessarily be
(liveness loop is not running yet).

Fixes cockroachdb#106706

Epic: None
Release note (bug fix): a rare (!) situation in which nodes would get stuck
during start-up was addressed. This is unlikely to have been encountered by
production users This is unlikely to have been encountered by users. If so, it
would manifest itself through a stack frame sitting on a select in
`waitForAdditionalStoreInit` for extended periods of time (i.e. minutes).
If we rely on sync heartbeats, there's an issue. They were very effective in
hiding the problem in cockroachdb#106706 so at least in our testing, allow sync heartbeats
only once there are also async heartbeats.

Epic: none
Release note: None
@tbg tbg force-pushed the TestAddNewStoresToExistingNodes-again branch from 9e5df2e to ec326ca Compare July 26, 2023 09:46
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Jul 26, 2023

TFTR!

bors r=erikgrinaker

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 26, 2023

Build succeeded:

@craig craig bot merged commit 8c52fea into cockroachdb:master Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: TestAddNewStoresToExistingNodes failed

3 participants