Backport - Fix replica can't finish failover when config epoch is outdated (#2178) to 7.2 by ranshid · Pull Request #2232 · valkey-io/valkey

ranshid · 2025-06-18T07:55:38Z

When the primary changes the config epoch and then down immediately,
the replica may not update the config epoch in time. Although we will
broadcast the change in cluster (see #1813), there may be a race in
the network or in the code. In this case, the replica will never finish
the failover since other primaries will refuse to vote because the
replica's slot config epoch is old.

We need a way to allow the replica can finish the failover in this case.

When the primary refuses to vote because the replica's config epoch is
less than the dead primary's config epoch, it can send an UPDATE packet
to the replica to inform the replica about the dead primary. The UPDATE
message contains information about the dead primary's config epoch and
owned slots. The failover will time out, but later the replica can try
again with the updated config epoch and it can succeed.

Fixes #2169.

…ated (valkey-io#2178) When the primary changes the config epoch and then down immediately, the replica may not update the config epoch in time. Although we will broadcast the change in cluster (see valkey-io#1813), there may be a race in the network or in the code. In this case, the replica will never finish the failover since other primaries will refuse to vote because the replica's slot config epoch is old. We need a way to allow the replica can finish the failover in this case. When the primary refuses to vote because the replica's config epoch is less than the dead primary's config epoch, it can send an UPDATE packet to the replica to inform the replica about the dead primary. The UPDATE message contains information about the dead primary's config epoch and owned slots. The failover will time out, but later the replica can try again with the updated config epoch and it can succeed. Fixes valkey-io#2169. Signed-off-by: Ran Shidlansik <[email protected]>

Signed-off-by: Ran Shidlansik <[email protected]>

enjoy-binbin · 2025-06-18T08:09:07Z

The test fail because there are some other changes around the test suit, see #2210 top comment.
We decided to drop the test to avoid the major backport unless there are other changes rely on it.

codecov · 2025-06-18T08:30:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (5dc6632) to head (baeafc4).
Report is 1 commits behind head on 7.2.

Additional details and impacted files

@@     Coverage Diff     @@
##   7.2   #2232   +/-   ##
===========================
===========================

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

ranshid requested a review from enjoy-binbin June 18, 2025 07:55

fix spelling

5bd8179

Signed-off-by: Ran Shidlansik <[email protected]>

enjoy-binbin approved these changes Jun 18, 2025

View reviewed changes

drop test changes

baeafc4

Signed-off-by: Ran Shidlansik <[email protected]>

ranshid merged commit 525551a into valkey-io:7.2 Jun 18, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport - Fix replica can't finish failover when config epoch is outdated (#2178) to 7.2#2232

Backport - Fix replica can't finish failover when config epoch is outdated (#2178) to 7.2#2232
ranshid merged 3 commits intovalkey-io:7.2from
ranshid:backport-2178-to-7.2

ranshid commented Jun 18, 2025 •

edited

Loading

Uh oh!

enjoy-binbin commented Jun 18, 2025

Uh oh!

Uh oh!

codecov bot commented Jun 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ranshid commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Jun 18, 2025

Uh oh!

Uh oh!

codecov bot commented Jun 18, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ranshid commented Jun 18, 2025 •

edited

Loading