Avoid shard id update of replica if not matching with primary shard id by hpatro · Pull Request #573 · valkey-io/valkey

hpatro · 2024-05-29T21:06:56Z

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file and leads to failure of engine restart.

Scenario:

Let's say there are two nodes in a cluster i.e. Node A and Node B. All the admin operation is performed on Node B. Node A and Node B finish handshake and haven't shared the extensions information yet. Node B is made a replica of Node A. As part of Node B sharing the slaveof information, it also share(s) the temporary shard-id. During the regular packet processing in Node A, while handling the replication information, the shard id of Node A get(s) applied to Node B. And during the extensions processing in Node A, the shard id passed by Node B is applied which diverges from the shard id of Node A. A crash/restart followed by it leads to unrecoverable corrupted cluster configuration file state.

PingXie · 2024-05-31T04:13:51Z

I am not sure I understand the event sequence that leads to a corrupt state. can you elaborate?

The change makes sense to me. Essentially with this change there is now an order in which the shard-id is updated in a shard: primary first and replicas next.

btw, this change also requires us to sequence the assignment of the primary before the invocation of updateShardId. This seems to be the case already at https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L3092 and https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L5194.

There are some timeout failures in the test pass though. that is a bit surprising.

hpatro · 2024-06-03T19:17:25Z

The scenario is slightly difficult to explain, I've tried my best to depict it (updated the main comment). @PingXie / @madolson have a look.

enjoy-binbin

with the top comment picture, i think now i understand the case. the changes LGTM, btw the test seem to keep failing.

codecov · 2024-06-04T18:09:15Z

Codecov Report

Attention: Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.06%. Comparing base (89d4577) to head (bd1a0bb).
Report is 5 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	80.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #573      +/-   ##
============================================
+ Coverage     70.97%   71.06%   +0.09%     
============================================
  Files           123      123              
  Lines         65937    65937              
============================================
+ Hits          46797    46857      +60     
+ Misses        19140    19080      -60

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.20% <80.00%> (+0.26%)`	⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hpatro · 2024-06-04T19:27:07Z

unit/cluster/manual-takeover seems to get stuck on the CI. Unable to reproduce locally so far. Trying to understand why it gets stuck sometime with this change.

hpatro · 2024-06-10T18:51:53Z

There are some timeout failures in the test pass though. that is a bit surprising.

From further investigation, the timeout failure happens from an infinite while loop within this block.

clusterNode *clusterNodeGetPrimary(clusterNode *node) {
    while (node->replicaof != NULL) node = node->replicaof;
    return node;
}

https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L5855C1-L5858C2

Looks like there could be temporary invalid state in cluster where node(s) can be pointing to each other as primary/replica. We could take two approaches to this infinite loop:

Deep dive into why the invalid state is reached (cyclic replication state).
We could avoid this loop as chained replication isn't a valid configuration in cluster mode.

madolson · 2024-06-11T16:54:54Z

Deep dive into why the invalid state is reached (cyclic replication state).

We have had multiple of these issues in the past, and I think we always tried to figure it out. Maybe we should use this chance to add a helper method for setting the replicaof so that we check for loops.

src/cluster_legacy.c

hpatro · 2024-06-12T20:16:02Z

Deep dive into why the invalid state is reached (cyclic replication state).

We have had multiple of these issues in the past, and I think we always tried to figure it out. Maybe we should use this chance to add a helper method for setting the replicaof so that we check for loops.

And if we detect a loop, do we crash?

madolson · 2024-06-12T20:59:06Z

And if we detect a loop, do we crash?

Maybe we debug assert crash (as in only crash during a test). For normal production, we unwind we maybe ignore it and wait for the other node to update us.

PingXie

LGTM overall but would be great if you could provide some more context in the code comment (left a review feedback too)

src/cluster_legacy.c

PingXie · 2024-06-13T06:45:09Z

The scenario is slightly difficult to explain, I've tried my best to depict it (updated the main comment). @PingXie / @madolson have a look.

Great diagram! Thanks @hpatro. This helps a lot.

PingXie · 2024-06-13T18:20:37Z

And if we detect a loop, do we crash?

Maybe we debug assert crash (as in only crash during a test). For normal production, we unwind we maybe ignore it and wait for the other node to update us.

debugAssert is reasonable but I don't think we should crash the server just because there is a loop. In fact, we have logic to break the loop already. I will suggest a fix in #609

tests/unit/cluster/shardid-propagation.tcl

madolson · 2024-07-01T05:22:52Z

@hpatro Sorry for taking so long to circle back on this, the DCO was failing last time and I forgot to ping you to update. I think this is good to merge otherwise.

Signed-off-by: Harkrishn Patro <[email protected]>

hpatro · 2024-07-01T18:52:04Z

@madolson Had to force push. PTAL.

madolson · 2024-07-02T22:20:50Z

https://github.com/valkey-io/valkey/actions/runs/9768739963

madolson

LGTM, just want to wait for some more comprehensive tests.

PingXie · 2024-07-03T06:05:56Z

We actually hit the replication cycle assert rather consistently in the test run @madolson shared above. This is something that I haven't seen before.

*** Crash report found in valkey_2/log.txt ***
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
44713:M 02 Jul 2024 22:52:22.118 # === ASSERTION FAILED ===
44713:M 02 Jul 2024 22:52:22.118 # ==> cluster_legacy.c:5879 'primary->replicaof == ((void *)0)' is not true

hpatro · 2024-07-03T15:43:00Z

We actually hit the replication cycle assert rather consistently in the test run @madolson shared above. This is something that I haven't seen before.
*** Crash report found in valkey_2/log.txt ***

=== VALKEY BUG REPORT START: Cut & paste starting from here ===

44713:M 02 Jul 2024 22:52:22.118 # === ASSERTION FAILED ===

44713:M 02 Jul 2024 22:52:22.118 # ==> cluster_legacy.c:5879 'primary->replicaof == ((void *)0)' is not true

Yeah, this change invokes the API more frequently. Someone needs to deep dive further to understand how we reach this state.

madolson · 2024-07-06T18:03:30Z

Yeah, this change invokes the API more frequently. Someone needs to deep dive further to understand how we reach this state.

I deep dived it with an AWS engineer last week, I have a partial fix and will post it early next week.

hpatro · 2025-04-18T16:45:22Z

@PingXie / @enjoy-binbin Any further comments or shall we merge it ?

PingXie · 2025-04-18T17:15:27Z

Finally! Go for it 😄

valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]>

valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>

valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Nitai Caro <[email protected]>

valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]>

valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>

#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>

valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>

#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>

valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: hwware <[email protected]>

hpatro requested review from PingXie and enjoy-binbin May 29, 2024 21:07

hpatro force-pushed the shard_id_divergence branch from 7cabc57 to 1714613 Compare May 29, 2024 21:10

hpatro requested a review from madolson June 3, 2024 19:16

enjoy-binbin reviewed Jun 4, 2024

View reviewed changes

madolson reviewed Jun 12, 2024

View reviewed changes

src/cluster_legacy.c Show resolved Hide resolved

hpatro mentioned this pull request Jun 12, 2024

[BUG] Flaky cluster tests 11-manual-takeover.tcl in 7.2 #609

Closed

PingXie reviewed Jun 13, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

madolson reviewed Jun 14, 2024

View reviewed changes

tests/unit/cluster/shardid-propagation.tcl Outdated Show resolved Hide resolved

madolson reviewed Jun 14, 2024

View reviewed changes

tests/unit/cluster/shardid-propagation.tcl Outdated Show resolved Hide resolved

Avoid shard id update of replica if not matching with primary shard id

770cfa9

Signed-off-by: Harkrishn Patro <[email protected]>

hpatro force-pushed the shard_id_divergence branch from 69a7d96 to 770cfa9 Compare July 1, 2024 18:39

madolson added the release-notes This issue should get a line item in the release notes label Jul 2, 2024

madolson approved these changes Jul 2, 2024

View reviewed changes

Merge branch 'unstable' into shard_id_divergence

bd1a0bb

hpatro added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 17, 2025

madolson approved these changes Apr 17, 2025

View reviewed changes

hpatro merged commit 05d8fd4 into valkey-io:unstable Apr 18, 2025
51 checks passed

github-project-automation bot moved this from In Progress to To be backported in Valkey 8.0 Apr 18, 2025

github-project-automation bot moved this from In Progress to To be backported in Valkey 8.1 Apr 18, 2025

hpatro mentioned this pull request Apr 21, 2025

Fixes for Valkey 8.0.3 #1972

Merged

zuiderkwast moved this from To be backported to 8.0.3 in Valkey 8.0 Apr 22, 2025

zuiderkwast moved this from To be backported to 8.1.1 in Valkey 8.1 Apr 22, 2025

zuiderkwast mentioned this pull request May 21, 2025

[BUG] Valkey stuck after worker node restart #2090

Open

hpatro mentioned this pull request Jun 4, 2025

[BUG] "Unrecoverable error: corrupted cluster config file" due to incorrect shard-id #2171

Closed

enjoy-binbin mentioned this pull request Feb 24, 2026

Try handling the extension before calling clusterUpdateSlotsConfigWith #2989

Open

Conversation

hpatro commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingXie commented May 31, 2024

Uh oh!

hpatro commented Jun 3, 2024

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hpatro commented Jun 4, 2024

Uh oh!

hpatro commented Jun 10, 2024

Uh oh!

madolson commented Jun 11, 2024

Uh oh!

Uh oh!

hpatro commented Jun 12, 2024

Uh oh!

madolson commented Jun 12, 2024

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PingXie commented Jun 13, 2024

Uh oh!

PingXie commented Jun 13, 2024

Uh oh!

Uh oh!

Uh oh!

madolson commented Jul 1, 2024

Uh oh!

hpatro commented Jul 1, 2024

Uh oh!

madolson commented Jul 2, 2024

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

PingXie commented Jul 3, 2024

Uh oh!

hpatro commented Jul 3, 2024

Uh oh!

madolson commented Jul 6, 2024

Uh oh!

hpatro commented Apr 18, 2025

Uh oh!

PingXie commented Apr 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hpatro commented May 29, 2024 •

edited

Loading

codecov bot commented Jun 4, 2024 •

edited

Loading