Skip to content

Avoid shard id update of replica if not matching with primary shard id#573

Merged
hpatro merged 5 commits intovalkey-io:unstablefrom
hpatro:shard_id_divergence
Apr 18, 2025
Merged

Avoid shard id update of replica if not matching with primary shard id#573
hpatro merged 5 commits intovalkey-io:unstablefrom
hpatro:shard_id_divergence

Conversation

@hpatro
Copy link
Contributor

@hpatro hpatro commented May 29, 2024

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file and leads to failure of engine restart.

Scenario:

Let's say there are two nodes in a cluster i.e. Node A and Node B. All the admin operation is performed on Node B. Node A and Node B finish handshake and haven't shared the extensions information yet. Node B is made a replica of Node A. As part of Node B sharing the slaveof information, it also share(s) the temporary shard-id. During the regular packet processing in Node A, while handling the replication information, the shard id of Node A get(s) applied to Node B. And during the extensions processing in Node A, the shard id passed by Node B is applied which diverges from the shard id of Node A. A crash/restart followed by it leads to unrecoverable corrupted cluster configuration file state.

image

@hpatro hpatro requested review from PingXie and enjoy-binbin May 29, 2024 21:07
@hpatro hpatro force-pushed the shard_id_divergence branch from 7cabc57 to 1714613 Compare May 29, 2024 21:10
@PingXie
Copy link
Member

PingXie commented May 31, 2024

I am not sure I understand the event sequence that leads to a corrupt state. can you elaborate?

The change makes sense to me. Essentially with this change there is now an order in which the shard-id is updated in a shard: primary first and replicas next.

btw, this change also requires us to sequence the assignment of the primary before the invocation of updateShardId. This seems to be the case already at https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L3092 and https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L5194.

There are some timeout failures in the test pass though. that is a bit surprising.

@hpatro hpatro requested a review from madolson June 3, 2024 19:16
@hpatro
Copy link
Contributor Author

hpatro commented Jun 3, 2024

The scenario is slightly difficult to explain, I've tried my best to depict it (updated the main comment). @PingXie / @madolson have a look.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the top comment picture, i think now i understand the case. the changes LGTM, btw the test seem to keep failing.

@codecov
Copy link

codecov bot commented Jun 4, 2024

Codecov Report

Attention: Patch coverage is 80.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.06%. Comparing base (89d4577) to head (bd1a0bb).
Report is 5 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 80.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #573      +/-   ##
============================================
+ Coverage     70.97%   71.06%   +0.09%     
============================================
  Files           123      123              
  Lines         65937    65937              
============================================
+ Hits          46797    46857      +60     
+ Misses        19140    19080      -60     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.20% <80.00%> (+0.26%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hpatro
Copy link
Contributor Author

hpatro commented Jun 4, 2024

unit/cluster/manual-takeover seems to get stuck on the CI. Unable to reproduce locally so far. Trying to understand why it gets stuck sometime with this change.

@hpatro
Copy link
Contributor Author

hpatro commented Jun 10, 2024

There are some timeout failures in the test pass though. that is a bit surprising.

From further investigation, the timeout failure happens from an infinite while loop within this block.

clusterNode *clusterNodeGetPrimary(clusterNode *node) {
    while (node->replicaof != NULL) node = node->replicaof;
    return node;
}

https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L5855C1-L5858C2

Looks like there could be temporary invalid state in cluster where node(s) can be pointing to each other as primary/replica. We could take two approaches to this infinite loop:

  1. Deep dive into why the invalid state is reached (cyclic replication state).
  2. We could avoid this loop as chained replication isn't a valid configuration in cluster mode.

@madolson
Copy link
Member

Deep dive into why the invalid state is reached (cyclic replication state).

We have had multiple of these issues in the past, and I think we always tried to figure it out. Maybe we should use this chance to add a helper method for setting the replicaof so that we check for loops.

@hpatro
Copy link
Contributor Author

hpatro commented Jun 12, 2024

Deep dive into why the invalid state is reached (cyclic replication state).

We have had multiple of these issues in the past, and I think we always tried to figure it out. Maybe we should use this chance to add a helper method for setting the replicaof so that we check for loops.

And if we detect a loop, do we crash?

@madolson
Copy link
Member

And if we detect a loop, do we crash?

Maybe we debug assert crash (as in only crash during a test). For normal production, we unwind we maybe ignore it and wait for the other node to update us.

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall but would be great if you could provide some more context in the code comment (left a review feedback too)

@PingXie
Copy link
Member

PingXie commented Jun 13, 2024

The scenario is slightly difficult to explain, I've tried my best to depict it (updated the main comment). @PingXie / @madolson have a look.

Great diagram! Thanks @hpatro. This helps a lot.

@PingXie
Copy link
Member

PingXie commented Jun 13, 2024

And if we detect a loop, do we crash?

Maybe we debug assert crash (as in only crash during a test). For normal production, we unwind we maybe ignore it and wait for the other node to update us.

debugAssert is reasonable but I don't think we should crash the server just because there is a loop. In fact, we have logic to break the loop already. I will suggest a fix in #609

@madolson
Copy link
Member

madolson commented Jul 1, 2024

@hpatro Sorry for taking so long to circle back on this, the DCO was failing last time and I forgot to ping you to update. I think this is good to merge otherwise.

@hpatro hpatro force-pushed the shard_id_divergence branch from 69a7d96 to 770cfa9 Compare July 1, 2024 18:39
@hpatro
Copy link
Contributor Author

hpatro commented Jul 1, 2024

@madolson Had to force push. PTAL.

@madolson
Copy link
Member

madolson commented Jul 2, 2024

@madolson madolson added the release-notes This issue should get a line item in the release notes label Jul 2, 2024
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just want to wait for some more comprehensive tests.

@PingXie
Copy link
Member

PingXie commented Jul 3, 2024

We actually hit the replication cycle assert rather consistently in the test run @madolson shared above. This is something that I haven't seen before.

*** Crash report found in valkey_2/log.txt ***
=== VALKEY BUG REPORT START: Cut & paste starting from here ===
44713:M 02 Jul 2024 22:52:22.118 # === ASSERTION FAILED ===
44713:M 02 Jul 2024 22:52:22.118 # ==> cluster_legacy.c:5879 'primary->replicaof == ((void *)0)' is not true

@hpatro
Copy link
Contributor Author

hpatro commented Jul 3, 2024

We actually hit the replication cycle assert rather consistently in the test run @madolson shared above. This is something that I haven't seen before.


*** Crash report found in valkey_2/log.txt ***

=== VALKEY BUG REPORT START: Cut & paste starting from here ===

44713:M 02 Jul 2024 22:52:22.118 # === ASSERTION FAILED ===

44713:M 02 Jul 2024 22:52:22.118 # ==> cluster_legacy.c:5879 'primary->replicaof == ((void *)0)' is not true

Yeah, this change invokes the API more frequently. Someone needs to deep dive further to understand how we reach this state.

@madolson
Copy link
Member

madolson commented Jul 6, 2024

Yeah, this change invokes the API more frequently. Someone needs to deep dive further to understand how we reach this state.

I deep dived it with an AWS engineer last week, I have a partial fix and will post it early next week.

@hpatro hpatro added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Apr 17, 2025
@hpatro
Copy link
Contributor Author

hpatro commented Apr 18, 2025

@PingXie / @enjoy-binbin Any further comments or shall we merge it ?

@PingXie
Copy link
Member

PingXie commented Apr 18, 2025

Finally! Go for it 😄

@hpatro hpatro merged commit 05d8fd4 into valkey-io:unstable Apr 18, 2025
51 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to To be backported in Valkey 8.0 Apr 18, 2025
@github-project-automation github-project-automation bot moved this from In Progress to To be backported in Valkey 8.1 Apr 18, 2025
@hpatro hpatro mentioned this pull request Apr 21, 2025
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 21, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.


Signed-off-by: Harkrishn Patro <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 21, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.


Signed-off-by: Harkrishn Patro <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 21, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 21, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.


Signed-off-by: Harkrishn Patro <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.


Signed-off-by: Harkrishn Patro <[email protected]>
@zuiderkwast zuiderkwast moved this from To be backported to 8.0.3 in Valkey 8.0 Apr 22, 2025
@zuiderkwast zuiderkwast moved this from To be backported to 8.1.1 in Valkey 8.1 Apr 22, 2025
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
zuiderkwast pushed a commit that referenced this pull request Apr 23, 2025
#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
zuiderkwast pushed a commit that referenced this pull request Apr 23, 2025
#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
hwware pushed a commit to wuranxx/valkey that referenced this pull request Apr 24, 2025
valkey-io#573)

During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas.

Signed-off-by: Harkrishn Patro <[email protected]>
Signed-off-by: hwware <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: 8.0.3
Status: 8.1.1

Development

Successfully merging this pull request may close these issues.

5 participants