Avoid shard id update of replica if not matching with primary shard id#573
Avoid shard id update of replica if not matching with primary shard id#573hpatro merged 5 commits intovalkey-io:unstablefrom
Conversation
7cabc57 to
1714613
Compare
|
I am not sure I understand the event sequence that leads to a corrupt state. can you elaborate? The change makes sense to me. Essentially with this change there is now an order in which the shard-id is updated in a shard: primary first and replicas next. btw, this change also requires us to sequence the assignment of the primary before the invocation of There are some timeout failures in the test pass though. that is a bit surprising. |
enjoy-binbin
left a comment
There was a problem hiding this comment.
with the top comment picture, i think now i understand the case. the changes LGTM, btw the test seem to keep failing.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #573 +/- ##
============================================
+ Coverage 70.97% 71.06% +0.09%
============================================
Files 123 123
Lines 65937 65937
============================================
+ Hits 46797 46857 +60
+ Misses 19140 19080 -60
🚀 New features to boost your workflow:
|
|
|
From further investigation, the timeout failure happens from an infinite while loop within this block. Looks like there could be temporary invalid state in cluster where node(s) can be pointing to each other as primary/replica. We could take two approaches to this infinite loop:
|
We have had multiple of these issues in the past, and I think we always tried to figure it out. Maybe we should use this chance to add a helper method for setting the replicaof so that we check for loops. |
And if we detect a loop, do we crash? |
Maybe we debug assert crash (as in only crash during a test). For normal production, we unwind we maybe ignore it and wait for the other node to update us. |
PingXie
left a comment
There was a problem hiding this comment.
LGTM overall but would be great if you could provide some more context in the code comment (left a review feedback too)
debugAssert is reasonable but I don't think we should crash the server just because there is a loop. In fact, we have logic to break the loop already. I will suggest a fix in #609 |
|
@hpatro Sorry for taking so long to circle back on this, the DCO was failing last time and I forgot to ping you to update. I think this is good to merge otherwise. |
Signed-off-by: Harkrishn Patro <[email protected]>
69a7d96 to
770cfa9
Compare
|
@madolson Had to force push. PTAL. |
madolson
left a comment
There was a problem hiding this comment.
LGTM, just want to wait for some more comprehensive tests.
|
We actually hit the replication cycle assert rather consistently in the test run @madolson shared above. This is something that I haven't seen before. |
Yeah, this change invokes the API more frequently. Someone needs to deep dive further to understand how we reach this state. |
I deep dived it with an AWS engineer last week, I have a partial fix and will post it early next week. |
|
@PingXie / @enjoy-binbin Any further comments or shall we merge it ? |
|
Finally! Go for it 😄 |
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Nitai Caro <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
valkey-io#573) During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file with temporary shard id stored in the cluster. With this fix, a replica sharing a temporary shard id is ignored and allows the cluster bus to converge for only one shard id for primary and it's replicas. Signed-off-by: Harkrishn Patro <[email protected]> Signed-off-by: hwware <[email protected]>
During cluster setup, the shard id gets established through extensions data propagation and if the engine crashes/restarts while the reconciliation of shard id is in progress, there is a possibility of corrupted config file and leads to failure of engine restart.
Scenario:
Let's say there are two nodes in a cluster i.e. Node A and Node B. All the admin operation is performed on Node B. Node A and Node B finish handshake and haven't shared the extensions information yet. Node B is made a replica of Node A. As part of Node B sharing the
slaveofinformation, it also share(s) the temporary shard-id. During the regular packet processing in Node A, while handling the replication information, the shard id of Node A get(s) applied to Node B. And during the extensions processing in Node A, the shard id passed by Node B is applied which diverges from the shard id of Node A. A crash/restart followed by it leads to unrecoverable corrupted cluster configuration file state.