Fix potential infinite loop in clusterNodeGetPrimary#651
Merged
PingXie merged 2 commits intovalkey-io:unstablefrom Jun 14, 2024
Merged
Fix potential infinite loop in clusterNodeGetPrimary#651PingXie merged 2 commits intovalkey-io:unstablefrom
clusterNodeGetPrimary#651PingXie merged 2 commits intovalkey-io:unstablefrom
Conversation
The original function could enter an infinite loop if there was a cycle in the replica chain. This change adds a check to break the loop if the node's replicaof eventually points back to itself. Additionally, a debugAssert has been added to ensure the replica/primary chain forms a tree structure and does not have any cycles. Signed-off-by: Ping Xie <[email protected]>
a1cba28 to
61aa3a3
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #651 +/- ##
============================================
+ Coverage 70.19% 70.22% +0.03%
============================================
Files 110 110
Lines 60049 60051 +2
============================================
+ Hits 42149 42170 +21
+ Misses 17900 17881 -19
|
enjoy-binbin
approved these changes
Jun 14, 2024
enjoy-binbin
added a commit
to enjoy-binbin/valkey
that referenced
this pull request
Feb 25, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop Signed-off-by: Binbin <[email protected]>
madolson
pushed a commit
that referenced
this pull request
Apr 15, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from #651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes #1015. --------- Signed-off-by: Binbin <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 18, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 18, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 18, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 21, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
nitaicaro
pushed a commit
to nitaicaro/valkey
that referenced
this pull request
Apr 22, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Nitai Caro <[email protected]>
nitaicaro
pushed a commit
to nitaicaro/valkey
that referenced
this pull request
Apr 22, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]>
nitaicaro
pushed a commit
to nitaicaro/valkey
that referenced
this pull request
Apr 22, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
zuiderkwast
pushed a commit
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from #651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes #1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4
pushed a commit
to murphyjacob4/valkey
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
zuiderkwast
pushed a commit
that referenced
this pull request
Apr 23, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from #651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes #1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Jacob Murphy <[email protected]>
hwware
pushed a commit
to wuranxx/valkey
that referenced
this pull request
Apr 24, 2025
There is a failure in the daily: ``` === ASSERTION FAILED === ==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true ``` This is the logs: ``` - i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica, - 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary 27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9 27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f 27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer. 27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654 27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started - in here myself got an stale message, but we still process the packet and cause this issue 27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9 27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f ``` We can see myself got a stale message, but we still process it, and changed the role and cause a primary replica chain loop. The reason is that, this text is copy from valkey-io#651. In some rare case, slot config updates (via either PING/PONG or UPDATE) can be delivered out of order as illustrated below: ``` 1. To keep the discussion simple, let's assume we have 2 shards, shard a and shard b. Let's also assume there are two slots in total with shard a owning slot 1 and shard b owning slot 2. 2. Shard a has two nodes: primary A and replica A*; shard b has primary B and replica B*. 3. A manual failover was initiated on A* and A* just wins the election. 4. A* announces to the world that it now owns slot 1 using PING messages. These PING messages are queued in the outgoing buffer to every other node in the cluster, namely, A, B, and B*. 5. Keep in mind that there is no ordering in the delivery of these PING messages. For the stale PING message to appear, we need the following events in the exact order as they are laid out. a. An old PING message before A* becomes the new primary is still queued in A*'s outgoing buffer to A. This later becomes the stale message, which says A* is a replica of A. It is followed by A*'s election winning announcement PING message. b. B or B* processes A's election winning announcement PING message and sets slots[1]=A*. c. A sends a PING message to B (or B*). Since A hasn't learnt that A* wins the election, it claims that it owns slot 1 but with a lower epoch than B has on slot 1. This leads to B sending an UPDATE to A directly saying A* is the new owner of slot 1 with a higher epoch. d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith. A now realizes that it is a replica of A* hence setting myself->replicaof to A*. e. Finally, the pre-failover PING message queued up in A*'s outgoing buffer to A is delivered and processed, out of order though, to A. f. This stale PING message creates the replication loop ``` Closes valkey-io#1015. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: hwware <[email protected]>
hieu2102
added a commit
to hieu2102/valkey
that referenced
this pull request
Nov 12, 2025
Signed-off-by: hieu2102 <[email protected]>
rjd15372
pushed a commit
that referenced
this pull request
Nov 13, 2025
Issue #609 is marked as completed (fixed by #651); however, the fix is only present in versions 8.0 and above. This PR backports the fix into the 7.2 branch Signed-off-by: hieu2102 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The original function could enter an infinite loop if there was a cycle in the replica chain. This change adds a check to break the loop if the node's replicaof eventually points back to itself. Additionally, a debugAssert has been added to ensure the replica/primary chain forms a tree structure and does not have any cycles.
Fix #609