Skip to content

Ignore stale gossip packets that arrive out of order#1777

Merged
madolson merged 5 commits intovalkey-io:unstablefrom
enjoy-binbin:fix_chain_loop
Apr 15, 2025
Merged

Ignore stale gossip packets that arrive out of order#1777
madolson merged 5 commits intovalkey-io:unstablefrom
enjoy-binbin:fix_chain_loop

Conversation

@enjoy-binbin
Copy link
Member

@enjoy-binbin enjoy-binbin commented Feb 25, 2025

There is a failure in the daily:

=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true

This is the logs:

- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f

We can see myself got a stale message, but we still process it, and changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from #651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:

1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop

Closes #1015.

There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop

Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

@madolson
Copy link
Member

madolson commented Mar 3, 2025

The test failures seem concerning to me, I spent some time on friday trying to understand what might be causing it to fail but didn't figure it out. I wonder if we should still process the packet, but check and ignore setting it as our primary if it causes a loop instead?

@enjoy-binbin
Copy link
Member Author

The test do these things:

  1. R 3 do a RESET, so it become a empty primary
  2. R 3 do a MEET, join the cluster as a primary
  3. R 3 do a REPLICATE, become a replica
    test "New replica inherits migrating slot" {
        # Reset R3 to turn it into an empty node
        assert_equal [get_open_slots 3] "\[609->-$R1_id\]"
        assert_equal {OK} [R 3 CLUSTER RESET]
        assert_not_equal [get_open_slots 3] "\[609->-$R1_id\]"
        # Add R3 back as a replica of R0
        assert_equal {OK} [R 3 CLUSTER MEET [srv 0 "host"] [srv 0 "port"]]
        wait_for_role 0 master
        assert_equal {OK} [R 3 CLUSTER REPLICATE $R0_id]
        wait_for_role 3 slave
        # Validate that R3 now sees slot 609 open
        wait_for_slot_state 3 "\[609->-$R1_id\]"
    }

And the reason seems to be:

  1. In the other node's view: R 3 is sender_last_reported_as_primary and its config epoch lets take an exmaple is 3.
  2. R 3 is a replica, so sender_claimed_config_epoch is its primary config epoch lets take an example is 1
  3. and in the code, the other node will always ignore the packet and the test fails.
                    if (sender_claimed_primary && areInSameShard(sender_claimed_primary, sender)) {
                        /* `sender` was a primary and was in the same shard as its new primary */
                        if (sender->configEpoch > sender_claimed_config_epoch) {
                            serverLog(LL_NOTICE,
                                      "Ignore stale message from %.40s (%s) in shard %.40s;"
                                      " gossip config epoch: %llu, current config epoch: %llu",
                                      sender->name, sender->human_nodename, sender->shard_id,
                                      (unsigned long long)sender_claimed_config_epoch,
                                      (unsigned long long)sender->configEpoch);
                        } else {

So i think in this case, we should check sender's primary config epoch?

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 4, 2025
@codecov
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.87%. Comparing base (7a2d50a) to head (6bbe4b0).
⚠️ Report is 370 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1777      +/-   ##
============================================
- Coverage     70.97%   70.87%   -0.10%     
============================================
  Files           123      123              
  Lines         65651    65657       +6     
============================================
- Hits          46593    46533      -60     
- Misses        19058    19124      +66     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.04% <100.00%> (+0.24%) ⬆️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Mar 4, 2025
This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.

For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.

Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Mar 11, 2025
This is somehow related with #974 and #1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.

For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.

Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.

These broadcasts are expensive in large clusters, but none of these
seem high frequency so it should be fine.

---------

Signed-off-by: Binbin <[email protected]>
zuiderkwast pushed a commit that referenced this pull request Mar 18, 2025
This is somehow related with #974 and #1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.

For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.

Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.

These broadcasts are expensive in large clusters, but none of these
seem high frequency so it should be fine.

---------

Signed-off-by: Binbin <[email protected]>
xbasel pushed a commit to xbasel/valkey that referenced this pull request Mar 27, 2025
…y-io#1813)

This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.

For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.

Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.

These broadcasts are expensive in large clusters, but none of these
seem high frequency so it should be fine.

---------

Signed-off-by: Binbin <[email protected]>
xbasel pushed a commit to xbasel/valkey that referenced this pull request Mar 27, 2025
…y-io#1813)

This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.

For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.

Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.

These broadcasts are expensive in large clusters, but none of these
seem high frequency so it should be fine.

---------

Signed-off-by: Binbin <[email protected]>
zarkash-aws pushed a commit to zarkash-aws/valkey that referenced this pull request Apr 6, 2025
…y-io#1813)

This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.

For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.

Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.

These broadcasts are expensive in large clusters, but none of these
seem high frequency so it should be fine.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Shai Zarka <[email protected]>
murphyjacob4 pushed a commit to enjoy-binbin/valkey that referenced this pull request Apr 13, 2025
…y-io#1813)

This is somehow related with valkey-io#974 and valkey-io#1777. When the epoch changes,
we should save the configuration file and broadcast a PONG as much
as possible.

For example, if a primary down after bumping the epoch, its replicas
may initiate a failover, but the other primaries may refuse to vote
because the epoch of the replica has not been updated.

Or for example, for some reasons we bump the epoch, if the epoch
is not updated in time in the cluster, it may affect the judgment
of message staleness.

These broadcasts are expensive in large clusters, but none of these
seem high frequency so it should be fine.

---------

Signed-off-by: Binbin <[email protected]>
@zuiderkwast zuiderkwast added the release-notes This issue should get a line item in the release notes label Apr 15, 2025
@madolson madolson merged commit d1053ac into valkey-io:unstable Apr 15, 2025
58 checks passed
@github-project-automation github-project-automation bot moved this to To be backported in Valkey 8.0 Apr 15, 2025
@github-project-automation github-project-automation bot moved this to To be backported in Valkey 8.1 Apr 15, 2025
@enjoy-binbin enjoy-binbin deleted the fix_chain_loop branch April 15, 2025 19:40
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 18, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 18, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 18, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 21, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
nitaicaro pushed a commit to nitaicaro/valkey that referenced this pull request Apr 22, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
@zuiderkwast zuiderkwast moved this from To be backported to 8.0.3 in Valkey 8.0 Apr 22, 2025
@zuiderkwast zuiderkwast moved this from To be backported to 8.1.1 in Valkey 8.1 Apr 22, 2025
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
zuiderkwast pushed a commit that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes #1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
murphyjacob4 pushed a commit to murphyjacob4/valkey that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
zuiderkwast pushed a commit that referenced this pull request Apr 23, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes #1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
hwware pushed a commit to wuranxx/valkey that referenced this pull request Apr 24, 2025
There is a failure in the daily:
```
=== ASSERTION FAILED ===
==> cluster_legacy.c:6588 'primary->replicaof == ((void *)0)' is not true
```

This is the logs:
```
- i am fd4318562665b4490ccc86e7f7988017cf960371 and myself become a replica,
- 63c0167232dae95cdcc0a1568cd5368ac3b99f5 is the new primary
27867:M 24 Feb 2025 00:19:11.011 * Failover auth granted to 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () for epoch 9
27867:M 24 Feb 2025 00:19:11.039 * Configuration change detected. Reconfiguring myself as a replica of node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
27867:S 24 Feb 2025 00:19:11.039 * Before turning into a replica, using my own primary parameters to synthesize a cached primary: I may be able to synchronize with the new primary with just a partial transfer.
27867:S 24 Feb 2025 00:19:11.039 * Connecting to PRIMARY 127.0.0.1:23654
27867:S 24 Feb 2025 00:19:11.039 * PRIMARY <-> REPLICA sync started

- in here myself got an stale message, but we still process the packet and cause this issue
27867:S 24 Feb 2025 00:19:11.040 * Ignore stale message from 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f; gossip config epoch: 8, current config epoch: 9
27867:S 24 Feb 2025 00:19:11.040 * Node 763c0167232dae95cdcc0a1568cd5368ac3b99f5 () is now a replica of node fd4318562665b4490ccc86e7f7988017cf960371 () in shard c5f6b2a9c74cabd4d1e54d1130dc9cb9419bf76f
```

We can see myself got a stale message, but we still process it, and
changed
the role and cause a primary replica chain loop.

The reason is that, this text is copy from
valkey-io#651.

In some rare case, slot config updates (via either PING/PONG or UPDATE)
can be delivered out of order as illustrated below:
```
1. To keep the discussion simple, let's assume we have 2 shards, shard a
   and shard b. Let's also assume there are two slots in total with shard
   a owning slot 1 and shard b owning slot 2.
2. Shard a has two nodes: primary A and replica A*; shard b has primary
   B and replica B*.
3. A manual failover was initiated on A* and A* just wins the election.
4. A* announces to the world that it now owns slot 1 using PING messages.
   These PING messages are queued in the outgoing buffer to every other
   node in the cluster, namely, A, B, and B*.
5. Keep in mind that there is no ordering in the delivery of these PING
   messages. For the stale PING message to appear, we need the following
   events in the exact order as they are laid out.
a. An old PING message before A* becomes the new primary is still queued
   in A*'s outgoing buffer to A. This later becomes the stale message,
   which says A* is a replica of A. It is followed by A*'s election
   winning announcement PING message.
b. B or B* processes A's election winning announcement PING message
   and sets slots[1]=A*.
c. A sends a PING message to B (or B*). Since A hasn't learnt that A*
   wins the election, it claims that it owns slot 1 but with a lower
   epoch than B has on slot 1. This leads to B sending an UPDATE to
   A directly saying A* is the new owner of slot 1 with a higher epoch.
d. A receives the UPDATE from B and executes clusterUpdateSlotsConfigWith.
   A now realizes that it is a replica of A* hence setting myself->replicaof
   to A*.
e. Finally, the pre-failover PING message queued up in A*'s outgoing
   buffer to A is delivered and processed, out of order though, to A.
f. This stale PING message creates the replication loop
```

Closes valkey-io#1015.

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: hwware <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: 8.0.3
Status: 8.1.1

Development

Successfully merging this pull request may close these issues.

Stale PONG message causes incorrect replicaof updates leading to replicaof loops

3 participants