Skip to content

Fix cluster slot migration test#10495

Merged
madolson merged 1 commit intoredis:unstablefrom
enjoy-binbin:fix_cluster_test
Mar 31, 2022
Merged

Fix cluster slot migration test#10495
madolson merged 1 commit intoredis:unstablefrom
enjoy-binbin:fix_cluster_test

Conversation

@enjoy-binbin
Copy link
Contributor

There are three timing issues:

  1. [ERR] Not all 16384 slots are covered by nodes.
    We can see new_owner doesn't have the old_owner slots information.
    I guess the cluster is unstable after SETSLOT, using wait_for_condition
    --cluster check make sure the check in reshard won't fail.
[exception]: Executing test client: >>> Performing Cluster Check (using node 127.0.0.1:21586)
M: def6b00d3ab27bdfde1e0c92a159022b7ed0006b 127.0.0.1:21586
   slots:[12182] (1 slots) master
M: 045c64c41bb18fc7f69e7db0ebd4737bc88dd865 127.0.0.1:21587
   slots:[10923-12181],[12183-16383] (5460 slots) master
M: b36176d5d999d43c1d385adab10c097418dc7c23 127.0.0.1:21589
   slots:[0-5460] (5461 slots) master
  1. clusterManagerMoveSlot failed: CLUSTERDOWN The cluster is down
    During the reshard, clusterManagerMoveSlot may fail in MIGRATE
    due to the cluster is down.

  2. Expected 'MOVED 12182 xxx' to be equal to 'CLUSTERDOWN The cluster is down'
    After the reshard, the cluster is unstable. So we also need to wait for it.

There are three timing issues:
1. [ERR] Not all 16384 slots are covered by nodes.
We can see new_owner doesn't have the old_owner slots information.
I guess the cluster is unstable after SETSLOT, using wait_for_condition
--cluster check make sure the check in reshard won't fail.
```
[exception]: Executing test client: >>> Performing Cluster Check (using node 127.0.0.1:21586)
M: def6b00d3ab27bdfde1e0c92a159022b7ed0006b 127.0.0.1:21586
   slots:[12182] (1 slots) master
M: 045c64c41bb18fc7f69e7db0ebd4737bc88dd865 127.0.0.1:21587
   slots:[10923-12181],[12183-16383] (5460 slots) master
M: b36176d5d999d43c1d385adab10c097418dc7c23 127.0.0.1:21589
   slots:[0-5460] (5461 slots) master
```

2. clusterManagerMoveSlot failed: CLUSTERDOWN The cluster is down
During the reshard, clusterManagerMoveSlot may fail in MIGRATE
due to the cluster is down.

3. Expected 'MOVED 12182 xxx' to be equal to 'CLUSTERDOWN The cluster is down'
After the reshard, the cluster is unstable. So we also need to wait for it.
@enjoy-binbin
Copy link
Contributor Author

The first issue: https://github.com/redis/redis/runs/5746629116?check_suite_focus=true#step:6:4259

[exception]: Executing test client: >>> Performing Cluster Check (using node 127.0.0.1:21586)
M: def6b00d3ab27bdfde1e0c92a159022b7ed0006b 127.0.0.1:21586
   slots:[12182] (1 slots) master
M: 045c64c41bb18fc7f69e7db0ebd4737bc88dd865 127.0.0.1:21587
   slots:[10923-12181],[12183-16383] (5460 slots) master
M: b36176d5d999d43c1d385adab10c097418dc7c23 127.0.0.1:21589
   slots:[0-5460] (5461 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.

The other two were found in local tests, run locally many times without failure

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing my mess! ^_^

@madolson
Copy link
Contributor

madolson commented Mar 30, 2022

@madolson madolson merged commit a3075ca into redis:unstable Mar 31, 2022
@enjoy-binbin enjoy-binbin deleted the fix_cluster_test branch March 31, 2022 03:15
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 19, 2022
A timing issue like this was reported in freebsd daily CI:
```
*** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl
Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*'
```

We additionally wait for each node to reach a consensus on the cluster
state in wait_for_condition to avoid the cluster down error.

The fix just like redis#10495, quoting madolson's comment:
Cluster check just verifies the the config state is self-consistent,
waiting for cluster_state to be okay is an independent check that all
the nodes actually believe each other are healthy.

At the same time i noticed that unit/moduleapi/cluster.tcl has an exact
same test, may have the same problem, also modified it.
madolson pushed a commit that referenced this pull request Jul 19, 2022
A timing issue like this was reported in freebsd daily CI:
```
*** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl
Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*'
```

We additionally wait for each node to reach a consensus on the cluster
state in wait_for_condition to avoid the cluster down error.

The fix just like #10495, quoting madolson's comment:
Cluster check just verifies the the config state is self-consistent,
waiting for cluster_state to be okay is an independent check that all
the nodes actually believe each other are healthy.

At the same time i noticed that unit/moduleapi/cluster.tcl has an exact
same test, may have the same problem, also modified it.
oranagra pushed a commit that referenced this pull request Dec 12, 2022
A timing issue like this was reported in freebsd daily CI:
```
*** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl
Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*'
```

We additionally wait for each node to reach a consensus on the cluster
state in wait_for_condition to avoid the cluster down error.

The fix just like #10495, quoting madolson's comment:
Cluster check just verifies the the config state is self-consistent,
waiting for cluster_state to be okay is an independent check that all
the nodes actually believe each other are healthy.

At the same time i noticed that unit/moduleapi/cluster.tcl has an exact
same test, may have the same problem, also modified it.

(cherry picked from commit 5ce64ab)
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
Fix three timing issues in the test
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
A timing issue like this was reported in freebsd daily CI:
```
*** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl
Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*'
```

We additionally wait for each node to reach a consensus on the cluster
state in wait_for_condition to avoid the cluster down error.

The fix just like redis#10495, quoting madolson's comment:
Cluster check just verifies the the config state is self-consistent,
waiting for cluster_state to be okay is an independent check that all
the nodes actually believe each other are healthy.

At the same time i noticed that unit/moduleapi/cluster.tcl has an exact
same test, may have the same problem, also modified it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants