Fix cluster slot migration test by enjoy-binbin · Pull Request #10495 · redis/redis

enjoy-binbin · 2022-03-30T14:11:37Z

There are three timing issues:

[ERR] Not all 16384 slots are covered by nodes.
We can see new_owner doesn't have the old_owner slots information.
I guess the cluster is unstable after SETSLOT, using wait_for_condition
--cluster check make sure the check in reshard won't fail.

[exception]: Executing test client: >>> Performing Cluster Check (using node 127.0.0.1:21586)
M: def6b00d3ab27bdfde1e0c92a159022b7ed0006b 127.0.0.1:21586
   slots:[12182] (1 slots) master
M: 045c64c41bb18fc7f69e7db0ebd4737bc88dd865 127.0.0.1:21587
   slots:[10923-12181],[12183-16383] (5460 slots) master
M: b36176d5d999d43c1d385adab10c097418dc7c23 127.0.0.1:21589
   slots:[0-5460] (5461 slots) master

clusterManagerMoveSlot failed: CLUSTERDOWN The cluster is down
During the reshard, clusterManagerMoveSlot may fail in MIGRATE
due to the cluster is down.
Expected 'MOVED 12182 xxx' to be equal to 'CLUSTERDOWN The cluster is down'
After the reshard, the cluster is unstable. So we also need to wait for it.

There are three timing issues: 1. [ERR] Not all 16384 slots are covered by nodes. We can see new_owner doesn't have the old_owner slots information. I guess the cluster is unstable after SETSLOT, using wait_for_condition --cluster check make sure the check in reshard won't fail. ``` [exception]: Executing test client: >>> Performing Cluster Check (using node 127.0.0.1:21586) M: def6b00d3ab27bdfde1e0c92a159022b7ed0006b 127.0.0.1:21586 slots:[12182] (1 slots) master M: 045c64c41bb18fc7f69e7db0ebd4737bc88dd865 127.0.0.1:21587 slots:[10923-12181],[12183-16383] (5460 slots) master M: b36176d5d999d43c1d385adab10c097418dc7c23 127.0.0.1:21589 slots:[0-5460] (5461 slots) master ``` 2. clusterManagerMoveSlot failed: CLUSTERDOWN The cluster is down During the reshard, clusterManagerMoveSlot may fail in MIGRATE due to the cluster is down. 3. Expected 'MOVED 12182 xxx' to be equal to 'CLUSTERDOWN The cluster is down' After the reshard, the cluster is unstable. So we also need to wait for it.

enjoy-binbin · 2022-03-30T14:13:07Z

The first issue: https://github.com/redis/redis/runs/5746629116?check_suite_focus=true#step:6:4259

[exception]: Executing test client: >>> Performing Cluster Check (using node 127.0.0.1:21586)
M: def6b00d3ab27bdfde1e0c92a159022b7ed0006b 127.0.0.1:21586
   slots:[12182] (1 slots) master
M: 045c64c41bb18fc7f69e7db0ebd4737bc88dd865 127.0.0.1:21587
   slots:[10923-12181],[12183-16383] (5460 slots) master
M: b36176d5d999d43c1d385adab10c097418dc7c23 127.0.0.1:21589
   slots:[0-5460] (5461 slots) master
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[ERR] Not all 16384 slots are covered by nodes.

The other two were found in local tests, run locally many times without failure

zuiderkwast

Thanks for fixing my mess! ^_^

tests/unit/cluster.tcl

madolson · 2022-03-30T20:59:13Z

Seems reasonable, kicking off some tests though: https://github.com/redis/redis/actions/runs/2067370250

and

https://github.com/redis/redis/runs/5762933019

A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like redis#10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it.

A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like #10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it.

A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like #10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it. (cherry picked from commit 5ce64ab)

Fix three timing issues in the test

A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like redis#10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it.

oranagra requested a review from zuiderkwast March 30, 2022 14:13

oranagra added the 7.0-must-have label Mar 30, 2022

zuiderkwast approved these changes Mar 30, 2022

View reviewed changes

tests/unit/cluster.tcl Show resolved Hide resolved

madolson approved these changes Mar 31, 2022

View reviewed changes

madolson merged commit a3075ca into redis:unstable Mar 31, 2022

enjoy-binbin deleted the fix_cluster_test branch March 31, 2022 03:15

enjoy-binbin mentioned this pull request Jul 19, 2022

Fix timing issue in cluster test #11008

Merged

enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023

Fix cluster slot migration test (redis#10495)

81ba08f

Fix three timing issues in the test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cluster slot migration test#10495

Fix cluster slot migration test#10495
madolson merged 1 commit intoredis:unstablefrom
enjoy-binbin:fix_cluster_test

enjoy-binbin commented Mar 30, 2022

Uh oh!

enjoy-binbin commented Mar 30, 2022

Uh oh!

zuiderkwast left a comment

Uh oh!

Uh oh!

madolson commented Mar 30, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

enjoy-binbin commented Mar 30, 2022

Uh oh!

enjoy-binbin commented Mar 30, 2022

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madolson commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

madolson commented Mar 30, 2022 •

edited

Loading