Fix cluster slot migration test#10495
Merged
madolson merged 1 commit intoredis:unstablefrom Mar 31, 2022
Merged
Conversation
There are three timing issues: 1. [ERR] Not all 16384 slots are covered by nodes. We can see new_owner doesn't have the old_owner slots information. I guess the cluster is unstable after SETSLOT, using wait_for_condition --cluster check make sure the check in reshard won't fail. ``` [exception]: Executing test client: >>> Performing Cluster Check (using node 127.0.0.1:21586) M: def6b00d3ab27bdfde1e0c92a159022b7ed0006b 127.0.0.1:21586 slots:[12182] (1 slots) master M: 045c64c41bb18fc7f69e7db0ebd4737bc88dd865 127.0.0.1:21587 slots:[10923-12181],[12183-16383] (5460 slots) master M: b36176d5d999d43c1d385adab10c097418dc7c23 127.0.0.1:21589 slots:[0-5460] (5461 slots) master ``` 2. clusterManagerMoveSlot failed: CLUSTERDOWN The cluster is down During the reshard, clusterManagerMoveSlot may fail in MIGRATE due to the cluster is down. 3. Expected 'MOVED 12182 xxx' to be equal to 'CLUSTERDOWN The cluster is down' After the reshard, the cluster is unstable. So we also need to wait for it.
Contributor
Author
|
The first issue: https://github.com/redis/redis/runs/5746629116?check_suite_focus=true#step:6:4259 The other two were found in local tests, run locally many times without failure |
zuiderkwast
approved these changes
Mar 30, 2022
Contributor
zuiderkwast
left a comment
There was a problem hiding this comment.
Thanks for fixing my mess! ^_^
Contributor
madolson
approved these changes
Mar 31, 2022
enjoy-binbin
added a commit
to enjoy-binbin/redis
that referenced
this pull request
Jul 19, 2022
A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like redis#10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it.
madolson
pushed a commit
that referenced
this pull request
Jul 19, 2022
A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like #10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it.
oranagra
pushed a commit
that referenced
this pull request
Dec 12, 2022
A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like #10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it. (cherry picked from commit 5ce64ab)
enjoy-binbin
added a commit
to enjoy-binbin/redis
that referenced
this pull request
Jul 31, 2023
Fix three timing issues in the test
enjoy-binbin
added a commit
to enjoy-binbin/redis
that referenced
this pull request
Jul 31, 2023
A timing issue like this was reported in freebsd daily CI: ``` *** [err]: Sanity test push cmd after resharding in tests/unit/cluster/cli.tcl Expected 'CLUSTERDOWN The cluster is down' to match '*MOVED*' ``` We additionally wait for each node to reach a consensus on the cluster state in wait_for_condition to avoid the cluster down error. The fix just like redis#10495, quoting madolson's comment: Cluster check just verifies the the config state is self-consistent, waiting for cluster_state to be okay is an independent check that all the nodes actually believe each other are healthy. At the same time i noticed that unit/moduleapi/cluster.tcl has an exact same test, may have the same problem, also modified it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There are three timing issues:
We can see new_owner doesn't have the old_owner slots information.
I guess the cluster is unstable after SETSLOT, using wait_for_condition
--cluster check make sure the check in reshard won't fail.
clusterManagerMoveSlot failed: CLUSTERDOWN The cluster is down
During the reshard, clusterManagerMoveSlot may fail in MIGRATE
due to the cluster is down.
Expected 'MOVED 12182 xxx' to be equal to 'CLUSTERDOWN The cluster is down'
After the reshard, the cluster is unstable. So we also need to wait for it.