Fix flaky cluster tests in 24-links.tcl#10157
Merged
madolson merged 3 commits intoredis:unstablefrom Jan 24, 2022
Merged
Conversation
madolson
reviewed
Jan 23, 2022
madolson
reviewed
Jan 24, 2022
madolson
approved these changes
Jan 24, 2022
ny0312
commented
Jan 24, 2022
Comment on lines
+30
to
31
| set nodes [get_cluster_nodes $id] | ||
| set links [get_cluster_links $id] |
Contributor
Author
There was a problem hiding this comment.
Assuming the cluster is un stable, Instead of getting nodes and links again, IMO we should get them while waiting the condition to be met.
proc number_of_peers {id nodes} {
upvar $nodes n
set n [get_cluster_nodes $id]
return [expr [llength $n] - 1]
}
proc number_of_links {id links} {
upvar $links l
set l [get_cluster_links $id]
return [llength $l]
}
test "Each node has two links with each peer" {
foreach_redis_id id {
set nodes {}
set links {}
# Assert that from point of view of each node, there are two links for
# each peer. It might take a while for cluster to stabilize so wait up
# to 5 seconds.
wait_for_condition 50 100 {
[number_of_peers $id $nodes]*2 == [number_of_links $id $links]
} else {
assert_equal [expr [number_of_peers $id $nodes]*2] [number_of_links $id $links]
}
# Then check if there are two entries in `$links` for each entry in `$nodes`
Contributor
There was a problem hiding this comment.
Why would the cluster be unstable? I was under the impression that the cluster was still establishing connections as a part of the meeting process, which is why not all links were established. In steady state it shouldn't be unstable.
Member
|
@ny0312 I noticed another a sporadic failure in this test: maybe you can look into it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix two flaky tests introduced in PR #9774
For test "Each node has two links with each peer"
This test seems to be flaky because cluster cluster is not stable and sometimes a node doesn't have both inbound and outbound connections established with every peer.
This failure is more rare than the next one. The fix is to add retries.
For test "Disconnect link when send buffer limit reached"
There were two sources of failure.
Redis getting OOM killed by kernel due to out of swap. In the test, I'm allowing cluster link buffers to grow up to 32MB. There are 20 Redis nodes running in parallel in cluster tests. That proved to be too much for the FreeBSD test environment used by the daily runs.
Example failure link: https://github.com/redis/redis/runs/4733591841?check_suite_focus=true
Fix is to use smaller cluster link buffer limits and fill it up by repeatedly sending smallish messages. This approach should be adaptive to different test environments.
I'm assuming as soon as I send a large PUBLISH command to fill up a cluster link, the link will be freed. But in reality the link will only get freed in the next clusterCron run whenever that happens. My test is not accounting for this race condition.
Example failure link: https://github.com/redis/redis/runs/4829401183?check_suite_focus=true#step:9:630
Fix is to wait for 0.5s before checking if link has been freed.