Fix flaky cluster tests in 24-links.tcl by ny0312 · Pull Request #10157 · redis/redis

ny0312 · 2022-01-20T21:38:24Z

Fix two flaky tests introduced in PR #9774

For test "Each node has two links with each peer"

00:39:21> Each node has two links with each peer: FAILED: Expected 19*2 eq 37 (context: type eval line 11 cmd {assert {$num_peers*2 eq $num_links}} proc ::foreach_instance_id)
(Jumping to next unit after error)

This test seems to be flaky because cluster cluster is not stable and sometimes a node doesn't have both inbound and outbound connections established with every peer.

This failure is more rare than the next one. The fix is to add retries.

For test "Disconnect link when send buffer limit reached"

There were two sources of failure.

 00:47:22> Disconnect link when send buffer limit reached: error writing "sock802fbc590": broken pipe
    while executing
"$primary1 publish channel [prepare_value [expr 30*1024*1024]]"

Redis getting OOM killed by kernel due to out of swap. In the test, I'm allowing cluster link buffers to grow up to 32MB. There are 20 Redis nodes running in parallel in cluster tests. That proved to be too much for the FreeBSD test environment used by the daily runs.

Example failure link: https://github.com/redis/redis/runs/4733591841?check_suite_focus=true

Fix is to use smaller cluster link buffer limits and fill it up by repeatedly sending smallish messages. This approach should be adaptive to different test environments.

00:46:57> Disconnect link when send buffer limit reached: FAILED: Expected [get_info_field [::redis::redisHandle1876 cluster info] total_cluster_links_buffer_limit_exceeded] eq 1 (context: type eval line 36 cmd {assert {[get_info_field [$primary1 cluster info] total_cluster_links_buffer_limit_exceeded] eq 1}} proc ::test)

I'm assuming as soon as I send a large PUBLISH command to fill up a cluster link, the link will be freed. But in reality the link will only get freed in the next clusterCron run whenever that happens. My test is not accounting for this race condition.

Example failure link: https://github.com/redis/redis/runs/4829401183?check_suite_focus=true#step:9:630

Fix is to wait for 0.5s before checking if link has been freed.

tests/cluster/tests/24-links.tcl

ny0312 · 2022-01-24T19:44:45Z

tests/cluster/tests/24-links.tcl

+        set nodes [get_cluster_nodes $id]
        set links [get_cluster_links $id]


Assuming the cluster is un stable, Instead of getting nodes and links again, IMO we should get them while waiting the condition to be met.

proc number_of_peers {id nodes} { upvar $nodes n set n [get_cluster_nodes $id] return [expr [llength $n] - 1] } proc number_of_links {id links} { upvar $links l set l [get_cluster_links $id] return [llength $l] } test "Each node has two links with each peer" { foreach_redis_id id { set nodes {} set links {} # Assert that from point of view of each node, there are two links for # each peer. It might take a while for cluster to stabilize so wait up # to 5 seconds. wait_for_condition 50 100 { [number_of_peers $id $nodes]*2 == [number_of_links $id $links] } else { assert_equal [expr [number_of_peers $id $nodes]*2] [number_of_links $id $links] } # Then check if there are two entries in `$links` for each entry in `$nodes`

Why would the cluster be unstable? I was under the impression that the cluster was still establishing connections as a part of the meeting process, which is why not all links were established. In steady state it shouldn't be unstable.

oranagra · 2022-05-24T05:40:11Z

@ny0312 I noticed another a sporadic failure in this test:
https://github.com/redis/redis/runs/6564848175?check_suite_focus=true (happen with test-sanitizer-address (gcc), which is slow)

00:45:47> Each node has two links with each peer: FAILED: Expected 0 eq 1 (context: type eval line 30 cmd {assert {$to eq 1}} proc ::foreach_instance_id)

maybe you can look into it.

Fix flaky cluster test "Disconnect link when send buffer limit reached"

2a5583d

ny0312 mentioned this pull request Jan 20, 2022

Introduce memory management on cluster link buffers #9774

Merged

oranagra assigned madolson Jan 21, 2022

Fix flaky test "Each node has two links with each peer"

6bae4d9

ny0312 changed the title ~~Fix flaky cluster test "Disconnect link when send buffer limit reached"~~ Fix flaky cluster tests in 24-links.tcl Jan 21, 2022

oranagra added the 7.0-RC1-must-have label Jan 23, 2022

madolson reviewed Jan 23, 2022

View reviewed changes

tests/cluster/tests/24-links.tcl Outdated Show resolved Hide resolved

madolson reviewed Jan 24, 2022

View reviewed changes

tests/cluster/tests/24-links.tcl Show resolved Hide resolved

tests/cluster/tests/24-links.tcl Outdated Show resolved Hide resolved

tests/cluster/tests/24-links.tcl Outdated Show resolved Hide resolved

Address CR feedback

32b1b1a

madolson approved these changes Jan 24, 2022

View reviewed changes

madolson merged commit b40a9ba into redis:unstable Jan 24, 2022

ny0312 commented Jan 24, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky cluster tests in 24-links.tcl#10157

Fix flaky cluster tests in 24-links.tcl#10157
madolson merged 3 commits intoredis:unstablefrom
ny0312:fix-cluster-link-buffer-limit-test

ny0312 commented Jan 20, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ny0312 Jan 24, 2022

Uh oh!

madolson Jan 24, 2022

Uh oh!

oranagra commented May 24, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		set nodes [get_cluster_nodes $id]
		set links [get_cluster_links $id]

Conversation

ny0312 commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For test "Each node has two links with each peer"

For test "Disconnect link when send buffer limit reached"

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ny0312 Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

madolson Jan 24, 2022

Choose a reason for hiding this comment

Uh oh!

oranagra commented May 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ny0312 commented Jan 20, 2022 •

edited

Loading

oranagra commented May 24, 2022 •

edited

Loading