Skip to content

libnetwork/networkdb: fix most flaky loopback tests#49938

Merged
thaJeztah merged 3 commits intomoby:masterfrom
corhere:libn/fix-networkdb-loopback-tests
May 8, 2025
Merged

libnetwork/networkdb: fix most flaky loopback tests#49938
thaJeztah merged 3 commits intomoby:masterfrom
corhere:libn/fix-networkdb-loopback-tests

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented May 7, 2025

- What I did

- How I did it
See commit messages

- How to verify it
Trial by fire: remove the flaky flag from the tests

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

corhere added 3 commits May 7, 2025 11:38
The NetworkDB unit tests instantiate clusters which communicate over
loopback where every "node" listens on a distinct localhost port. The
tests make use of a NetworkDB configuration knob to set the port. When
the NetworkDB configuration's BindPort field is set to a nonzero value,
its memberlist instance is configured to bind to the specified port
number. However, the advertise port is left at the
memberlist.DefaultLANConfig() default value of 7946. Because of this,
nodes would be unable to contact any of the other nodes in the cluster
learned by gossip as the gossiped addresseses specify the wrong ports!
The flaky tests passed as often as they did thanks to the robustness of
the memberlist module: NetworkDB gossip and and memberlist node
liveness-probe pings to unreachable nodes can all be relayed through
the reachable nodes, the nodes on the bootstrap join list.

Make the NetworkDB unit tests less flaky by setting each node's
advertise port to the bind port.

The daemon is unaffected by this oversight as it unconditionally uses
the default listen port of 7946, which aligns with the advertise port.

Signed-off-by: Cory Snider <[email protected]>
NetworkDB defaults to binding to the unspecified address for gossip
communications, with no advertise address set. In this configuration,
the memberlist instance listens on all network interfaces and picks one
of the host's public IP addresses as the advertise address.
The NetworkDB unit tests don't override this default, leaving them
vulnerable to flaking out as a result of rogue network traffic
perturbing the test, or the inferred advertise address not being useable
for loopback testing. And macOS prompts for permission to allow the test
executable to listen on public interfaces every time it is rebuilt.

Modify the NetworkDB tests to explicitly bind to, advertise, and join
ports on 127.0.0.1 to make the tests more robust to flakes in CI and
more convenient to run locally.

Signed-off-by: Cory Snider <[email protected]>
The loopback-test fixes seem to be sufficient to resolve the flakiness
of all the tests aside from TestFlakyNetworkDBIslands.

Signed-off-by: Cory Snider <[email protected]>
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

db := launchNode(t, localConfig)
if i != 0 {
assert.Check(t, db.Join([]string{fmt.Sprintf("localhost:%d", db.config.BindPort-1)}))
assert.Check(t, db.Join([]string{net.JoinHostPort(db.config.AdvertiseAddr, strconv.Itoa(db.config.BindPort-1))}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great if the db had an accessor to get what it's listening on (I was looking at the -1 a few minutes before realising this is to join the "previous" DB instance in the loop).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db.config.BindPort is that accessor. The -1 is to get db N to join the cluster by dialing db N-1's address.

@thaJeztah thaJeztah merged commit e205701 into moby:master May 8, 2025
153 checks passed
@corhere corhere deleted the libn/fix-networkdb-loopback-tests branch May 8, 2025 15:09
@thaJeztah thaJeztah added this to the 28.2.0 milestone May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestNetworkDBCRUDMediumCluster (libnetwork)

3 participants