libnetwork/networkdb: make TestNetworkDBIslands not flaky#50343
Merged
thaJeztah merged 2 commits intomoby:masterfrom Jul 8, 2025
Merged
libnetwork/networkdb: make TestNetworkDBIslands not flaky#50343thaJeztah merged 2 commits intomoby:masterfrom
thaJeztah merged 2 commits intomoby:masterfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes flakiness in the TestNetworkDBIslands integration test and corrects the cluster rejoin logic to distinguish nodes by both IP and port.
- Renamed and updated
TestNetworkDBIslandsto accept nodes marked as left or failed in any combination. - Switched bootstrap parsing in
rejoinClusterBootStrapto usenetip.ParseAddrPortand compare full address-port pairs.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libnetwork/networkdb/networkdb_test.go | Renamed test, unified left/failed count checks, and updated log messages |
| libnetwork/networkdb/cluster.go | Replaced net.SplitHostPort/net.ParseIP usage with netip parsing and full port comparisons |
The rejoinClusterBootStrap periodic task rejoins with the bootstrap nodes if none of them are members of the cluster. It correlates the cluster nodes with the bootstrap list by comparing IP addresses, ignoring ports. In normal operation this works out fine as every node has a unique IP address, but in unit tests every node listens on a distinct port of 127.0.0.1. This situation causes the check to incorrectly filter out all nodes from the list, mistaking them for the local node. Filter out the local node using pointer equality of the *node to avoid any ambiguity. Correlate the remote nodes by IP:port so that the check behaves the same in tests and in production. Signed-off-by: Cory Snider <[email protected]>
With rejoinClusterBootStrap fixed in tests, split clusters should reliably self-heal in tests as well as production. Work around the other source of flakiness in TestNetworkDBIslands: timing out waiting for a failed node to transition to gracefully left. This flake happens when one of the leaving nodes sends its NodeLeft message to the other leaving node, and the second is shut down before it has a chance to rebroadcast the message to the remaining nodes. The proper fix would be to leverage memberlist's own bookkeeping instead of duplicating it poorly with user messages, but doing so requires a change in the memberlist module. Instead have the test check that the sum of failed+left nodes is expected instead of waiting for all nodes to have failed==3 && left==0. Signed-off-by: Cory Snider <[email protected]>
d8032a7 to
aff444d
Compare
thaJeztah
reviewed
Jul 8, 2025
| // bootstrap IPs are usually IP:port from the Join | ||
| var bootstrapIP net.IP | ||
| ipStr, _, err := net.SplitHostPort(bootIP) | ||
| bootstrapIP, err := netip.ParseAddrPort(bootIP) |
Member
There was a problem hiding this comment.
TIL netip.ParseAddrPort, nice!
(Looks like it only is for ip-address:port, not hostname:port though, so I guess in many of our cases, we'd still need net.SplitHostPort or variants thereof)
Member
|
Had to kick CI because GitHub was mis-behaving. This one is unrelated (also known flaky test); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
- What I did
- How I did it
I identified a bug in the rejoin logic that TestNetworkDBIslands is supposed to be validating which only manifests in tests as that is the only time where more than one node can be advertising the same IP address, only differing by port. I updated the logic to use both the IP address and port number when identifying nodes so that it behaves the same in tests as in production. I also updated the test itself to be more tolerant of nodes being spuriously marked as failed.
- How to verify it
go test -count=...- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)