cilium-cli: Validate NodePort readiness on all node IP addresses#40812
Conversation
There was a problem hiding this comment.
- Insufficient timeouts: 2-second connect timeout was too aggressive for IPv6 NodePort connections
- IPv6 networking delays: IPv6 connections can take longer to establish than IPv4
I am a bit skeptical about these findings. Establishing a connection should not take 2 seconds, especially when all of it is effective on the same node (no internet involved). If they do, then it seems to me that something is going wrong (packets being dropped, causing TCP level retires, or something related) and increasing the timeout may just mask that.
I also have never heard of IPv6 being slower than IPv4 to establish connections. I don't think we should accept that statement without something to back that up.
Missing retry logic: Single-attempt connections failed on transient network issues
This is essentially saying that we expect Cilium, the thing we are testing to fail. I don't think that is right. These sorts of tests are exactly here to catch issues in Cilium. If the first connection we make always fails or frequently fails then we should investigate why and not plaster over it with retries. We want all connections to work, not just at least one out of three.
Premature testing: Tests started before NodePorts were fully ready
This sounds more like the right direction. I think this is more likely to be the real fix.
4538a18 to
4527a2f
Compare
a2d1e38 to
083d5d1
Compare
083d5d1 to
d60f535
Compare
d60f535 to
ab2dc2a
Compare
|
/test |
ab2dc2a to
4ee5bec
Compare
4ee5bec to
c0ea472
Compare
|
@viktor-kurchenko are we good to go ahead with merge? |
c0ea472 to
707c054
Compare
|
/test |
Yes, it's marked as |
|
I see there's been some updates, but the PR title and description (release note) both talk about dropping IPv6-specific timeouts and I don't see that in the code change. Related, in the commit message:
If the need is removed, does that mean there should be another change to remove them? Does it make sense to track that with an issue and follow up to remove those checks or were you planning to look into that as part of this PR? |
Updated the pr title, description, commit message and release note. The current approach is simpler: The fix enhances the existing upfront NodePort validation to check all node IP addresses (NodeInternalIP and NodeExternalIP) instead of just HostIP. This covers IPv6 addresses in dual-stack clusters that were previously unchecked and were causing the intermittent timeouts.
In an earlier version of this PR I added per-test |
2b00fb3 to
2c0b93c
Compare
|
/test |
2c0b93c to
6f7772e
Compare
The upfront NodePort readiness check only validated HostIP (typically one IP per node), but tests target all addresses in node.Status.Addresses which can include multiple IPs per node (e.g., IPv4 and IPv6 in dual-stack clusters). This caused intermittent failures (exit code 28) when tests tried IPv6 NodePorts that were never validated for readiness. Fix by enhancing the upfront validation in deployment.go to check NodePort readiness on all node IP addresses (InternalIP and ExternalIP types) during the setup phase, ensuring all addresses that will be tested are ready before tests execute. Fixes: cilium#39793 Signed-off-by: Ashwin Pillai <[email protected]>
6f7772e to
9607dac
Compare
|
/test |
|
Looks like this resulted in #43749. |
PR cilium#40812 added validation for NodePorts on all node IP addresses (internal and external) to fix dual-stack clusters. However, on GKE, external IPs are not reachable from inside the cluster due to firewall restrictions. The connectivity tests already skip external IPs on GKE (service.go), but the upfront validation added by cilium#40812 did not, causing timeouts during test setup. This commit adds the same GKE-aware external IP skip logic to the NodePort validation phase, making it consistent with the actual test behavior. Fixes: cilium#43749 See-also: cilium#39793 Signed-off-by: Ash <[email protected]>
PR cilium#40812 added validation for NodePorts on all node IP addresses (internal and external) to fix dual-stack clusters. However, on GKE, external IPs are not reachable from inside the cluster due to firewall restrictions. The connectivity tests already skip external IPs on GKE, but the upfront validation added by cilium#40812 did not, causing timeouts during test setup. This commit adds the same GKE-aware external IP skip logic to the NodePort validation phase, making it consistent with the actual test behavior. Fixes: cilium#43749 See-also: cilium#39793 Signed-off-by: Ashwin Pillai <[email protected]>
PR cilium#40812 added validation for NodePorts on all node IP addresses (internal and external) to fix dual-stack clusters. However, on GKE, external IPs are not reachable from inside the cluster due to firewall restrictions. The connectivity tests already skip external IPs on GKE, but the upfront validation added by cilium#40812 did not, causing timeouts during test setup. This commit adds the same GKE-aware external IP skip logic to the NodePort validation phase, making it consistent with the actual test behavior. Fixes: cilium#43749 See-also: cilium#39793 Signed-off-by: Ashwin Pillai <[email protected]>
PR #40812 added validation for NodePorts on all node IP addresses (internal and external) to fix dual-stack clusters. However, on GKE, external IPs are not reachable from inside the cluster due to firewall restrictions. The connectivity tests already skip external IPs on GKE, but the upfront validation added by #40812 did not, causing timeouts during test setup. This commit adds the same GKE-aware external IP skip logic to the NodePort validation phase, making it consistent with the actual test behavior. Fixes: #43749 See-also: #39793 Signed-off-by: Ashwin Pillai <[email protected]>
…0 ) (#584) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [aqua:cilium/cilium-cli](https://redirect.github.com/cilium/cilium-cli) | minor | `0.18.9` → `0.19.0` | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>cilium/cilium-cli (aqua:cilium/cilium-cli)</summary> ### [`v0.19.0`](https://redirect.github.com/cilium/cilium-cli/releases/tag/v0.19.0) [Compare Source](https://redirect.github.com/cilium/cilium-cli/compare/v0.18.9...v0.19.0) ## Summary of Changes **CI Changes:** - Add L7 policy traffic disruption tests ([cilium/cilium#42150](https://redirect.github.com/cilium/cilium/issues/42150), [@​fristonio](https://redirect.github.com/fristonio)) - Cilium-cli SNI connectivity tests now retry expected successful operations to recover from failures due to external upstream issues. ([cilium/cilium#42980](https://redirect.github.com/cilium/cilium/issues/42980), [@​jrajahalme](https://redirect.github.com/jrajahalme)) - cli: connectivity: fix typo in L7 LB tests ([cilium/cilium#43610](https://redirect.github.com/cilium/cilium/issues/43610), [@​julianwiedmann](https://redirect.github.com/julianwiedmann)) - Fix intermittent NodePort connectivity test timeouts in dual-stack clusters by validating NodePort readiness on all node IP addresses during test setup. ([cilium/cilium#40812](https://redirect.github.com/cilium/cilium/issues/40812), [@​pillai-ashwin](https://redirect.github.com/pillai-ashwin)) - tests: remove identity manager from ignored error messages ([cilium/cilium#42982](https://redirect.github.com/cilium/cilium/issues/42982), [@​odinuge](https://redirect.github.com/odinuge)) **Misc Changes:** - chore(deps): update all-dependencies (main) ([cilium/cilium#43169](https://redirect.github.com/cilium/cilium/issues/43169), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update all-dependencies (main) ([cilium/cilium#43456](https://redirect.github.com/cilium/cilium/issues/43456), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update all-dependencies (main) ([cilium/cilium#43508](https://redirect.github.com/cilium/cilium/issues/43508), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update base-images (main) ([cilium/cilium#43457](https://redirect.github.com/cilium/cilium/issues/43457), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update base-images (main) ([cilium/cilium#43538](https://redirect.github.com/cilium/cilium/issues/43538), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update docker.io/library/golang:1.25.5 docker digest to [`a22b2e6`](https://redirect.github.com/cilium/cilium-cli/commit/a22b2e6) (main) ([cilium/cilium#43303](https://redirect.github.com/cilium/cilium/issues/43303), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - chore(deps): update go to v1.25.5 (main) ([cilium/cilium#43173](https://redirect.github.com/cilium/cilium/issues/43173), [@​cilium-renovate](https://redirect.github.com/cilium-renovate)\[bot]) - cilium-cli/connectivity: remove matcher for bpf/init.sh errors ([cilium/cilium#43109](https://redirect.github.com/cilium/cilium/issues/43109), [@​tklauser](https://redirect.github.com/tklauser)) - cilium-cli: convert net.IP to netip.Addr ([cilium/cilium#42371](https://redirect.github.com/cilium/cilium/issues/42371), [@​phuhung273](https://redirect.github.com/phuhung273)) - cli: Update `network-perf` image ref ([cilium/cilium#43297](https://redirect.github.com/cilium/cilium/issues/43297), [@​HadrienPatte](https://redirect.github.com/HadrienPatte)) - chore(deps): update golangci/golangci-lint-action action to v9.2.0 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3148](https://redirect.github.com/cilium/cilium-cli/pull/3148) - Update stable release to v0.18.9 by [@​michi-covalent](https://redirect.github.com/michi-covalent) in [#​3149](https://redirect.github.com/cilium/cilium-cli/pull/3149) - chore(deps): update golangci/golangci-lint docker tag to v2.7.0 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3151](https://redirect.github.com/cilium/cilium-cli/pull/3151) - chore(deps): update go to v1.25.5 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3153](https://redirect.github.com/cilium/cilium-cli/pull/3153) - ci: clean up disk space in release workflow by [@​tklauser](https://redirect.github.com/tklauser) in [#​3154](https://redirect.github.com/cilium/cilium-cli/pull/3154) - chore(deps): update actions/stale action to v10.1.1 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3150](https://redirect.github.com/cilium/cilium-cli/pull/3150) - chore(deps): update gcr.io/distroless/static:latest docker digest to [`4b2a093`](https://redirect.github.com/cilium/cilium-cli/commit/4b2a093) by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3152](https://redirect.github.com/cilium/cilium-cli/pull/3152) - chore(deps): update golangci/golangci-lint docker tag to v2.7.2 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3155](https://redirect.github.com/cilium/cilium-cli/pull/3155) - chore(deps): update docker.io/library/golang:1.25.5 docker digest to [`a22b2e6`](https://redirect.github.com/cilium/cilium-cli/commit/a22b2e6) by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3156](https://redirect.github.com/cilium/cilium-cli/pull/3156) - chore(deps): update actions/upload-artifact action to v6 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3157](https://redirect.github.com/cilium/cilium-cli/pull/3157) - chore(deps): update docker.io/library/golang:1.25.5 docker digest to [`36b4f45`](https://redirect.github.com/cilium/cilium-cli/commit/36b4f45) by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3160](https://redirect.github.com/cilium/cilium-cli/pull/3160) - chore(deps): update dependency cilium/cilium to v1.18.5 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3159](https://redirect.github.com/cilium/cilium-cli/pull/3159) - chore(deps): update dependency kubernetes-sigs/kind to v0.31.0 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3158](https://redirect.github.com/cilium/cilium-cli/pull/3158) - chore(deps): update docker/setup-buildx-action action to v3.12.0 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3162](https://redirect.github.com/cilium/cilium-cli/pull/3162) - chore(deps): update golangci/golangci-lint docker tag to v2.8.0 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3163](https://redirect.github.com/cilium/cilium-cli/pull/3163) - chore(deps): update docker.io/library/golang:1.25.5 docker digest to [`6cc2338`](https://redirect.github.com/cilium/cilium-cli/commit/6cc2338) by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3164](https://redirect.github.com/cilium/cilium-cli/pull/3164) - chore(deps): update gcr.io/distroless/static:latest docker digest to [`cd64bec`](https://redirect.github.com/cilium/cilium-cli/commit/cd64bec) by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3165](https://redirect.github.com/cilium/cilium-cli/pull/3165) - chore(deps): update actions/setup-go action to v6.2.0 by [@​renovate](https://redirect.github.com/renovate)\[bot] in [#​3166](https://redirect.github.com/cilium/cilium-cli/pull/3166) - Prepare for v0.19.0 release by [@​tklauser](https://redirect.github.com/tklauser) in [#​3167](https://redirect.github.com/cilium/cilium-cli/pull/3167) **Full Changelog**: <cilium/cilium-cli@v0.18.9...v0.19.0> </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://redirect.github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi44MS4yIiwidXBkYXRlZEluVmVyIjoiNDIuODEuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsidHlwZS9taW5vciJdfQ==--> Co-authored-by: zocimek-renovate[bot] <134739422+zocimek-renovate[bot]@users.noreply.github.com>
Please ensure your pull request adheres to the following guidelines:
description and a
Fixes: #XXXline if the commit addresses a particularGitHub issue.
Fixes: <commit-id>tag, thenplease add the commit author[s] as reviewer[s] to this issue.
Description
Fixes intermittent
pod-to-local-nodeportconnectivity test failures (exit code 28 - connection timeout) in dual-stack clusters.Root Cause
The upfront NodePort readiness check in
deployment.goonly validatedHostIP(one IP per node, typically IPv4), but tests iterate through all addresses innode.Status.Addresses[]which includes both IPv4 and IPv6 in dual-stack clusters.This meant IPv6 NodePorts were never validated for readiness, causing timeouts when tests tried them.
Fix
Enhance the upfront NodePort validation to check all node IP addresses (NodeInternalIP and NodeExternalIP types) instead of just HostIP. This ensures all addresses that will be tested are validated during setup.
Fixes: #39793