-
Notifications
You must be signed in to change notification settings - Fork 3.7k
CI: Cilium E2E Upgrade: no-interrupted-connections #37520
Copy link
Copy link
Closed as not planned
Labels
area/CIContinuous Integration testing issue or flakeContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!This is a known failure that occurs in the tree. Please investigate me!dependenciesPull requests that update a dependency filePull requests that update a dependency filestaleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Metadata
Metadata
Assignees
Labels
area/CIContinuous Integration testing issue or flakeContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!This is a known failure that occurs in the tree. Please investigate me!dependenciesPull requests that update a dependency filePull requests that update a dependency filestaleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
That workflow is failing very often on main: https://github.com/cilium/cilium/actions/workflows/tests-e2e-upgrade.yaml?query=event%3Aschedule
This happens during the upgrade from v1.17.
I checked one of those failures and didn't see any packet drops from Cilium or from XFRM (which is disabled anyway, but I'm a bit parano). So seems like the drops are from somewhere else in the kernel.
First occurrence is on Feb. 2nd at 2:05pm. That's a Sunday. Pull requests merged on Thursday & Friday are: https://github.com/cilium/cilium/pulls?page=1&q=is%3Apr+merged%3A2025-01-30..2025-02-02+is%3Aclosed+-label%3Akind%2Fbackports.
The list of configs affected are: 7, 8, 12, 15, 19, 20, 21, 22, 23. I went through the list and compared the configs. They all have BPF NodePort enabled (but not necessarily the rest of KPR's options) and they are all on bpf or bpf-next kernels. That puts #37406 as a good suspect among the candidates. I sent a revert at #37485 and confirmed by running the workflow 10 times that it fixes the flake.