Skip to content

CI: cilium-e2e-upgrade failure: client-egress-tls-sni/pod-to-world: exit code 28 on PR with policy fix #40627

@anubhabMajumdar

Description

@anubhabMajumdar

Introduction

I have documented the failures in depth on a PR where the runs were failing - #40519 (comment)

Failed run

https://github.com/cilium/cilium/actions/runs/16281632742/job/45972300970

Output

❌ 2/22 tests failed (2/202 actions), 7 tests skipped, 0 scenarios skipped:
Test [client-egress-tls-sni]:
  🟥 client-egress-tls-sni/pod-to-world:https-to-one.one.one.one.-ipv4-0: cilium-test-5/client-645b68dcf7-2jmrw (10.244.3.30) -> one.one.one.one.-https (one.one.one.one.:443): command "curl --silent --fail --show-error --connect-timeout 2 --max-time 10 -4 -H Host: one.one.one.one -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code}\n --output /dev/null [https://one.one.one.one.:443](https://one.one.one.one./)" failed: command failed (pod=cilium-test-5/client-645b68dcf7-2jmrw, container=client): command terminated with exit code 28
    ⛑️ The following owners are responsible for reliability of the testsuite: 
        - @cilium/proxy (pod-to-world)
        - @cilium/ci-structure (.github/workflows/tests-e2e-upgrade.yaml)
Test [client-egress-tls-sni-wildcard]:
  🟥 client-egress-tls-sni-wildcard/pod-to-world:https-to-one.one.one.one.-ipv4-1: cilium-test-5/client2-66475877c6-fbw7t (10.244.3.68) -> one.one.one.one.-https (one.one.one.one.:443): command "curl --silent --fail --show-error --connect-timeout 2 --max-time 10 -4 -H Host: one.one.one.one -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code}\n --output /dev/null [https://one.one.one.one.:443](https://one.one.one.one./)" failed: command failed (pod=cilium-test-5/client2-66475877c6-fbw7t, container=client2): command terminated with exit code 28
    ⛑️ The following owners are responsible for reliability of the testsuite: 
        - @cilium/proxy (pod-to-world)
        - @cilium/ci-structure (.github/workflows/tests-e2e-upgrade.yaml)

Attachments

All sysdump link - https://github.com/cilium/cilium/actions/runs/16281632742/artifacts/3531894952

Attached as zip the failed run

cilium-sysdump-conn-disrupt-test-cilium-upgrade-6-concurrent-20250715-013221.zip

Brief analysis of the failure - #40519 (comment)

Slack discussion

There's a long Slack discussion about the failure - https://cilium.slack.com/archives/CDLN7836J/p1752524978378719?thread_ts=1752163611.807769&cid=CDLN7836J

More details

I am trying to check in this change - cb40019 .

Here's the details of the bug and root cause:

The test is only failing during upgrade. And the test is failing consistently (though it passes occasionally). What I was trying to determine if there is an issue with syncing the policy to the eBPF map which was causing curl failures. Given the test pass consistently on retry, I think either

  • the client-egress-sni* test is flaky and retry is the correct way to go (but at the test level), or
  • there is an existing bug in L7 that this change exposes

Metadata

Metadata

Assignees

Labels

area/CIContinuous Integration testing issue or flakeci/flakeThis is a known failure that occurs in the tree. Please investigate me!

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions