test(policy): address timeouts and flakiness #14773

alpeb · 2025-12-02T00:25:31Z

Policy tests are very flaky. Currently one of the main culprits is that service account creation sometimes isn't caught as an event by the watcher, blocking await_service_account until it times out after 60s. We already have in place up to 3 retries when calling cargo nextest, but these tests are sequential and the 60s timeouts start accumulating until we reach the CI job timeout at 20min.

This change first lowers the service account creation timeout down to 15s, understanding that if the watcher catches that event it will do pretty quickly or else block indefinitely. So better to fail faster and trigger the test retry ASAP.

With this change, test-policy (v1.34, linkerd, experimental) is finally passing, taking 17m due to the large number of retries:

     Summary [ 779.409s] 151 tests run: 151 passed (15 flaky), 0 skipped
   FLAKY 2/4 [   0.069s] linkerd-policy-test::admit_network_authentication rejects_invalid_cidr
   FLAKY 3/4 [  15.019s] linkerd-policy-test::e2e_audit ns_audit
   FLAKY 2/4 [  10.964s] linkerd-policy-test::e2e_authorization_policy targets_route
   FLAKY 3/4 [   3.830s] linkerd-policy-test::e2e_egress_network default_traffic_policy_http_allow
   FLAKY 2/4 [  37.004s] linkerd-policy-test::e2e_http_local_ratelimit_policy ratelimit_total
   FLAKY 2/4 [   7.947s] linkerd-policy-test::e2e_server_authorization network
   FLAKY 2/4 [   0.142s] linkerd-policy-test::inbound_http_route_status inbound_accepted_parent
   FLAKY 2/4 [   0.167s] linkerd-policy-test::inbound_http_route_status inbound_multiple_parents
   FLAKY 2/4 [   1.681s] linkerd-policy-test::outbound_api multiple_routes
   FLAKY 2/4 [   1.013s] linkerd-policy-test::outbound_api routes_without_backends
   FLAKY 3/4 [   1.153s] linkerd-policy-test::outbound_api service_with_routes_with_cross_namespace_backend
   FLAKY 2/4 [   0.282s] linkerd-policy-test::outbound_api_failure_accrual consecutive_failure_accrual
   FLAKY 3/4 [   0.290s] linkerd-policy-test::outbound_api_failure_accrual default_failure_accrual
   FLAKY 2/4 [   0.354s] linkerd-policy-test::outbound_api_http http_route_gateway_timeouts
   FLAKY 2/4 [   0.740s] linkerd-policy-test::outbound_api_http http_route_retries_and_timeouts

After having measured this, we also added a check in await_service_account to bypass the watcher logic if the SA is already in place. This resulted in the same tests taking only 12m with far less flakiness:

     Summary [ 517.330s] 151 tests run: 151 passed (2 flaky), 0 skipped
   FLAKY 2/4 [   0.941s] linkerd-policy-test::outbound_api routes_without_backends
   FLAKY 2/4 [   0.459s] linkerd-policy-test::outbound_api_tcp multiple_tcp_routes

Policy tests are very flaky. Currently one of the main culprits is that service account creation sometimes isn't caught as an event by the watcher, blocking `await_service_account` until it times out after 60s. We already have in place up to 3 retries when calling `cargo nextest`, but these tests are sequential and the 60s timeouts start accumulating until we reach the CI job timeout at 20min. This change lowers the service account creation timeout down to 15s, understanding that if the watcher catches that event it will do pretty quickly or else block indefinitely. So better to fail faster and trigger the test retry ASAP.

adleong

I wonder if the reduction in flakiness we get here is just because checking for the service account with a synchronous call takes time, allowing for more time for the namespace to be persisted. I.e. I wonder if the service account get is roughly equivalent to a sleep here.

I also wonder if awaiting for the namespace to show up in a watch could get rid of the flakiness entirely by guaranteeing that kubernetes is ready for us to initiate the namespaced service account watch.

Signed-off-by: Scott Fleener <[email protected]>

alpeb · 2025-12-04T18:25:32Z

Ok I finally added the namespace watcher and refactored things a bit to avoid repetition, and CI is doing great. I also had to incorporate #14777, that just surfaced and was blocking CI.

Policy tests are very flaky. Currently one of the main culprits is that service account creation sometimes isn't caught as an event by the watcher, blocking `await_service_account` until it times out after 60s. We already have in place up to 3 retries when calling `cargo nextest`, but these tests are sequential and the 60s timeouts start accumulating until we reach the CI job timeout at 20min. This change first lowers the service account creation timeout down to 15s, understanding that if the watcher catches that event it will do pretty quickly or else block indefinitely. So better to fail faster and trigger the test retry ASAP. With this change, `test-policy (v1.34, linkerd, experimental)` is finally passing, taking 17m due to the large number of retries: ``` Summary [ 779.409s] 151 tests run: 151 passed (15 flaky), 0 skipped FLAKY 2/4 [ 0.069s] linkerd-policy-test::admit_network_authentication rejects_invalid_cidr FLAKY 3/4 [ 15.019s] linkerd-policy-test::e2e_audit ns_audit FLAKY 2/4 [ 10.964s] linkerd-policy-test::e2e_authorization_policy targets_route FLAKY 3/4 [ 3.830s] linkerd-policy-test::e2e_egress_network default_traffic_policy_http_allow FLAKY 2/4 [ 37.004s] linkerd-policy-test::e2e_http_local_ratelimit_policy ratelimit_total FLAKY 2/4 [ 7.947s] linkerd-policy-test::e2e_server_authorization network FLAKY 2/4 [ 0.142s] linkerd-policy-test::inbound_http_route_status inbound_accepted_parent FLAKY 2/4 [ 0.167s] linkerd-policy-test::inbound_http_route_status inbound_multiple_parents FLAKY 2/4 [ 1.681s] linkerd-policy-test::outbound_api multiple_routes FLAKY 2/4 [ 1.013s] linkerd-policy-test::outbound_api routes_without_backends FLAKY 3/4 [ 1.153s] linkerd-policy-test::outbound_api service_with_routes_with_cross_namespace_backend FLAKY 2/4 [ 0.282s] linkerd-policy-test::outbound_api_failure_accrual consecutive_failure_accrual FLAKY 3/4 [ 0.290s] linkerd-policy-test::outbound_api_failure_accrual default_failure_accrual FLAKY 2/4 [ 0.354s] linkerd-policy-test::outbound_api_http http_route_gateway_timeouts FLAKY 2/4 [ 0.740s] linkerd-policy-test::outbound_api_http http_route_retries_and_timeouts ``` After having measured this, we also added a watcher for the namespace, vastly reducing flakiness: ``` Summary [ 517.330s] 151 tests run: 151 passed (2 flaky), 0 skipped FLAKY 2/4 [ 0.941s] linkerd-policy-test::outbound_api routes_without_backends FLAKY 2/4 [ 0.459s] linkerd-policy-test::outbound_api_tcp multiple_tcp_routes ``` Finally, the jaeger chart version has to be pinned to an earlier one as the latest one is presenting some breaking changes that are making the tracing test fail. Signed-off-by: Ivan Porta <[email protected]>

alpeb requested a review from a team as a code owner December 2, 2025 00:25

before triggering watch, check if SA already exists

8f02571

alpeb changed the title ~~test(policy): avoid timeouts due to flakiness~~ test(policy): address timeouts and flakiness Dec 2, 2025

adleong reviewed Dec 2, 2025

View reviewed changes

alpeb added 3 commits December 3, 2025 15:27

testing replacing initial SA check with sleep

9c7a3c1

test adding ns watch

aa37033

refactor watches

a95df41

adleong approved these changes Dec 3, 2025

View reviewed changes

fix(test): Pin jaeger helm chart version to v3

bee63d2

Signed-off-by: Scott Fleener <[email protected]>

sfleen approved these changes Dec 4, 2025

View reviewed changes

alpeb merged commit 4dfd4ec into main Dec 4, 2025
74 of 76 checks passed

alpeb deleted the alpeb/policy-test-flakiness branch December 4, 2025 18:27

sfleen mentioned this pull request Dec 4, 2025

fix(test): Pin jaeger helm chart version to v3 #14777

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(policy): address timeouts and flakiness #14773

test(policy): address timeouts and flakiness #14773

Uh oh!

alpeb commented Dec 2, 2025 •

edited

Loading

Uh oh!

adleong left a comment

Uh oh!

alpeb commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

test(policy): address timeouts and flakiness #14773

test(policy): address timeouts and flakiness #14773

Uh oh!

Conversation

alpeb commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adleong left a comment

Choose a reason for hiding this comment

Uh oh!

alpeb commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alpeb commented Dec 2, 2025 •

edited

Loading