-
Notifications
You must be signed in to change notification settings - Fork 1.3k
test(policy): address timeouts and flakiness #14773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Policy tests are very flaky. Currently one of the main culprits is that service account creation sometimes isn't caught as an event by the watcher, blocking `await_service_account` until it times out after 60s. We already have in place up to 3 retries when calling `cargo nextest`, but these tests are sequential and the 60s timeouts start accumulating until we reach the CI job timeout at 20min. This change lowers the service account creation timeout down to 15s, understanding that if the watcher catches that event it will do pretty quickly or else block indefinitely. So better to fail faster and trigger the test retry ASAP.
adleong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the reduction in flakiness we get here is just because checking for the service account with a synchronous call takes time, allowing for more time for the namespace to be persisted. I.e. I wonder if the service account get is roughly equivalent to a sleep here.
I also wonder if awaiting for the namespace to show up in a watch could get rid of the flakiness entirely by guaranteeing that kubernetes is ready for us to initiate the namespaced service account watch.
Signed-off-by: Scott Fleener <[email protected]>
|
Ok I finally added the namespace watcher and refactored things a bit to avoid repetition, and CI is doing great. I also had to incorporate #14777, that just surfaced and was blocking CI. |
Policy tests are very flaky. Currently one of the main culprits is that service account creation sometimes isn't caught as an event by the watcher, blocking `await_service_account` until it times out after 60s. We already have in place up to 3 retries when calling `cargo nextest`, but these tests are sequential and the 60s timeouts start accumulating until we reach the CI job timeout at 20min.
This change first lowers the service account creation timeout down to 15s, understanding that if the watcher catches that event it will do pretty quickly or else block indefinitely. So better to fail faster and trigger the test retry ASAP.
With this change, `test-policy (v1.34, linkerd, experimental)` is finally passing, taking 17m due to the large number of retries:
```
Summary [ 779.409s] 151 tests run: 151 passed (15 flaky), 0 skipped
FLAKY 2/4 [ 0.069s] linkerd-policy-test::admit_network_authentication rejects_invalid_cidr
FLAKY 3/4 [ 15.019s] linkerd-policy-test::e2e_audit ns_audit
FLAKY 2/4 [ 10.964s] linkerd-policy-test::e2e_authorization_policy targets_route
FLAKY 3/4 [ 3.830s] linkerd-policy-test::e2e_egress_network default_traffic_policy_http_allow
FLAKY 2/4 [ 37.004s] linkerd-policy-test::e2e_http_local_ratelimit_policy ratelimit_total
FLAKY 2/4 [ 7.947s] linkerd-policy-test::e2e_server_authorization network
FLAKY 2/4 [ 0.142s] linkerd-policy-test::inbound_http_route_status inbound_accepted_parent
FLAKY 2/4 [ 0.167s] linkerd-policy-test::inbound_http_route_status inbound_multiple_parents
FLAKY 2/4 [ 1.681s] linkerd-policy-test::outbound_api multiple_routes
FLAKY 2/4 [ 1.013s] linkerd-policy-test::outbound_api routes_without_backends
FLAKY 3/4 [ 1.153s] linkerd-policy-test::outbound_api service_with_routes_with_cross_namespace_backend
FLAKY 2/4 [ 0.282s] linkerd-policy-test::outbound_api_failure_accrual consecutive_failure_accrual
FLAKY 3/4 [ 0.290s] linkerd-policy-test::outbound_api_failure_accrual default_failure_accrual
FLAKY 2/4 [ 0.354s] linkerd-policy-test::outbound_api_http http_route_gateway_timeouts
FLAKY 2/4 [ 0.740s] linkerd-policy-test::outbound_api_http http_route_retries_and_timeouts
```
After having measured this, we also added a watcher for the namespace, vastly reducing flakiness:
```
Summary [ 517.330s] 151 tests run: 151 passed (2 flaky), 0 skipped
FLAKY 2/4 [ 0.941s] linkerd-policy-test::outbound_api routes_without_backends
FLAKY 2/4 [ 0.459s] linkerd-policy-test::outbound_api_tcp multiple_tcp_routes
```
Finally, the jaeger chart version has to be pinned to an earlier one as the latest one is presenting some breaking changes that are making the tracing test fail.
Signed-off-by: Ivan Porta <[email protected]>
Policy tests are very flaky. Currently one of the main culprits is that service account creation sometimes isn't caught as an event by the watcher, blocking
await_service_accountuntil it times out after 60s. We already have in place up to 3 retries when callingcargo nextest, but these tests are sequential and the 60s timeouts start accumulating until we reach the CI job timeout at 20min.This change first lowers the service account creation timeout down to 15s, understanding that if the watcher catches that event it will do pretty quickly or else block indefinitely. So better to fail faster and trigger the test retry ASAP.
With this change,
test-policy (v1.34, linkerd, experimental)is finally passing, taking 17m due to the large number of retries:After having measured this, we also added a check in
await_service_accountto bypass the watcher logic if the SA is already in place. This resulted in the same tests taking only 12m with far less flakiness: