In the `external-prometheus-deep` integration test, wait for prom to roll out by alpeb · Pull Request #6524 · linkerd/linkerd2

alpeb · 2021-07-19T20:57:38Z

In the external-prometheus-deep integration test, make sure we wait for the rollout of the external prometheus instance before proceeding.

Also remove the special logic around if TestHelper.ExternalPrometheus() in the helm-related tests because we know we're using the embedded linkerd prometheus instance in those tests.

…roll out Fixes #6511 In the `external-prometheus-deep` integration test, make sure we wait for the rollout of the external prometheus instance before proceeding. Also remove the special logic around `if TestHelper.ExternalPrometheus()` in the helm-related tests because we know we're using the embedded linkerd prometheus instance in those tests.

codecov-commenter · 2021-07-19T21:00:07Z

Codecov Report

Merging #6524 (7508b11) into main (9b3d120) will increase coverage by 0.34%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #6524      +/-   ##
==========================================
+ Coverage   44.48%   44.82%   +0.34%     
==========================================
  Files         183      183              
  Lines       17639    23859    +6220     
  Branches      290      290              
==========================================
+ Hits         7846    10694    +2848     
- Misses       9056    12412    +3356     
- Partials      737      753      +16

Flag	Coverage Δ
golang	`43.45% <ø> (+0.89%)`	⬆️
javascript	`66.76% <ø> (ø)`
unittests	`44.82% <ø> (+0.34%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/profiles/proto.go	`67.30% <0.00%> (-4.79%)`	⬇️
cli/cmd/install-cni-plugin.go	`65.80% <0.00%> (-4.54%)`	⬇️
pkg/k8s/api.go	`28.26% <0.00%> (-3.78%)`	⬇️
controller/k8s/test_helper.go	`88.88% <0.00%> (-3.12%)`	⬇️
...luster/service-mirror/cluster_watcher_test_util.go	`83.67% <0.00%> (-2.36%)`	⬇️
viz/metrics-api/test_helper.go	`91.28% <0.00%> (-2.01%)`	⬇️
cli/cmd/uninstall.go	`19.04% <0.00%> (-1.89%)`	⬇️
controller/api/destination/watcher/prometheus.go	`86.73% <0.00%> (-1.84%)`	⬇️
controller/webhook/server.go	`17.11% <0.00%> (-1.71%)`	⬇️
pkg/k8s/fake.go	`77.88% <0.00%> (-1.64%)`	⬇️
... and 134 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b3d120...7508b11. Read the comment docs.

Pothulapati

LGTM

Another attempt at fixing #6511 Even after #6524, we continued experiencing discrepancies on the linkerd-edges integration test. The problem ended up being the external prometheus instance not being injected. The injector logs revealed this: ```console 2021-07-29T13:57:10.2497460Z time="2021-07-29T13:54:15Z" level=info msg="caches synced" 2021-07-29T13:57:10.2498191Z time="2021-07-29T13:54:15Z" level=info msg="starting admin server on :9995" 2021-07-29T13:57:10.2498935Z time="2021-07-29T13:54:15Z" level=info msg="listening at :8443" 2021-07-29T13:57:10.2499945Z time="2021-07-29T13:54:18Z" level=info msg="received admission review request 2b7b4970-db40-4bda-895b-bb2e95e98265" 2021-07-29T13:57:10.2511751Z time="2021-07-29T13:54:18Z" level=debug msg="admission request: &AdmissionRequest{UID:2b7b4970-db40-4bda-895b-bb2e95e98265,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:metrics-api,Namespace:linkerd-viz... ``` Usually one expects the webhook server to start first ("listening at :8443") and then the admin server, but in this case it happened the other way around. The admin server serves the readiness probe, so k8s was signaled that the injector was ready before it could listen to webhook requests, and given the WebhookFailurePolicy is Ignore by default, sometimes this was causing for the prometheus pod creation event to get missed, and we see in the log above that it starts by processing the pods that are created afterwards, which are the viz ones. In this fix we start first the webhook server, then block on the syncing of the k8s API, which should give enough time for the webhook to be up, and finally we start the admin server.

…roll out (linkerd#6524) Fixes linkerd#6511 In the `external-prometheus-deep` integration test, make sure we wait for the rollout of the external prometheus instance before proceeding. Also remove the special logic around `if TestHelper.ExternalPrometheus()` in the helm-related tests because we know we're using the embedded linkerd prometheus instance in those tests.

Another attempt at fixing linkerd#6511 Even after linkerd#6524, we continued experiencing discrepancies on the linkerd-edges integration test. The problem ended up being the external prometheus instance not being injected. The injector logs revealed this: ```console 2021-07-29T13:57:10.2497460Z time="2021-07-29T13:54:15Z" level=info msg="caches synced" 2021-07-29T13:57:10.2498191Z time="2021-07-29T13:54:15Z" level=info msg="starting admin server on :9995" 2021-07-29T13:57:10.2498935Z time="2021-07-29T13:54:15Z" level=info msg="listening at :8443" 2021-07-29T13:57:10.2499945Z time="2021-07-29T13:54:18Z" level=info msg="received admission review request 2b7b4970-db40-4bda-895b-bb2e95e98265" 2021-07-29T13:57:10.2511751Z time="2021-07-29T13:54:18Z" level=debug msg="admission request: &AdmissionRequest{UID:2b7b4970-db40-4bda-895b-bb2e95e98265,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:metrics-api,Namespace:linkerd-viz... ``` Usually one expects the webhook server to start first ("listening at :8443") and then the admin server, but in this case it happened the other way around. The admin server serves the readiness probe, so k8s was signaled that the injector was ready before it could listen to webhook requests, and given the WebhookFailurePolicy is Ignore by default, sometimes this was causing for the prometheus pod creation event to get missed, and we see in the log above that it starts by processing the pods that are created afterwards, which are the viz ones. In this fix we start first the webhook server, then block on the syncing of the k8s API, which should give enough time for the webhook to be up, and finally we start the admin server.

…roll out (linkerd#6524) Fixes linkerd#6511 In the `external-prometheus-deep` integration test, make sure we wait for the rollout of the external prometheus instance before proceeding. Also remove the special logic around `if TestHelper.ExternalPrometheus()` in the helm-related tests because we know we're using the embedded linkerd prometheus instance in those tests. Signed-off-by: Sanni Michael <[email protected]>

Another attempt at fixing linkerd#6511 Even after linkerd#6524, we continued experiencing discrepancies on the linkerd-edges integration test. The problem ended up being the external prometheus instance not being injected. The injector logs revealed this: ```console 2021-07-29T13:57:10.2497460Z time="2021-07-29T13:54:15Z" level=info msg="caches synced" 2021-07-29T13:57:10.2498191Z time="2021-07-29T13:54:15Z" level=info msg="starting admin server on :9995" 2021-07-29T13:57:10.2498935Z time="2021-07-29T13:54:15Z" level=info msg="listening at :8443" 2021-07-29T13:57:10.2499945Z time="2021-07-29T13:54:18Z" level=info msg="received admission review request 2b7b4970-db40-4bda-895b-bb2e95e98265" 2021-07-29T13:57:10.2511751Z time="2021-07-29T13:54:18Z" level=debug msg="admission request: &AdmissionRequest{UID:2b7b4970-db40-4bda-895b-bb2e95e98265,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:metrics-api,Namespace:linkerd-viz... ``` Usually one expects the webhook server to start first ("listening at :8443") and then the admin server, but in this case it happened the other way around. The admin server serves the readiness probe, so k8s was signaled that the injector was ready before it could listen to webhook requests, and given the WebhookFailurePolicy is Ignore by default, sometimes this was causing for the prometheus pod creation event to get missed, and we see in the log above that it starts by processing the pods that are created afterwards, which are the viz ones. In this fix we start first the webhook server, then block on the syncing of the k8s API, which should give enough time for the webhook to be up, and finally we start the admin server. Signed-off-by: Sanni Michael <[email protected]>

alpeb requested a review from a team as a code owner July 19, 2021 20:57

adleong approved these changes Jul 19, 2021

View reviewed changes

Pothulapati approved these changes Jul 20, 2021

View reviewed changes

Pothulapati merged commit 0874b1d into main Jul 20, 2021

Pothulapati deleted the alpeb/test-ext-prom-flake branch July 20, 2021 05:26

alpeb mentioned this pull request Jul 29, 2021

Fix external-prometheus integration test flakyness #6575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In the `external-prometheus-deep` integration test, wait for prom to roll out#6524

In the `external-prometheus-deep` integration test, wait for prom to roll out#6524
Pothulapati merged 1 commit intomainfrom
alpeb/test-ext-prom-flake

alpeb commented Jul 19, 2021

Uh oh!

codecov-commenter commented Jul 19, 2021 •

edited

Loading

Uh oh!

Pothulapati left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alpeb commented Jul 19, 2021

Uh oh!

codecov-commenter commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Pothulapati left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jul 19, 2021 •

edited

Loading