Skip to content

Fix external-prometheus integration test flakyness#6575

Merged
alpeb merged 1 commit intomainfrom
alpeb/injector-race
Jul 29, 2021
Merged

Fix external-prometheus integration test flakyness#6575
alpeb merged 1 commit intomainfrom
alpeb/injector-race

Conversation

@alpeb
Copy link
Member

@alpeb alpeb commented Jul 29, 2021

Another attempt at fixing #6511

Even after #6524, we continued experiencing discrepancies on the
linkerd-edges integration test. The problem ended up being the external
prometheus instance not being injected. The injector logs revealed this:

2021-07-29T13:57:10.2497460Z time="2021-07-29T13:54:15Z" level=info msg="caches synced"
2021-07-29T13:57:10.2498191Z time="2021-07-29T13:54:15Z" level=info msg="starting admin server on :9995"
2021-07-29T13:57:10.2498935Z time="2021-07-29T13:54:15Z" level=info msg="listening at :8443"
2021-07-29T13:57:10.2499945Z time="2021-07-29T13:54:18Z" level=info msg="received admission review request 2b7b4970-db40-4bda-895b-bb2e95e98265"
2021-07-29T13:57:10.2511751Z time="2021-07-29T13:54:18Z" level=debug msg="admission request: &AdmissionRequest{UID:2b7b4970-db40-4bda-895b-bb2e95e98265,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:metrics-api,Namespace:linkerd-viz...

Usually one expects the webhook server to start first ("listening at
:8443") and then the admin server, but in this case it happened the
other way around. The admin server serves the readiness probe, so k8s
was signaled that the injector was ready before it could listen to
webhook requests, and given the WebhookFailurePolicy is Ignore by
default, sometimes this was causing for the prometheus pod creation
event to get missed, and we see in the log above that it starts by
processing the pods that are created afterwards, which are the viz ones.

In this fix we start first the webhook server, then block on the syncing
of the k8s API, which should give enough time for the webhook to be up,
and finally we start the admin server.

Another attempt at fixing #6511

Even after #6524, we continued experiencing discrepancies on the
linkerd-edges integration test. The problem ended up being the external
prometheus instance not being injected. The injector logs revealed this:

```console
2021-07-29T13:57:10.2497460Z time="2021-07-29T13:54:15Z" level=info msg="caches synced"
2021-07-29T13:57:10.2498191Z time="2021-07-29T13:54:15Z" level=info msg="starting admin server on :9995"
2021-07-29T13:57:10.2498935Z time="2021-07-29T13:54:15Z" level=info msg="listening at :8443"
2021-07-29T13:57:10.2499945Z time="2021-07-29T13:54:18Z" level=info msg="received admission review request 2b7b4970-db40-4bda-895b-bb2e95e98265"
2021-07-29T13:57:10.2511751Z time="2021-07-29T13:54:18Z" level=debug msg="admission request: &AdmissionRequest{UID:2b7b4970-db40-4bda-895b-bb2e95e98265,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:metrics-api,Namespace:linkerd-viz...
```

Usually one expects the webhook server to start first ("listening at
:8443") and then the admin server, but in this case it happened the
other way around. The admin server serves the readiness probe, so k8s
was signaled that the injector was ready before it could listen to
webhook requests, and given the WebhookFailurePolicy is Ignore by
default, sometimes this was causing for the prometheus pod creation
event to get missed, and we see in the log above that it starts by
processing the pods that are created afterwards, which are the viz ones.

In this fix we start first the webhook server, then block on the syncing
of the k8s API, which should give enough time for the webhook to be up,
and finally we start the admin server.
@alpeb alpeb requested a review from a team as a code owner July 29, 2021 15:59
@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2021

Codecov Report

Merging #6575 (f42d883) into main (ca1077b) will not change coverage.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #6575   +/-   ##
=======================================
  Coverage   44.86%   44.86%           
=======================================
  Files         182      182           
  Lines       23866    23866           
  Branches      290      290           
=======================================
  Hits        10707    10707           
  Misses      12407    12407           
  Partials      752      752           
Flag Coverage Δ
golang 43.50% <0.00%> (ø)
javascript 66.76% <ø> (ø)
unittests 44.86% <0.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
controller/webhook/launcher.go 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ca1077b...f42d883. Read the comment docs.

Copy link
Contributor

@dadjeibaah dadjeibaah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📦 🚢

@alpeb alpeb merged commit d9e9013 into main Jul 29, 2021
@alpeb alpeb deleted the alpeb/injector-race branch July 29, 2021 18:29
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 1, 2021
Another attempt at fixing linkerd#6511

Even after linkerd#6524, we continued experiencing discrepancies on the
linkerd-edges integration test. The problem ended up being the external
prometheus instance not being injected. The injector logs revealed this:

```console
2021-07-29T13:57:10.2497460Z time="2021-07-29T13:54:15Z" level=info msg="caches synced"
2021-07-29T13:57:10.2498191Z time="2021-07-29T13:54:15Z" level=info msg="starting admin server on :9995"
2021-07-29T13:57:10.2498935Z time="2021-07-29T13:54:15Z" level=info msg="listening at :8443"
2021-07-29T13:57:10.2499945Z time="2021-07-29T13:54:18Z" level=info msg="received admission review request 2b7b4970-db40-4bda-895b-bb2e95e98265"
2021-07-29T13:57:10.2511751Z time="2021-07-29T13:54:18Z" level=debug msg="admission request: &AdmissionRequest{UID:2b7b4970-db40-4bda-895b-bb2e95e98265,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:metrics-api,Namespace:linkerd-viz...
```

Usually one expects the webhook server to start first ("listening at
:8443") and then the admin server, but in this case it happened the
other way around. The admin server serves the readiness probe, so k8s
was signaled that the injector was ready before it could listen to
webhook requests, and given the WebhookFailurePolicy is Ignore by
default, sometimes this was causing for the prometheus pod creation
event to get missed, and we see in the log above that it starts by
processing the pods that are created afterwards, which are the viz ones.

In this fix we start first the webhook server, then block on the syncing
of the k8s API, which should give enough time for the webhook to be up,
and finally we start the admin server.
sannimichaelse pushed a commit to sannimichaelse/linkerd2 that referenced this pull request Aug 13, 2021
Another attempt at fixing linkerd#6511

Even after linkerd#6524, we continued experiencing discrepancies on the
linkerd-edges integration test. The problem ended up being the external
prometheus instance not being injected. The injector logs revealed this:

```console
2021-07-29T13:57:10.2497460Z time="2021-07-29T13:54:15Z" level=info msg="caches synced"
2021-07-29T13:57:10.2498191Z time="2021-07-29T13:54:15Z" level=info msg="starting admin server on :9995"
2021-07-29T13:57:10.2498935Z time="2021-07-29T13:54:15Z" level=info msg="listening at :8443"
2021-07-29T13:57:10.2499945Z time="2021-07-29T13:54:18Z" level=info msg="received admission review request 2b7b4970-db40-4bda-895b-bb2e95e98265"
2021-07-29T13:57:10.2511751Z time="2021-07-29T13:54:18Z" level=debug msg="admission request: &AdmissionRequest{UID:2b7b4970-db40-4bda-895b-bb2e95e98265,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:metrics-api,Namespace:linkerd-viz...
```

Usually one expects the webhook server to start first ("listening at
:8443") and then the admin server, but in this case it happened the
other way around. The admin server serves the readiness probe, so k8s
was signaled that the injector was ready before it could listen to
webhook requests, and given the WebhookFailurePolicy is Ignore by
default, sometimes this was causing for the prometheus pod creation
event to get missed, and we see in the log above that it starts by
processing the pods that are created afterwards, which are the viz ones.

In this fix we start first the webhook server, then block on the syncing
of the k8s API, which should give enough time for the webhook to be up,
and finally we start the admin server.

Signed-off-by: Sanni Michael <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants