Skip to content

Fixed linkerd check not finding Prometheus#4797

Merged
alpeb merged 3 commits intomainfrom
alpeb/wait-for-prom
Jul 27, 2020
Merged

Fixed linkerd check not finding Prometheus#4797
alpeb merged 3 commits intomainfrom
alpeb/wait-for-prom

Conversation

@alpeb
Copy link
Member

@alpeb alpeb commented Jul 24, 2020

The Problem

linkerd check run right after install is failing because it can't find the Prometheus Pod.

The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the runCheckRPC() function, making sure the final output remains the same

It also temporarily disables the upgrade-edge integration test because after installing edge-20.7.4 linkerd check will fail because of this.

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
@alpeb alpeb requested a review from a team as a code owner July 24, 2020 22:07
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and I've tested that this works. I do have one question before approving.

Copy link
Contributor

@Pothulapati Pothulapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I scaled linkerd-prometheus to 0 and ran the linkerd check, and the check command ran for 5m4s

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
| waiting for check to complete

but once prometheus comes back up, the check succeeds

retryDeadline: hc.RetryDeadline,
description: "control plane self-check",
hintAnchor: "l5d-api-control-api",
surfaceErrorOnRetry: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason this has to be false? As its hard to understand which check we are waiting for with just \ waiting for check to complete?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following output, by making it true

- Error calling Prometheus from the control plane: server_error: server error: 503

This seems better than the waiting for check right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, prometheus takes a long while to become available, so users that run linker check right after install will always see that error. I thought it was less scary for them to just let them know we're waiting for the check. If prometheus doesn't come up by the time-out, the real error is shown:

linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
    Error calling Prometheus from the control plane: server_error: server error: 503
    see https://linkerd.io/checks/#l5d-api-control-api for hints

Status check results are ×

Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@Pothulapati Pothulapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍

@alpeb alpeb merged commit 2aea222 into main Jul 27, 2020
@alpeb alpeb deleted the alpeb/wait-for-prom branch July 27, 2020 16:54
han-so1omon pushed a commit to han-so1omon/linkerd2 that referenced this pull request Jul 28, 2020
* Fixed `linkerd check` not finding Prometheus

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since linkerd#4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
han-so1omon pushed a commit to han-so1omon/linkerd2 that referenced this pull request Jul 28, 2020
* Fixed `linkerd check` not finding Prometheus

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since linkerd#4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.

Signed-off-by: Eric Solomon <[email protected]>
alpeb added a commit that referenced this pull request Jul 31, 2020
Followup to #4797

That test was temporarily disabled until the prometheus check in
`linkerd check` got fixed in #4797 and made it into edge-20.7.5
alpeb added a commit that referenced this pull request Jul 31, 2020
Followup to #4797

That test was temporarily disabled until the prometheus check in
`linkerd check` got fixed in #4797 and made it into edge-20.7.5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants