Skip to content

Remove/Relax prometheus related checks#4724

Merged
ihcsim merged 4 commits intomainfrom
tarun/prom-checks
Jul 20, 2020
Merged

Remove/Relax prometheus related checks#4724
ihcsim merged 4 commits intomainfrom
tarun/prom-checks

Conversation

@Pothulapati
Copy link
Contributor

@Pothulapati Pothulapati commented Jul 7, 2020

Implements second part of linkerd/rfc#16, #4375

Now that prometheus is an add-on (i.e #4362 ), There can be cases where prometheus is
disabled at which the check should show a warning but not fail. This
decouples the tight depedency.

This changes the following checks:

  • Removes serviceAccount and pod checks in the CLI.
  • Relaxes linkerd-api checks to only check for prometheus access when
    the URL is not empty. This should work seamlessly with external
    prometheus as that URL will be passed and it performs the same
    check.
  • Updates queryProm function to perform a check, to not crash but
    report a failure to the dashboard and cli

Signed-off-by: Tarun Pothulapati [email protected]

Now that prometheus is an add-on, There can be cases where prometheus is
disabled at which the check should show a warning but not fail. This
decouples the tight depedency.

This changes the following checks:

- Removes serviceAccount and pod checks in the CLI.
- Relaxes `linkerd-api` checks to only check for prometheus access when
the URL is not empty. This should work seamlessly with external
prometheus as that URL will be passed and it performs the same
check.

Signed-off-by: Tarun Pothulapati <[email protected]>
@Pothulapati Pothulapati requested a review from a team as a code owner July 7, 2020 05:29
@alpeb
Copy link
Member

alpeb commented Jul 8, 2020

How feasible would it be to move those checks to its own section under healthcheck_addons.go so prometheus gets own checks section in the linkerd check output?

@Pothulapati
Copy link
Contributor Author

@alpeb Yep, That's the plan :)

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me 👍
Looking forward for the other changes to have the public-api endpoints fail gracefully when there is no prometheus 😉

@Pothulapati
Copy link
Contributor Author

@alpeb Updated the controller instance to fail gracefully when there is no prometheus, works as expected both with CLI and as well as the dashboard.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM 👍

Copy link
Member

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ihcsim ihcsim merged commit b7e9507 into main Jul 20, 2020
@ihcsim ihcsim deleted the tarun/prom-checks branch July 20, 2020 21:24
alpeb added a commit that referenced this pull request Jul 24, 2020
## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
alpeb added a commit that referenced this pull request Jul 27, 2020
* Fixed `linkerd check` not finding Prometheus

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
han-so1omon pushed a commit to han-so1omon/linkerd2 that referenced this pull request Jul 28, 2020
* Fixed `linkerd check` not finding Prometheus

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since linkerd#4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
han-so1omon pushed a commit to han-so1omon/linkerd2 that referenced this pull request Jul 28, 2020
* Fixed `linkerd check` not finding Prometheus

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since linkerd#4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.

Signed-off-by: Eric Solomon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants