Fixed linkerd check not finding Prometheus#4797
Conversation
## The Problem `linkerd check` run right after install is failing because it can't find the Prometheus Pod. ## The Cause The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready. Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries. ## The Fix The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind. This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
kleimkuhler
left a comment
There was a problem hiding this comment.
This looks good and I've tested that this works. I do have one question before approving.
Pothulapati
left a comment
There was a problem hiding this comment.
I scaled linkerd-prometheus to 0 and ran the linkerd check, and the check command ran for 5m4s
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
| waiting for check to complete
but once prometheus comes back up, the check succeeds
| retryDeadline: hc.RetryDeadline, | ||
| description: "control plane self-check", | ||
| hintAnchor: "l5d-api-control-api", | ||
| surfaceErrorOnRetry: false, |
There was a problem hiding this comment.
Any reason this has to be false? As its hard to understand which check we are waiting for with just \ waiting for check to complete?
There was a problem hiding this comment.
I get the following output, by making it true
- Error calling Prometheus from the control plane: server_error: server error: 503
This seems better than the waiting for check right?
There was a problem hiding this comment.
As you know, prometheus takes a long while to become available, so users that run linker check right after install will always see that error. I thought it was less scary for them to just let them know we're waiting for the check. If prometheus doesn't come up by the time-out, the real error is shown:
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
× [prometheus] control plane can talk to Prometheus
Error calling Prometheus from the control plane: server_error: server error: 503
see https://linkerd.io/checks/#l5d-api-control-api for hints
Status check results are ×
* Fixed `linkerd check` not finding Prometheus ## The Problem `linkerd check` run right after install is failing because it can't find the Prometheus Pod. ## The Cause The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready. Since linkerd#4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries. ## The Fix The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind. This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
* Fixed `linkerd check` not finding Prometheus ## The Problem `linkerd check` run right after install is failing because it can't find the Prometheus Pod. ## The Cause The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready. Since linkerd#4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries. ## The Fix The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind. This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this. Signed-off-by: Eric Solomon <[email protected]>
The Problem
linkerd checkrun right after install is failing because it can't find the Prometheus Pod.The Cause
The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.
Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.
The Fix
The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.
This PR adds retry functionality to the
runCheckRPC()function, making sure the final output remains the sameIt also temporarily disables the
upgrade-edgeintegration test because after installing edge-20.7.4linkerd checkwill fail because of this.