Add data plane check for metrics Prometheus#1635
Conversation
klingerf
left a comment
There was a problem hiding this comment.
⭐️ This looks great -- I'm really happy you can use the ListPods endpoint to perform this check. Left some small comments, but nothing blocking.
pkg/healthcheck/healthcheck.go
Outdated
| } | ||
|
|
||
| if errMsg != "" { | ||
| return errors.New(errMsg) |
There was a problem hiding this comment.
Mind switching this to fmt.Errorf? That's more consistent with how we produce errors throughout the rest of this file.
| return err | ||
| } | ||
| return hc.kubeAPI.CheckProxyVersion(pods, hc.latestVersion) | ||
| return hc.kubeAPI.CheckProxyVersion(hc.dataPlanePods, hc.latestVersion) |
pkg/healthcheck/healthcheck.go
Outdated
| category: LinkerdDataPlaneCategory, | ||
| description: "data plane proxy metrics are present in Prometheus", | ||
| retry: hc.ShouldRetry, | ||
| fatal: true, |
There was a problem hiding this comment.
I don't think this check needs to be fatal. A fatal check cancels all remaining checks, but if metrics aren't present in prometheus, I think it's still ok to continue with the remaining checks. I tend to categorize a check as fatal only if subsequent checks depend some side effect of the current check.
| errMsg = fmt.Sprintf("Data plane metrics not found for %s. ", strings.Join(onlyInK8s, ", ")) | ||
| } | ||
| if len(onlyInProm) > 0 { | ||
| errMsg += fmt.Sprintf("Found data plane metrics for %s, but not found in Kubernetes.", strings.Join(onlyInProm, ", ")) |
There was a problem hiding this comment.
Innnnteresting. I'm not sure this error will ever fire, since the ListPods endpoint builds its list of pods from the Kubernetes API, so if a host is not found in the Kubernetes, it also won't exist in the response from the ListPods endpoint. But it certainly doesn't hurt to leave this in here just in case something is really wonky.
There was a problem hiding this comment.
Agreed! I put this here primarily as a guard in case the behavior of ListPods ever changes.
There was a problem hiding this comment.
I was thinking maybe we could have used the prometheus endpoint api/v1/targets's results and compare that with the data plane pods to verify that data plane pods are in fact reporting stats. But, considering that we would have to deal with unmarshalling JSON and other things to make the endpoint's result consumable, this is probably a better way to do this.
There was a problem hiding this comment.
@dadjeibaah Yup I definitely thought about that approach. I think if ListPods at some point changes in behavior, we could revisit which Prometheus endpoint we use.
pkg/healthcheck/healthcheck.go
Outdated
| return nil | ||
| } | ||
|
|
||
| func validateDataPlanePodReporting(k8sPods []v1.Pod, promPods []*pb.Pod, targetNamespace string) error { |
There was a problem hiding this comment.
If you feel inclined, I think it would be pretty straightforward to add unit tests for this method in pkg/healthcheck/healthcheck_test.go. If you look in that file you'll see existing unit tests for the rest of the validate* funcs in this file.
The `linkerd check` command was not validating whether data plane proxies were successfully reporting metrics to Prometheus. Introduce a new check that validates data plane proxies are found in Prometheus. This is made possible via the existing `ListPods` endpoint in the public API, which includes an `Added` field, indicating a pod's metrics were found in Prometheus. Fixes #1517 Signed-off-by: Andrew Seigner <[email protected]>
6b903fe to
732bcfd
Compare
|
Changes LGTM! thanks for adding this to the check command 📦 |
* master: Move more info from the tap table into the expanded row (linkerd#1641) `linkerd check` sends params on version check (linkerd#1642) Bikeshed the tap and top icons (linkerd#1637) Add link to tap each row in top table (linkerd#1643) Bump default check retry time to 5 minutes (linkerd#1645) Make wait=true a default option for check and dashboard (linkerd#1640) Add version check to Grafana dashboard (linkerd#1638) Add data plane check for metrics Prometheus (linkerd#1635)
The
linkerd checkcommand was not validating whether data planeproxies were successfully reporting metrics to Prometheus.
Introduce a new check that validates data plane proxies are found in
Prometheus. This is made possible via the existing
ListPodsendpointin the public API, which includes an
Addedfield, indicating a pod'smetrics were found in Prometheus.
Fixes #1517
Signed-off-by: Andrew Seigner [email protected]