Skip to content

Add data plane check for metrics Prometheus#1635

Merged
siggy merged 1 commit intomasterfrom
siggy/pod-reporting-check
Sep 13, 2018
Merged

Add data plane check for metrics Prometheus#1635
siggy merged 1 commit intomasterfrom
siggy/pod-reporting-check

Conversation

@siggy
Copy link
Member

@siggy siggy commented Sep 12, 2018

The linkerd check command was not validating whether data plane
proxies were successfully reporting metrics to Prometheus.

Introduce a new check that validates data plane proxies are found in
Prometheus. This is made possible via the existing ListPods endpoint
in the public API, which includes an Added field, indicating a pod's
metrics were found in Prometheus.

Fixes #1517

Signed-off-by: Andrew Seigner [email protected]

@siggy siggy added the area/cli label Sep 12, 2018
@siggy siggy self-assigned this Sep 12, 2018
Copy link
Contributor

@klingerf klingerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐️ This looks great -- I'm really happy you can use the ListPods endpoint to perform this check. Left some small comments, but nothing blocking.

}

if errMsg != "" {
return errors.New(errMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind switching this to fmt.Errorf? That's more consistent with how we produce errors throughout the rest of this file.

return err
}
return hc.kubeAPI.CheckProxyVersion(pods, hc.latestVersion)
return hc.kubeAPI.CheckProxyVersion(hc.dataPlanePods, hc.latestVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍

category: LinkerdDataPlaneCategory,
description: "data plane proxy metrics are present in Prometheus",
retry: hc.ShouldRetry,
fatal: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check needs to be fatal. A fatal check cancels all remaining checks, but if metrics aren't present in prometheus, I think it's still ok to continue with the remaining checks. I tend to categorize a check as fatal only if subsequent checks depend some side effect of the current check.

errMsg = fmt.Sprintf("Data plane metrics not found for %s. ", strings.Join(onlyInK8s, ", "))
}
if len(onlyInProm) > 0 {
errMsg += fmt.Sprintf("Found data plane metrics for %s, but not found in Kubernetes.", strings.Join(onlyInProm, ", "))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Innnnteresting. I'm not sure this error will ever fire, since the ListPods endpoint builds its list of pods from the Kubernetes API, so if a host is not found in the Kubernetes, it also won't exist in the response from the ListPods endpoint. But it certainly doesn't hurt to leave this in here just in case something is really wonky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I put this here primarily as a guard in case the behavior of ListPods ever changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe we could have used the prometheus endpoint api/v1/targets's results and compare that with the data plane pods to verify that data plane pods are in fact reporting stats. But, considering that we would have to deal with unmarshalling JSON and other things to make the endpoint's result consumable, this is probably a better way to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dadjeibaah Yup I definitely thought about that approach. I think if ListPods at some point changes in behavior, we could revisit which Prometheus endpoint we use.

return nil
}

func validateDataPlanePodReporting(k8sPods []v1.Pod, promPods []*pb.Pod, targetNamespace string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel inclined, I think it would be pretty straightforward to add unit tests for this method in pkg/healthcheck/healthcheck_test.go. If you look in that file you'll see existing unit tests for the rest of the validate* funcs in this file.

The `linkerd check` command was not validating whether data plane
proxies were successfully reporting metrics to Prometheus.

Introduce a new check that validates data plane proxies are found in
Prometheus. This is made possible via the existing `ListPods` endpoint
in the public API, which includes an `Added` field, indicating a pod's
metrics were found in Prometheus.

Fixes #1517

Signed-off-by: Andrew Seigner <[email protected]>
@siggy siggy force-pushed the siggy/pod-reporting-check branch from 6b903fe to 732bcfd Compare September 13, 2018 18:47
@dadjeibaah
Copy link
Contributor

Changes LGTM! thanks for adding this to the check command 📦

@siggy siggy merged commit 7c70531 into master Sep 13, 2018
@siggy siggy deleted the siggy/pod-reporting-check branch September 13, 2018 20:02
zachalbert added a commit to zachalbert/linkerd2 that referenced this pull request Sep 15, 2018
* master:
  Move more info from the tap table into the expanded row (linkerd#1641)
  `linkerd check` sends params on version check (linkerd#1642)
  Bikeshed the tap and top icons (linkerd#1637)
  Add link to tap each row in top table (linkerd#1643)
  Bump default check retry time to 5 minutes (linkerd#1645)
  Make wait=true a default option for check and dashboard (linkerd#1640)
  Add version check to Grafana dashboard (linkerd#1638)
  Add data plane check for metrics Prometheus (linkerd#1635)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants