Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] CRD for AlertRule and an operator to manage Alert Rules #619

Closed
wants to merge 14 commits into from

Conversation

kevinjqiu
Copy link
Contributor

@kevinjqiu kevinjqiu commented Sep 14, 2017

See issue #616

Here's how an AlertRule CRD look like:

apiVersion: monitoring.coreos.com/v1alpha1
kind: Alertrule
metadata:
  name: alertmanager-config-inconsistent
  labels:
    app: alertmanager
    prometheus: prometheus
spec:
  definition: |
    ALERT AlertmanagerConfigInconsistent
      IF   count_values by (service) ("config_hash", alertmanager_config_hash)
         / on(service) group_left
           label_replace(prometheus_operator_alertmanager_spec_replicas, "service", "alertmanager-$1", "alertmanager", "(.*)") != 1
      FOR 5m
      LABELS {
        severity = "critical"
      }
      ANNOTATIONS {
        summary = "Alertmanager configurations are inconsistent",
        description = "The configuration of the instances of the Alertmanager cluster `{{$labels.service}}` are out of sync."
      }

When the AlertRule object is created, the operator creates a corresponding ConfigMap, and it will get picked up by prometheus if the label is set to meet the criteria of the prometheus' ruleSelector.

Here's an asciinema demo: https://asciinema.org/a/3BwHT6nT5IqqrEWODho7B7F6K

Let me know what you think.

I'll add some e2e tests if you think this is a useful feature for the prometheus-operator.

@coreosbot
Copy link

Can one of the admins verify this patch?

@brancz
Copy link
Contributor

brancz commented Sep 15, 2017

Wow that's a lot of code, looks like you know your way around the Prometheus Operator code now 🙂. I'm somewhat confused though how this is different to having ConfigMaps? As far as I understand the code it creates a ConfigMap from the AlertRule type.

Regarding templating (although I realize this PR doesn't implement it yet). I know that this went very much side-wards with other Monitoring systems so I would be very careful with the implementation of such a feature. I personally believe that a proper tool that spits out the "templated" rules is better suited as it is better for visibility and comprehension than a transparently templated Rule, where one can't be sure whether it's a templated Rule or hand written.

@kevinjqiu
Copy link
Contributor Author

kevinjqiu commented Sep 15, 2017

Wow that's a lot of code

Most of it is from the alert manager operator thanks to Golang's lack of generics :)

I'm somewhat confused though how this is different to having ConfigMaps?

Fair question. I wanted to make prometheus alert rules a Kubernetes-native object, instead of just a bunch of ConfigMaps. Right now, the AlertRule CRD is a thin wrapper around a .rules file but having an intermediary representation (the AlertRule CRD) allows us to have further abstraction over alert rule files, e.g.,

...
spec:
  name: AlertmanagerConfigInconsistent
  condition: |-
      count_values by (service) ("config_hash", alertmanager_config_hash)
         / on(service) group_left
           label_replace(prometheus_operator_alertmanager_spec_replicas, "service", "alertmanager-$1", "alertmanager", "(.*)") != 1
  duration: 5m
  labels:
    severity: critical
  annotations:
    summary: Alertmanager configurations are inconsistent
    description: The configuration of the instances of the Alertmanager cluster `{{$labels.service}}` are out of sync.

With regard to templating, I'm sorry I don't quite understand your comment here. Could you give an example? I'm thinking of leaving it totally up to helm to template a AlertRule manifest file. Here's an example of an alert rule in a helm chart:

{{- if .Values.enabled }}
apiVersion: monitoring.coreos.com/v1alpha1
kind: Alertrule
metadata:
  name: {{ template "fullname" . }}-alertmanager
  label:
    chart: {{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
{{- if .Values.labels }}
{{ .Values.labels | toYaml | indent 4 }}
{{ end }}
spec:
  definition: |
    ALERT AlertmanagerConfigInconsistent
      IF   count_values by (service) ("config_hash", alertmanager_config_hash)
         / on(service) group_left
           label_replace(prometheus_operator_alertmanager_spec_replicas, "service", "alertmanager-$1", "alertmanager", "(.*)") != 1
      FOR {{ .Values.configInconsistentDuration }}
      LABELS {
          severity = "critical",
          environment = {{ .Values.labels.environment | quote }}
        }
      ANNOTATIONS {
        summary = "Alertmanager configurations are inconsistent",
        description = "The configuration of the instances of the Alertmanager cluster {{`{{$labels.service}}`}} are out of sync."
      }
{{ end }}

@brancz
Copy link
Contributor

brancz commented Sep 15, 2017

Your example of templating would work just as well with a ConfigMap. What I'm saying is that I'd prefer to have an external tool that will actually generate the alerting rules, put them into version control and then apply them as plain ConfigMaps is going to cause less pain, when one needs to investigate what is wrong with a particular alerting rule. Having a system that dynamically picks together a bunch of rules and templates them seems like a nice feature but comes at a high cost when actually troubleshooting it.

@kevinjqiu
Copy link
Contributor Author

@brancz I hear you. I'm fine with closing it. All is not lost as at least now I know how prometheus-operator works at the code level :)

@kevinjqiu kevinjqiu closed this Sep 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants