Prometheus Alertmanager + extra args + rules by naseemkullah · Pull Request #4220 · linkerd/linkerd2

naseemkullah · 2020-03-31T22:40:51Z

This PR allow users to use linkerd metrics directly from linkerd prometheus to setup alerts, furthermore it provides flexibility by means of extra cmd line arguments.

Add prometheus rules configurable values.

These are set as empty by default, but allow the user to configure recording and alerting rules on the linkerd prometheus.

Add prometheusAlertmanagers value.

This allows configuring a list of statically targetted alertmanager instances.

Allow addition of extra arguements.

This allows end user flexibility for options such as log format. Rather than bubbling up such possible config options into helm values, extra arguments provides more flexibility.

alpeb · 2020-04-06T17:59:30Z

Thanks @naseemkullah !

Regarding the alerting_rules.yaml and recording_rules.yaml parts, have you considered loading them through separate ConfigMaps? Instead of loading them through the main linkerd-prometheus-config CM, we could allow users to define as many separate CMs as they want, and Linkerd would mount them under /etc/prometheus for Prometheus to load them.
So instead of declaring in values.yaml the content of those extra configs, we would just declare the list of extra CMs to load, e.g. in a prometheusExtraConfigs array. And then we'd modify prometheus.yaml so that it mounts them all. That might imply mounting them in subpaths and modifying the rule_files directive so it crawls down those subpaths.

I think decoupling these configs from the main Linkerd values.yaml might be more flexible. LMKWYT.

zaharidichev

I agree with @alpeb wrt to decoupling

charts/linkerd2/templates/prometheus.yaml

naseemkullah · 2020-04-08T13:01:15Z

Thanks for the feedback @alpeb and @zaharidichev, I'll go with that approach then.

naseemkullah · 2020-04-08T13:31:17Z

Just pushed a change, but I have not had a chance to test it yet.

alpeb

Thanks for the quick update @naseemkullah 👍
I managed to test this successfully after a few changes that I explain below.

charts/linkerd2/values.yaml

charts/linkerd2/templates/prometheus.yaml

charts/linkerd2/values.yaml

naseemkullah · 2020-04-08T17:58:57Z

Just removed the leftover values and fixed indentations @alpeb , pardon the force-push!

alpeb

I've added some new comments below.
Also, I forgot to mention you also need to update pkg/charts/linkerd2/values.go referencing these new properties you added (that's used when installing through the CLI, not Helm).

charts/linkerd2/values.yaml

charts/linkerd2/templates/prometheus.yaml

Pothulapati · 2020-04-20T05:43:00Z

Hey Naseem,

Thanks for taking this up! I've been trying to fix this problem in a more broader sense (and using the current add-on model), and have a rfc up for the same. It would be really awesome, to have your reviews/suggestions there as a end user/contributor.

I have the following question:

Any particular reasoning for dividing configuration into separate configmaps other than to have the state stored during upgrades? In my RFC, as we already have linkerd-values for storing add-on configuration details I was planning to use the same and not separate them into configmaps. WDYT?

Also, The RFC model ain't much different from this PR, especially that we get better separation of configuration there i.e prometheus.*, and it's also trying to address the Bringing your own prometheus use-case. If the rfc isn't missing something that you want, should we directly work on that?

naseemkullah · 2020-04-20T10:04:47Z

Hey Naseem,

Thanks for taking this up! I've been trying to fix this problem in a more broader sense (and using the current add-on model), and have a rfc up for the same. It would be really awesome, to have your reviews/suggestions there as a end user/contributor.

I have the following question:

Any particular reasoning for dividing configuration into separate configmaps other than to have the state stored during upgrades? In my RFC, as we already have linkerd-values for storing add-on configuration details I was planning to use the same and not separate them into configmaps. WDYT?

Hi @Pothulapati, thanks for bringing up these questions.

The original PR had a prometheusRules value but @alpeb and @zaharidichev advised the configmap approach as per their comments #4220 (comment) and #4220 (review), either of these approaches would work to allow a user to define alerting and recording rules, the latter just requiring that the configmaps pre exist.

Also, The RFC model ain't much different from this PR, especially that we get better separation of configuration there i.e prometheus.*,

The flat values (e.g. prometheusSomeValue) was already in place and so the pattern was followed, I'm not sure what is meant by better separation of configuration though?

and it's also trying to address the Bringing your own prometheus use-case. If the rfc isn't missing something that you want, should we directly work on that?

Seeing that this PR was opened quite a while ago, effort was put in to it and most importantly it is ready to be merged. The RFC can fill in any gaps/continue to extract prometheus into an add-on if that is ok?

naseemkullah · 2020-04-20T10:05:11Z

cc @alpeb Can we please merge this if it is good to go? Thanks!

alpeb · 2020-04-20T15:09:24Z

Any particular reasoning for dividing configuration into separate configmaps other than to have the state stored during upgrades? In my RFC, as we already have linkerd-values for storing add-on configuration details I was planning to use the same and not separate them into configmaps.

@Pothulapati Just for decoupling those configs from the rest, in generic configmaps instead of somehing linkerd-specific like linkerd-values which also deals with the other add-on configs.

@naseemkullah I was waiting for the pkg/charts/linkerd2/values.go change to be addressed before approving this PR (mentioned in my last comment).

Having Prometheus become an add-on, its config items will change to become something similar to what we have for tracing in values.yaml. If we merge the changes from this PR, things will have to change again when the RFC changes get merged. So I propose we do merge this PR, but not announce the changes, and then the RFC should pick on these changes when the config migrates to be an add-on. Sounds good to you @Pothulapati ?

Pothulapati · 2020-04-20T16:07:29Z

instead of somehing linkerd-specific like linkerd-values which also deals with the other add-on configs.

Yep, that makes sense. Also, okay with merging this as the rfc implementation to get merged could take some time!

naseemkullah · 2020-04-20T17:10:22Z

@naseemkullah I was waiting for the pkg/charts/linkerd2/values.go change to be addressed before approving this PR (mentioned in my last comment).

Oh right, I'll have that in shortly! Thanks.

alpeb

@naseemkullah To pass the test, we need to reset some values, and also add a workaround for the mergo lib we depend on for merging values when using both values.yaml and values-ha.yaml. Please apply the following patch:

diff --git a/pkg/charts/linkerd2/values_test.go b/pkg/charts/linkerd2/values_test.go
index 11c3f304..323a3930 100644
--- a/pkg/charts/linkerd2/values_test.go
+++ b/pkg/charts/linkerd2/values_test.go
@@ -14,24 +14,27 @@ func TestNewValues(t *testing.T) {
        testVersion := "linkerd-dev"
 
        expected := &Values{
-               Stage:                       "",
-               ControllerImage:             "gcr.io/linkerd-io/controller",
-               ControllerImageVersion:      testVersion,
-               WebImage:                    "gcr.io/linkerd-io/web",
-               PrometheusImage:             "prom/prometheus:v2.15.2",
-               GrafanaImage:                "gcr.io/linkerd-io/grafana",
-               ControllerReplicas:          1,
-               ControllerLogLevel:          "info",
-               PrometheusLogLevel:          "info",
-               ControllerUID:               2103,
-               EnableH2Upgrade:             true,
-               EnablePodAntiAffinity:       false,
-               WebhookFailurePolicy:        "Ignore",
-               OmitWebhookSideEffects:      false,
-               RestrictDashboardPrivileges: false,
-               DisableHeartBeat:            false,
-               HeartbeatSchedule:           "0 0 * * *",
-               InstallNamespace:            true,
+               Stage:                         "",
+               ControllerImage:               "gcr.io/linkerd-io/controller",
+               ControllerImageVersion:        testVersion,
+               WebImage:                      "gcr.io/linkerd-io/web",
+               PrometheusImage:               "prom/prometheus:v2.15.2",
+               GrafanaImage:                  "gcr.io/linkerd-io/grafana",
+               ControllerReplicas:            1,
+               ControllerLogLevel:            "info",
+               PrometheusLogLevel:            "info",
+               PrometheusExtraArgs:           map[string]string{},
+               PrometheusAlertmanagers:       []interface{}{},
+               PrometheusRuleConfigMapMounts: []PrometheusRuleConfigMapMount{},
+               ControllerUID:                 2103,
+               EnableH2Upgrade:               true,
+               EnablePodAntiAffinity:         false,
+               WebhookFailurePolicy:          "Ignore",
+               OmitWebhookSideEffects:        false,
+               RestrictDashboardPrivileges:   false,
+               DisableHeartBeat:              false,
+               HeartbeatSchedule:             "0 0 * * *",
+               InstallNamespace:              true,
                Global: &Global{
                        Namespace:                "linkerd",
                        ClusterDomain:            "cluster.local",
@@ -228,6 +231,11 @@ func TestNewValues(t *testing.T) {
                // Make Add-On Values nil to not have to check for their defaults
                actual.Tracing = nil
 
+               // workaround for mergo, which resets these to []interface{}(nil)
+               // and []PrometheusRuleConfigMapMount(nil)
+               actual.PrometheusAlertmanagers = []interface{}{}
+               actual.PrometheusRuleConfigMapMounts = []PrometheusRuleConfigMapMount{}
+
                if !reflect.DeepEqual(expected, actual) {
                        t.Errorf("Mismatch Helm HA defaults.\nExpected: %+v\nActual: %+v", expected, actual)
                }

cli/cmd/testdata/install_control-plane.golden

naseemkullah · 2020-04-23T12:52:10Z

expected := &Values{

Thanks @alpeb , how did you reset the values for expected := &Values{ ?

Changes applied, hoping tests will pass now.

naseemkullah · 2020-04-23T14:59:29Z

Worth noting: After applying the patch go test ./cli/cmd/... --update had to be run twice to succeed 🤔 ... failed on first try but succeeded on second 🤷‍♂️

naseemkullah · 2020-04-23T15:06:07Z

😭
Still failing.

--- FAIL: TestNewValues (0.02s)
    --- FAIL: TestNewValues/HA (0.01s)
        values_test.go:241: Mismatch Helm HA defaults.
            Expected: &{Stage: ControllerImage:gcr.io/linkerd-io/controller ControllerImageVersion:linkerd-dev WebImage:gcr.io/linkerd-io/web PrometheusImage:prom/prometheus:v2.15.2 GrafanaImage:gcr.io/linkerd-io/grafana ControllerReplicas:3 ControllerLogLevel:info PrometheusLogLevel:info PrometheusExtraArgs:map[] PrometheusAlertmanagers:[] PrometheusRuleConfigMapMounts:[] ControllerUID:2103 EnableH2Upgrade:true EnablePodAntiAffinity:true WebhookFailurePolicy:Fail OmitWebhookSideEffects:false RestrictDashboardPrivileges:false DisableHeartBeat:false HeartbeatSchedule:0 0 * * * InstallNamespace:true Configs:{Global: Proxy: Install:} Global:0xc0000eed00 Identity:0xc000010890 Dashboard:0xc0004c8d40 DebugContainer:0xc0000108a0 ProxyInjector:0xc0000108b0 ProfileValidator:0xc0000108c0 Tap:0xc0000108d0 NodeSelector:map[beta.kubernetes.io/os:linux] SMIMetrics:0xc00000f720 DestinationResources:0xc00003d2c0 GrafanaResources:0xc00003d340 HeartbeatResources:0xc00003d2c0 IdentityResources:0xc00003d380 PrometheusResources:0xc00003d3c0 ProxyInjectorResources:0xc00003d2c0 PublicAPIResources:0xc00003d2c0 SPValidatorResources:0xc00003d2c0 TapResources:0xc00003d2c0 WebResources:0xc00003d2c0 Tracing:map[]}
            Actual: &{Stage: ControllerImage:gcr.io/linkerd-io/controller ControllerImageVersion:linkerd-dev WebImage:gcr.io/linkerd-io/web PrometheusImage:prom/prometheus:v2.15.2 GrafanaImage:gcr.io/linkerd-io/grafana ControllerReplicas:3 ControllerLogLevel:info PrometheusLogLevel:info PrometheusExtraArgs:map[] PrometheusAlertmanagers:[] PrometheusRuleConfigMapMounts:[] ControllerUID:2103 EnableH2Upgrade:true EnablePodAntiAffinity:true WebhookFailurePolicy:Fail OmitWebhookSideEffects:false RestrictDashboardPrivileges:false DisableHeartBeat:false HeartbeatSchedule:0 0 * * * InstallNamespace:true Configs:{Global: Proxy: Install:} Global:0xc0000ef300 Identity:0xc000010910 Dashboard:0xc0004c9750 DebugContainer:0xc000010920 ProxyInjector:0xc000010980 ProfileValidator:0xc000010970 Tap:0xc000010990 NodeSelector:map[beta.kubernetes.io/os:linux] SMIMetrics:0xc0001641c0 DestinationResources:0xc00003d000 GrafanaResources:0xc00003d080 HeartbeatResources:0xc00003d140 IdentityResources:0xc00003ce80 PrometheusResources:0xc00003cec0 ProxyInjectorResources:0xc00003cf40 PublicAPIResources:0xc00003d040 SPValidatorResources:0xc00003cf80 TapResources:0xc00003d1c0 WebResources:0xc00003d200 Tracing:map[]}
FAIL

alpeb · 2020-04-23T18:16:35Z

@naseemkullah my fault; the mergo workaround should have gone inside t.Run("HA", func(t *testing.T) { , right after calling NewValues(true)

alpeb

Thanks for the followup @naseemkullah . Just one more thing. The config map is rendering funny unless we trim .Values.prometheusAlertmanagers, so it would need something like this:

{{- toYaml .Values.prometheusAlertmanagers | trim | nindent 8 }}

naseemkullah · 2020-04-27T17:29:03Z

{{- toYaml .Values.prometheusAlertmanagers | trim | nindent 8 }}

Hi @alpeb thank for your guidance thus far.

The current rendering of

prometheusAlertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - "alertmanager.linkerd.svc:9093"
  - scheme: https
    static_configs:
    - targets:
      - alertmanager.mycomain.com

is

  prometheus.yml: |-
    alerting:
      alertmanagers:
        - scheme: http
          static_configs:
          - targets:
            - alertmanager.linkerd.svc:9093
        - scheme: https
          static_configs:
          - targets:
            - alertmanager.mycomain.com

which seems fine, in what case does it render funny?

alpeb · 2020-04-27T19:14:02Z

It renders fine locally, e.g. using helm template, but when actually installing with helm install if you ask for it through kubectl -n linkerd get cm linkerd-prometheus-config -oyaml you'll see there are issues.

alpeb

Thanks @naseemkullah for the fast turnaround; from my perspective this is good to go 👍

Pothulapati · 2020-05-01T19:34:39Z

charts/linkerd2/README.md

+| `prometheusAlertmanagers`                   | Alertmanager instances the Prometheus server sends alerts to configured via the static_configs parameter.                                                                             | `[]`                                 |
+| `prometheusImage`                           | Docker image for the Prometheus container                                                                                                                                             | `prom/prometheus:v2.15.2`            |
+| `prometheusLogLevel`                        | Log level for Prometheus                                                                                                                                                              | `info`                               |
+| `prometheusExtraArgs`                       | Extra command line options for Prometheus                                                                                                                                             | `{}`                                 |


nit: Can you move this up? as we followed alphabetical order here.

Sure! Done.

Pothulapati

Overall works great! Thanks for the awesome work!

naseemkullah · 2020-05-01T19:49:01Z

Thanks @Pothulapati, my absolute pleasure!

This allows end user flexibility for options such as log format. Rather than bubbling up such possible config options into helm values, extra arguments provides more flexibility. Add prometheusAlertmanagers value allows configuring a list of statically targetted alertmanager instances. Use rule configmaps for prometheus rules. They take a list of {name,subPath,configMap} values and mounts them accordingly. Provided that subpaths end with _rules.yml or _rules.yaml they should be loaded by prometheus as per prometheus.yml's rule_files content. Signed-off-by: Naseem <[email protected]>

alpeb · 2020-05-04T19:05:36Z

Hey @naseemkullah , please fill out this form if you'd like to receive some well-earned Linkerd swag! 😉

naseemkullah · 2020-05-04T19:07:53Z

Hey @naseemkullah , please fill out this form if you'd like to receive some well-earned Linkerd swag! 😉

Thanks @alpeb!

naseemkullah requested review from alpeb and zaharidichev as code owners March 31, 2020 22:40

naseemkullah changed the title ~~Prometheus Alertmanager + extra args~~ Prometheus Alertmanager + extra args + rules Apr 1, 2020

naseemkullah force-pushed the prom-config branch from ec694ea to 52dc604 Compare April 2, 2020 00:39

naseemkullah mentioned this pull request Apr 3, 2020

Out of the box alerts #1726

Open

zaharidichev reviewed Apr 8, 2020

View reviewed changes

charts/linkerd2/templates/prometheus.yaml Outdated Show resolved Hide resolved

naseemkullah force-pushed the prom-config branch 2 times, most recently from abbcd7f to 95d077d Compare April 8, 2020 13:45

alpeb reviewed Apr 8, 2020

View reviewed changes

charts/linkerd2/values.yaml Outdated Show resolved Hide resolved

charts/linkerd2/templates/prometheus.yaml Outdated Show resolved Hide resolved

charts/linkerd2/templates/prometheus.yaml Outdated Show resolved Hide resolved

charts/linkerd2/values.yaml Outdated Show resolved Hide resolved

naseemkullah force-pushed the prom-config branch from 95d077d to bfdcba7 Compare April 8, 2020 17:58

naseemkullah force-pushed the prom-config branch 2 times, most recently from 85d5174 to 9d179a8 Compare April 10, 2020 04:28

naseemkullah requested review from alpeb and zaharidichev April 10, 2020 04:29

alpeb reviewed Apr 14, 2020

View reviewed changes

charts/linkerd2/values.yaml Outdated Show resolved Hide resolved

charts/linkerd2/templates/prometheus.yaml Outdated Show resolved Hide resolved

naseemkullah force-pushed the prom-config branch from 9d179a8 to 6118b36 Compare April 15, 2020 12:53

naseemkullah requested a review from alpeb April 15, 2020 12:58

alpeb mentioned this pull request Apr 16, 2020

Bring your own prometheus linkerd/rfc#16

Merged

naseemkullah force-pushed the prom-config branch 2 times, most recently from 8e4c8d3 to f5bc580 Compare April 20, 2020 10:55

alpeb reviewed Apr 22, 2020

View reviewed changes

cli/cmd/testdata/install_control-plane.golden Outdated Show resolved Hide resolved

naseemkullah force-pushed the prom-config branch from 30d3eeb to a34bb0f Compare April 23, 2020 12:51

naseemkullah force-pushed the prom-config branch 2 times, most recently from 0dad051 to e2907de Compare April 23, 2020 14:57

naseemkullah force-pushed the prom-config branch from e2907de to e57c83d Compare April 23, 2020 18:19

naseemkullah requested a review from alpeb April 26, 2020 17:12

alpeb reviewed Apr 27, 2020

View reviewed changes

naseemkullah force-pushed the prom-config branch 3 times, most recently from 8de3d78 to 759eaef Compare April 27, 2020 19:53

naseemkullah requested a review from alpeb April 29, 2020 10:45

alpeb approved these changes Apr 30, 2020

View reviewed changes

grampelberg assigned Pothulapati Apr 30, 2020

Pothulapati reviewed May 1, 2020

View reviewed changes

naseemkullah force-pushed the prom-config branch from 759eaef to 5902161 Compare May 1, 2020 19:42

Pothulapati approved these changes May 1, 2020

View reviewed changes

naseemkullah force-pushed the prom-config branch from 5902161 to 4807f89 Compare May 1, 2020 20:10

naseemkullah force-pushed the prom-config branch from 4807f89 to b6896f5 Compare May 2, 2020 04:36

alpeb merged commit 6aa1e76 into linkerd:master May 4, 2020

naseemkullah deleted the prom-config branch May 4, 2020 19:07

Conversation

naseemkullah commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alpeb commented Apr 6, 2020

Uh oh!

zaharidichev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

naseemkullah commented Apr 8, 2020

Uh oh!

naseemkullah commented Apr 8, 2020

Uh oh!

alpeb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

naseemkullah commented Apr 8, 2020

Uh oh!

alpeb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pothulapati commented Apr 20, 2020

Uh oh!

naseemkullah commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naseemkullah commented Apr 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alpeb commented Apr 20, 2020

Uh oh!

Pothulapati commented Apr 20, 2020

Uh oh!

naseemkullah commented Apr 20, 2020

Uh oh!

alpeb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

naseemkullah commented Apr 23, 2020

Uh oh!

naseemkullah commented Apr 23, 2020

Uh oh!

naseemkullah commented Apr 23, 2020

Uh oh!

alpeb commented Apr 23, 2020

Uh oh!

alpeb left a comment

Choose a reason for hiding this comment

Uh oh!

naseemkullah commented Apr 27, 2020

Uh oh!

alpeb commented Apr 27, 2020

Uh oh!

alpeb left a comment

Choose a reason for hiding this comment

Uh oh!

Pothulapati May 1, 2020

Choose a reason for hiding this comment

Uh oh!

naseemkullah May 1, 2020

Choose a reason for hiding this comment

Uh oh!

Pothulapati left a comment

Choose a reason for hiding this comment

Uh oh!

naseemkullah commented May 1, 2020

Uh oh!

alpeb commented May 4, 2020

Uh oh!

naseemkullah commented May 4, 2020

Uh oh!

Reviewers

naseemkullah commented Mar 31, 2020 •

edited

Loading

naseemkullah commented Apr 20, 2020 •

edited

Loading

naseemkullah commented Apr 20, 2020 •

edited

Loading