Skip to content

Prometheus Alertmanager + extra args + rules#4220

Merged
alpeb merged 1 commit intolinkerd:masterfrom
naseemkullah:prom-config
May 4, 2020
Merged

Prometheus Alertmanager + extra args + rules#4220
alpeb merged 1 commit intolinkerd:masterfrom
naseemkullah:prom-config

Conversation

@naseemkullah
Copy link
Contributor

@naseemkullah naseemkullah commented Mar 31, 2020

This PR allow users to use linkerd metrics directly from linkerd prometheus to setup alerts, furthermore it provides flexibility by means of extra cmd line arguments.

Add prometheus rules configurable values.

These are set as empty by default, but allow the user to configure recording and alerting rules on the linkerd prometheus.

Add prometheusAlertmanagers value.

This allows configuring a list of statically targetted alertmanager instances.

Allow addition of extra arguements.

This allows end user flexibility for options such as log format. Rather than bubbling up such possible config options into helm values, extra arguments provides more flexibility.

@naseemkullah naseemkullah changed the title Prometheus Alertmanager + extra args Prometheus Alertmanager + extra args + rules Apr 1, 2020
@alpeb
Copy link
Member

alpeb commented Apr 6, 2020

Thanks @naseemkullah !

Regarding the alerting_rules.yaml and recording_rules.yaml parts, have you considered loading them through separate ConfigMaps? Instead of loading them through the main linkerd-prometheus-config CM, we could allow users to define as many separate CMs as they want, and Linkerd would mount them under /etc/prometheus for Prometheus to load them.
So instead of declaring in values.yaml the content of those extra configs, we would just declare the list of extra CMs to load, e.g. in a prometheusExtraConfigs array. And then we'd modify prometheus.yaml so that it mounts them all. That might imply mounting them in subpaths and modifying the rule_files directive so it crawls down those subpaths.

I think decoupling these configs from the main Linkerd values.yaml might be more flexible. LMKWYT.

Copy link
Member

@zaharidichev zaharidichev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alpeb wrt to decoupling

@naseemkullah
Copy link
Contributor Author

Thanks for the feedback @alpeb and @zaharidichev, I'll go with that approach then.

@naseemkullah
Copy link
Contributor Author

Just pushed a change, but I have not had a chance to test it yet.

@naseemkullah naseemkullah force-pushed the prom-config branch 2 times, most recently from abbcd7f to 95d077d Compare April 8, 2020 13:45
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update @naseemkullah 👍
I managed to test this successfully after a few changes that I explain below.

@naseemkullah
Copy link
Contributor Author

Just removed the leftover values and fixed indentations @alpeb , pardon the force-push!

@naseemkullah naseemkullah force-pushed the prom-config branch 2 times, most recently from 85d5174 to 9d179a8 Compare April 10, 2020 04:28
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some new comments below.
Also, I forgot to mention you also need to update pkg/charts/linkerd2/values.go referencing these new properties you added (that's used when installing through the CLI, not Helm).

@Pothulapati
Copy link
Contributor

Hey Naseem,

Thanks for taking this up! I've been trying to fix this problem in a more broader sense (and using the current add-on model), and have a rfc up for the same. It would be really awesome, to have your reviews/suggestions there as a end user/contributor.

I have the following question:

  • Any particular reasoning for dividing configuration into separate configmaps other than to have the state stored during upgrades? In my RFC, as we already have linkerd-values for storing add-on configuration details I was planning to use the same and not separate them into configmaps. WDYT?

Also, The RFC model ain't much different from this PR, especially that we get better separation of configuration there i.e prometheus.*, and it's also trying to address the Bringing your own prometheus use-case. If the rfc isn't missing something that you want, should we directly work on that?

@naseemkullah
Copy link
Contributor Author

naseemkullah commented Apr 20, 2020

Hey Naseem,

Thanks for taking this up! I've been trying to fix this problem in a more broader sense (and using the current add-on model), and have a rfc up for the same. It would be really awesome, to have your reviews/suggestions there as a end user/contributor.

I have the following question:

  • Any particular reasoning for dividing configuration into separate configmaps other than to have the state stored during upgrades? In my RFC, as we already have linkerd-values for storing add-on configuration details I was planning to use the same and not separate them into configmaps. WDYT?

Hi @Pothulapati, thanks for bringing up these questions.

The original PR had a prometheusRules value but @alpeb and @zaharidichev advised the configmap approach as per their comments #4220 (comment) and #4220 (review), either of these approaches would work to allow a user to define alerting and recording rules, the latter just requiring that the configmaps pre exist.

Also, The RFC model ain't much different from this PR, especially that we get better separation of configuration there i.e prometheus.*,

The flat values (e.g. prometheusSomeValue) was already in place and so the pattern was followed, I'm not sure what is meant by better separation of configuration though?

and it's also trying to address the Bringing your own prometheus use-case. If the rfc isn't missing something that you want, should we directly work on that?

Seeing that this PR was opened quite a while ago, effort was put in to it and most importantly it is ready to be merged. The RFC can fill in any gaps/continue to extract prometheus into an add-on if that is ok?

@naseemkullah
Copy link
Contributor Author

naseemkullah commented Apr 20, 2020

cc @alpeb Can we please merge this if it is good to go? Thanks!

@naseemkullah naseemkullah force-pushed the prom-config branch 2 times, most recently from 8e4c8d3 to f5bc580 Compare April 20, 2020 10:55
@alpeb
Copy link
Member

alpeb commented Apr 20, 2020

Any particular reasoning for dividing configuration into separate configmaps other than to have the state stored during upgrades? In my RFC, as we already have linkerd-values for storing add-on configuration details I was planning to use the same and not separate them into configmaps.

@Pothulapati Just for decoupling those configs from the rest, in generic configmaps instead of somehing linkerd-specific like linkerd-values which also deals with the other add-on configs.

@naseemkullah I was waiting for the pkg/charts/linkerd2/values.go change to be addressed before approving this PR (mentioned in my last comment).

Having Prometheus become an add-on, its config items will change to become something similar to what we have for tracing in values.yaml. If we merge the changes from this PR, things will have to change again when the RFC changes get merged. So I propose we do merge this PR, but not announce the changes, and then the RFC should pick on these changes when the config migrates to be an add-on. Sounds good to you @Pothulapati ?

@Pothulapati
Copy link
Contributor

instead of somehing linkerd-specific like linkerd-values which also deals with the other add-on configs.

Yep, that makes sense. Also, okay with merging this as the rfc implementation to get merged could take some time!

@naseemkullah
Copy link
Contributor Author

@naseemkullah I was waiting for the pkg/charts/linkerd2/values.go change to be addressed before approving this PR (mentioned in my last comment).

Oh right, I'll have that in shortly! Thanks.

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naseemkullah To pass the test, we need to reset some values, and also add a workaround for the mergo lib we depend on for merging values when using both values.yaml and values-ha.yaml. Please apply the following patch:

diff --git a/pkg/charts/linkerd2/values_test.go b/pkg/charts/linkerd2/values_test.go
index 11c3f304..323a3930 100644
--- a/pkg/charts/linkerd2/values_test.go
+++ b/pkg/charts/linkerd2/values_test.go
@@ -14,24 +14,27 @@ func TestNewValues(t *testing.T) {
        testVersion := "linkerd-dev"
 
        expected := &Values{
-               Stage:                       "",
-               ControllerImage:             "gcr.io/linkerd-io/controller",
-               ControllerImageVersion:      testVersion,
-               WebImage:                    "gcr.io/linkerd-io/web",
-               PrometheusImage:             "prom/prometheus:v2.15.2",
-               GrafanaImage:                "gcr.io/linkerd-io/grafana",
-               ControllerReplicas:          1,
-               ControllerLogLevel:          "info",
-               PrometheusLogLevel:          "info",
-               ControllerUID:               2103,
-               EnableH2Upgrade:             true,
-               EnablePodAntiAffinity:       false,
-               WebhookFailurePolicy:        "Ignore",
-               OmitWebhookSideEffects:      false,
-               RestrictDashboardPrivileges: false,
-               DisableHeartBeat:            false,
-               HeartbeatSchedule:           "0 0 * * *",
-               InstallNamespace:            true,
+               Stage:                         "",
+               ControllerImage:               "gcr.io/linkerd-io/controller",
+               ControllerImageVersion:        testVersion,
+               WebImage:                      "gcr.io/linkerd-io/web",
+               PrometheusImage:               "prom/prometheus:v2.15.2",
+               GrafanaImage:                  "gcr.io/linkerd-io/grafana",
+               ControllerReplicas:            1,
+               ControllerLogLevel:            "info",
+               PrometheusLogLevel:            "info",
+               PrometheusExtraArgs:           map[string]string{},
+               PrometheusAlertmanagers:       []interface{}{},
+               PrometheusRuleConfigMapMounts: []PrometheusRuleConfigMapMount{},
+               ControllerUID:                 2103,
+               EnableH2Upgrade:               true,
+               EnablePodAntiAffinity:         false,
+               WebhookFailurePolicy:          "Ignore",
+               OmitWebhookSideEffects:        false,
+               RestrictDashboardPrivileges:   false,
+               DisableHeartBeat:              false,
+               HeartbeatSchedule:             "0 0 * * *",
+               InstallNamespace:              true,
                Global: &Global{
                        Namespace:                "linkerd",
                        ClusterDomain:            "cluster.local",
@@ -228,6 +231,11 @@ func TestNewValues(t *testing.T) {
                // Make Add-On Values nil to not have to check for their defaults
                actual.Tracing = nil
 
+               // workaround for mergo, which resets these to []interface{}(nil)
+               // and []PrometheusRuleConfigMapMount(nil)
+               actual.PrometheusAlertmanagers = []interface{}{}
+               actual.PrometheusRuleConfigMapMounts = []PrometheusRuleConfigMapMount{}
+
                if !reflect.DeepEqual(expected, actual) {
                        t.Errorf("Mismatch Helm HA defaults.\nExpected: %+v\nActual: %+v", expected, actual)
                }

@naseemkullah
Copy link
Contributor Author

expected := &Values{

Thanks @alpeb , how did you reset the values for expected := &Values{ ?

Changes applied, hoping tests will pass now.

@naseemkullah naseemkullah force-pushed the prom-config branch 2 times, most recently from 0dad051 to e2907de Compare April 23, 2020 14:57
@naseemkullah
Copy link
Contributor Author

Worth noting: After applying the patch go test ./cli/cmd/... --update had to be run twice to succeed 🤔 ... failed on first try but succeeded on second 🤷‍♂️

@naseemkullah
Copy link
Contributor Author

😭
Still failing.

--- FAIL: TestNewValues (0.02s)
    --- FAIL: TestNewValues/HA (0.01s)
        values_test.go:241: Mismatch Helm HA defaults.
            Expected: &{Stage: ControllerImage:gcr.io/linkerd-io/controller ControllerImageVersion:linkerd-dev WebImage:gcr.io/linkerd-io/web PrometheusImage:prom/prometheus:v2.15.2 GrafanaImage:gcr.io/linkerd-io/grafana ControllerReplicas:3 ControllerLogLevel:info PrometheusLogLevel:info PrometheusExtraArgs:map[] PrometheusAlertmanagers:[] PrometheusRuleConfigMapMounts:[] ControllerUID:2103 EnableH2Upgrade:true EnablePodAntiAffinity:true WebhookFailurePolicy:Fail OmitWebhookSideEffects:false RestrictDashboardPrivileges:false DisableHeartBeat:false HeartbeatSchedule:0 0 * * * InstallNamespace:true Configs:{Global: Proxy: Install:} Global:0xc0000eed00 Identity:0xc000010890 Dashboard:0xc0004c8d40 DebugContainer:0xc0000108a0 ProxyInjector:0xc0000108b0 ProfileValidator:0xc0000108c0 Tap:0xc0000108d0 NodeSelector:map[beta.kubernetes.io/os:linux] SMIMetrics:0xc00000f720 DestinationResources:0xc00003d2c0 GrafanaResources:0xc00003d340 HeartbeatResources:0xc00003d2c0 IdentityResources:0xc00003d380 PrometheusResources:0xc00003d3c0 ProxyInjectorResources:0xc00003d2c0 PublicAPIResources:0xc00003d2c0 SPValidatorResources:0xc00003d2c0 TapResources:0xc00003d2c0 WebResources:0xc00003d2c0 Tracing:map[]}
            Actual: &{Stage: ControllerImage:gcr.io/linkerd-io/controller ControllerImageVersion:linkerd-dev WebImage:gcr.io/linkerd-io/web PrometheusImage:prom/prometheus:v2.15.2 GrafanaImage:gcr.io/linkerd-io/grafana ControllerReplicas:3 ControllerLogLevel:info PrometheusLogLevel:info PrometheusExtraArgs:map[] PrometheusAlertmanagers:[] PrometheusRuleConfigMapMounts:[] ControllerUID:2103 EnableH2Upgrade:true EnablePodAntiAffinity:true WebhookFailurePolicy:Fail OmitWebhookSideEffects:false RestrictDashboardPrivileges:false DisableHeartBeat:false HeartbeatSchedule:0 0 * * * InstallNamespace:true Configs:{Global: Proxy: Install:} Global:0xc0000ef300 Identity:0xc000010910 Dashboard:0xc0004c9750 DebugContainer:0xc000010920 ProxyInjector:0xc000010980 ProfileValidator:0xc000010970 Tap:0xc000010990 NodeSelector:map[beta.kubernetes.io/os:linux] SMIMetrics:0xc0001641c0 DestinationResources:0xc00003d000 GrafanaResources:0xc00003d080 HeartbeatResources:0xc00003d140 IdentityResources:0xc00003ce80 PrometheusResources:0xc00003cec0 ProxyInjectorResources:0xc00003cf40 PublicAPIResources:0xc00003d040 SPValidatorResources:0xc00003cf80 TapResources:0xc00003d1c0 WebResources:0xc00003d200 Tracing:map[]}
FAIL

@alpeb
Copy link
Member

alpeb commented Apr 23, 2020

@naseemkullah my fault; the mergo workaround should have gone inside t.Run("HA", func(t *testing.T) { , right after calling NewValues(true)

Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the followup @naseemkullah . Just one more thing. The config map is rendering funny unless we trim .Values.prometheusAlertmanagers, so it would need something like this:

{{- toYaml .Values.prometheusAlertmanagers | trim | nindent 8 }}

@naseemkullah
Copy link
Contributor Author

{{- toYaml .Values.prometheusAlertmanagers | trim | nindent 8 }}

Hi @alpeb thank for your guidance thus far.

The current rendering of

prometheusAlertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - "alertmanager.linkerd.svc:9093"
  - scheme: https
    static_configs:
    - targets:
      - alertmanager.mycomain.com

is

  prometheus.yml: |-
    alerting:
      alertmanagers:
        - scheme: http
          static_configs:
          - targets:
            - alertmanager.linkerd.svc:9093
        - scheme: https
          static_configs:
          - targets:
            - alertmanager.mycomain.com

which seems fine, in what case does it render funny?

@alpeb
Copy link
Member

alpeb commented Apr 27, 2020

It renders fine locally, e.g. using helm template, but when actually installing with helm install if you ask for it through kubectl -n linkerd get cm linkerd-prometheus-config -oyaml you'll see there are issues.

@naseemkullah naseemkullah force-pushed the prom-config branch 3 times, most recently from 8de3d78 to 759eaef Compare April 27, 2020 19:53
@naseemkullah naseemkullah requested a review from alpeb April 29, 2020 10:45
Copy link
Member

@alpeb alpeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @naseemkullah for the fast turnaround; from my perspective this is good to go 👍

| `prometheusAlertmanagers` | Alertmanager instances the Prometheus server sends alerts to configured via the static_configs parameter. | `[]` |
| `prometheusImage` | Docker image for the Prometheus container | `prom/prometheus:v2.15.2` |
| `prometheusLogLevel` | Log level for Prometheus | `info` |
| `prometheusExtraArgs` | Extra command line options for Prometheus | `{}` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you move this up? as we followed alphabetical order here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! Done.

Copy link
Contributor

@Pothulapati Pothulapati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall works great! Thanks for the awesome work!

@naseemkullah
Copy link
Contributor Author

Thanks @Pothulapati, my absolute pleasure!

This allows end user flexibility for options such as log format. Rather than bubbling up such possible config options into helm values, extra arguments provides more flexibility.

Add prometheusAlertmanagers value allows configuring a list of statically targetted alertmanager instances.
Use rule configmaps for prometheus rules. They take a list of {name,subPath,configMap} values and mounts them accordingly. Provided that subpaths end with _rules.yml or _rules.yaml they should be loaded by prometheus as per prometheus.yml's rule_files content.

Signed-off-by: Naseem <[email protected]>
@alpeb
Copy link
Member

alpeb commented May 4, 2020

Hey @naseemkullah , please fill out this form if you'd like to receive some well-earned Linkerd swag! 😉

@alpeb alpeb merged commit 6aa1e76 into linkerd:master May 4, 2020
@naseemkullah naseemkullah deleted the prom-config branch May 4, 2020 19:07
@naseemkullah
Copy link
Contributor Author

Hey @naseemkullah , please fill out this form if you'd like to receive some well-earned Linkerd swag! 😉

Thanks @alpeb!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants